This is the mindspore official repository for Graph neural topic model with commonsense knowledge
https://www.sciencedirect.com/science/article/pii/S0306457322003168
Traditional topic models are based on the bag-of-words assumption, which states that the topic assignment of each word is independent of the others. However, this assumption ignores the relationship between words, which may hinder the quality of extracted topics. To address this issue, some recent works formulate documents as graphs based on word co-occurrence patterns. It assumes that if two words co-occur frequently, they should have the same topic. Nevertheless, it introduces noise edges into the model and thus hinders topic quality since two words co-occur frequently do not mean that they are on the same topic. In this paper, we use the commonsense relationship between words as a bridge to connect the words in each document. Compared to word co-occurrence, the commonsense relationship can explicitly imply the semantic relevance between words, which can be utilized to filter out noise edges. We use a relational graph neural network to capture the relation information in the graph. Moreover, manifold regularization is utilized to constrain the documents’ topic distributions. Experimental results on a public dataset show that our method is effective at extracting topics compared to baseline methods.
======================================= How to run =============================================
[1] Setup the environment:
conda env create -f environment.yml
(optional) pip install requirements.txt
[2] Set the ROOTPATH in file 'settings.py' to be the absolute path of this code
[3] All the data files are saved in the folders 'data/reuters/' and 'data/reuters_2hop/'.
The file 'preprocess.py' in the folder 'dataPrepare/' is used to remove the stop words
and build the vocabulary in the raw reuters dataset.
The triples extracted from ConceptNet for each doc are saved in those files with the names
prefixed 'all_doc_triples_'. The word pairs that has some commonsense relationship in each
doc are saved in those files with the names prefixed 'all_doc_pairs_'.
To see how to represent the documents into graphs in the code, see the file 'graph_data_concept.py'
in the folder 'dataPrepare/'.
[4] Pretrain the rgcn using the script 'run_reuters_rgcn_pretrain_xhop.sh' (x=1 or x=2)
bash run_reuters_rgcn_pretrain_1hop.sh
This is used to obtain the initial node embeddings in the rgcn.
[5] Train the GCNTM-CK model with the scripts 'run_reuters_mr{}_path{}num{}{}hop.sh' acoording to
different settings. For instance, the script run_reuters_mr0.1_path50_num50_1hop.sh will train
a model with the following parameter settings:
- H (hop number) -> 1
- P (maximum number of pairs) -> 50
- R (maximum number of nearest neighbors) -> 50
- \lambda (manifold coefficient) -> 0.1
```
bash run_reuters_mr0.1_path50_num50_1hop.sh
```
We train 5 times for each different topic settings and report the average results.
[6] After finishing training all the model variants, use the script 'overall_results_reuters.sh'
to obtain the final evaluation results, which will contain three topic coherence scores
(c_v, c_npmi, c_uci) and one topic diversity score (td). The results will be saved in an xlsx
file. The results for the main setting (H=1, P=100, R=100, \lambda=0.1) in the paper can be
found in the folder 'final/'.
The example log for some specifc topic setting of our model can be found in the folder 'models/' .
========================================= End =====================================================
华南理工大学蔡毅老师团队
Python Shell other
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》