@@ -10,10 +10,11 @@ __pycache__/ | |||
*.zip | |||
*.bin | |||
*.pt | |||
*.log | |||
# numpy files | |||
*.npy | |||
*.npz | |||
# yaml files | |||
*.yaml | |||
@@ -0,0 +1,29 @@ | |||
## Contributing to OpenHGNN | |||
Contribution is always welcomed. Please feel free to open an issue or email to tyzhao@bupt.edu.cn. | |||
## Contributors | |||
- **[GAMMA LAB](https://github.com/BUPT-GAMMA) [BUPT]** | |||
- [Tianyu Zhao](https://github.com/Theheavens) | |||
- Hongyi Zhang | |||
- Yaoqi Liu | |||
- Hui Han | |||
- Fengqi Liang | |||
- Yibo Li | |||
- Yanhu Mo | |||
- Donglin Xia | |||
- Xinlong Zhai | |||
- Siyuan Zhang | |||
- Qi Zhang | |||
- [Chuan Shi](http://shichuan.org/) | |||
- Cheng Yang | |||
- Xiao Wang | |||
- **BUPT** | |||
- Jiahang Li | |||
- Anke Hu | |||
- **DGL Team** | |||
- Quan Gan | |||
- Minjie Wang | |||
- [Jian Zhang](https://github.com/zhjwy9343) | |||
@@ -51,7 +51,7 @@ pip install -r requirements.txt | |||
#### Running an existing baseline model on an existing benchmark [dataset](./openhgnn/dataset/#Dataset) | |||
```bash | |||
python main.py -m model_name -d dataset_name -t task_name -g 0 --use_best_config | |||
python main.py -m model_name -d dataset_name -t task_name -g 0 --use_best_config --load_from_pretrained | |||
``` | |||
usage: main.py [-h] [--model MODEL] [--task TASK] [--dataset DATASET] | |||
@@ -73,6 +73,8 @@ usage: main.py [-h] [--model MODEL] [--task TASK] [--dataset DATASET] | |||
``--use_hpo`` Besides use_best_config, we give a hyper-parameter [example](./openhgnn/auto) to search the best hyper-parameter automatically. | |||
``--load_from_pretrained`` will load the model from a default checkpoint. | |||
e.g.: | |||
```bash | |||
@@ -89,29 +91,32 @@ Refer to the [docs](https://openhgnn.readthedocs.io/en/latest/index.html) to get | |||
The link will give some basic usage. | |||
| Model | Node classification | Link prediction | Recommendation | | |||
| ----------------------------------------------- | ------------------- | ------------------ | ------------------ | | |||
| [RGCN](./openhgnn/output/RGCN)[ESWC 2018] | :heavy_check_mark: | :heavy_check_mark: | | | |||
| [HAN](./openhgnn/output/HAN)[WWW 2019] | :heavy_check_mark: | | | | |||
| [KGCN](./openhgnn/output/KGCN)[WWW 2019] | | | :heavy_check_mark: | | |||
| [HetGNN](./openhgnn/output/HetGNN)[KDD 2019] | :heavy_check_mark: | :heavy_check_mark: | | | |||
| [GTN](./openhgnn/output/GTN)[NeurIPS 2019] | :heavy_check_mark: | | | | |||
| [RSHN](./openhgnn/output/RSHN)[ICDM 2019] | :heavy_check_mark: | | | | |||
| [DGMI](./openhgnn/output/DMGI)[AAAI 2020] | :heavy_check_mark: | | | | |||
| [MAGNN](./openhgnn/output/MAGNN)[WWW 2020] | :heavy_check_mark: | | | | |||
| [CompGCN](./openhgnn/output/CompGCN)[ICLR 2020] | :heavy_check_mark: | :heavy_check_mark: | | | |||
| [NSHE](./openhgnn/output/NSHE)[IJCAI 2020] | :heavy_check_mark: | | | | |||
| [NARS](./openhgnn/output/NARS)[arxiv] | :heavy_check_mark: | | | | |||
| [MHNF](./openhgnn/output/MHNF)[arxiv] | :heavy_check_mark: | | | | |||
| [HGSL](./openhgnn/output/HGSL)[AAAI 2021] | :heavy_check_mark: | | | | |||
| [HGNN-AC](./openhgnn/output/HGNN_AC)[WWW 2021] | :heavy_check_mark: | | | | |||
| [HeCo](./openhgnn/output/HeCo)[KDD 2021] | :heavy_check_mark: | | | | |||
| [HPN](./openhgnn/output/HPN)[TKDE 2021] | :heavy_check_mark: | | | | |||
| [RHGNN](./openhgnn/output/RHGNN)[arxiv] | :heavy_check_mark: | | | | |||
### To be supported models | |||
- Metapath2vec[KDD 2017] | |||
| Model | Node classification | Link prediction | Recommendation | | |||
| -------------------------------------------------------- | ------------------- | ------------------ | ------------------ | | |||
| [Metapath2vec](./openhgnn/output/metapath2vec)[KDD 2017] | :heavy_check_mark: | | | | |||
| [RGCN](./openhgnn/output/RGCN)[ESWC 2018] | :heavy_check_mark: | :heavy_check_mark: | | | |||
| [HERec](./openhgnn/output/HERec)[TKDE 2018] | :heavy_check_mark: | | | | |||
| [HAN](./openhgnn/output/HAN)[WWW 2019] | :heavy_check_mark: | | | | |||
| [KGCN](./openhgnn/output/KGCN)[WWW 2019] | | | :heavy_check_mark: | | |||
| [HetGNN](./openhgnn/output/HetGNN)[KDD 2019] | :heavy_check_mark: | :heavy_check_mark: | | | |||
| HGAT[EMNLP 2019] | | | | | |||
| [GTN](./openhgnn/output/GTN)[NeurIPS 2019] | :heavy_check_mark: | | | | |||
| [RSHN](./openhgnn/output/RSHN)[ICDM 2019] | :heavy_check_mark: | | | | |||
| [DMGI](./openhgnn/output/DMGI)[AAAI 2020] | :heavy_check_mark: | | | | |||
| [MAGNN](./openhgnn/output/MAGNN)[WWW 2020] | :heavy_check_mark: | | | | |||
| HGT[WWW 2020] | | | | | |||
| [CompGCN](./openhgnn/output/CompGCN)[ICLR 2020] | :heavy_check_mark: | :heavy_check_mark: | | | |||
| [NSHE](./openhgnn/output/NSHE)[IJCAI 2020] | :heavy_check_mark: | | | | |||
| [NARS](./openhgnn/output/NARS)[arxiv] | :heavy_check_mark: | | | | |||
| [MHNF](./openhgnn/output/MHNF)[arxiv] | :heavy_check_mark: | | | | |||
| [HGSL](./openhgnn/output/HGSL)[AAAI 2021] | :heavy_check_mark: | | | | |||
| [HGNN-AC](./openhgnn/output/HGNN_AC)[WWW 2021] | :heavy_check_mark: | | | | |||
| [HeCo](./openhgnn/output/HeCo)[KDD 2021] | :heavy_check_mark: | | | | |||
| SimpleHGN[KDD 2021] | | | | | |||
| [HPN](./openhgnn/output/HPN)[TKDE 2021] | :heavy_check_mark: | | | | |||
| [RHGNN](./openhgnn/output/RHGNN)[arxiv] | :heavy_check_mark: | | | | |||
| [HDE](./openhgnn/output/HDE)[ICDM 2021] | | :heavy_check_mark: | | | |||
| | | | | | |||
### Candidate models | |||
@@ -120,9 +125,7 @@ The link will give some basic usage. | |||
## Contributors | |||
**[GAMMA LAB](https://github.com/BUPT-GAMMA) [BUPT]**: [Tianyu Zhao](https://github.com/Theheavens), Yaoqi Liu, Fengqi Liang, Yibo Li, Yanhu Mo, Donglin Xia, Xinlong Zhai, Siyuan Zhang, Qi Zhang, [Chuan Shi](http://shichuan.org/), Cheng Yang, Xiao Wang | |||
**BUPT**: Jiahang Li, Anke Hu | |||
OpenHGNN Team[GAMMA LAB] & DGL Team. | |||
**DGL Team**: Quan Gan, [Jian Zhang](https://github.com/zhjwy9343) | |||
See more in [CONTRIBUTING](./CONTRIBUTING.md). | |||
@@ -22,10 +22,15 @@ Node classification Task | |||
Link prediction Task | |||
--------------------- | |||
.. autoclass:: openhgnn.tasks.link_prediction.LinkPrediction | |||
:members: | |||
:undoc-members: | |||
:show-inheritance: | |||
.. automodule:: openhgnn.tasks.link_prediction | |||
Link prediction Model | |||
~~~~~~~~~~~~~~~~~~~~~~ | |||
.. autoclass:: openhgnn.models.link_prediction.LinkPrediction | |||
:members: | |||
:undoc-members: | |||
:show-inheritance: | |||
Recommendation Task | |||
--------------------- | |||
@@ -17,10 +17,12 @@ if __name__ == '__main__': | |||
parser.add_argument('--gpu', '-g', default='0', type=int, help='-1 means cpu') | |||
parser.add_argument('--use_best_config', action='store_true', help='will load utils.best_config') | |||
parser.add_argument('--use_hpo', action='store_true', help='hyper-parameter optimization') | |||
parser.add_argument('--load_from_pretrained', action='store_true', help='load model from the checkpoint') | |||
args = parser.parse_args() | |||
config_file = ["./openhgnn/config.ini"] | |||
config = Config(file_path=config_file, model=args.model, dataset=args.dataset, task=args.task, gpu=args.gpu) | |||
config.use_best_config = args.use_best_config | |||
config.use_hpo = args.use_hpo | |||
config.load_from_pretrained = args.load_from_pretrained | |||
OpenHGNN(args=config) |
@@ -98,7 +98,6 @@ dropout = 0.2 | |||
seed =0 | |||
in_dim = 64 | |||
hidden_dim = 64 | |||
out_dim = 64 | |||
n_bases = 40 | |||
n_layers = 3 | |||
@@ -151,22 +150,29 @@ mini_batch_flag = True | |||
[Metapath2vec] | |||
learning_rate = 0.01 | |||
;weight_decay = 0.0001 | |||
dim = 128 | |||
max_epoch = 50 | |||
max_epoch = 1 | |||
batch_size = 32 | |||
window_size = 5 | |||
num_workers = 4 | |||
;batches_per_epoch = 20 | |||
rw_length = 10 | |||
rw_length = 20 | |||
rw_walks = 10 | |||
;rwr_prob = 0.5 | |||
neg_size = 5 | |||
seed = 0 | |||
meta_path_key = APVPA | |||
[HERec] | |||
learning_rate = 0.01 | |||
dim = 128 | |||
max_epoch = 1 | |||
batch_size = 32 | |||
window_size = 2 | |||
num_workers = 4 | |||
rw_length = 100 | |||
rw_walks = 10 | |||
neg_size = 5 | |||
seed = 0 | |||
;patience = 100 | |||
;mini_batch_flag = True | |||
meta_path_keys = APVPA,APA | |||
[HAN] | |||
seed = 0 | |||
@@ -400,4 +406,15 @@ stage_type = stack | |||
num_heads = 8 | |||
feat = 0 | |||
subgraph=metapath | |||
mini_batch_flag = false | |||
mini_batch_flag = false | |||
[HDE] | |||
emb_dim = 128 | |||
num_neighbor = 5 | |||
use_bias = true | |||
k_hop = 2 | |||
max_epoch = 400 | |||
batch_size = 32 | |||
max_dist = 3 | |||
lr = 0.001 |
@@ -14,7 +14,7 @@ class Config(object): | |||
if th.cuda.is_available(): | |||
self.device = th.device('cuda', int(gpu)) | |||
else: | |||
print("cuda is not available, please set 'gpu' -1") | |||
raise ValueError("cuda is not available, please set 'gpu' -1") | |||
try: | |||
conf.read(file_path) | |||
@@ -107,7 +107,6 @@ class Config(object): | |||
self.in_dim = conf.getint("RGCN", "in_dim") | |||
self.hidden_dim = conf.getint("RGCN", "hidden_dim") | |||
self.out_dim = conf.getint("RGCN", "out_dim") | |||
self.n_bases = conf.getint("RGCN", "n_bases") | |||
self.n_layers = conf.getint("RGCN", "n_layers") | |||
@@ -159,22 +158,28 @@ class Config(object): | |||
pass | |||
elif model == 'Metapath2vec': | |||
self.lr = conf.getfloat("Metapath2vec", "learning_rate") | |||
# self.weight_decay = conf.getfloat("Metapath2vec", "weight_decay") | |||
#self.dropout = conf.getfloat("CompGCN", "dropout") | |||
self.max_epoch = conf.getint("Metapath2vec", "max_epoch") | |||
self.dim = conf.getint("Metapath2vec", "dim") | |||
self.batch_size = conf.getint("Metapath2vec", "batch_size") | |||
self.window_size = conf.getint("Metapath2vec", "window_size") | |||
self.num_workers = conf.getint("Metapath2vec", "num_workers") | |||
self.neg_size = conf.getint("Metapath2vec", "neg_size") | |||
# self.batches_per_epoch = conf.getint("Metapath2vec", "batches_per_epoch") | |||
# self.seed = conf.getint("Metapath2vec", "seed") | |||
# self.patience = conf.getint("Metapath2vec", "patience") | |||
self.rw_length = conf.getint("Metapath2vec", "rw_length") | |||
self.rw_walks = conf.getint("Metapath2vec", "rw_walks") | |||
# self.rwr_prob = conf.getfloat("Metapath2vec", "rwr_prob") | |||
# self.mini_batch_flag = conf.getboolean("Metapath2vec", "mini_batch_flag") | |||
self.meta_path_key = conf.get("Metapath2vec", "meta_path_key") | |||
elif model == 'HERec': | |||
self.lr = conf.getfloat("HERec", "learning_rate") | |||
self.max_epoch = conf.getint("HERec", "max_epoch") | |||
self.dim = conf.getint("HERec", "dim") | |||
self.batch_size = conf.getint("HERec", "batch_size") | |||
self.window_size = conf.getint("HERec", "window_size") | |||
self.num_workers = conf.getint("HERec", "num_workers") | |||
self.neg_size = conf.getint("HERec", "neg_size") | |||
self.rw_length = conf.getint("HERec", "rw_length") | |||
self.rw_walks = conf.getint("HERec", "rw_walks") | |||
meta_path_keys = conf.get("HERec", "meta_path_keys") | |||
self.meta_path_keys = meta_path_keys.split(',') | |||
elif model == 'HAN': | |||
self.lr = conf.getfloat("HAN", "learning_rate") | |||
@@ -480,6 +485,15 @@ class Config(object): | |||
self.validation = conf.getboolean('HeGAN', 'validation') | |||
self.emb_size = conf.getint("HeGAN", 'emb_size') | |||
self.patience = conf.getint("HeGAN", 'patience') | |||
elif model == 'HDE': | |||
self.emb_dim = conf.getint('HDE', 'emb_dim') | |||
self.num_neighbor = conf.getint('HDE', 'num_neighbor') | |||
self.use_bias = conf.getboolean('HDE', 'use_bias') | |||
self.k_hop = conf.getint('HDE', 'k_hop') | |||
self.max_epoch = conf.getint('HDE', 'max_epoch') | |||
self.batch_size = conf.getint('HDE', 'batch_size') | |||
self.max_dist = conf.getint('HDE', 'max_dist') | |||
self.lr = conf.getfloat('HDE', 'lr') | |||
def __repr__(self): | |||
return 'Model:' + self.model + '\nTask:' + self.task + '\nDataset:' + self.dataset | |||
return '[Config Info]\tModel: {},\tTask: {},\tDataset: {}'.format(self.model, self.task, self.dataset) |
@@ -52,7 +52,7 @@ class LinkPredictionDataset(BaseDataset): | |||
random_int = th.randperm(num_edges) | |||
val_index = random_int[:int(num_edges * val_ratio)] | |||
val_edge = self.g.find_edges(val_index, etype) | |||
test_index = random_int[int(num_edges * test_ratio):int(num_edges * test_ratio)] | |||
test_index = random_int[int(num_edges * test_ratio):int(num_edges * (test_ratio + val_ratio))] | |||
test_edge = self.g.find_edges(test_index, etype) | |||
val_edge_dict[etype] = val_edge | |||
@@ -430,6 +430,9 @@ def comp_deg_norm(g): | |||
@register_dataset('kg_link_prediction') | |||
class KG_LinkPrediction(LinkPredictionDataset): | |||
""" | |||
From `RGCN <https://arxiv.org/abs/1703.06103>`_, WN18 & FB15k face a data leakage. | |||
""" | |||
def __init__(self, dataset_name): | |||
super(KG_LinkPrediction, self).__init__() | |||
if dataset_name in ['wn18', 'FB15k', 'FB15k-237']: | |||
@@ -438,40 +441,70 @@ class KG_LinkPrediction(LinkPredictionDataset): | |||
self.num_rels = dataset.num_rels * 2 | |||
# include inverse edge | |||
self.num_nodes = dataset.num_nodes | |||
self.homo_g = dataset[0] | |||
self.g = self.homo_to_hetero(dataset[0]) | |||
self.train_hg = self.get_graph_directed_from_triples(dataset.train, 'graph') | |||
self.valid_hg = self.get_graph_directed_from_triples(dataset.valid, 'graph') | |||
self.test_hg = self.get_graph_directed_from_triples(dataset.test, 'graph') | |||
self.train_triplets, self.valid_triplets, self.test_triplets = self.get_all_triplets(dataset) | |||
self.target_link = self.test_hg.canonical_etypes | |||
self.g = self.train_hg | |||
# self.g = self.build_g(dataset.train) | |||
# self.dataset = dataset | |||
def get_triples_directed(self, mask_mode): | |||
if mask_mode == 'train_mask': | |||
data = self.dataset.train | |||
elif mask_mode == 'val_mask': | |||
data = self.dataset.test | |||
elif mask_mode == 'test_mask': | |||
data = self.dataset.test | |||
s = th.LongTensor(data[:, 0]) | |||
r = th.LongTensor(data[:, 1]) | |||
o = th.LongTensor(data[:, 2]) | |||
return th.stack([s, r, o]) | |||
def get_triples(self, mask_mode): | |||
def get_graph_directed_from_triples(self, triples, format='graph'): | |||
s = th.LongTensor(triples[:, 0]) | |||
r = th.LongTensor(triples[:, 1]) | |||
o = th.LongTensor(triples[:, 2]) | |||
if format == 'graph': | |||
edge_dict = {} | |||
for i in range(self.num_rels): | |||
mask = (r == i) | |||
edge_name = (self.category, str(i), self.category) | |||
edge_dict[edge_name] = (s[mask], o[mask]) | |||
return dgl.heterograph(edge_dict, {self.category: self.num_nodes}) | |||
def get_triples(self, g, mask_mode): | |||
''' | |||
:param g: | |||
:param mask_mode: should be one of 'train_mask', 'val_mask', 'test_mask | |||
:return: | |||
''' | |||
g = self.homo_g | |||
edges = g.edges() | |||
etype = g.edata['etype'] | |||
mask = g.edata.pop(mask_mode) | |||
return th.stack((edges[0][mask], etype[mask], edges[1][mask])) | |||
def homo_to_hetero(self, g): | |||
def get_all_triplets(self, dataset): | |||
train_data = th.LongTensor(dataset.train) | |||
valid_data = th.LongTensor(dataset.valid) | |||
test_data = th.LongTensor(dataset.test) | |||
return train_data, valid_data, test_data | |||
def get_idx(self): | |||
return self.train_hg, self.valid_hg, self.test_hg | |||
def split_graph(self, g, mode='train'): | |||
""" | |||
Parameters | |||
---------- | |||
g: DGLGraph | |||
a homogeneous graph fomat | |||
mode: str | |||
split the subgraph according to the mode | |||
Returns | |||
------- | |||
hg: DGLHeterograph | |||
""" | |||
edges = g.edges() | |||
etype = g.edata['etype'] | |||
train_mask = g.edata['train_mask'] | |||
hg = self.build_graph((edges[0][train_mask], edges[1][train_mask]), etype[train_mask]) | |||
if mode == 'train': | |||
mask = g.edata['train_mask'] | |||
elif mode == 'valid': | |||
mask = g.edata['valid_edge_mask'] | |||
elif mode == 'test': | |||
mask = g.edata['test_edge_mask'] | |||
hg = self.build_graph((edges[0][mask], edges[1][mask]), etype[mask]) | |||
return hg | |||
def build_graph(self, edges, etype): | |||
@@ -178,6 +178,12 @@ class HIN_NodeClassification(NodeClassificationDataset): | |||
category = 'A' | |||
g = dataset[0].long() | |||
num_classes = 4 | |||
self.meta_paths_dict = { | |||
'APVPA': [('A', 'A-P', 'P'), ('P', 'P-V', 'V'), ('V', 'V-P', 'P'), ('P', 'P-A', 'A')], | |||
'APA': [('A', 'A-P', 'P'), ('P', 'P-A', 'A')], | |||
} | |||
self.meta_paths = [(('A', 'A-P', 'P'), ('P', 'P-V', 'V'), ('V', 'V-P', 'P'), ('P', 'P-A', 'A')), | |||
(('A', 'A-P', 'P'), ('P', 'P-A', 'A'))] | |||
self.in_dim = g.ndata['h'][category].shape[1] | |||
elif name_dataset == 'imdb4MAGNN': | |||
@@ -197,6 +203,10 @@ class HIN_NodeClassification(NodeClassificationDataset): | |||
category = 'paper' | |||
g = dataset[0].long() | |||
num_classes = 3 | |||
self.meta_paths_dict = {'PAPSP': [('paper', 'paper-author', 'author'), ('author', 'author-paper', 'paper'), | |||
('paper', 'paper-subject', 'subject'), | |||
('subject', 'subject-paper', 'paper')] | |||
} | |||
self.meta_paths = [(('paper', 'paper-author', 'author'), ('author', 'author-paper', 'paper'), | |||
('paper', 'paper-subject', 'subject'), ('subject', 'subject-paper', 'paper'))] | |||
self.in_dim = g.ndata['h'][category].shape[1] | |||
@@ -44,10 +44,14 @@ So dataset should load not only a heterograph[DGLGraph], but also some index inv | |||
| imdb4GTN | 4,661 | 5,841 | 2,270 | 13,983 | 4,661 | 300 | 300 | 2,339 | | |||
| imdb4MAGNN | 4,278 | 5,257 | 2,081 | 12,828 | 4,278 | 400 | 400 | 3,478 | | |||
- **HGB_NodeClassification** | |||
- **HGBn ** | |||
The datasets are HGB for Node Classification | |||
**Note**:The test data labels are randomly replaced to prevent data leakage issues, refer to [HGB](https://github.com/THUDM/HGB). | |||
In OpenHGNN, you will get the test results in `./openhgnn/output/{model_name}/`. If you want to obtain test scores, you need to submit your prediction to HGB's [website](https://www.biendata.xyz/hgb/). | |||
- HGBn-ACM | |||
| paper | author | subject | term | paper-author | paper-paper | paper-subject | paper-term | Val | Test | | |||
@@ -86,10 +90,14 @@ So dataset should load not only a heterograph[DGLGraph], but also some index inv | |||
- ###### academic4HetGNN | |||
- **HGBl-LinkPrediction** | |||
- **HGBl** | |||
The datasets are HGB for Link Prediction. | |||
**Note**:The test data labels are randomly replaced to prevent data leakage issues, refer to [HGB](https://github.com/THUDM/HGB). | |||
In OpenHGNN, you will get the test results in `./openhgnn/output/{model_name}/`. If you want to obtain test scores, you need to submit your prediction to HGB's [website](https://www.biendata.xyz/hgb/). | |||
- HGBl-amazon | |||
| | product | features | product-product0 | product-product1 | test : product-product0 | test : product-product1 | | |||
@@ -50,7 +50,7 @@ def build_dataset(dataset, task): | |||
_dataset = 'rdf_' + task | |||
elif dataset in ['acm4NSHE', 'acm4GTN', 'academic4HetGNN', 'acm_han', 'acm_han_raw', 'acm4HeCo', 'dblp', 'dblp4MAGNN', | |||
'imdb4MAGNN', 'imdb4GTN', 'acm4NARS', 'demo_graph', 'yelp4HeGAN', 'DoubanMovie', 'Book-Crossing', | |||
'amazon4SLICE', 'HNE-PubMed', 'HGBl-ACM', 'HGBl-DBLP', 'HGBl-IMDB']: | |||
'amazon4SLICE', 'MTWM', 'HNE-PubMed', 'HGBl-ACM', 'HGBl-DBLP', 'HGBl-IMDB']: | |||
_dataset = 'hin_' + task | |||
elif dataset in ['ogbn-mag']: | |||
_dataset = 'ogbn_' + task | |||
@@ -0,0 +1,65 @@ | |||
import torch | |||
import torch.nn as nn | |||
import torch.nn.functional as F | |||
from . import BaseModel, register_model | |||
class GNN(nn.Module): | |||
""" | |||
Aggregate 2-hop neighbor. | |||
""" | |||
def __init__(self, input_dim, output_dim, num_neighbor, use_bias=True): | |||
super(GNN, self).__init__() | |||
self.input_dim = int(input_dim) | |||
self.num_fea = int(input_dim) | |||
self.output_dim = int(output_dim) | |||
self.num_neighbor = num_neighbor | |||
self.use_bias = use_bias | |||
self.linear1 = nn.Linear(self.input_dim * 2, 64) | |||
self.linear2 = nn.Linear(64+self.num_fea, 64) | |||
self.linear3 = nn.Linear(64, self.output_dim) | |||
def forward(self, fea): | |||
node = fea[:, :self.num_fea] | |||
neigh1 = fea[:, self.num_fea:self.num_fea * (self.num_neighbor + 1)] | |||
neigh1 = torch.reshape(neigh1, [-1, self.num_neighbor, self.num_fea]) | |||
neigh2 = fea[:, self.num_fea * (self.num_neighbor + 1):] | |||
neigh2 = torch.reshape(neigh2, [-1, self.num_neighbor, self.num_neighbor, self.num_fea]) | |||
neigh2_agg = torch.mean(neigh2, dim=2) | |||
tmp = torch.cat([neigh1, neigh2_agg], dim=2) | |||
tmp = F.relu(self.linear1(tmp)) | |||
emb = torch.cat([node, torch.mean(tmp, dim=1)], dim=1) | |||
emb = F.relu(self.linear2(emb)) | |||
emb = F.relu(self.linear3(emb)) | |||
return emb | |||
@register_model('HDE') | |||
class HDE(BaseModel): | |||
def __init__(self, input_dim, output_dim, num_neighbor, use_bias=True): | |||
super(HDE, self).__init__() | |||
self.input_dim = int(input_dim) | |||
self.output_dim = int(output_dim) | |||
self.num_neighbor = num_neighbor | |||
self.use_bias = use_bias | |||
self.aggregator = GNN(input_dim=input_dim, output_dim=output_dim, num_neighbor=num_neighbor) | |||
self.linear1 = nn.Linear(2*self.output_dim, 32) | |||
self.linear2 = nn.Linear(32, 2) | |||
def forward(self, fea_a, fea_b): | |||
emb_a = self.aggregator(fea_a) | |||
emb_b = self.aggregator(fea_b) | |||
emb = torch.cat([emb_a, emb_b], dim=1) | |||
emb = F.relu(self.linear1(emb)) | |||
output = self.linear2(emb) | |||
return output | |||
@classmethod | |||
def build_model_from_args(cls, args, hg): | |||
return cls(input_dim=args.input_dim, | |||
output_dim=args.output_dim, | |||
num_neighbor=args.num_neighbor, | |||
use_bias=args.use_bias) |
@@ -10,14 +10,15 @@ from . import BaseModel, register_model | |||
""" | |||
@register_model('HERec') | |||
@register_model('Metapath2vec') | |||
class HeteroEmbedding(BaseModel): | |||
class SkipGram(BaseModel): | |||
@classmethod | |||
def build_model_from_args(cls, args, hg): | |||
return cls(hg.num_nodes(), args.dim) | |||
def __init__(self, num_nodes, dim): | |||
super(HeteroEmbedding, self).__init__() | |||
super(SkipGram, self).__init__() | |||
self.embedding_dim = dim | |||
self.u_embeddings = nn.Embedding(num_nodes, self.embedding_dim, |
@@ -56,7 +56,8 @@ SUPPORTED_MODELS = { | |||
'RGCN': 'openhgnn.models.RGCN', | |||
"RGAT": 'openhgnn.models.RGAT', | |||
'RSHN': 'openhgnn.models.RSHN', | |||
'Metapath2vec': 'openhgnn.models.Metapath2vec', | |||
'Metapath2vec': 'openhgnn.models.SkipGram', | |||
'HERec': 'openhgnn.models.SkipGram', | |||
'HAN': 'openhgnn.models.HAN', | |||
#'HGT': 'openhgnn.models.HGT', | |||
'HeCo': 'openhgnn.models.HeCo', | |||
@@ -77,4 +78,5 @@ SUPPORTED_MODELS = { | |||
'GAT': 'space4hgnn.homo_models.GAT', | |||
'homo_GNN': 'openhgnn.models.homo_GNN', | |||
'general_HGNN': 'openhgnn.models.general_HGNN', | |||
'HDE': 'openhgnn.models.HDE', | |||
} |
@@ -0,0 +1,55 @@ | |||
# HDE[ICDM2021] | |||
Paper:[Heterogeneous Graph Neural Network with Distance Encoding](http://www.shichuan.org/doc/116.pdf) | |||
## How to run | |||
* Clone the Openhgnn-DGL | |||
```bash | |||
python main.py -m HDE -d HGBl-IMDB -t link_prediction -g 0 --use_best_config | |||
``` | |||
If you do not have gpu, set -gpu -1. | |||
## Performance | |||
| Dataset | AUC_ROC | | |||
| --------- | ------- | | |||
| HGBl-IMDB | 0.9151 | | |||
| HGBl-ACM | 0.8741 | | |||
| HGBl-DBLP | 0.9836 | | |||
### TrainerFlow | |||
```hde_trainer``` | |||
### model | |||
```HDE``` | |||
### Dataset | |||
Supported datasets: | |||
* HGBl-IMDB | |||
* HGBl-ACM | |||
* HGBl-DBLP | |||
### Hyper-parameter specific to the model | |||
```python | |||
emb_dim = 128 # dimension of HDE | |||
k_hop = 2 # radius when sampling ego graph for a node | |||
max_dist = 3 # max value of SPD, usually set to k_hop + 1 | |||
``` | |||
## More | |||
#### Contributor | |||
Zhiyuan Lu[GAMMA LAB] | |||
#### If you have any questions, | |||
Submit an issue or email to [luzy@bupt.edu.cn](luzy@bupt.edu.cn) |
@@ -0,0 +1,21 @@ | |||
# HERec[TKDE2018] | |||
Paper: [**Heterogeneous Information Network Embedding for Recommendation**](https://ieeexplore.ieee.org/abstract/document/8355676) | |||
Code from author: https://github.com/librahu/HERec | |||
## How to run | |||
```bash | |||
python main.py -m HERec -t node_classification -d dblp4MAGNN -g 0 | |||
``` | |||
If you do not have gpu, set -gpu -1. | |||
## Performance | |||
### Node Classification | |||
| Dataset | Metapath | Macro-F1 | Micro-F1 | | |||
| ------------- |-------- | -------- | -------- | | |||
| dblp | APVPA | 0.9277 | 0.9331 | |
@@ -11,14 +11,37 @@ | |||
- Clone the Openhgnn-DGL | |||
```bash | |||
# For node classification task | |||
python main.py -m RGCN -t node_classification -d aifb -g 0 --use_best_config | |||
# For link prediction task | |||
python main.py -m RGCN -t link_prediction -d HGBl-amazon -g 0 --use_best_config | |||
``` | |||
If you do not have gpu, set -gpu -1. | |||
##### Supported dataset | |||
###### Node classification: | |||
[RDFDataset](../../dataset/#RDF_NodeCLassification)[aifb/mutag/bgs/am], [HGBn](../../dataset/#HGBn) and other datasets for node classification. | |||
###### Link prediction: | |||
## Performance | |||
#### Task: Node classification | |||
Evaluation metric: accuracy | |||
| Method | AIFB | MUTAG | BGS | AM | | |||
| -------- | ----- | ----- | ----- | ----- | | |||
| **RGCN** | 97.22 | 72.06 | 96.55 | 88.89 | | |||
-d means dataset, candidate dataset: aifb/mutag/bgs/am. Refer to [RDFDataset](../../dataset/#RDF_NodeCLassification) to get more infos. | |||
#### Task: Link prediction | |||
## Performance: Node classification | |||
Evaluation metric: roc_auc | |||
| Method | AIFB | MUTAG | BGS | AM | | |||
| -------------------- | ----- | ----- | ----- | ----- | | |||
@@ -1 +1,25 @@ | |||
# Metapath2vec[KDD2017] | |||
Paper: [**metapath2vec: Scalable Representation Learning for Heterogeneous Networks**](https://ericdongyx.github.io/metapath2vec/m2v.html) | |||
Code from author: https://www.dropbox.com/s/w3wmo2ru9kpk39n/code_metapath2vec.zip | |||
Code from dgl Team: https://github.com/dmlc/dgl/tree/master/examples/pytorch/metapath2vec | |||
## How to run | |||
```bash | |||
python main.py -m Metapath2vec -t node_classification -d dblp4MAGNN -g 0 | |||
``` | |||
If you do not have gpu, set -gpu -1. | |||
## Performance | |||
### Node Classification | |||
| Dataset | Metapath | Macro-F1 | Micro-F1 | | |||
| ------------- |-------- | -------- | -------- | | |||
| dblp | APVPA | 0.9241 | 0.9288 | | |||
@@ -0,0 +1,243 @@ | |||
import networkx as nx | |||
import random | |||
import torch | |||
from tqdm import tqdm | |||
import numpy as np | |||
from itertools import combinations | |||
class HDESampler: | |||
""" | |||
Sampler for HDE model, the function of this sampler is perform data preprocess | |||
and compute HDE for each node. Please notice that for different target node set, | |||
the HDE for each node can be different. | |||
For more details, please refer to the original paper: http://www.shichuan.org/doc/116.pdf | |||
""" | |||
def __init__(self, that): | |||
self.mini_batch = [] | |||
self.type2idx = that.type2idx | |||
self.node_type = that.node_type | |||
self.target_link = that.target_link | |||
self.num_fea = that.num_fea | |||
self.sample_size = that.sample_size | |||
self.hg = that.hg | |||
self.device = that.device | |||
def preprocess(self): | |||
self.preprocess_feature() | |||
def dgl2nx(self): | |||
""" | |||
Convert the dgl graph data to networkx data structure. | |||
:return: | |||
""" | |||
test_ratio = 0.1 | |||
nx_g = nx.Graph() | |||
sp = 1 - test_ratio * 2 | |||
edges = self.hg.edges(etype=self.target_link[0][1]) | |||
a_list = edges[0] | |||
b_list = edges[1] | |||
edge_list = [] | |||
for i in range(self.hg.num_edges(self.target_link[0][1])): | |||
edge_list.append(['A' + str(int(a_list[i])), 'B' + str(int(b_list[i]))]) | |||
num_edge = len(edge_list) | |||
sp1 = int(num_edge * sp) | |||
sp2 = int(num_edge * test_ratio) | |||
self.g_train = nx.Graph() | |||
self.g_val = nx.Graph() | |||
self.g_test = nx.Graph() | |||
self.g_train.add_edges_from(edge_list[:sp1]) | |||
self.g_val.add_edges_from(edge_list[sp1:sp1 + sp2]) | |||
self.g_test.add_edges_from(edge_list[sp1 + sp2:]) | |||
def compute_hde(self, args): | |||
""" | |||
Compute hde for training set, validation set and test set. | |||
:param args: arguments | |||
:return: | |||
""" | |||
print("Computing HDE for training set...") | |||
self.data_A_train, self.data_B_train, self.data_y_train = self.batch_data(self.g_train, args) | |||
print("Computing HDE for validation set...") | |||
self.data_A_val, self.data_B_val, self.data_y_val = self.batch_data(self.g_val, args) | |||
print("Computing HDE for test set...") | |||
self.data_A_test, self.data_B_test, self.data_y_test = self.batch_data(self.g_test, args) | |||
val_batch_A_fea = self.data_A_val.reshape(-1, self.sample_size) | |||
val_batch_B_fea = self.data_B_val.reshape(-1, self.sample_size) | |||
val_batch_y = self.data_y_val.reshape(-1) | |||
self.val_batch_A_fea = torch.FloatTensor(val_batch_A_fea).to(self.device) | |||
self.val_batch_B_fea = torch.FloatTensor(val_batch_B_fea).to(self.device) | |||
self.val_batch_y = torch.LongTensor(val_batch_y).to(self.device) | |||
test_batch_A_fea = self.data_A_test.reshape(-1, self.sample_size) | |||
test_batch_B_fea = self.data_B_test.reshape(-1, self.sample_size) | |||
test_batch_y = self.data_y_test.reshape(-1) | |||
self.test_batch_A_fea = torch.FloatTensor(test_batch_A_fea).to(self.device) | |||
self.test_batch_B_fea = torch.FloatTensor(test_batch_B_fea).to(self.device) | |||
self.test_batch_y = torch.LongTensor(test_batch_y).to(self.device) | |||
return self.data_A_train, self.data_B_train, self.data_y_train, \ | |||
self.val_batch_A_fea, self.val_batch_B_fea, self.val_batch_y, \ | |||
self.test_batch_A_fea, self.test_batch_B_fea, self.test_batch_y | |||
def batch_data(self, g, args): | |||
""" | |||
Generate batch data. | |||
:param g: graph data | |||
:param args: arguments | |||
:return: | |||
""" | |||
edge = list(g.edges) | |||
nodes = list(g.nodes) | |||
num_batch = int(len(edge) * 2 / args.batch_size) | |||
random.shuffle(edge) | |||
data = [] | |||
edge = edge[0: num_batch * args.batch_size // 2] | |||
for bx in tqdm(edge): | |||
posA, posB = self.subgraph_sampling_with_DE_node_pair(g, bx, args) | |||
data.append([posA, posB, 1]) | |||
neg_tmpB_id = random.choice(nodes) | |||
negA, negB = self.subgraph_sampling_with_DE_node_pair(g, | |||
[bx[0], neg_tmpB_id], | |||
args) | |||
data.append([negA, negB, 0]) | |||
random.shuffle(data) | |||
data = np.array(data) | |||
data = data.reshape(num_batch, args.batch_size, 3) | |||
data_A = data[:, :, 0].tolist() | |||
data_B = data[:, :, 1].tolist() | |||
data_y = data[:, :, 2].tolist() | |||
for i in range(len(data_A)): | |||
for j in range(len(data[0])): | |||
data_A[i][j] = data_A[i][j].tolist() | |||
data_B[i][j] = data_B[i][j].tolist() | |||
data_A = np.squeeze(np.array(data_A)) | |||
data_B = np.squeeze(np.array(data_B)) | |||
data_y = np.squeeze(np.array(data_y)) | |||
return data_A, data_B, data_y | |||
def subgraph_sampling_with_DE_node_pair(self, | |||
G, | |||
node_pair, | |||
args): | |||
""" | |||
compute distance encoding given a target node set | |||
:param G: graph data | |||
:param node_pair: target node set | |||
:param args: arguments | |||
:return: | |||
""" | |||
[A, B] = node_pair | |||
A_ego = nx.ego_graph(G, A, radius=args.k_hop) | |||
B_ego = nx.ego_graph(G, B, radius=args.k_hop) | |||
sub_G_for_AB = nx.compose(A_ego, B_ego) | |||
sub_G_for_AB.remove_edges_from(combinations(node_pair, 2)) | |||
sub_G_nodes = sub_G_for_AB.nodes | |||
SPD_based_on_node_pair = {} | |||
for node in sub_G_nodes: | |||
tmpA = self.dist_encoder(A, node, sub_G_for_AB, args) | |||
tmpB = self.dist_encoder(B, node, sub_G_for_AB, args) | |||
SPD_based_on_node_pair[node] = np.concatenate([tmpA, tmpB], axis=0) | |||
A_fea_batch = self.gen_fea_batch(sub_G_for_AB, | |||
A, | |||
SPD_based_on_node_pair, | |||
args) | |||
B_fea_batch = self.gen_fea_batch(sub_G_for_AB, | |||
B, | |||
SPD_based_on_node_pair, | |||
args) | |||
return A_fea_batch, B_fea_batch | |||
def gen_fea_batch(self, G, root, fea_dict, args): | |||
""" | |||
Neighbor sampling for each node. The Neighbor feature will be concatenated and | |||
will be separated in the model. | |||
:param G: graph data | |||
:param root: root node | |||
:param fea_dict: node features | |||
:param args: arguments | |||
:return: | |||
""" | |||
fea_batch = [] | |||
self.mini_batch.append([root]) | |||
a = [0] * (self.num_fea - self.node_type) + self.type_encoder(root) | |||
fea_batch.append(np.asarray(a, | |||
dtype=np.float32 | |||
).reshape(-1, self.num_fea) | |||
) | |||
ns_1 = [list(np.random.choice(list(G.neighbors(node)) + [node], | |||
args.num_neighbor, | |||
replace=True)) | |||
for node in self.mini_batch[-1]] | |||
self.mini_batch.append(ns_1[0]) | |||
de_1 = [ | |||
np.concatenate([fea_dict[dest], np.asarray(self.type_encoder(dest))], axis=0) | |||
for dest in ns_1[0] | |||
] | |||
fea_batch.append(np.asarray(de_1, | |||
dtype=np.float32).reshape(1, -1) | |||
) | |||
# 2-order | |||
ns_2 = [list(np.random.choice(list(G.neighbors(node)) + [node], | |||
args.num_neighbor, | |||
replace=True)) | |||
for node in self.mini_batch[-1]] | |||
de_2 = [] | |||
for i in range(len(ns_2)): | |||
tmp = [] | |||
for j in range(len(ns_2[0])): | |||
tmp.append( | |||
np.concatenate( | |||
[fea_dict[ns_2[i][j]], np.asarray(self.type_encoder(ns_2[i][j]))], | |||
axis=0) | |||
) | |||
de_2.append(tmp) | |||
fea_batch.append(np.asarray(de_2, | |||
dtype=np.float32).reshape(1, -1) | |||
) | |||
return np.concatenate(fea_batch, axis=1) | |||
def dist_encoder(self, src, dest, G, args): | |||
""" | |||
compute H_SPD for a node pair | |||
:param src: source node | |||
:param dest: target node | |||
:param G: graph data | |||
:param args: arguments | |||
:return: | |||
""" | |||
paths = list(nx.all_simple_paths(G, src, dest, cutoff=args.max_dist + 1)) | |||
cnt = [args.max_dist] * self.node_type # truncation SPD at max_spd | |||
for path in paths: | |||
res = [0] * self.node_type | |||
for i in range(1, len(path)): | |||
tmp = path[i][0] | |||
res[self.type2idx[tmp]] += 1 | |||
# print(path, res) | |||
for k in range(self.node_type): | |||
cnt[k] = min(cnt[k], res[k]) | |||
one_hot_list = [np.eye(args.max_dist + 1, dtype=np.float64)[cnt[i]] | |||
for i in range(self.node_type)] | |||
return np.concatenate(one_hot_list) | |||
def type_encoder(self, node): | |||
""" | |||
perform one-hot encoding based on the node type. | |||
:param node: | |||
:return: | |||
""" | |||
res = [0] * self.node_type | |||
res[self.type2idx[node[0]]] = 1.0 | |||
return res | |||
@@ -4,19 +4,16 @@ from torch.utils.data import Dataset | |||
import torch | |||
class Metapath2VecSampler(Dataset): | |||
def __init__(self, hg, metapath, start_ntype: str, rw_length: int, rw_walks: int, window_size: int, neg_size: int): | |||
class RandomWalkSampler(Dataset): | |||
def __init__(self, g, rw_walks: int, window_size: int, neg_size: int, metapath=None, rw_prob=None, rw_length=None): | |||
self.hg = hg | |||
self.g = g | |||
self.start_idx = [0] | |||
for i, num_nodes in enumerate([hg.num_nodes(ntype) for ntype in hg.ntypes]): | |||
if i < len(hg.ntypes) - 1: | |||
for i, num_nodes in enumerate([g.num_nodes(ntype) for ntype in g.ntypes]): | |||
if i < len(g.ntypes) - 1: | |||
self.start_idx.append(num_nodes + self.start_idx[i]) | |||
self.node_frequency = [0] * hg.num_nodes() | |||
self.metapath = metapath | |||
self.start_ntype = start_ntype | |||
self.traces = [] | |||
self.negatives = [] | |||
self.node_frequency = [0] * g.num_nodes() | |||
self.rw_prob = rw_prob | |||
self.rw_length = rw_length | |||
self.rw_walks = rw_walks | |||
self.window_size = window_size | |||
@@ -24,18 +21,28 @@ class Metapath2VecSampler(Dataset): | |||
self.neg_pos = 0 | |||
self.neg_size = neg_size | |||
if metapath is not None: | |||
start_ntype = self.g.to_canonical_etype(metapath[0])[0] | |||
self.start_nodes = list(range(self.g.num_nodes(start_ntype))) * self.rw_walks | |||
else: | |||
self.start_nodes = list(range(self.g.num_nodes())) * self.rw_walks | |||
self.metapath = metapath | |||
self.traces = [] | |||
self.negatives = [] | |||
self.discards = [] | |||
self.trace_ntype = [] | |||
self.traces_idx_of_all_types = [] | |||
self.__generate_metapath() | |||
self.__generate_node_frequency() | |||
self.__generate_negatives() | |||
self.__generate_discards() | |||
def __generate_metapath(self): | |||
start_nodes = list( | |||
range(self.hg.num_nodes(self.start_ntype))) * self.rw_walks # start_nodes + [i] * self.rw_walks | |||
self.traces, self.trace_ntype = dgl.sampling.random_walk(g=self.hg, nodes=start_nodes, | |||
metapath=self.metapath * self.rw_length) | |||
self.traces, self.trace_ntype = dgl.sampling.random_walk(g=self.g, nodes=self.start_nodes, | |||
metapath=self.metapath, prob=self.rw_prob, | |||
length=self.rw_length) | |||
self.traces_idx_of_all_types = torch.index_select(torch.tensor(self.start_idx), 0, self.trace_ntype).repeat( | |||
len(self.traces), 1) + self.traces | |||
@@ -56,20 +63,26 @@ class Metapath2VecSampler(Dataset): | |||
np.random.shuffle(self.negatives) | |||
self.sampling_prob = ratio | |||
def __generate_discards(self): | |||
t = 0.0001 | |||
f = np.array(self.node_frequency) / sum(self.node_frequency) | |||
self.discards = np.sqrt(t / f) + (t / f) | |||
return | |||
def get_center_context_negatives(self, idx): | |||
# return all (center, context, 5 negatives) on one trace | |||
# return all (center, context, n negatives) on one trace | |||
trace = self.traces_idx_of_all_types[idx] | |||
pair_catch = [] | |||
for i, u in enumerate(trace): | |||
if np.random.rand() > self.discards[u]: | |||
continue | |||
for j, v in enumerate( | |||
trace[max(i - self.window_size, 0):i + self.window_size]): | |||
# todo 为什么有idx<0? | |||
if u < 0 or v < 0: | |||
# print('warning: less than 0') | |||
continue | |||
if u >= self.hg.num_nodes() or v >= self.hg.num_nodes(): | |||
# print('warning: bigger than num_nodes') | |||
if u >= self.g.num_nodes() or v >= self.g.num_nodes(): | |||
continue | |||
pair_catch.append((int(u), int(v), self.__get_negatives())) | |||
return pair_catch | |||
@@ -1,5 +1,6 @@ | |||
import dgl | |||
from .mp2vec_sampler import Metapath2VecSampler | |||
from .random_walk_sampler import RandomWalkSampler | |||
import torch | |||
hg = dgl.heterograph({ | |||
('user', 'follow', 'user'): ([0, 1, 1, 2, 3], [1, 2, 3, 0, 0]), | |||
@@ -9,7 +10,21 @@ hg = dgl.heterograph({ | |||
def test_get_center_context_negatives(): | |||
rw_walks = 5 | |||
sampler = Metapath2VecSampler(hg=hg, metapath=['follow', 'view', 'viewed-by'], start_ntype='user', rw_length=3, | |||
rw_walks=rw_walks, window_size=3, neg_size=5) | |||
hetero_sampler = RandomWalkSampler(g=hg, metapath=['follow', 'view', 'viewed-by'] * 3, | |||
rw_walks=rw_walks, window_size=3, neg_size=5) | |||
for i in range(hg.num_nodes('user') * rw_walks): | |||
print(sampler.get_center_context_negatives(i)) | |||
print(hetero_sampler.get_center_context_negatives(i)) | |||
metapath = ['view', 'viewed-by'] | |||
for i, elem in enumerate(metapath): | |||
if i == 0: | |||
adj = hg.adj(etype=elem) | |||
else: | |||
adj = torch.sparse.mm(adj, hg.adj(etype=elem)) | |||
adj = adj.coalesce() | |||
g = dgl.graph(data=(adj.indices()[0], adj.indices()[1])) | |||
g.edata['rw_prob'] = adj.values() | |||
homo_sampler = RandomWalkSampler(g=g, rw_length=10, rw_walks=rw_walks, window_size=3, neg_size=5) | |||
for i in range(g.num_nodes() * rw_walks): | |||
print(homo_sampler.get_center_context_negatives(i)) |
@@ -1,4 +1,4 @@ | |||
from .utils import set_random_seed, set_best_config | |||
from .utils import set_random_seed, set_best_config, Logger | |||
from .trainerflow import build_flow | |||
from .auto import hpo_experiment | |||
@@ -6,12 +6,11 @@ from .auto import hpo_experiment | |||
def OpenHGNN(args): | |||
if not getattr(args, 'seed', False): | |||
args.seed = 0 | |||
set_random_seed(args.seed) | |||
args.logger = Logger(args) | |||
if getattr(args, "use_best_config", False): | |||
args = set_best_config(args) | |||
set_random_seed(args.seed) | |||
trainerflow = SpecificTrainerflow.get(args.model, args.task) | |||
print(args) | |||
if getattr(args, "use_hpo", False): | |||
# hyper-parameter search | |||
hpo_experiment(args, trainerflow) | |||
@@ -31,6 +30,8 @@ SpecificTrainerflow = { | |||
'DMGI': 'DMGI_trainer', | |||
'KGCN': 'kgcntrainer', | |||
'Metapath2vec': 'mp2vec_trainer', | |||
'HERec': 'herec_trainer', | |||
'SLiCE':'slicetrainer', | |||
'HeGAN': 'HeGAN_trainer', | |||
'HDE': 'hde_trainer', | |||
} |
@@ -1,4 +1,7 @@ | |||
import dgl | |||
import torch as th | |||
import torch.nn.functional as F | |||
from dgl.dataloading.negative_sampler import Uniform | |||
from . import BaseTask, register_task | |||
from ..dataset import build_dataset | |||
from ..utils import Evaluator | |||
@@ -30,8 +33,27 @@ class LinkPrediction(BaseTask): | |||
self.dataset = build_dataset(args.dataset, 'link_prediction') | |||
# self.evaluator = Evaluator() | |||
self.train_hg, self.val_hg, self.test_hg = self.dataset.get_idx() | |||
self.val_hg = self.val_hg.to(args.device) | |||
self.test_hg = self.test_hg.to(args.device) | |||
self.evaluator = Evaluator(args.seed) | |||
if not hasattr(args, 'score_fn'): | |||
self.ScorePredictor = HeteroDistMultPredictor() | |||
args.score_fn = 'distmult' | |||
elif args.score_fn == 'dot-product': | |||
self.ScorePredictor = HeteroDotProductPredictor() | |||
elif args.score_fn == 'distmult': | |||
self.ScorePredictor = HeteroDistMultPredictor() | |||
self.negative_sampler = Uniform(1) | |||
if args.dataset in ['wn18', 'FB15k', 'FB15k-237']: | |||
self.evaluation_metric = 'mrr' | |||
else: | |||
self.evaluation_metric = 'roc_auc' | |||
args.logger.info('[Init Task] The task: link prediction, the dataset: {}, the evaluation metric is {}, ' | |||
'the score function: {} '.format(self.n_dataset, self.evaluation_metric, args.score_fn)) | |||
def get_graph(self): | |||
return self.dataset.g | |||
@@ -48,34 +70,156 @@ class LinkPrediction(BaseTask): | |||
elif name == 'roc_auc': | |||
return self.evaluator.cal_roc_auc | |||
def evaluate(self, name, logits): | |||
def evaluate(self, n_embedding, r_embedding=None, mode='test'): | |||
r""" | |||
Parameters | |||
---------- | |||
logits : th.Tensor | |||
the prediction of specific | |||
name | |||
n_embedding: th.Tensor | |||
the embedding of nodes | |||
r_embedding: th.Tensor | |||
the embedding of relation types | |||
mode: str | |||
the evaluation mode, train/valid/test | |||
Returns | |||
------- | |||
""" | |||
if name == 'acc': | |||
if self.evaluation_metric == 'acc': | |||
return self.evaluator.author_link_prediction | |||
elif name == 'mrr': | |||
return self.evaluator.mrr_ | |||
elif name == 'academic_lp': | |||
return self.evaluator.author_link_prediction(logits, self.dataset.train_batch, self.dataset.test_batch) | |||
elif self.evaluation_metric == 'mrr': | |||
return self.evaluator.mrr_(n_embedding['_N'], self.dict2emd(r_embedding), self.dataset.train_triplets, | |||
self.dataset.valid_triplets, self.dataset.test_triplets, | |||
hits=[1, 3, 10], eval_bz=100) | |||
elif self.evaluation_metric == 'academic_lp': | |||
return self.evaluator.author_link_prediction(n_embedding, self.dataset.train_batch, self.dataset.test_batch) | |||
elif self.evaluation_metric == 'roc_auc': | |||
if mode == 'test': | |||
eval_hg = self.test_hg | |||
elif mode == 'valid': | |||
eval_hg = self.val_hg | |||
else: | |||
raise ValueError('Mode error, supported test and valid.') | |||
negative_graph = self.construct_negative_graph(eval_hg) | |||
p_score = th.sigmoid(self.ScorePredictor(eval_hg, n_embedding, r_embedding)) | |||
n_score = th.sigmoid(self.ScorePredictor(negative_graph, n_embedding, r_embedding)) | |||
p_label = th.ones(len(p_score)) | |||
n_label = th.zeros(len(n_score)) | |||
return self.evaluator.cal_roc_auc(th.cat((p_label, n_label)).cpu(), th.cat((p_score, n_score)).cpu()) | |||
else: | |||
return self.evaluator.link_prediction | |||
def get_batch(self): | |||
return self.dataset.train_batch, self.dataset.test_batch | |||
def get_idx(self): | |||
return self.train_hg, self.val_hg, self.test_hg | |||
def get_train(self): | |||
return self.train_hg | |||
def get_labels(self): | |||
return self.dataset.get_labels() | |||
def dict2emd(self, r_embedding): | |||
r_emd = [] | |||
for i in range(self.dataset.num_rels): | |||
r_emd.append(r_embedding[str(i)]) | |||
return th.stack(r_emd).squeeze() | |||
def construct_negative_graph(self, hg): | |||
e_dict = { | |||
etype: hg.edges(etype=etype, form='eid') | |||
for etype in hg.canonical_etypes} | |||
neg_srcdst = self.negative_sampler(hg, e_dict) | |||
neg_pair_graph = dgl.heterograph(neg_srcdst, | |||
{ntype: hg.number_of_nodes(ntype) for ntype in hg.ntypes}) | |||
return neg_pair_graph | |||
class HeteroDotProductPredictor(th.nn.Module): | |||
""" | |||
References: `documentation of dgl <https://docs.dgl.ai/guide/training-link.html#heterogeneous-graphs>_` | |||
""" | |||
def forward(self, edge_subgraph, x, *args, **kwargs): | |||
""" | |||
Parameters | |||
---------- | |||
edge_subgraph: dgl.Heterograph | |||
the prediction graph only contains the edges of the target link | |||
x: dict[str: th.Tensor] | |||
the embedding dict. The key only contains the nodes involving with the target link. | |||
Returns | |||
------- | |||
score: th.Tensor | |||
the prediction of the edges in edge_subgraph | |||
""" | |||
with edge_subgraph.local_scope(): | |||
for ntype in edge_subgraph.ntypes: | |||
edge_subgraph.nodes[ntype].data['x'] = x[ntype] | |||
for etype in edge_subgraph.canonical_etypes: | |||
edge_subgraph.apply_edges( | |||
dgl.function.u_dot_v('x', 'x', 'score'), etype=etype) | |||
score = edge_subgraph.edata['score'] | |||
if isinstance(score, dict): | |||
result = [] | |||
for _, value in score.items(): | |||
result.append(value) | |||
score = th.cat(result) | |||
return score.squeeze() | |||
class HeteroDistMultPredictor(th.nn.Module): | |||
def forward(self, edge_subgraph, x, r_embedding, *args, **kwargs): | |||
""" | |||
DistMult factorization (Yang et al. 2014) as the scoring function, | |||
which is known to perform well on standard link prediction benchmarks when used on its own. | |||
In DistMult, every relation r is associated with a diagonal matrix :math:`R_{r} \in \mathbb{R}^{d \times d}` | |||
and a triple (s, r, o) is scored as | |||
.. math:: | |||
f(s, r, o)=e_{s}^{T} R_{r} e_{o} | |||
Parameters | |||
---------- | |||
edge_subgraph: dgl.Heterograph | |||
the prediction graph only contains the edges of the target link | |||
x: dict[str: th.Tensor] | |||
the node embedding dict. The key only contains the nodes involving with the target link. | |||
r_embedding: th.Tensor | |||
the all relation types embedding | |||
Returns | |||
------- | |||
score: th.Tensor | |||
the prediction of the edges in edge_subgraph | |||
""" | |||
with edge_subgraph.local_scope(): | |||
for ntype in edge_subgraph.ntypes: | |||
edge_subgraph.nodes[ntype].data['x'] = x[ntype] | |||
for etype in edge_subgraph.canonical_etypes: | |||
e = r_embedding[etype[1]] | |||
n = edge_subgraph.num_edges(etype) | |||
if 1 == len(edge_subgraph.canonical_etypes): | |||
edge_subgraph.edata['e'] = e.expand(n, -1) | |||
else: | |||
edge_subgraph.edata['e'] = {etype: e.expand(n, -1)} | |||
edge_subgraph.apply_edges( | |||
dgl.function.u_mul_e('x', 'e', 's'), etype=etype) | |||
edge_subgraph.apply_edges( | |||
dgl.function.e_mul_v('s', 'x', 'score'), etype=etype) | |||
score = edge_subgraph.edata['score'] | |||
if isinstance(score, dict): | |||
result = [] | |||
for _, value in score.items(): | |||
result.append(th.sum(value, dim=1)) | |||
score = th.cat(result) | |||
else: | |||
score = th.sum(score, dim=1) | |||
return score | |||
@@ -59,7 +59,9 @@ SUPPORTED_FLOWS = { | |||
'kgcntrainer': 'openhgnn.trainerflow.kgcn_trainer', | |||
'HeGAN_trainer': 'openhgnn.trainerflow.HeGAN_trainer', | |||
'mp2vec_trainer': 'openhgnn.trainerflow.mp2vec_trainer', | |||
'herec_trainer': 'openhgnn.trainerflow.herec_trainer', | |||
'HeCo_trainer': 'openhgnn.trainerflow.HeCo_trainer', | |||
'DMGI_trainer': 'openhgnn.trainerflow.DMGI_trainer', | |||
'slicetrainer':'openhgnn.trainerflow.slice_trainer', | |||
'slicetrainer': 'openhgnn.trainerflow.slice_trainer', | |||
'hde_trainer': 'openhgnn.trainerflow.hde_trainer', | |||
} |
@@ -14,17 +14,27 @@ class BaseFlow(ABC): | |||
} | |||
def __init__(self, args): | |||
""" | |||
Parameters | |||
---------- | |||
args | |||
Attributes | |||
------------- | |||
evaluate_interval: int | |||
the interval of evaluation in validation | |||
""" | |||
super(BaseFlow, self).__init__() | |||
self.evaluator = None | |||
self.evaluate_interval = 1 | |||
self.load_from_checkpoint = True | |||
if hasattr(args, '_checkpoint'): | |||
self._checkpoint = os.path.join(args._checkpoint, | |||
f"{args.model}_{args.dataset}.pt") | |||
else: | |||
if self.load_from_checkpoint: | |||
if hasattr(args, 'load_from_pretrained'): | |||
self._checkpoint = os.path.join("./openhgnn/output/{}".format(args.model), | |||
f"{args.model}_{args.dataset}.pt") | |||
f"{args.model}_{args.dataset}_{args.task}.pt") | |||
else: | |||
self._checkpoint = None | |||
@@ -32,6 +42,7 @@ class BaseFlow(ABC): | |||
args.HGB_results_path = os.path.join("./openhgnn/output/{}/{}_{}.txt".format(args.model, args.dataset[5:], args.seed)) | |||
self.args = args | |||
self.logger = self.args.logger | |||
self.model_name = args.model | |||
self.device = args.device | |||
self.task = build_task(args) | |||
@@ -43,7 +54,7 @@ class BaseFlow(ABC): | |||
self.optimizer = None | |||
self.loss_fn = self.task.get_loss_fn() | |||
def preprocess_feature(self): | |||
def preprocess(self): | |||
r""" | |||
Every trainerflow should run the preprocess_feature if you want to get a feature preprocessing. | |||
The Parameters in input_feature will be added into optimizer and input_feature will be added into the model. | |||
@@ -66,38 +77,58 @@ class BaseFlow(ABC): | |||
if hasattr(self.args, 'feat'): | |||
pass | |||
else: | |||
# Default 0, nothing to do. | |||
self.args.feat = 0 | |||
if self.args.feat == 0: | |||
print("feat0, pass!") | |||
if isinstance(self.hg.ndata['h'], dict): | |||
self.input_feature = HeteroFeature(self.hg.ndata['h'], get_nodes_dict(self.hg), | |||
self.args.hidden_dim, act=act).to(self.device) | |||
elif isinstance(self.hg.ndata['h'], torch.Tensor): | |||
self.input_feature = HeteroFeature({self.hg.ntypes[0]: self.hg.ndata['h']}, get_nodes_dict(self.hg), | |||
self.args.hidden_dim, act=act).to(self.device) | |||
self.feature_preprocess(act) | |||
self.optimizer.add_param_group({'params': self.input_feature.parameters()}) | |||
# for early stop, load the model with input_feature module. | |||
self.model.add_module('input_feature', self.input_feature) | |||
self.load_from_pretrained() | |||
def feature_preprocess(self, act): | |||
""" | |||
Feat | |||
0, 1 ,2 | |||
Node feature | |||
1 node type & more than 1 node types | |||
no feature | |||
Returns | |||
------- | |||
""" | |||
if self.hg.ndata.get('h', {}) == {} or self.args.feat == 2: | |||
if self.hg.ndata.get('h', {}) == {}: | |||
self.logger.feature_info('Assign embedding as features, because hg.ndata is empty.') | |||
else: | |||
self.logger.feature_info('feat2, drop features!') | |||
self.hg.ndata.pop('h') | |||
self.input_feature = HeteroFeature({}, get_nodes_dict(self.hg), self.args.hidden_dim, | |||
act=act).to(self.device) | |||
elif self.args.feat == 0: | |||
self.input_feature = self.init_feature(act) | |||
elif self.args.feat == 1: | |||
if self.args.task != 'node_classification': | |||
print('it is only for node classification task!') | |||
self.logger.feature_info('\'feat 1\' is only for node classification task, set feat 0!') | |||
self.input_feature = self.init_feature(act) | |||
else: | |||
h_dict = self.hg.ndata.pop('h') | |||
print('feat1, preserve target nodes!') | |||
self.logger.feature_info('feat1, preserve target nodes!') | |||
self.input_feature = HeteroFeature({self.category: h_dict[self.category]}, get_nodes_dict(self.hg), self.args.hidden_dim, | |||
act=act).to(self.device) | |||
elif self.args.feat == 2: | |||
self.hg.ndata.pop('h') | |||
self.input_feature = HeteroFeature({}, get_nodes_dict(self.hg), self.args.hidden_dim, | |||
act=act).to(self.device) | |||
print('feat2, drop features!') | |||
# if isinstance(self.hg.ndata['h'], dict): | |||
# self.input_feature = HeteroFeature(self.hg.ndata['h'], get_nodes_dict(self.hg), self.args.hidden_dim, act=act).to(self.device) | |||
# elif isinstance(self.hg.ndata['h'], torch.Tensor): | |||
# self.input_feature = HeteroFeature({self.hg.ntypes[0]: self.hg.ndata['h']}, get_nodes_dict(self.hg), self.args.hidden_dim, act=act).to(self.device) | |||
# else: | |||
# self.input_feature = HeteroFeature({}, get_nodes_dict(self.hg), self.args.hidden_dim, | |||
# act=act).to(self.device) | |||
self.optimizer.add_param_group({'params': self.input_feature.parameters()}) | |||
# for early stop, load the model with input_feature module. | |||
self.model.add_module('input_feature', self.input_feature) | |||
def init_feature(self, act): | |||
self.logger.feature_info("Feat is 0, nothing to do!") | |||
if isinstance(self.hg.ndata['h'], dict): | |||
# The heterogeneous contains more than one node type. | |||
input_feature = HeteroFeature(self.hg.ndata['h'], get_nodes_dict(self.hg), | |||
self.args.hidden_dim, act=act).to(self.device) | |||
elif isinstance(self.hg.ndata['h'], torch.Tensor): | |||
# The heterogeneous only contains one node type. | |||
input_feature = HeteroFeature({self.hg.ntypes[0]: self.hg.ndata['h']}, get_nodes_dict(self.hg), | |||
self.args.hidden_dim, act=act).to(self.device) | |||
return input_feature | |||
@abstractmethod | |||
def train(self): | |||
@@ -128,13 +159,15 @@ class BaseFlow(ABC): | |||
raise NotImplementedError | |||
def load_from_pretrained(self): | |||
if self.load_from_checkpoint: | |||
if self.args.load_from_pretrained: | |||
try: | |||
ck_pt = torch.load(self._checkpoint) | |||
self.model.load_state_dict(ck_pt) | |||
self.logger.info('[Load Model] Load model from pretrained model:' + self._checkpoint) | |||
except FileNotFoundError: | |||
print(f"'{self._checkpoint}' doesn't exists") | |||
return self.model | |||
self.logger.info('[Load Model] Do not load the model from pretrained, ' | |||
'{} doesn\'t exists'.format(self._checkpoint)) | |||
# return self.model | |||
def save_checkpoint(self): | |||
if self._checkpoint and hasattr(self.model, "_parameters()"): | |||
@@ -0,0 +1,107 @@ | |||
import torch as th | |||
from torch import nn | |||
from tqdm import tqdm | |||
import torch | |||
from . import BaseFlow, register_flow | |||
from ..models import build_model | |||
from ..utils import extract_embed, EarlyStopping, get_nodes_dict | |||
import torch.optim as optim | |||
from sklearn.metrics import roc_auc_score | |||
from ..sampler import hde_sampler | |||
@register_flow("hde_trainer") | |||
class hde_trainer(BaseFlow): | |||
""" | |||
HDE trainer flow. | |||
Supported Model: HDE | |||
Supported Dataset:imdb4hde | |||
The trainerflow supports link prediction task. It first calculates HDE for every node in the graph, | |||
and uses the HDE as a part of the initial feature for each node. | |||
And then it performs standard message passing and link prediction operations. | |||
Please notice that for different target node set, the HDE for each node can be different. | |||
For more details, please refer to the original paper: http://www.shichuan.org/doc/116.pdf | |||
""" | |||
def __init__(self, args): | |||
super(hde_trainer, self).__init__(args) | |||
self.target_link = self.task.dataset.target_link | |||
self.loss_fn = self.task.get_loss_fn() | |||
self.args.out_node_type = self.task.dataset.out_ntypes | |||
self.type2idx = {'A': 0, 'B': 1} | |||
self.node_type = len(self.type2idx) | |||
self.num_fea = (self.node_type * (args.max_dist + 1)) * 2 + self.node_type | |||
self.sample_size = self.num_fea * (1 + args.num_neighbor + args.num_neighbor * args.num_neighbor) | |||
self.args.patience = 10 | |||
args.input_dim = self.num_fea | |||
args.output_dim = args.emb_dim // 2 | |||
self.model = build_model(self.model_name).build_model_from_args(self.args, self.hg).to(self.device) | |||
# initialization | |||
for m in self.model.modules(): | |||
if isinstance(m, (nn.Conv2d, nn.Linear)): | |||
nn.init.kaiming_normal_(m.weight, mode='fan_in') | |||
self.loss_fn = nn.CrossEntropyLoss().to(self.device) | |||
self.optimizer = optim.Adam(self.model.parameters(), lr=args.lr, weight_decay=1e-2) | |||
self.evaluator = roc_auc_score | |||
self.HDE_sampler = hde_sampler.HDESampler(self) | |||
self.HDE_sampler.dgl2nx() | |||
self.data_A_train, self.data_B_train, self.data_y_train, \ | |||
self.val_batch_A_fea, self.val_batch_B_fea, self.val_batch_y, \ | |||
self.test_batch_A_fea, self.test_batch_B_fea, self.test_batch_y = self.HDE_sampler.compute_hde(args) | |||
def train(self): | |||
epoch_iter = tqdm(range(self.max_epoch)) | |||
stopper = EarlyStopping(self.args.patience, self._checkpoint) | |||
for epoch in tqdm(range(self.max_epoch), ncols=80): | |||
loss = self._mini_train_step() | |||
if epoch % 2 == 0: | |||
val_metric = self._test_step('valid') | |||
epoch_iter.set_description( | |||
f"Epoch: {epoch:03d}, roc_auc: {val_metric:.4f}, Loss:{loss:.4f}" | |||
) | |||
early_stop = stopper.step_score(val_metric, self.model) | |||
if early_stop: | |||
print('Early Stop!\tEpoch:' + str(epoch)) | |||
break | |||
print(f"Valid_score_ = {stopper.best_score: .4f}") | |||
stopper.load_model(self.model) | |||
test_auc = self._test_step(split="test") | |||
val_auc = self._test_step(split="valid") | |||
print(f"Test roc_auc = {test_auc:.4f}") | |||
return dict(Test_mrr=test_auc, Val_mrr=val_auc) | |||
def _mini_train_step(self,): | |||
self.model.train() | |||
all_loss = 0 | |||
for (train_batch_A_fea, train_batch_B_fea, train_batch_y) in zip(self.data_A_train, self.data_B_train, self.data_y_train): | |||
# train | |||
self.model.train() | |||
train_batch_A_fea = torch.FloatTensor(train_batch_A_fea).to(self.device) | |||
train_batch_B_fea = torch.FloatTensor(train_batch_B_fea).to(self.device) | |||
train_batch_y = torch.LongTensor(train_batch_y).to(self.device) | |||
logits = self.model(train_batch_A_fea, train_batch_B_fea) | |||
loss = self.loss_fn(logits, train_batch_y.squeeze()) | |||
self.optimizer.zero_grad() | |||
loss.backward() | |||
self.optimizer.step() | |||
all_loss += loss.item() | |||
return all_loss | |||
def _test_step(self, split=None, logits=None): | |||
self.model.eval() | |||
with th.no_grad(): | |||
if split == 'valid': | |||
data_A_eval = self.val_batch_A_fea | |||
data_B_eval = self.val_batch_B_fea | |||
data_y_eval = self.val_batch_y | |||
elif split == 'test': | |||
data_A_eval = self.test_batch_A_fea | |||
data_B_eval = self.test_batch_B_fea | |||
data_y_eval = self.test_batch_y | |||
logits = self.model(data_A_eval, data_B_eval) | |||
pred = logits.argmax(dim=1) | |||
metric = self.evaluator(data_y_eval.cpu().numpy(), pred.cpu().numpy()) | |||
return metric |
@@ -0,0 +1,87 @@ | |||
import os | |||
import numpy | |||
from tqdm import tqdm | |||
import torch.sparse as sparse | |||
import torch.optim as optim | |||
from torch.utils.data import DataLoader | |||
from ..models import build_model | |||
from . import BaseFlow, register_flow | |||
from ..sampler import random_walk_sampler | |||
import dgl | |||
@register_flow("herec_trainer") | |||
class HERecTrainer(BaseFlow): | |||
def __init__(self, args): | |||
super(HERecTrainer, self).__init__(args) | |||
self.model = build_model(self.model_name).build_model_from_args(self.args, self.hg).to(self.device) | |||
self.random_walk_sampler = None | |||
self.dataloader = None | |||
self.metapath = self.task.dataset.meta_paths_dict[self.args.meta_path_keys[0]] | |||
self.output_dir = './openhgnn/output/' + self.model_name | |||
self.embeddings_file_path = os.path.join(self.output_dir, self.args.dataset + '_' + | |||
self.args.meta_path_keys[0] + '_herec_embeddings.npy') | |||
self.load_trained_embeddings = False | |||
def preprocess(self): | |||
for i, elem in enumerate(self.metapath): | |||
if i == 0: | |||
adj = self.hg.adj(etype=elem) | |||
else: | |||
adj = sparse.mm(adj, self.hg.adj(etype=elem)) | |||
adj = adj.coalesce() | |||
g = dgl.graph(data=(adj.indices()[0], adj.indices()[1])) | |||
g.edata['rw_prob'] = adj.values() | |||
self.random_walk_sampler = random_walk_sampler.RandomWalkSampler(g=g.to('cpu'), | |||
rw_length=self.args.rw_length, | |||
rw_walks=self.args.rw_walks, | |||
window_size=self.args.window_size, | |||
neg_size=self.args.neg_size, rw_prob='rw_prob') | |||
self.dataloader = DataLoader(self.random_walk_sampler, batch_size=self.args.batch_size, | |||
shuffle=True, num_workers=self.args.num_workers, | |||
collate_fn=self.random_walk_sampler.collate) | |||
def train(self): | |||
emb = self.load_embeddings() | |||
# todo: only supports node classification now | |||
self.task.evaluate(logits=emb, name='f1_lr') | |||
def load_embeddings(self): | |||
if not self.load_trained_embeddings or not os.path.exists(self.embeddings_file_path): | |||
self.train_embeddings() | |||
emb = numpy.load(self.embeddings_file_path) | |||
return emb | |||
def train_embeddings(self): | |||
self.preprocess() | |||
optimizer = optim.SparseAdam(list(self.model.parameters()), lr=self.args.lr) | |||
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, len(self.dataloader)) | |||
for epoch in range(self.max_epoch): | |||
print('\n\n\nEpoch: ' + str(epoch + 1)) | |||
running_loss = 0.0 | |||
for i, sample_batched in enumerate(tqdm(self.dataloader)): | |||
if len(sample_batched[0]) > 1: | |||
pos_u = sample_batched[0].to(self.device) | |||
pos_v = sample_batched[1].to(self.device) | |||
neg_v = sample_batched[2].to(self.device) | |||
scheduler.step() | |||
optimizer.zero_grad() | |||
loss = self.model.forward(pos_u, pos_v, neg_v) | |||
loss.backward() | |||
optimizer.step() | |||
running_loss = running_loss * 0.9 + loss.item() * 0.1 | |||
if i > 0 and i % 50 == 0: | |||
print(' Loss: ' + str(running_loss)) | |||
self.model.save_embedding(self.embeddings_file_path) |
@@ -6,7 +6,6 @@ import torch | |||
import torch.nn.functional as F | |||
from . import BaseFlow, register_flow | |||
from ..models import build_model | |||
from dgl.dataloading.negative_sampler import Uniform | |||
from ..utils import extract_embed, EarlyStopping, get_nodes_dict, add_reverse_edges | |||
@@ -14,7 +13,8 @@ from ..utils import extract_embed, EarlyStopping, get_nodes_dict, add_reverse_ed | |||
class LinkPrediction(BaseFlow): | |||
""" | |||
Link Prediction trainer flows. | |||
Here is a tutorial teach you how to train a GNN for link prediction. <https://docs.dgl.ai/en/latest/tutorials/blitz/4_link_predict.html>_ | |||
Here is a tutorial teach you how to train a GNN for | |||
`link prediction <https://docs.dgl.ai/en/latest/tutorials/blitz/4_link_predict.html>_`. | |||
When training, you will need to remove the edges in the test set from the original graph. | |||
DGL recommends you to treat the pairs of nodes as another graph, since you can describe a pair of nodes with an edge. | |||
@@ -27,34 +27,57 @@ class LinkPrediction(BaseFlow): | |||
""" | |||
def __init__(self, args): | |||
""" | |||
Parameters | |||
---------- | |||
args | |||
Attributes | |||
------------ | |||
target_link: list | |||
list of edge types which are target link type to be predicted | |||
score_fn: str | |||
score function used in calculating the scores of links, supported function: distmult[Default if not specified] & dot product | |||
r_embedding: nn. ParameterDict | |||
In DistMult, the representations of edge types are involving the calculation of score. | |||
General models do not generate the representations of edge types, so we generate the embeddings of edge types. | |||
The dimension of embedding is `self.args.hidden_dim`. | |||
""" | |||
super(LinkPrediction, self).__init__(args) | |||
self.target_link = self.task.dataset.target_link | |||
self.loss_fn = self.task.get_loss_fn() | |||
self.train_hg = self.task.get_train().to(self.device) | |||
if hasattr(self.args, 'flag_add_reverse_edges'): | |||
self.train_hg = add_reverse_edges(self.train_hg) | |||
if not hasattr(self.args, 'out_dim'): | |||
self.args.out_dim = self.args.hidden_dim | |||
self.args.out_node_type = self.task.dataset.out_ntypes | |||
self.args.out_dim = self.args.hidden_dim | |||
self.train_hg, self.val_hg, self.test_hg = self.task.get_idx() | |||
self.train_hg = add_reverse_edges(self.train_hg) | |||
self.model = build_model(self.model_name).build_model_from_args(self.args, self.train_hg).to(self.device) | |||
self.args.score_fn = 'distmult' | |||
if not hasattr(self.args, 'score_fn'): | |||
self.args.score_fn = 'distmult' | |||
if self.args.score_fn == 'distmult': | |||
self.r_embedding = nn.ParameterDict({etype[1]: nn.Parameter(th.Tensor(1, self.args.out_dim)) | |||
""" | |||
In DistMult, the representations of edge types are involving the calculation of score. | |||
General models do not generate the representations of edge types, so we generate the embeddings of edge types. | |||
""" | |||
self.r_embedding = nn.ParameterDict({etype[1]: nn.Parameter(th.Tensor(1, self.args.hidden_dim)) | |||
for etype in self.hg.canonical_etypes}).to(self.device) | |||
for _, para in self.r_embedding.items(): | |||
nn.init.xavier_uniform_(para) | |||
else: | |||
self.r_embedding = None | |||
self.evaluator = self.task.get_evaluator('roc_auc') | |||
self.optimizer = self.candidate_optimizer[args.optimizer](self.model.parameters(), lr=args.lr, weight_decay=args.weight_decay) | |||
if self.args.score_fn == 'distmult': | |||
self.optimizer.add_param_group({'params': self.r_embedding.parameters()}) | |||
self.patience = args.patience | |||
self.max_epoch = args.max_epoch | |||
self.train_hg = self.train_hg.to(self.device) | |||
self.val_hg = self.val_hg.to(self.device) | |||
self.test_hg = self.test_hg.to(self.device) | |||
self.negative_sampler = Uniform(1) | |||
self.positive_graph = self.train_hg.edge_type_subgraph(self.target_link) | |||
def preprocess(self): | |||
@@ -62,47 +85,56 @@ class LinkPrediction(BaseFlow): | |||
In link prediction, you will have a positive graph consisting of all the positive examples as edges, | |||
and a negative graph consisting of all the negative examples. | |||
The positive graph and the negative graph will contain the same set of nodes as the original graph. | |||
Returns | |||
------- | |||
""" | |||
self.preprocess_feature() | |||
super(LinkPrediction, self).preprocess() | |||
def train(self): | |||
self.preprocess() | |||
epoch_iter = tqdm(range(self.max_epoch)) | |||
stopper = EarlyStopping(self.args.patience, self._checkpoint) | |||
for epoch in tqdm(range(self.max_epoch), ncols=80): | |||
stopper = EarlyStopping(self.patience, self._checkpoint) | |||
for epoch in tqdm(range(self.max_epoch)): | |||
if self.args.mini_batch_flag: | |||
loss = self._mini_train_step() | |||
else: | |||
loss = self._full_train_setp() | |||
if epoch % 2 == 0: | |||
if epoch % self.evaluate_interval == 0: | |||
val_metric = self._test_step('valid') | |||
epoch_iter.set_description( | |||
f"Epoch: {epoch:03d}, roc_auc: {val_metric:.4f}, Loss:{loss:.4f}" | |||
) | |||
self.logger.train_info(f"Epoch: {epoch:03d}, roc_auc: {val_metric:.4f}, Loss:{loss:.4f}") | |||
early_stop = stopper.step_score(val_metric, self.model) | |||
if early_stop: | |||
print('Early Stop!\tEpoch:' + str(epoch)) | |||
self.logger.train_info(f'Early Stop!\tEpoch:{epoch:03d}') | |||
break | |||
print(f"Valid_score_ = {stopper.best_score: .4f}") | |||
self.logger.train_info(f"Valid score = {stopper.best_score: .4f}") | |||
stopper.load_model(self.model) | |||
############ TEST SCORE ######### | |||
# Test | |||
if self.args.dataset in ['HGBl-amazon', 'HGBl-LastFM', 'HGBl-PubMed']: | |||
# Test in HGB datasets. | |||
self.model.eval() | |||
with torch.no_grad(): | |||
val_metric = self._test_step('valid') | |||
h_dict = self.model.input_feature() | |||
embedding = self.model(self.hg, h_dict) | |||
score = th.sigmoid(self.ScorePredictor(self.test_hg, embedding)) | |||
self.task.dataset.save_results(hg=self.test_hg, score=score, file_path=self.args.HGB_results_path) | |||
score = th.sigmoid(self.task.ScorePredictor(self.task.test_hg, embedding, self.r_embedding)) | |||
self.task.dataset.save_results(hg=self.task.test_hg, score=score, file_path=self.args.HGB_results_path) | |||
return val_metric, val_metric, epoch | |||
test_mrr = self._test_step(split="test") | |||
val_mrr = self._test_step(split="valid") | |||
print(f"Test mrr = {test_mrr:.4f}") | |||
return dict(Test_mrr=test_mrr, Val_mrr=val_mrr) | |||
test_score = self._test_step(split="test") | |||
val_score = self._test_step(split="valid") | |||
self.logger.train_info(f"Test score = {test_score:.4f}") | |||
return dict(Test_mrr=test_score, Val_mrr=val_score) | |||
def _full_train_setp(self): | |||
self.model.train() | |||
h_dict = self.model.input_feature() | |||
embedding = self.model(self.train_hg, h_dict) | |||
# construct a negative graph according to the positive graph in each training epoch. | |||
negative_graph = self.task.construct_negative_graph(self.positive_graph) | |||
loss = self.loss_calculation(self.positive_graph, negative_graph, embedding) | |||
# negative_graph = self.construct_negative_graph(self.train_hg) | |||
# loss = self.loss_calculation(self.train_hg, negative_graph, embedding) | |||
self.optimizer.zero_grad() | |||
loss.backward() | |||
self.optimizer.step() | |||
return loss.item() | |||
def _mini_train_step(self,): | |||
self.model.train() | |||
@@ -123,101 +155,20 @@ class LinkPrediction(BaseFlow): | |||
return all_loss | |||
def loss_calculation(self, positive_graph, negative_graph, embedding): | |||
p_score = self.ScorePredictor(positive_graph, embedding) | |||
n_score = self.ScorePredictor(negative_graph, embedding) | |||
p_score = self.task.ScorePredictor(positive_graph, embedding, self.r_embedding) | |||
n_score = self.task.ScorePredictor(negative_graph, embedding, self.r_embedding) | |||
p_label = th.ones(len(p_score), device=self.device) | |||
n_label = th.zeros(len(n_score), device=self.device) | |||
loss = F.binary_cross_entropy_with_logits(th.cat((p_score, n_score)), th.cat((p_label, n_label))) | |||
return loss | |||
def ScorePredictor(self, edge_subgraph, x): | |||
if self.args.score_fn == 'dot-product': | |||
with edge_subgraph.local_scope(): | |||
for ntype in edge_subgraph.ntypes: | |||
edge_subgraph.nodes[ntype].data['x'] = x[ntype] | |||
for etype in edge_subgraph.canonical_etypes: | |||
edge_subgraph.apply_edges( | |||
dgl.function.u_dot_v('x', 'x', 'score'), etype=etype) | |||
score = edge_subgraph.edata['score'] | |||
if isinstance(score, dict): | |||
result = [] | |||
for _, value in score.items(): | |||
result.append(value) | |||
score = th.cat(result) | |||
return score.squeeze() | |||
elif self.args.score_fn == 'distmult': | |||
score_list = [] | |||
with edge_subgraph.local_scope(): | |||
for ntype in edge_subgraph.ntypes: | |||
edge_subgraph.nodes[ntype].data['x'] = x[ntype] | |||
for etype in edge_subgraph.canonical_etypes: | |||
e = self.r_embedding[etype[1]] | |||
n = edge_subgraph.num_edges(etype) | |||
if 1 == len(edge_subgraph.canonical_etypes): | |||
edge_subgraph.edata['e'] = e.expand(n, -1) | |||
else: | |||
edge_subgraph.edata['e'] = {etype: e.expand(n, -1)} | |||
edge_subgraph.apply_edges( | |||
dgl.function.u_mul_e('x', 'e', 's'), etype=etype) | |||
edge_subgraph.apply_edges( | |||
dgl.function.e_mul_v('s', 'x', 'score'), etype=etype) | |||
if 1 == len(edge_subgraph.canonical_etypes): | |||
score = th.sum(edge_subgraph.edata['score'], dim=1) | |||
else: | |||
score = th.sum(edge_subgraph.edata['score'].pop(etype), dim=1) | |||
#score = th.sum(th.mul(edge_subgraph.edata['score'].pop(etype), e), dim=1) | |||
score_list.append(score) | |||
return th.cat(score_list) | |||
def regularization_loss(self, embedding): | |||
return th.mean(embedding.pow(2)) + th.mean(self.r_embedding.pow(2)) | |||
def construct_negative_graph(self, hg): | |||
e_dict = { | |||
etype: hg.edges(etype=etype, form='eid') | |||
for etype in hg.canonical_etypes} | |||
neg_srcdst = self.negative_sampler(hg, e_dict) | |||
neg_pair_graph = dgl.heterograph(neg_srcdst, | |||
{ntype: hg.number_of_nodes(ntype) for ntype in hg.ntypes}) | |||
return neg_pair_graph | |||
def _full_train_setp(self): | |||
self.model.train() | |||
h_dict = self.model.input_feature() | |||
embedding = self.model(self.train_hg, h_dict) | |||
negative_graph = self.construct_negative_graph(self.positive_graph) | |||
loss = self.loss_calculation(self.positive_graph, negative_graph, embedding) | |||
# negative_graph = self.construct_negative_graph(self.train_hg) | |||
# loss = self.loss_calculation(self.train_hg, negative_graph, embedding) | |||
self.optimizer.zero_grad() | |||
loss.backward() | |||
self.optimizer.step() | |||
#print(loss.item()) | |||
return loss.item() | |||
def _test_step(self, split=None, logits=None): | |||
def _test_step(self, split=None): | |||
self.model.eval() | |||
with th.no_grad(): | |||
h_dict = self.model.input_feature() | |||
embedding = self.model(self.train_hg, h_dict) | |||
if split == 'valid': | |||
eval_hg = self.val_hg | |||
# label = self.task.dataset.val_label | |||
elif split == 'test': | |||
label = self.task.dataset.test_label | |||
score = th.sigmoid(self.ScorePredictor(self.test_hg, embedding)) | |||
metric = self.evaluator(label.cpu(), score.cpu()) | |||
return metric | |||
# score = th.sigmoid(self.ScorePredictor(eval_hg, embedding)) | |||
# metric = self.evaluator(label.cpu(), score.cpu()) | |||
negative_graph = self.construct_negative_graph(eval_hg) | |||
p_score = th.sigmoid(self.ScorePredictor(eval_hg, embedding)) | |||
n_score = th.sigmoid(self.ScorePredictor(negative_graph, embedding)) | |||
p_label = th.ones(len(p_score)) | |||
n_label = th.zeros(len(n_score)) | |||
metric = self.evaluator(th.cat((p_label, n_label)).cpu(), th.cat((p_score, n_score)).cpu()) | |||
return metric | |||
return self.task.evaluate(embedding, self.r_embedding, mode=split) |
@@ -1,39 +1,32 @@ | |||
import os.path | |||
import numpy | |||
from tqdm import tqdm | |||
import torch.optim as optim | |||
from torch.utils.data import DataLoader | |||
from ..models import build_model | |||
from . import BaseFlow, register_flow | |||
from ..tasks import build_task | |||
from ..sampler import mp2vec_sampler | |||
from ..sampler import random_walk_sampler | |||
@register_flow("mp2vec_trainer") | |||
class Metapath2VecTrainer(BaseFlow): | |||
def __init__(self, args): | |||
super(Metapath2VecTrainer, self).__init__(args) | |||
self.args = args | |||
self.model_name = args.model | |||
self.device = args.device | |||
self.task = build_task(args) | |||
self.hg = self.task.get_graph().to(self.device) | |||
self.model = build_model(self.model_name).build_model_from_args(self.args, self.hg).to(self.device) | |||
self.model = self.model.to(self.device) | |||
self.mp2vec_sampler = None | |||
self.dataloader = None | |||
self.embeddings_file_name = self.args.dataset + '_mp2vec_embeddings' | |||
self.output_dir = './openhgnn/output/' + self.model_name | |||
self.embeddings_file_path = os.path.join(self.output_dir, self.args.dataset + '_mp2vec_embeddings.npy') | |||
self.load_trained_embeddings = False | |||
def preprocess(self): | |||
metapath = self.task.dataset.meta_paths[0] | |||
start_ntype = metapath[0][0] | |||
metapath_edges = [elem[1] for elem in metapath] | |||
self.mp2vec_sampler = mp2vec_sampler.Metapath2VecSampler(hg=self.hg.to('cpu'), metapath=metapath_edges, | |||
start_ntype=start_ntype, rw_length=self.args.rw_length, | |||
rw_walks=self.args.rw_walks, | |||
window_size=self.args.window_size, | |||
neg_size=self.args.neg_size) | |||
metapath = self.task.dataset.meta_paths_dict[self.args.meta_path_key] | |||
self.mp2vec_sampler = random_walk_sampler.RandomWalkSampler(g=self.hg.to('cpu'), | |||
metapath=metapath * self.args.rw_length, | |||
rw_walks=self.args.rw_walks, | |||
window_size=self.args.window_size, | |||
neg_size=self.args.neg_size) | |||
self.dataloader = DataLoader(self.mp2vec_sampler, batch_size=self.args.batch_size, | |||
shuffle=True, num_workers=self.args.num_workers, | |||
@@ -47,11 +40,9 @@ class Metapath2VecTrainer(BaseFlow): | |||
self.task.evaluate(logits=emb[start_idx:end_idx], name='f1_lr') | |||
def load_embeddings(self): | |||
try: | |||
emb = numpy.load(self.embeddings_file_name + '.npy') | |||
except Exception: | |||
if not self.load_trained_embeddings or not os.path.exists(self.embeddings_file_path): | |||
self.train_embeddings() | |||
emb = numpy.load(self.embeddings_file_name + '.npy') | |||
emb = numpy.load(self.embeddings_file_path) | |||
return emb | |||
def train_embeddings(self): | |||
@@ -61,25 +52,24 @@ class Metapath2VecTrainer(BaseFlow): | |||
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, len(self.dataloader)) | |||
for epoch in range(self.max_epoch): | |||
print('\n\n\nEpoch: ' + str(epoch + 1)) | |||
self.logger.info('Epoch: ' + str(epoch + 1)) | |||
running_loss = 0.0 | |||
for i, sample_batched in enumerate(tqdm(self.dataloader)): | |||
if len(sample_batched[0]) > 1: | |||
pos_u = sample_batched[0].to(self.device) | |||
pos_v = sample_batched[1].to(self.device) | |||
neg_v = sample_batched[2].to(self.device) | |||
scheduler.step() | |||
optimizer.zero_grad() | |||
loss = self.model.forward(pos_u, pos_v, neg_v) | |||
loss.backward() | |||
optimizer.step() | |||
scheduler.step() | |||
running_loss = running_loss * 0.9 + loss.item() * 0.1 | |||
if i > 0 and i % 50 == 0: | |||
print(' Loss: ' + str(running_loss)) | |||
self.model.save_embedding(self.embeddings_file_name) | |||
self.logger.info(' Loss: ' + str(running_loss)) | |||
self.model.save_embedding(self.embeddings_file_path) | |||
def get_ntype_range(self, target_ntype): | |||
start_idx = 0 | |||
@@ -2,4 +2,5 @@ from .best_config import BEST_CONFIGS | |||
from .dgl_graph import * | |||
from .utils import * | |||
from .evaluator import * | |||
from .logger import Logger | |||
@@ -65,7 +65,7 @@ BEST_CONFIGS = { | |||
}, | |||
}, | |||
'GTN': { | |||
'general': {'lr': 0.005, 'weight_decay': 0.001, 'hidden_dim': 64, 'max_epoch': 50, 'patience': 10, | |||
'general': {'lr': 0.005, 'weight_decay': 0.001, 'hidden_dim': 64, 'max_epoch': 100, 'patience': 20, | |||
'norm_emd_flag': True, 'mini_batch_flag': False}, | |||
'acm4GTN': { | |||
'num_layers': 2, 'num_channels': 2, 'adaptive_lr_flag': True, | |||
@@ -158,7 +158,7 @@ BEST_CONFIGS = { | |||
}, | |||
'NSHE': { | |||
'general': {}, | |||
'acm4SNHE': {'weight_decay': 0.001, 'num_e_neg': 1, 'num_ns_neg': 4, | |||
'acm4NSHE': {'weight_decay': 0.001, 'num_e_neg': 1, 'num_ns_neg': 4, | |||
'max_epoch': 500, 'patience': 10, | |||
} | |||
}, | |||
@@ -210,17 +210,28 @@ BEST_CONFIGS = { | |||
} | |||
}, | |||
"link_prediction": { | |||
'NARS': { | |||
'general': {'num_hops': 3}, | |||
}, | |||
'HetGNN':{ | |||
'HetGNN': { | |||
'general': {'max_epoch': 500, 'patience': 10, 'mini_batch_flag': True}, | |||
'academic4HetGNN': { | |||
'lr': 0.01, 'weight_decay': 0.0001, 'dim': 128, 'batch_size': 64, 'window_size': 5, | |||
'batches_per_epoch': 50, 'rw_length': 50, 'rw_walks': 10, 'rwr_prob': 0.5, | |||
} | |||
} | |||
}, | |||
'RGCN': { | |||
'general': { | |||
}, | |||
'FB15k-237': { | |||
'lr': 0.01, 'weight_decay': 0.0005, 'max_epoch': 100, | |||
'hidden_dim': 16, 'n_bases': 40, 'n_layers': 2, 'batch_size': 126, 'fanout': 4, 'dropout': 0, | |||
'validation': True | |||
}, | |||
}, | |||
}, | |||
"recommendation": { | |||
@@ -41,10 +41,9 @@ class Evaluator(): | |||
def cal_roc_auc(self, y_true, y_pred): | |||
return roc_auc_score(y_true, y_pred) | |||
def mrr_(self, embedding, w, train_triplets, valid_triplets, test_triplets, hits=[], eval_bz=100, eval_p="filtered"): | |||
def mrr_(self, embedding, w, train_triplets, valid_triplets, test_triplets, hits=[], eval_bz=100, eval_p='raw'): | |||
if eval_p == "filtered": | |||
mrr = calc_filtered_mrr(embedding, w, train_triplets, valid_triplets, test_triplets, hits) | |||
pass | |||
else: | |||
mrr = calc_raw_mrr(embedding, w, test_triplets, hits, eval_bz) | |||
return mrr | |||
@@ -92,34 +91,25 @@ class Evaluator(): | |||
return micro_f1, macro_f1 | |||
def perturb_and_get_raw_rank(embedding, w, a, r, b, test_size, batch_size=1000): | |||
def perturb_and_get_raw_rank(embedding, w, a, r, b, test_size, batch_size=100): | |||
""" Perturb one element in the triplets | |||
""" | |||
n_batch = (test_size + batch_size - 1) // batch_size | |||
ranks = [] | |||
for idx in range(n_batch): | |||
#print("batch {} / {}".format(idx, n_batch)) | |||
# print("batch {} / {}".format(idx, n_batch)) | |||
batch_start = idx * batch_size | |||
batch_end = min(test_size, (idx + 1) * batch_size) | |||
batch_a = a[batch_start: batch_end] | |||
batch_r = r[batch_start: batch_end] | |||
emb_ar = embedding[batch_a] * w[batch_r] | |||
emb_ar = emb_ar.transpose(0, 1).unsqueeze(2) # size: D x E x 1 | |||
emb_c = embedding.transpose(0, 1).unsqueeze(1) # size: D x 1 x V | |||
# out-prod and reduce sum | |||
out_prod = th.bmm(emb_ar, emb_c) # size D x E x V | |||
score = th.sum(out_prod, dim=0) # size E x V | |||
# emb_ar = (embedding[batch_a] + w[batch_r]).unsqueeze(1).expand((batch_end - batch_start,embedding.shape[0], -1)) | |||
# emb_c = embedding.unsqueeze(0) | |||
# out_prod = th.sub(emb_ar, emb_c) # size D x E x V | |||
# score = -th.norm(out_prod, p=1, dim=2) # size E x V | |||
score = th.sigmoid(score) | |||
target = b[batch_start: batch_end] | |||
target = b[batch_start: batch_end].to(score.device) | |||
ranks.append(sort_and_rank(score, target)) | |||
return th.cat(ranks) | |||
@@ -134,17 +124,17 @@ def sort_and_rank(score, target): | |||
# return MRR (raw), and Hits @ (1, 3, 10) | |||
def calc_raw_mrr(embedding, w, test_triplets, hits=[], eval_bz=100): | |||
with th.no_grad(): | |||
s = test_triplets[0] | |||
r = test_triplets[1] | |||
o = test_triplets[2] | |||
test_size = test_triplets.shape[1] | |||
s = test_triplets[:, 0] | |||
r = test_triplets[:, 1] | |||
o = test_triplets[:, 2] | |||
test_size = test_triplets.shape[0] | |||
# perturb subject | |||
ranks = perturb_and_get_raw_rank(embedding, w, s, r, o, test_size, eval_bz) | |||
ranks_s = perturb_and_get_raw_rank(embedding, w, o, r, s, test_size, eval_bz) | |||
# perturb object | |||
#ranks_o = perturb_and_get_raw_rank(embedding, w, s, r, o, test_size, eval_bz) | |||
ranks_o = perturb_and_get_raw_rank(embedding, w, s, r, o, test_size, eval_bz) | |||
#ranks = th.cat([ranks_s, ranks_o]) | |||
ranks = th.cat([ranks_s, ranks_o]) | |||
ranks += 1 # change to 1-indexed | |||
mrr = th.mean(1.0 / ranks.float()) | |||
@@ -167,19 +157,56 @@ def perturb_s_and_get_filtered_rank(embedding, w, s, r, o, test_size, triplets_t | |||
target_s = s[idx] | |||
target_r = r[idx] | |||
target_o = o[idx] | |||
filtered_s = filter_s(triplets_to_filter, target_s, target_r, target_o, num_entities).to(target_s.device) | |||
target_s_idx = th.nonzero(filtered_s == target_s).item() | |||
filtered_s = filter_s(triplets_to_filter, target_s, target_r, target_o, num_entities) | |||
target_s_idx = int((filtered_s == target_s).nonzero()) | |||
emb_s = embedding[filtered_s] | |||
emb_r = w[target_r] | |||
emb_r = w[str(target_r.item())] | |||
emb_o = embedding[target_o] | |||
emb_triplet = emb_s * emb_r * emb_o | |||
scores = th.sigmoid(th.sum(emb_triplet, dim=1)) | |||
_, indices = th.sort(scores, descending=True) | |||
rank = th.nonzero(indices == target_s_idx).item() | |||
rank = int((indices == target_s_idx).nonzero()) | |||
ranks.append(rank) | |||
return th.LongTensor(ranks) | |||
def perturb_o_and_get_filtered_rank(embedding, w, s, r, o, test_size, triplets_to_filter): | |||
""" Perturb object in the triplets | |||
""" | |||
num_entities = embedding.shape[0] | |||
ranks = [] | |||
for idx in range(test_size): | |||
if idx % 10000 == 0: | |||
print("test triplet {} / {}".format(idx, test_size)) | |||
target_s = s[idx] | |||
target_r = r[idx] | |||
target_o = o[idx] | |||
filtered_o = filter_o(triplets_to_filter, target_s, target_r, target_o, num_entities) | |||
target_o_idx = int((filtered_o == target_o).nonzero()) | |||
emb_s = embedding[target_s] | |||
emb_r = w[str(target_r.item())] | |||
emb_o = embedding[filtered_o] | |||
emb_triplet = emb_s * emb_r * emb_o | |||
scores = th.sigmoid(th.sum(emb_triplet, dim=1)) | |||
_, indices = th.sort(scores, descending=True) | |||
rank = int((indices == target_o_idx).nonzero()) | |||
ranks.append(rank) | |||
return th.LongTensor(ranks) | |||
def filter_o(triplets_to_filter, target_s, target_r, target_o, num_entities): | |||
target_s, target_r, target_o = int(target_s), int(target_r), int(target_o) | |||
filtered_o = [] | |||
# Do not filter out the test triplet, since we want to predict on it | |||
if (target_s, target_r, target_o) in triplets_to_filter: | |||
triplets_to_filter.remove((target_s, target_r, target_o)) | |||
# Do not consider an object if it is part of a triplet to filter | |||
for o in range(num_entities): | |||
if (target_s, target_r, o) not in triplets_to_filter: | |||
filtered_o.append(o) | |||
return th.LongTensor(filtered_o) | |||
def filter_s(triplets_to_filter, target_s, target_r, target_o, num_entities): | |||
target_s, target_r, target_o = int(target_s), int(target_r), int(target_o) | |||
filtered_s = [] | |||
@@ -195,20 +222,19 @@ def filter_s(triplets_to_filter, target_s, target_r, target_o, num_entities): | |||
def calc_filtered_mrr(embedding, w, train_triplets, valid_triplets, test_triplets, hits=[]): | |||
with th.no_grad(): | |||
s = test_triplets[0] | |||
r = test_triplets[1] | |||
o = test_triplets[2] | |||
test_size = test_triplets.shape[1] | |||
s = test_triplets[:, 0] | |||
r = test_triplets[:, 1] | |||
o = test_triplets[:, 2] | |||
test_size = test_triplets.shape[0] | |||
triplets_to_filter = th.transpose(th.cat([train_triplets, valid_triplets, test_triplets], dim=1), 0, 1).tolist() | |||
triplets_to_filter = th.cat([train_triplets, valid_triplets, test_triplets]).tolist() | |||
triplets_to_filter = {tuple(triplet) for triplet in triplets_to_filter} | |||
print('Perturbing subject...') | |||
ranks_s = perturb_s_and_get_filtered_rank(embedding, w, s, r, o, test_size, triplets_to_filter) | |||
print('Perturbing object...') | |||
#ranks_o = perturb_o_and_get_filtered_rank(embedding, w, s, r, o, test_size, triplets_to_filter) | |||
ranks_o = perturb_o_and_get_filtered_rank(embedding, w, s, r, o, test_size, triplets_to_filter) | |||
#ranks = th.cat([ranks_s, ranks_o]) | |||
ranks = ranks_s | |||
ranks = th.cat([ranks_s, ranks_o]) | |||
ranks += 1 # change to 1-indexed | |||
mrr = th.mean(1.0 / ranks.float()) | |||
@@ -279,5 +305,6 @@ def cal_loss_f1(y, node_data, loss_func, mode): | |||
macro_f1, micro_f1 = f1_node_classification(y_label.cpu(), y_pred.cpu()) | |||
return loss, macro_f1, micro_f1 | |||
def cal_acc(y_pred, y_true): | |||
return th.sum(y_pred.argmax(dim=1) == y_true).item() / len(y_true) |
@@ -1,3 +1,11 @@ | |||
import logging | |||
import os | |||
import colorlog | |||
import re | |||
import datetime | |||
from logging import getLogger | |||
from colorama import init | |||
def printInfo(metric, epoch, train_score, train_loss, val_score, val_loss): | |||
if metric == 'f1_lr': | |||
@@ -28,3 +36,184 @@ def printMetric(metric, score, mode): | |||
print(f"{mode}_macro_{metric} = {score[0]:.4f}, {mode}_micro_{metric}: {score[1]:.4f}") | |||
else: | |||
print(f"{mode}_{metric} = {score:.4f}") | |||
# UPDATE | |||
# Hu Anke 2021/11/07 | |||
log_colors_config = { | |||
'DEBUG': 'cyan', | |||
'WARNING': 'yellow', | |||
'ERROR': 'red', | |||
'CRITICAL': 'red', | |||
} | |||
def get_local_time(): | |||
r"""Get current time | |||
Returns: | |||
str: current time | |||
""" | |||
cur = datetime.datetime.now() | |||
cur = cur.strftime('%b-%d-%Y_%H-%M-%S') | |||
return cur | |||
def ensure_dir(dir_path): | |||
r"""Make sure the directory exists, if it does not exist, create it | |||
Args: | |||
dir_path (str): directory path | |||
""" | |||
if not os.path.exists(dir_path): | |||
os.makedirs(dir_path) | |||
class RemoveColorFilter(logging.Filter): | |||
def filter(self, record): | |||
if record: | |||
ansi_escape = re.compile(r'\x1B(?:[@-Z\\-_]|\[[0-?]*[ -/]*[@-~])') | |||
record.msg = ansi_escape.sub('', str(record.msg)) | |||
return True | |||
def set_color(log, color, highlight=True): | |||
color_set = ['black', 'red', 'green', 'yellow', 'blue', 'pink', 'cyan', 'white'] | |||
try: | |||
index = color_set.index(color) | |||
except: | |||
index = len(color_set) - 1 | |||
prev_log = '\033[' | |||
if highlight: | |||
prev_log += '1;3' | |||
else: | |||
prev_log += '0;3' | |||
prev_log += str(index) + 'm' | |||
return prev_log + log + '\033[0m' | |||
# UPDATE | |||
# Hu Anke 2021/11/11 | |||
# UPDATE | |||
# Hu AnKe 2021/12/16 | |||
class Logger: | |||
def __init__(self, config): | |||
""" | |||
A logger that can show a message on standard output and write it into the | |||
file named `filename` simultaneously. | |||
All the message that you want to log MUST be str. | |||
Args: | |||
config (Config): An instance object of Config, used to record parameter information. | |||
Example: | |||
>>> logger = logging.getLogger(config) | |||
>>> logger.debug(train_state) | |||
>>> logger.info(train_result) | |||
""" | |||
init(autoreset=True) | |||
LOGROOT = f'./openhgnn/output/{config.model}/' | |||
dir_name = os.path.dirname(LOGROOT) | |||
ensure_dir(dir_name) | |||
logfilename = '{}-{}.log'.format(config.model, get_local_time()) | |||
logfilepath = os.path.join(LOGROOT, logfilename) | |||
filefmt = "%(asctime)-15s %(levelname)s %(message)s" | |||
filedatefmt = "%a %d %b %Y %H:%M:%S" | |||
fileformatter = logging.Formatter(filefmt, filedatefmt) | |||
sfmt = "%(log_color)s%(asctime)-15s %(levelname)s %(message)s" | |||
sdatefmt = "%d %b %H:%M" | |||
sformatter = colorlog.ColoredFormatter(sfmt, sdatefmt, log_colors=log_colors_config) | |||
if not hasattr(config, 'state') or config.state.lower() == 'info': | |||
level = logging.INFO | |||
elif config.state.lower() == 'debug': | |||
level = logging.DEBUG | |||
elif config.state.lower() == 'error': | |||
level = logging.ERROR | |||
elif config.state.lower() == 'warning': | |||
level = logging.WARNING | |||
elif config.state.lower() == 'critical': | |||
level = logging.CRITICAL | |||
else: | |||
level = logging.INFO | |||
fh = logging.FileHandler(logfilepath, mode='a') | |||
fh.setLevel(level) | |||
fh.setFormatter(fileformatter) | |||
remove_color_filter = RemoveColorFilter() | |||
fh.addFilter(remove_color_filter) | |||
sh = logging.StreamHandler() | |||
sh.setLevel(level) | |||
sh.setFormatter(sformatter) | |||
root_logger = logging.getLogger() | |||
for h in root_logger.handlers: | |||
root_logger.removeHandler(h) | |||
logging.basicConfig(level=level, handlers=[sh, fh]) | |||
self.logger = getLogger() | |||
self.logger.info(config) | |||
def info(self, s): | |||
self.logger.info(s) | |||
def load_best_config(self, s): | |||
self.logger.info('[Load Best Config] ' + s) | |||
def train_info(self, s): | |||
self.logger.info('[Train Info] ' + s) | |||
def feature_info(self, s): | |||
self.logger.info('[Feature Transformation] ' + s) | |||
# graph data analyze | |||
def log_data_info(self, g): | |||
num_nodes = g.num_nodes() | |||
num_edges = g.num_edges() | |||
node_types = len(g.ntypes) | |||
edge_types = len(g.etypes) | |||
c_etypes = len(g.canonical_etypes) | |||
datainfo = {'total nodes':num_nodes, 'total edges':num_edges, 'node types':node_types, 'edge types': edge_types, | |||
'c_etypes': c_etypes} | |||
self.logger.info(datainfo) | |||
return | |||
# evaluate results | |||
def log_eval_info(self, metric, epoch, train_score, train_loss, val_score, val_loss): | |||
if metric == 'f1_lr': | |||
eval_info = {f"Epoch: {epoch:03d}, Train_loss: {train_loss:.4f}, Train_macro_f1: {train_score[0]:.4f}, Train_micro_f1: {train_score[1]:.4f}, " | |||
f"Val_macro_f1: {val_score[0]:.4f}, Val_micro_f1: {val_score[1]:.4f}, ValLoss:{val_loss: .4f}"} | |||
# use acc | |||
elif metric == 'acc': | |||
eval_info = {f"Epoch: {epoch:03d}, Train_loss: {train_loss:.4f}, Train_acc: {train_score:.4f}, " | |||
f"Val_acc: {val_score:.4f}, ValLoss:{val_loss: .4f}"} | |||
elif metric == 'acc-ogbn-mag': | |||
eval_info = {f"Epoch: {epoch:03d}, Train_loss: {train_loss:.4f}, Train_acc: {train_score:.4f}, " | |||
f"Val_acc: {val_score:.4f}, ValLoss:{val_loss: .4f}"} | |||
else: | |||
eval_info = {f"Epoch: {epoch:03d}, Train_loss: {train_loss:.4f}, Train_macro_f1: {train_score[0]:.4f}, Train_micro_f1: {train_score[1]:.4f}, " | |||
f"Val_macro_f1: {val_score[0]:.4f}, Val_micro_f1: {val_score[1]:.4f}, ValLoss:{val_loss: .4f}"} | |||
self.logger.info(eval_info) | |||
return | |||
def log_metric_info_1(self, metric, score, mode): | |||
met_info = {f"{mode}_{metric} = {score:.4f}"} | |||
self.logger.info(met_info) | |||
return | |||
def log_metric_info_2(self, metric, score, mode): | |||
met_info = {f"{mode}_macro_{metric} = {score[0]:.4f}, {mode}_micro_{metric}: {score[1]:.4f}"} | |||
self.logger.info(met_info) |
@@ -5,7 +5,8 @@ import torch as th | |||
from scipy.sparse import coo_matrix | |||
import numpy as np | |||
import random | |||
from . import load_HIN, load_KG, load_OGB, BEST_CONFIGS | |||
from . import load_HIN, load_KG, load_OGB | |||
from .best_config import BEST_CONFIGS | |||
def sum_up_params(model): | |||
@@ -66,20 +67,20 @@ def add_reverse_edges(hg, copy_ndata=True, copy_edata=True, ignore_one_type=True | |||
def set_best_config(args): | |||
configs = BEST_CONFIGS.get(args.task) | |||
if configs is None: | |||
print('The task do not have a best_config!') | |||
args.logger.load_best_config('The task: {} do not have a best_config!'.format(args.task)) | |||
return args | |||
if args.model not in configs: | |||
print('The model is not in the best config.') | |||
args.logger.load_best_config('The model: {} is not in the best config.'.format(args.model)) | |||
return args | |||
configs = configs[args.model] | |||
for key, value in configs["general"].items(): | |||
args.__setattr__(key, value) | |||
if args.dataset not in configs: | |||
print('The dataset is not in the best config.') | |||
args.logger.load_best_config('The dataset: {} is not in the best config of model: {}.'.format(args.dataset, args.model)) | |||
return args | |||
for key, value in configs[args.dataset].items(): | |||
args.__setattr__(key, value) | |||
print('Use the best config.') | |||
args.logger.load_best_config('Load the best config of model: {} for dataset: {}.'.format(args.model, args.dataset)) | |||
return args | |||
@@ -334,3 +335,27 @@ def extract_metapaths(category, canonical_etypes, self_loop=False): | |||
if etype[0] != etype[2]: | |||
meta_paths.append((etype, dst_e)) | |||
return meta_paths | |||
# for etype in self.model.hg.etypes: | |||
# g = self.model.hg[etype] | |||
# for etype in ['paper-ref-paper','paper-cite-paper']: | |||
# g = self.hg[etype] | |||
# r = [] | |||
# for i in self.train_idx: | |||
# neigh = g.predecessors(i) | |||
# cen_label = self.labels[i] | |||
# neigh_label = self.labels[neigh] | |||
# if len(neigh) == 0: | |||
# pass | |||
# else: | |||
# r.append((cen_label == neigh_label).sum() / len(neigh)) | |||
# for i in self.valid_idx: | |||
# neigh = g.predecessors(i) | |||
# cen_label = self.labels[i] | |||
# neigh_label = self.labels[neigh] | |||
# if len(neigh) == 0: | |||
# pass | |||
# else: | |||
# r.append((cen_label == neigh_label).sum() / len(neigh)) | |||
# he = torch.stack(r).mean() | |||
# print(etype+ str(he)) |
@@ -12,7 +12,7 @@ The installation process is same with OpenHGNN [Get Started](https://github.com/ | |||
#### 2.1 Generate designs randomly | |||
Here we will generate a random design combination for each dataset and save it in a `.yaml` file. The candidate designs are list in [`./space4hgnn/generate_yaml.py`](./generate_yaml.py). | |||
Here we will generate a random design combination for each dataset and save it in a `.yaml` file. The candidate designs are listed in [`./space4hgnn/generate_yaml.py`](./generate_yaml.py). | |||
```bash | |||
python ./space4hgnn/generate_yaml.py --gnn_type gcnconv --times 1 --key has_bn --configfile test | |||
@@ -85,7 +85,9 @@ For **Meta-path model family**, ``--model`` is general_HGNN and ``--subgraph_ext | |||
python space4hgnn.py -m general_HGNN -u metapath -t node_classification -d HGBn-ACM -g 0 -r 5 -a gcnconv -s 1 -k has_bn -v True -c test -p HGB | |||
``` | |||
**Note: ** Similar with generating yaml file, experiment will load the design configuration from ``yaml_file_path``. And it will save the results into a `.csv` file in `prediction_file_path`. | |||
**Note: ** | |||
Similar with generating yaml file, experiment will load the design configuration from ``yaml_file_path``. And it will save the results into a `.csv` file in `prediction_file_path`. | |||
```python | |||
yaml_file_path = './space4hgnn/config/{}/{}/{}_{}.yaml'.format(configfile, key, gnn_type, times) | |||
@@ -94,7 +96,7 @@ prediction_file_path = './space4hgnn/prediction/excel/{}/{}_{}/{}_{}_{}_{}.csv'. | |||
# Here prediction_file_path = './space4hgnn/prediction/test/has_bn_True/metapath_gcnconv_1_HGBn-ACM.yaml' | |||
``` | |||
### 3 Run a bach of experiments | |||
### 3 Run a batch of experiments | |||
An example: | |||
@@ -102,14 +104,14 @@ An example: | |||
./space4hgnn/parallel.sh 0 5 has_bn True node_classification test_paral test_paral | |||
``` | |||
It will generate configuration files for the bach of experiments. And | |||
It will generate configuration files for the batch of experiments and launch a batch of experiments. | |||
The following is the arguments descriptions: | |||
1. The first argument controls which gpu to use. Here is 0. | |||
2. Repeat times. Here is 5 | |||
3. Design dimension. Here is BN. | |||
4. Choice of design dimension. Here set BN`` True``. | |||
4. Choice of design dimension. Here set BN `` True``. | |||
5. Task name. Here is nodeclassification | |||
6. Configfile is the path to save configuration files. | |||
7. Predictfile is the path to save prediction files. | |||
@@ -132,12 +134,12 @@ python ./space4hgnn/prediction/excel/gather_all_Csv.py -p ./space4hgnn/predictio | |||
##### 3.2.1 Ranking analysis | |||
We analyze the results with average ranking following [GraphGym](https://github.com/snap-stanford/GraphGym#3-analyze-the-results), and | |||
We analyze the results with average ranking following [GraphGym](https://github.com/snap-stanford/GraphGym#3-analyze-the-results), the according code is in [`figure/rank.py`](./figure/rank.py). | |||
![space4hgnn_rank](../docs/source/_static/space4hgnn_rank.png) | |||
##### 3.2.2 Distribution analysis | |||
##### 3.2.2 Distribution estimates | |||
We analyze the results with distribution estimates following [NDS](https://github.com/facebookresearch/nds), and | |||
We analyze the results with distribution estimates following [NDS](https://github.com/facebookresearch/nds), and the according code is in [`figure/distribution.py`](./figure/distribution.py). | |||
![space4hgnn_distribution](../docs/source/_static/space4hgnn_distribution.png) |
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》