#1 main

Merged
clearhanhui merged 18 commits from GAMMALab/OpenHGNN:main into main 2 years ago
  1. +2
    -1
      .gitignore
  2. +29
    -0
      CONTRIBUTING.md
  3. +31
    -28
      README.md
  4. +9
    -4
      docs/source/api/task.rst
  5. +2
    -0
      main.py
  6. +27
    -10
      openhgnn/config.ini
  7. +25
    -11
      openhgnn/config.py
  8. +53
    -20
      openhgnn/dataset/LinkPredictionDataset.py
  9. +10
    -0
      openhgnn/dataset/NodeClassificationDataset.py
  10. +10
    -2
      openhgnn/dataset/README.md
  11. +1
    -1
      openhgnn/dataset/__init__.py
  12. +65
    -0
      openhgnn/models/HDE.py
  13. +3
    -2
      openhgnn/models/SkipGram.py
  14. +3
    -1
      openhgnn/models/__init__.py
  15. +55
    -0
      openhgnn/output/HDE/README.md
  16. +21
    -0
      openhgnn/output/HERec/README.md
  17. +26
    -3
      openhgnn/output/RGCN/README.md
  18. +24
    -0
      openhgnn/output/metapath2vec/README.md
  19. +243
    -0
      openhgnn/sampler/hde_sampler.py
  20. +32
    -19
      openhgnn/sampler/random_walk_sampler.py
  21. +19
    -4
      openhgnn/sampler/test_mp2vec_sampler.py
  22. +5
    -4
      openhgnn/start.py
  23. +157
    -13
      openhgnn/tasks/link_prediction.py
  24. +3
    -1
      openhgnn/trainerflow/__init__.py
  25. +65
    -32
      openhgnn/trainerflow/base_flow.py
  26. +107
    -0
      openhgnn/trainerflow/hde_trainer.py
  27. +87
    -0
      openhgnn/trainerflow/herec_trainer.py
  28. +71
    -120
      openhgnn/trainerflow/link_prediction.py
  29. +17
    -27
      openhgnn/trainerflow/mp2vec_trainer.py
  30. +1
    -0
      openhgnn/utils/__init__.py
  31. +15
    -4
      openhgnn/utils/best_config.py
  32. +60
    -33
      openhgnn/utils/evaluator.py
  33. +189
    -0
      openhgnn/utils/logger.py
  34. +30
    -5
      openhgnn/utils/utils.py
  35. +10
    -8
      space4hgnn/README.md

+ 2
- 1
.gitignore View File

@@ -10,10 +10,11 @@ __pycache__/
*.zip
*.bin
*.pt
*.log
# numpy files
*.npy
*.npz

# yaml files
*.yaml



+ 29
- 0
CONTRIBUTING.md View File

@@ -0,0 +1,29 @@
## Contributing to OpenHGNN

Contribution is always welcomed. Please feel free to open an issue or email to tyzhao@bupt.edu.cn.

## Contributors

- **[GAMMA LAB](https://github.com/BUPT-GAMMA) [BUPT]**
- [Tianyu Zhao](https://github.com/Theheavens)
- Hongyi Zhang
- Yaoqi Liu
- Hui Han
- Fengqi Liang
- Yibo Li
- Yanhu Mo
- Donglin Xia
- Xinlong Zhai
- Siyuan Zhang
- Qi Zhang
- [Chuan Shi](http://shichuan.org/)
- Cheng Yang
- Xiao Wang
- **BUPT**
- Jiahang Li
- Anke Hu
- **DGL Team**
- Quan Gan
- Minjie Wang
- [Jian Zhang](https://github.com/zhjwy9343)


+ 31
- 28
README.md View File

@@ -51,7 +51,7 @@ pip install -r requirements.txt
#### Running an existing baseline model on an existing benchmark [dataset](./openhgnn/dataset/#Dataset)

```bash
python main.py -m model_name -d dataset_name -t task_name -g 0 --use_best_config
python main.py -m model_name -d dataset_name -t task_name -g 0 --use_best_config --load_from_pretrained
```

usage: main.py [-h] [--model MODEL] [--task TASK] [--dataset DATASET]
@@ -73,6 +73,8 @@ usage: main.py [-h] [--model MODEL] [--task TASK] [--dataset DATASET]

``--use_hpo`` Besides use_best_config, we give a hyper-parameter [example](./openhgnn/auto) to search the best hyper-parameter automatically.

``--load_from_pretrained`` will load the model from a default checkpoint.

e.g.:

```bash
@@ -89,29 +91,32 @@ Refer to the [docs](https://openhgnn.readthedocs.io/en/latest/index.html) to get

The link will give some basic usage.

| Model | Node classification | Link prediction | Recommendation |
| ----------------------------------------------- | ------------------- | ------------------ | ------------------ |
| [RGCN](./openhgnn/output/RGCN)[ESWC 2018] | :heavy_check_mark: | :heavy_check_mark: | |
| [HAN](./openhgnn/output/HAN)[WWW 2019] | :heavy_check_mark: | | |
| [KGCN](./openhgnn/output/KGCN)[WWW 2019] | | | :heavy_check_mark: |
| [HetGNN](./openhgnn/output/HetGNN)[KDD 2019] | :heavy_check_mark: | :heavy_check_mark: | |
| [GTN](./openhgnn/output/GTN)[NeurIPS 2019] | :heavy_check_mark: | | |
| [RSHN](./openhgnn/output/RSHN)[ICDM 2019] | :heavy_check_mark: | | |
| [DGMI](./openhgnn/output/DMGI)[AAAI 2020] | :heavy_check_mark: | | |
| [MAGNN](./openhgnn/output/MAGNN)[WWW 2020] | :heavy_check_mark: | | |
| [CompGCN](./openhgnn/output/CompGCN)[ICLR 2020] | :heavy_check_mark: | :heavy_check_mark: | |
| [NSHE](./openhgnn/output/NSHE)[IJCAI 2020] | :heavy_check_mark: | | |
| [NARS](./openhgnn/output/NARS)[arxiv] | :heavy_check_mark: | | |
| [MHNF](./openhgnn/output/MHNF)[arxiv] | :heavy_check_mark: | | |
| [HGSL](./openhgnn/output/HGSL)[AAAI 2021] | :heavy_check_mark: | | |
| [HGNN-AC](./openhgnn/output/HGNN_AC)[WWW 2021] | :heavy_check_mark: | | |
| [HeCo](./openhgnn/output/HeCo)[KDD 2021] | :heavy_check_mark: | | |
| [HPN](./openhgnn/output/HPN)[TKDE 2021] | :heavy_check_mark: | | |
| [RHGNN](./openhgnn/output/RHGNN)[arxiv] | :heavy_check_mark: | | |

### To be supported models

- Metapath2vec[KDD 2017]
| Model | Node classification | Link prediction | Recommendation |
| -------------------------------------------------------- | ------------------- | ------------------ | ------------------ |
| [Metapath2vec](./openhgnn/output/metapath2vec)[KDD 2017] | :heavy_check_mark: | | |
| [RGCN](./openhgnn/output/RGCN)[ESWC 2018] | :heavy_check_mark: | :heavy_check_mark: | |
| [HERec](./openhgnn/output/HERec)[TKDE 2018] | :heavy_check_mark: | | |
| [HAN](./openhgnn/output/HAN)[WWW 2019] | :heavy_check_mark: | | |
| [KGCN](./openhgnn/output/KGCN)[WWW 2019] | | | :heavy_check_mark: |
| [HetGNN](./openhgnn/output/HetGNN)[KDD 2019] | :heavy_check_mark: | :heavy_check_mark: | |
| HGAT[EMNLP 2019] | | | |
| [GTN](./openhgnn/output/GTN)[NeurIPS 2019] | :heavy_check_mark: | | |
| [RSHN](./openhgnn/output/RSHN)[ICDM 2019] | :heavy_check_mark: | | |
| [DMGI](./openhgnn/output/DMGI)[AAAI 2020] | :heavy_check_mark: | | |
| [MAGNN](./openhgnn/output/MAGNN)[WWW 2020] | :heavy_check_mark: | | |
| HGT[WWW 2020] | | | |
| [CompGCN](./openhgnn/output/CompGCN)[ICLR 2020] | :heavy_check_mark: | :heavy_check_mark: | |
| [NSHE](./openhgnn/output/NSHE)[IJCAI 2020] | :heavy_check_mark: | | |
| [NARS](./openhgnn/output/NARS)[arxiv] | :heavy_check_mark: | | |
| [MHNF](./openhgnn/output/MHNF)[arxiv] | :heavy_check_mark: | | |
| [HGSL](./openhgnn/output/HGSL)[AAAI 2021] | :heavy_check_mark: | | |
| [HGNN-AC](./openhgnn/output/HGNN_AC)[WWW 2021] | :heavy_check_mark: | | |
| [HeCo](./openhgnn/output/HeCo)[KDD 2021] | :heavy_check_mark: | | |
| SimpleHGN[KDD 2021] | | | |
| [HPN](./openhgnn/output/HPN)[TKDE 2021] | :heavy_check_mark: | | |
| [RHGNN](./openhgnn/output/RHGNN)[arxiv] | :heavy_check_mark: | | |
| [HDE](./openhgnn/output/HDE)[ICDM 2021] | | :heavy_check_mark: | |
| | | | |

### Candidate models

@@ -120,9 +125,7 @@ The link will give some basic usage.

## Contributors

**[GAMMA LAB](https://github.com/BUPT-GAMMA) [BUPT]**: [Tianyu Zhao](https://github.com/Theheavens), Yaoqi Liu, Fengqi Liang, Yibo Li, Yanhu Mo, Donglin Xia, Xinlong Zhai, Siyuan Zhang, Qi Zhang, [Chuan Shi](http://shichuan.org/), Cheng Yang, Xiao Wang

**BUPT**: Jiahang Li, Anke Hu
OpenHGNN Team[GAMMA LAB] & DGL Team.

**DGL Team**: Quan Gan, [Jian Zhang](https://github.com/zhjwy9343)
See more in [CONTRIBUTING](./CONTRIBUTING.md).


+ 9
- 4
docs/source/api/task.rst View File

@@ -22,10 +22,15 @@ Node classification Task
Link prediction Task
---------------------

.. autoclass:: openhgnn.tasks.link_prediction.LinkPrediction
:members:
:undoc-members:
:show-inheritance:
.. automodule:: openhgnn.tasks.link_prediction

Link prediction Model
~~~~~~~~~~~~~~~~~~~~~~

.. autoclass:: openhgnn.models.link_prediction.LinkPrediction
:members:
:undoc-members:
:show-inheritance:

Recommendation Task
---------------------


+ 2
- 0
main.py View File

@@ -17,10 +17,12 @@ if __name__ == '__main__':
parser.add_argument('--gpu', '-g', default='0', type=int, help='-1 means cpu')
parser.add_argument('--use_best_config', action='store_true', help='will load utils.best_config')
parser.add_argument('--use_hpo', action='store_true', help='hyper-parameter optimization')
parser.add_argument('--load_from_pretrained', action='store_true', help='load model from the checkpoint')
args = parser.parse_args()

config_file = ["./openhgnn/config.ini"]
config = Config(file_path=config_file, model=args.model, dataset=args.dataset, task=args.task, gpu=args.gpu)
config.use_best_config = args.use_best_config
config.use_hpo = args.use_hpo
config.load_from_pretrained = args.load_from_pretrained
OpenHGNN(args=config)

+ 27
- 10
openhgnn/config.ini View File

@@ -98,7 +98,6 @@ dropout = 0.2
seed =0
in_dim = 64
hidden_dim = 64
out_dim = 64
n_bases = 40
n_layers = 3

@@ -151,22 +150,29 @@ mini_batch_flag = True

[Metapath2vec]
learning_rate = 0.01
;weight_decay = 0.0001
dim = 128
max_epoch = 50
max_epoch = 1
batch_size = 32
window_size = 5
num_workers = 4
;batches_per_epoch = 20

rw_length = 10
rw_length = 20
rw_walks = 10
;rwr_prob = 0.5
neg_size = 5
seed = 0
meta_path_key = APVPA

[HERec]
learning_rate = 0.01
dim = 128
max_epoch = 1
batch_size = 32
window_size = 2
num_workers = 4
rw_length = 100
rw_walks = 10
neg_size = 5
seed = 0
;patience = 100
;mini_batch_flag = True
meta_path_keys = APVPA,APA

[HAN]
seed = 0
@@ -400,4 +406,15 @@ stage_type = stack
num_heads = 8
feat = 0
subgraph=metapath
mini_batch_flag = false
mini_batch_flag = false

[HDE]

emb_dim = 128
num_neighbor = 5
use_bias = true
k_hop = 2
max_epoch = 400
batch_size = 32
max_dist = 3
lr = 0.001

+ 25
- 11
openhgnn/config.py View File

@@ -14,7 +14,7 @@ class Config(object):
if th.cuda.is_available():
self.device = th.device('cuda', int(gpu))
else:
print("cuda is not available, please set 'gpu' -1")
raise ValueError("cuda is not available, please set 'gpu' -1")

try:
conf.read(file_path)
@@ -107,7 +107,6 @@ class Config(object):

self.in_dim = conf.getint("RGCN", "in_dim")
self.hidden_dim = conf.getint("RGCN", "hidden_dim")
self.out_dim = conf.getint("RGCN", "out_dim")

self.n_bases = conf.getint("RGCN", "n_bases")
self.n_layers = conf.getint("RGCN", "n_layers")
@@ -159,22 +158,28 @@ class Config(object):
pass
elif model == 'Metapath2vec':
self.lr = conf.getfloat("Metapath2vec", "learning_rate")
# self.weight_decay = conf.getfloat("Metapath2vec", "weight_decay")

#self.dropout = conf.getfloat("CompGCN", "dropout")
self.max_epoch = conf.getint("Metapath2vec", "max_epoch")
self.dim = conf.getint("Metapath2vec", "dim")
self.batch_size = conf.getint("Metapath2vec", "batch_size")
self.window_size = conf.getint("Metapath2vec", "window_size")
self.num_workers = conf.getint("Metapath2vec", "num_workers")
self.neg_size = conf.getint("Metapath2vec", "neg_size")
# self.batches_per_epoch = conf.getint("Metapath2vec", "batches_per_epoch")
# self.seed = conf.getint("Metapath2vec", "seed")
# self.patience = conf.getint("Metapath2vec", "patience")
self.rw_length = conf.getint("Metapath2vec", "rw_length")
self.rw_walks = conf.getint("Metapath2vec", "rw_walks")
# self.rwr_prob = conf.getfloat("Metapath2vec", "rwr_prob")
# self.mini_batch_flag = conf.getboolean("Metapath2vec", "mini_batch_flag")
self.meta_path_key = conf.get("Metapath2vec", "meta_path_key")

elif model == 'HERec':
self.lr = conf.getfloat("HERec", "learning_rate")
self.max_epoch = conf.getint("HERec", "max_epoch")
self.dim = conf.getint("HERec", "dim")
self.batch_size = conf.getint("HERec", "batch_size")
self.window_size = conf.getint("HERec", "window_size")
self.num_workers = conf.getint("HERec", "num_workers")
self.neg_size = conf.getint("HERec", "neg_size")
self.rw_length = conf.getint("HERec", "rw_length")
self.rw_walks = conf.getint("HERec", "rw_walks")
meta_path_keys = conf.get("HERec", "meta_path_keys")
self.meta_path_keys = meta_path_keys.split(',')

elif model == 'HAN':
self.lr = conf.getfloat("HAN", "learning_rate")
@@ -480,6 +485,15 @@ class Config(object):
self.validation = conf.getboolean('HeGAN', 'validation')
self.emb_size = conf.getint("HeGAN", 'emb_size')
self.patience = conf.getint("HeGAN", 'patience')
elif model == 'HDE':
self.emb_dim = conf.getint('HDE', 'emb_dim')
self.num_neighbor = conf.getint('HDE', 'num_neighbor')
self.use_bias = conf.getboolean('HDE', 'use_bias')
self.k_hop = conf.getint('HDE', 'k_hop')
self.max_epoch = conf.getint('HDE', 'max_epoch')
self.batch_size = conf.getint('HDE', 'batch_size')
self.max_dist = conf.getint('HDE', 'max_dist')
self.lr = conf.getfloat('HDE', 'lr')

def __repr__(self):
return 'Model:' + self.model + '\nTask:' + self.task + '\nDataset:' + self.dataset
return '[Config Info]\tModel: {},\tTask: {},\tDataset: {}'.format(self.model, self.task, self.dataset)

+ 53
- 20
openhgnn/dataset/LinkPredictionDataset.py View File

@@ -52,7 +52,7 @@ class LinkPredictionDataset(BaseDataset):
random_int = th.randperm(num_edges)
val_index = random_int[:int(num_edges * val_ratio)]
val_edge = self.g.find_edges(val_index, etype)
test_index = random_int[int(num_edges * test_ratio):int(num_edges * test_ratio)]
test_index = random_int[int(num_edges * test_ratio):int(num_edges * (test_ratio + val_ratio))]
test_edge = self.g.find_edges(test_index, etype)

val_edge_dict[etype] = val_edge
@@ -430,6 +430,9 @@ def comp_deg_norm(g):

@register_dataset('kg_link_prediction')
class KG_LinkPrediction(LinkPredictionDataset):
"""
From `RGCN <https://arxiv.org/abs/1703.06103>`_, WN18 & FB15k face a data leakage.
"""
def __init__(self, dataset_name):
super(KG_LinkPrediction, self).__init__()
if dataset_name in ['wn18', 'FB15k', 'FB15k-237']:
@@ -438,40 +441,70 @@ class KG_LinkPrediction(LinkPredictionDataset):
self.num_rels = dataset.num_rels * 2
# include inverse edge
self.num_nodes = dataset.num_nodes
self.homo_g = dataset[0]
self.g = self.homo_to_hetero(dataset[0])
self.train_hg = self.get_graph_directed_from_triples(dataset.train, 'graph')
self.valid_hg = self.get_graph_directed_from_triples(dataset.valid, 'graph')
self.test_hg = self.get_graph_directed_from_triples(dataset.test, 'graph')
self.train_triplets, self.valid_triplets, self.test_triplets = self.get_all_triplets(dataset)
self.target_link = self.test_hg.canonical_etypes
self.g = self.train_hg
# self.g = self.build_g(dataset.train)
# self.dataset = dataset

def get_triples_directed(self, mask_mode):
if mask_mode == 'train_mask':
data = self.dataset.train
elif mask_mode == 'val_mask':
data = self.dataset.test
elif mask_mode == 'test_mask':
data = self.dataset.test
s = th.LongTensor(data[:, 0])
r = th.LongTensor(data[:, 1])
o = th.LongTensor(data[:, 2])
return th.stack([s, r, o])
def get_triples(self, mask_mode):
def get_graph_directed_from_triples(self, triples, format='graph'):
s = th.LongTensor(triples[:, 0])
r = th.LongTensor(triples[:, 1])
o = th.LongTensor(triples[:, 2])
if format == 'graph':
edge_dict = {}
for i in range(self.num_rels):
mask = (r == i)
edge_name = (self.category, str(i), self.category)
edge_dict[edge_name] = (s[mask], o[mask])
return dgl.heterograph(edge_dict, {self.category: self.num_nodes})
def get_triples(self, g, mask_mode):
'''
:param g:
:param mask_mode: should be one of 'train_mask', 'val_mask', 'test_mask
:return:
'''
g = self.homo_g
edges = g.edges()
etype = g.edata['etype']
mask = g.edata.pop(mask_mode)
return th.stack((edges[0][mask], etype[mask], edges[1][mask]))

def homo_to_hetero(self, g):
def get_all_triplets(self, dataset):
train_data = th.LongTensor(dataset.train)
valid_data = th.LongTensor(dataset.valid)
test_data = th.LongTensor(dataset.test)
return train_data, valid_data, test_data

def get_idx(self):
return self.train_hg, self.valid_hg, self.test_hg

def split_graph(self, g, mode='train'):
"""

Parameters
----------
g: DGLGraph
a homogeneous graph fomat
mode: str
split the subgraph according to the mode

Returns
-------
hg: DGLHeterograph
"""
edges = g.edges()
etype = g.edata['etype']
train_mask = g.edata['train_mask']
hg = self.build_graph((edges[0][train_mask], edges[1][train_mask]), etype[train_mask])
if mode == 'train':
mask = g.edata['train_mask']
elif mode == 'valid':
mask = g.edata['valid_edge_mask']
elif mode == 'test':
mask = g.edata['test_edge_mask']
hg = self.build_graph((edges[0][mask], edges[1][mask]), etype[mask])
return hg

def build_graph(self, edges, etype):


+ 10
- 0
openhgnn/dataset/NodeClassificationDataset.py View File

@@ -178,6 +178,12 @@ class HIN_NodeClassification(NodeClassificationDataset):
category = 'A'
g = dataset[0].long()
num_classes = 4
self.meta_paths_dict = {
'APVPA': [('A', 'A-P', 'P'), ('P', 'P-V', 'V'), ('V', 'V-P', 'P'), ('P', 'P-A', 'A')],
'APA': [('A', 'A-P', 'P'), ('P', 'P-A', 'A')],
}
self.meta_paths = [(('A', 'A-P', 'P'), ('P', 'P-V', 'V'), ('V', 'V-P', 'P'), ('P', 'P-A', 'A')),
(('A', 'A-P', 'P'), ('P', 'P-A', 'A'))]
self.in_dim = g.ndata['h'][category].shape[1]

elif name_dataset == 'imdb4MAGNN':
@@ -197,6 +203,10 @@ class HIN_NodeClassification(NodeClassificationDataset):
category = 'paper'
g = dataset[0].long()
num_classes = 3
self.meta_paths_dict = {'PAPSP': [('paper', 'paper-author', 'author'), ('author', 'author-paper', 'paper'),
('paper', 'paper-subject', 'subject'),
('subject', 'subject-paper', 'paper')]
}
self.meta_paths = [(('paper', 'paper-author', 'author'), ('author', 'author-paper', 'paper'),
('paper', 'paper-subject', 'subject'), ('subject', 'subject-paper', 'paper'))]
self.in_dim = g.ndata['h'][category].shape[1]


+ 10
- 2
openhgnn/dataset/README.md View File

@@ -44,10 +44,14 @@ So dataset should load not only a heterograph[DGLGraph], but also some index inv
| imdb4GTN | 4,661 | 5,841 | 2,270 | 13,983 | 4,661 | 300 | 300 | 2,339 |
| imdb4MAGNN | 4,278 | 5,257 | 2,081 | 12,828 | 4,278 | 400 | 400 | 3,478 |

- **HGB_NodeClassification**
- **HGBn **

The datasets are HGB for Node Classification

**Note**:The test data labels are randomly replaced to prevent data leakage issues, refer to [HGB](https://github.com/THUDM/HGB).

In OpenHGNN, you will get the test results in `./openhgnn/output/{model_name}/`. If you want to obtain test scores, you need to submit your prediction to HGB's [website](https://www.biendata.xyz/hgb/).

- HGBn-ACM

| paper | author | subject | term | paper-author | paper-paper | paper-subject | paper-term | Val | Test |
@@ -86,10 +90,14 @@ So dataset should load not only a heterograph[DGLGraph], but also some index inv

- ###### academic4HetGNN

- **HGBl-LinkPrediction**
- **HGBl**

The datasets are HGB for Link Prediction.

**Note**:The test data labels are randomly replaced to prevent data leakage issues, refer to [HGB](https://github.com/THUDM/HGB).

In OpenHGNN, you will get the test results in `./openhgnn/output/{model_name}/`. If you want to obtain test scores, you need to submit your prediction to HGB's [website](https://www.biendata.xyz/hgb/).

- HGBl-amazon

| | product | features | product-product0 | product-product1 | test : product-product0 | test : product-product1 |


+ 1
- 1
openhgnn/dataset/__init__.py View File

@@ -50,7 +50,7 @@ def build_dataset(dataset, task):
_dataset = 'rdf_' + task
elif dataset in ['acm4NSHE', 'acm4GTN', 'academic4HetGNN', 'acm_han', 'acm_han_raw', 'acm4HeCo', 'dblp', 'dblp4MAGNN',
'imdb4MAGNN', 'imdb4GTN', 'acm4NARS', 'demo_graph', 'yelp4HeGAN', 'DoubanMovie', 'Book-Crossing',
'amazon4SLICE', 'HNE-PubMed', 'HGBl-ACM', 'HGBl-DBLP', 'HGBl-IMDB']:
'amazon4SLICE', 'MTWM', 'HNE-PubMed', 'HGBl-ACM', 'HGBl-DBLP', 'HGBl-IMDB']:
_dataset = 'hin_' + task
elif dataset in ['ogbn-mag']:
_dataset = 'ogbn_' + task


+ 65
- 0
openhgnn/models/HDE.py View File

@@ -0,0 +1,65 @@
import torch
import torch.nn as nn
import torch.nn.functional as F
from . import BaseModel, register_model


class GNN(nn.Module):
"""
Aggregate 2-hop neighbor.
"""
def __init__(self, input_dim, output_dim, num_neighbor, use_bias=True):
super(GNN, self).__init__()
self.input_dim = int(input_dim)
self.num_fea = int(input_dim)
self.output_dim = int(output_dim)
self.num_neighbor = num_neighbor
self.use_bias = use_bias
self.linear1 = nn.Linear(self.input_dim * 2, 64)
self.linear2 = nn.Linear(64+self.num_fea, 64)
self.linear3 = nn.Linear(64, self.output_dim)

def forward(self, fea):
node = fea[:, :self.num_fea]
neigh1 = fea[:, self.num_fea:self.num_fea * (self.num_neighbor + 1)]
neigh1 = torch.reshape(neigh1, [-1, self.num_neighbor, self.num_fea])

neigh2 = fea[:, self.num_fea * (self.num_neighbor + 1):]
neigh2 = torch.reshape(neigh2, [-1, self.num_neighbor, self.num_neighbor, self.num_fea])
neigh2_agg = torch.mean(neigh2, dim=2)
tmp = torch.cat([neigh1, neigh2_agg], dim=2)
tmp = F.relu(self.linear1(tmp))
emb = torch.cat([node, torch.mean(tmp, dim=1)], dim=1)
emb = F.relu(self.linear2(emb))
emb = F.relu(self.linear3(emb))

return emb


@register_model('HDE')
class HDE(BaseModel):
def __init__(self, input_dim, output_dim, num_neighbor, use_bias=True):
super(HDE, self).__init__()
self.input_dim = int(input_dim)
self.output_dim = int(output_dim)
self.num_neighbor = num_neighbor
self.use_bias = use_bias
self.aggregator = GNN(input_dim=input_dim, output_dim=output_dim, num_neighbor=num_neighbor)
self.linear1 = nn.Linear(2*self.output_dim, 32)
self.linear2 = nn.Linear(32, 2)

def forward(self, fea_a, fea_b):
emb_a = self.aggregator(fea_a)
emb_b = self.aggregator(fea_b)
emb = torch.cat([emb_a, emb_b], dim=1)
emb = F.relu(self.linear1(emb))
output = self.linear2(emb)

return output

@classmethod
def build_model_from_args(cls, args, hg):
return cls(input_dim=args.input_dim,
output_dim=args.output_dim,
num_neighbor=args.num_neighbor,
use_bias=args.use_bias)

openhgnn/models/Metapath2vec.py → openhgnn/models/SkipGram.py View File

@@ -10,14 +10,15 @@ from . import BaseModel, register_model
"""


@register_model('HERec')
@register_model('Metapath2vec')
class HeteroEmbedding(BaseModel):
class SkipGram(BaseModel):
@classmethod
def build_model_from_args(cls, args, hg):
return cls(hg.num_nodes(), args.dim)

def __init__(self, num_nodes, dim):
super(HeteroEmbedding, self).__init__()
super(SkipGram, self).__init__()
self.embedding_dim = dim

self.u_embeddings = nn.Embedding(num_nodes, self.embedding_dim,

+ 3
- 1
openhgnn/models/__init__.py View File

@@ -56,7 +56,8 @@ SUPPORTED_MODELS = {
'RGCN': 'openhgnn.models.RGCN',
"RGAT": 'openhgnn.models.RGAT',
'RSHN': 'openhgnn.models.RSHN',
'Metapath2vec': 'openhgnn.models.Metapath2vec',
'Metapath2vec': 'openhgnn.models.SkipGram',
'HERec': 'openhgnn.models.SkipGram',
'HAN': 'openhgnn.models.HAN',
#'HGT': 'openhgnn.models.HGT',
'HeCo': 'openhgnn.models.HeCo',
@@ -77,4 +78,5 @@ SUPPORTED_MODELS = {
'GAT': 'space4hgnn.homo_models.GAT',
'homo_GNN': 'openhgnn.models.homo_GNN',
'general_HGNN': 'openhgnn.models.general_HGNN',
'HDE': 'openhgnn.models.HDE',
}

+ 55
- 0
openhgnn/output/HDE/README.md View File

@@ -0,0 +1,55 @@
# HDE[ICDM2021]

Paper:[Heterogeneous Graph Neural Network with Distance Encoding](http://www.shichuan.org/doc/116.pdf)

## How to run

* Clone the Openhgnn-DGL

```bash
python main.py -m HDE -d HGBl-IMDB -t link_prediction -g 0 --use_best_config
```

If you do not have gpu, set -gpu -1.

## Performance

| Dataset | AUC_ROC |
| --------- | ------- |
| HGBl-IMDB | 0.9151 |
| HGBl-ACM | 0.8741 |
| HGBl-DBLP | 0.9836 |

### TrainerFlow

```hde_trainer```

### model

```HDE```

### Dataset

Supported datasets:

* HGBl-IMDB
* HGBl-ACM
* HGBl-DBLP

### Hyper-parameter specific to the model

```python
emb_dim = 128 # dimension of HDE
k_hop = 2 # radius when sampling ego graph for a node
max_dist = 3 # max value of SPD, usually set to k_hop + 1
```

## More

#### Contributor

Zhiyuan Lu[GAMMA LAB]

#### If you have any questions,

Submit an issue or email to [luzy@bupt.edu.cn](luzy@bupt.edu.cn)

+ 21
- 0
openhgnn/output/HERec/README.md View File

@@ -0,0 +1,21 @@
# HERec[TKDE2018]

Paper: [**Heterogeneous Information Network Embedding for Recommendation**](https://ieeexplore.ieee.org/abstract/document/8355676)

Code from author: https://github.com/librahu/HERec

## How to run

```bash
python main.py -m HERec -t node_classification -d dblp4MAGNN -g 0
```

If you do not have gpu, set -gpu -1.

## Performance

### Node Classification

| Dataset | Metapath | Macro-F1 | Micro-F1 |
| ------------- |-------- | -------- | -------- |
| dblp | APVPA | 0.9277 | 0.9331 |

+ 26
- 3
openhgnn/output/RGCN/README.md View File

@@ -11,14 +11,37 @@
- Clone the Openhgnn-DGL

```bash
# For node classification task
python main.py -m RGCN -t node_classification -d aifb -g 0 --use_best_config
# For link prediction task
python main.py -m RGCN -t link_prediction -d HGBl-amazon -g 0 --use_best_config
```

If you do not have gpu, set -gpu -1.
##### Supported dataset
###### Node classification:
[RDFDataset](../../dataset/#RDF_NodeCLassification)[aifb/mutag/bgs/am], [HGBn](../../dataset/#HGBn) and other datasets for node classification.
###### Link prediction:

## Performance

#### Task: Node classification

Evaluation metric: accuracy

| Method | AIFB | MUTAG | BGS | AM |
| -------- | ----- | ----- | ----- | ----- |
| **RGCN** | 97.22 | 72.06 | 96.55 | 88.89 |

-d means dataset, candidate dataset: aifb/mutag/bgs/am. Refer to [RDFDataset](../../dataset/#RDF_NodeCLassification) to get more infos.
#### Task: Link prediction

## Performance: Node classification
Evaluation metric: roc_auc

| Method | AIFB | MUTAG | BGS | AM |
| -------------------- | ----- | ----- | ----- | ----- |


+ 24
- 0
openhgnn/output/metapath2vec/README.md View File

@@ -1 +1,25 @@
# Metapath2vec[KDD2017]

Paper: [**metapath2vec: Scalable Representation Learning for Heterogeneous Networks**](https://ericdongyx.github.io/metapath2vec/m2v.html)

Code from author: https://www.dropbox.com/s/w3wmo2ru9kpk39n/code_metapath2vec.zip

Code from dgl Team: https://github.com/dmlc/dgl/tree/master/examples/pytorch/metapath2vec

## How to run

```bash
python main.py -m Metapath2vec -t node_classification -d dblp4MAGNN -g 0
```

If you do not have gpu, set -gpu -1.

## Performance

### Node Classification

| Dataset | Metapath | Macro-F1 | Micro-F1 |
| ------------- |-------- | -------- | -------- |
| dblp | APVPA | 0.9241 | 0.9288 |



+ 243
- 0
openhgnn/sampler/hde_sampler.py View File

@@ -0,0 +1,243 @@
import networkx as nx
import random
import torch
from tqdm import tqdm
import numpy as np
from itertools import combinations


class HDESampler:
"""
Sampler for HDE model, the function of this sampler is perform data preprocess
and compute HDE for each node. Please notice that for different target node set,
the HDE for each node can be different.
For more details, please refer to the original paper: http://www.shichuan.org/doc/116.pdf
"""
def __init__(self, that):
self.mini_batch = []
self.type2idx = that.type2idx
self.node_type = that.node_type
self.target_link = that.target_link
self.num_fea = that.num_fea
self.sample_size = that.sample_size
self.hg = that.hg
self.device = that.device

def preprocess(self):
self.preprocess_feature()

def dgl2nx(self):
"""
Convert the dgl graph data to networkx data structure.
:return:
"""
test_ratio = 0.1
nx_g = nx.Graph()
sp = 1 - test_ratio * 2

edges = self.hg.edges(etype=self.target_link[0][1])
a_list = edges[0]
b_list = edges[1]
edge_list = []
for i in range(self.hg.num_edges(self.target_link[0][1])):
edge_list.append(['A' + str(int(a_list[i])), 'B' + str(int(b_list[i]))])

num_edge = len(edge_list)
sp1 = int(num_edge * sp)
sp2 = int(num_edge * test_ratio)
self.g_train = nx.Graph()
self.g_val = nx.Graph()
self.g_test = nx.Graph()

self.g_train.add_edges_from(edge_list[:sp1])
self.g_val.add_edges_from(edge_list[sp1:sp1 + sp2])
self.g_test.add_edges_from(edge_list[sp1 + sp2:])

def compute_hde(self, args):
"""
Compute hde for training set, validation set and test set.
:param args: arguments
:return:
"""
print("Computing HDE for training set...")
self.data_A_train, self.data_B_train, self.data_y_train = self.batch_data(self.g_train, args)
print("Computing HDE for validation set...")
self.data_A_val, self.data_B_val, self.data_y_val = self.batch_data(self.g_val, args)
print("Computing HDE for test set...")
self.data_A_test, self.data_B_test, self.data_y_test = self.batch_data(self.g_test, args)

val_batch_A_fea = self.data_A_val.reshape(-1, self.sample_size)
val_batch_B_fea = self.data_B_val.reshape(-1, self.sample_size)
val_batch_y = self.data_y_val.reshape(-1)
self.val_batch_A_fea = torch.FloatTensor(val_batch_A_fea).to(self.device)
self.val_batch_B_fea = torch.FloatTensor(val_batch_B_fea).to(self.device)
self.val_batch_y = torch.LongTensor(val_batch_y).to(self.device)
test_batch_A_fea = self.data_A_test.reshape(-1, self.sample_size)
test_batch_B_fea = self.data_B_test.reshape(-1, self.sample_size)
test_batch_y = self.data_y_test.reshape(-1)
self.test_batch_A_fea = torch.FloatTensor(test_batch_A_fea).to(self.device)
self.test_batch_B_fea = torch.FloatTensor(test_batch_B_fea).to(self.device)
self.test_batch_y = torch.LongTensor(test_batch_y).to(self.device)

return self.data_A_train, self.data_B_train, self.data_y_train, \
self.val_batch_A_fea, self.val_batch_B_fea, self.val_batch_y, \
self.test_batch_A_fea, self.test_batch_B_fea, self.test_batch_y

def batch_data(self, g, args):
"""
Generate batch data.
:param g: graph data
:param args: arguments
:return:
"""
edge = list(g.edges)
nodes = list(g.nodes)
num_batch = int(len(edge) * 2 / args.batch_size)
random.shuffle(edge)
data = []
edge = edge[0: num_batch * args.batch_size // 2]
for bx in tqdm(edge):
posA, posB = self.subgraph_sampling_with_DE_node_pair(g, bx, args)
data.append([posA, posB, 1])

neg_tmpB_id = random.choice(nodes)
negA, negB = self.subgraph_sampling_with_DE_node_pair(g,
[bx[0], neg_tmpB_id],
args)
data.append([negA, negB, 0])

random.shuffle(data)
data = np.array(data)
data = data.reshape(num_batch, args.batch_size, 3)
data_A = data[:, :, 0].tolist()
data_B = data[:, :, 1].tolist()
data_y = data[:, :, 2].tolist()
for i in range(len(data_A)):
for j in range(len(data[0])):
data_A[i][j] = data_A[i][j].tolist()
data_B[i][j] = data_B[i][j].tolist()
data_A = np.squeeze(np.array(data_A))
data_B = np.squeeze(np.array(data_B))
data_y = np.squeeze(np.array(data_y))
return data_A, data_B, data_y

def subgraph_sampling_with_DE_node_pair(self,
G,
node_pair,
args):
"""
compute distance encoding given a target node set
:param G: graph data
:param node_pair: target node set
:param args: arguments
:return:
"""
[A, B] = node_pair
A_ego = nx.ego_graph(G, A, radius=args.k_hop)
B_ego = nx.ego_graph(G, B, radius=args.k_hop)
sub_G_for_AB = nx.compose(A_ego, B_ego)
sub_G_for_AB.remove_edges_from(combinations(node_pair, 2))

sub_G_nodes = sub_G_for_AB.nodes
SPD_based_on_node_pair = {}
for node in sub_G_nodes:
tmpA = self.dist_encoder(A, node, sub_G_for_AB, args)
tmpB = self.dist_encoder(B, node, sub_G_for_AB, args)
SPD_based_on_node_pair[node] = np.concatenate([tmpA, tmpB], axis=0)

A_fea_batch = self.gen_fea_batch(sub_G_for_AB,
A,
SPD_based_on_node_pair,
args)
B_fea_batch = self.gen_fea_batch(sub_G_for_AB,
B,
SPD_based_on_node_pair,
args)
return A_fea_batch, B_fea_batch

def gen_fea_batch(self, G, root, fea_dict, args):
"""
Neighbor sampling for each node. The Neighbor feature will be concatenated and
will be separated in the model.
:param G: graph data
:param root: root node
:param fea_dict: node features
:param args: arguments
:return:
"""
fea_batch = []
self.mini_batch.append([root])
a = [0] * (self.num_fea - self.node_type) + self.type_encoder(root)
fea_batch.append(np.asarray(a,
dtype=np.float32
).reshape(-1, self.num_fea)
)

ns_1 = [list(np.random.choice(list(G.neighbors(node)) + [node],
args.num_neighbor,
replace=True))
for node in self.mini_batch[-1]]
self.mini_batch.append(ns_1[0])
de_1 = [
np.concatenate([fea_dict[dest], np.asarray(self.type_encoder(dest))], axis=0)
for dest in ns_1[0]
]

fea_batch.append(np.asarray(de_1,
dtype=np.float32).reshape(1, -1)
)
# 2-order
ns_2 = [list(np.random.choice(list(G.neighbors(node)) + [node],
args.num_neighbor,
replace=True))
for node in self.mini_batch[-1]]
de_2 = []
for i in range(len(ns_2)):
tmp = []
for j in range(len(ns_2[0])):
tmp.append(
np.concatenate(
[fea_dict[ns_2[i][j]], np.asarray(self.type_encoder(ns_2[i][j]))],
axis=0)
)
de_2.append(tmp)

fea_batch.append(np.asarray(de_2,
dtype=np.float32).reshape(1, -1)
)

return np.concatenate(fea_batch, axis=1)

def dist_encoder(self, src, dest, G, args):
"""
compute H_SPD for a node pair
:param src: source node
:param dest: target node
:param G: graph data
:param args: arguments
:return:
"""
paths = list(nx.all_simple_paths(G, src, dest, cutoff=args.max_dist + 1))
cnt = [args.max_dist] * self.node_type # truncation SPD at max_spd
for path in paths:
res = [0] * self.node_type
for i in range(1, len(path)):
tmp = path[i][0]
res[self.type2idx[tmp]] += 1
# print(path, res)
for k in range(self.node_type):
cnt[k] = min(cnt[k], res[k])
one_hot_list = [np.eye(args.max_dist + 1, dtype=np.float64)[cnt[i]]
for i in range(self.node_type)]
return np.concatenate(one_hot_list)

def type_encoder(self, node):
"""
perform one-hot encoding based on the node type.
:param node:
:return:
"""
res = [0] * self.node_type
res[self.type2idx[node[0]]] = 1.0
return res


openhgnn/sampler/mp2vec_sampler.py → openhgnn/sampler/random_walk_sampler.py View File

@@ -4,19 +4,16 @@ from torch.utils.data import Dataset
import torch


class Metapath2VecSampler(Dataset):
def __init__(self, hg, metapath, start_ntype: str, rw_length: int, rw_walks: int, window_size: int, neg_size: int):
class RandomWalkSampler(Dataset):
def __init__(self, g, rw_walks: int, window_size: int, neg_size: int, metapath=None, rw_prob=None, rw_length=None):

self.hg = hg
self.g = g
self.start_idx = [0]
for i, num_nodes in enumerate([hg.num_nodes(ntype) for ntype in hg.ntypes]):
if i < len(hg.ntypes) - 1:
for i, num_nodes in enumerate([g.num_nodes(ntype) for ntype in g.ntypes]):
if i < len(g.ntypes) - 1:
self.start_idx.append(num_nodes + self.start_idx[i])
self.node_frequency = [0] * hg.num_nodes()
self.metapath = metapath
self.start_ntype = start_ntype
self.traces = []
self.negatives = []
self.node_frequency = [0] * g.num_nodes()
self.rw_prob = rw_prob
self.rw_length = rw_length
self.rw_walks = rw_walks
self.window_size = window_size
@@ -24,18 +21,28 @@ class Metapath2VecSampler(Dataset):
self.neg_pos = 0
self.neg_size = neg_size

if metapath is not None:
start_ntype = self.g.to_canonical_etype(metapath[0])[0]
self.start_nodes = list(range(self.g.num_nodes(start_ntype))) * self.rw_walks
else:
self.start_nodes = list(range(self.g.num_nodes())) * self.rw_walks
self.metapath = metapath

self.traces = []
self.negatives = []
self.discards = []
self.trace_ntype = []
self.traces_idx_of_all_types = []

self.__generate_metapath()
self.__generate_node_frequency()
self.__generate_negatives()
self.__generate_discards()

def __generate_metapath(self):
start_nodes = list(
range(self.hg.num_nodes(self.start_ntype))) * self.rw_walks # start_nodes + [i] * self.rw_walks
self.traces, self.trace_ntype = dgl.sampling.random_walk(g=self.hg, nodes=start_nodes,
metapath=self.metapath * self.rw_length)
self.traces, self.trace_ntype = dgl.sampling.random_walk(g=self.g, nodes=self.start_nodes,
metapath=self.metapath, prob=self.rw_prob,
length=self.rw_length)

self.traces_idx_of_all_types = torch.index_select(torch.tensor(self.start_idx), 0, self.trace_ntype).repeat(
len(self.traces), 1) + self.traces
@@ -56,20 +63,26 @@ class Metapath2VecSampler(Dataset):
np.random.shuffle(self.negatives)
self.sampling_prob = ratio

def __generate_discards(self):
t = 0.0001
f = np.array(self.node_frequency) / sum(self.node_frequency)
self.discards = np.sqrt(t / f) + (t / f)
return

def get_center_context_negatives(self, idx):
# return all (center, context, 5 negatives) on one trace
# return all (center, context, n negatives) on one trace
trace = self.traces_idx_of_all_types[idx]
pair_catch = []
for i, u in enumerate(trace):
if np.random.rand() > self.discards[u]:
continue
for j, v in enumerate(
trace[max(i - self.window_size, 0):i + self.window_size]):
# todo 为什么有idx<0?
if u < 0 or v < 0:
# print('warning: less than 0')
continue
if u >= self.hg.num_nodes() or v >= self.hg.num_nodes():
# print('warning: bigger than num_nodes')
if u >= self.g.num_nodes() or v >= self.g.num_nodes():
continue

pair_catch.append((int(u), int(v), self.__get_negatives()))
return pair_catch


+ 19
- 4
openhgnn/sampler/test_mp2vec_sampler.py View File

@@ -1,5 +1,6 @@
import dgl
from .mp2vec_sampler import Metapath2VecSampler
from .random_walk_sampler import RandomWalkSampler
import torch

hg = dgl.heterograph({
('user', 'follow', 'user'): ([0, 1, 1, 2, 3], [1, 2, 3, 0, 0]),
@@ -9,7 +10,21 @@ hg = dgl.heterograph({

def test_get_center_context_negatives():
rw_walks = 5
sampler = Metapath2VecSampler(hg=hg, metapath=['follow', 'view', 'viewed-by'], start_ntype='user', rw_length=3,
rw_walks=rw_walks, window_size=3, neg_size=5)
hetero_sampler = RandomWalkSampler(g=hg, metapath=['follow', 'view', 'viewed-by'] * 3,
rw_walks=rw_walks, window_size=3, neg_size=5)
for i in range(hg.num_nodes('user') * rw_walks):
print(sampler.get_center_context_negatives(i))
print(hetero_sampler.get_center_context_negatives(i))

metapath = ['view', 'viewed-by']
for i, elem in enumerate(metapath):
if i == 0:
adj = hg.adj(etype=elem)
else:
adj = torch.sparse.mm(adj, hg.adj(etype=elem))
adj = adj.coalesce()
g = dgl.graph(data=(adj.indices()[0], adj.indices()[1]))
g.edata['rw_prob'] = adj.values()
homo_sampler = RandomWalkSampler(g=g, rw_length=10, rw_walks=rw_walks, window_size=3, neg_size=5)

for i in range(g.num_nodes() * rw_walks):
print(homo_sampler.get_center_context_negatives(i))

+ 5
- 4
openhgnn/start.py View File

@@ -1,4 +1,4 @@
from .utils import set_random_seed, set_best_config
from .utils import set_random_seed, set_best_config, Logger
from .trainerflow import build_flow
from .auto import hpo_experiment

@@ -6,12 +6,11 @@ from .auto import hpo_experiment
def OpenHGNN(args):
if not getattr(args, 'seed', False):
args.seed = 0
set_random_seed(args.seed)

args.logger = Logger(args)
if getattr(args, "use_best_config", False):
args = set_best_config(args)
set_random_seed(args.seed)
trainerflow = SpecificTrainerflow.get(args.model, args.task)
print(args)
if getattr(args, "use_hpo", False):
# hyper-parameter search
hpo_experiment(args, trainerflow)
@@ -31,6 +30,8 @@ SpecificTrainerflow = {
'DMGI': 'DMGI_trainer',
'KGCN': 'kgcntrainer',
'Metapath2vec': 'mp2vec_trainer',
'HERec': 'herec_trainer',
'SLiCE':'slicetrainer',
'HeGAN': 'HeGAN_trainer',
'HDE': 'hde_trainer',
}

+ 157
- 13
openhgnn/tasks/link_prediction.py View File

@@ -1,4 +1,7 @@
import dgl
import torch as th
import torch.nn.functional as F
from dgl.dataloading.negative_sampler import Uniform
from . import BaseTask, register_task
from ..dataset import build_dataset
from ..utils import Evaluator
@@ -30,8 +33,27 @@ class LinkPrediction(BaseTask):
self.dataset = build_dataset(args.dataset, 'link_prediction')
# self.evaluator = Evaluator()
self.train_hg, self.val_hg, self.test_hg = self.dataset.get_idx()
self.val_hg = self.val_hg.to(args.device)
self.test_hg = self.test_hg.to(args.device)
self.evaluator = Evaluator(args.seed)

if not hasattr(args, 'score_fn'):
self.ScorePredictor = HeteroDistMultPredictor()
args.score_fn = 'distmult'
elif args.score_fn == 'dot-product':
self.ScorePredictor = HeteroDotProductPredictor()
elif args.score_fn == 'distmult':
self.ScorePredictor = HeteroDistMultPredictor()
self.negative_sampler = Uniform(1)
if args.dataset in ['wn18', 'FB15k', 'FB15k-237']:
self.evaluation_metric = 'mrr'
else:
self.evaluation_metric = 'roc_auc'
args.logger.info('[Init Task] The task: link prediction, the dataset: {}, the evaluation metric is {}, '
'the score function: {} '.format(self.n_dataset, self.evaluation_metric, args.score_fn))
def get_graph(self):
return self.dataset.g

@@ -48,34 +70,156 @@ class LinkPrediction(BaseTask):
elif name == 'roc_auc':
return self.evaluator.cal_roc_auc

def evaluate(self, name, logits):
def evaluate(self, n_embedding, r_embedding=None, mode='test'):
r"""

Parameters
----------
logits : th.Tensor
the prediction of specific
name

n_embedding: th.Tensor
the embedding of nodes
r_embedding: th.Tensor
the embedding of relation types
mode: str
the evaluation mode, train/valid/test
Returns
-------

"""
if name == 'acc':
if self.evaluation_metric == 'acc':
return self.evaluator.author_link_prediction
elif name == 'mrr':
return self.evaluator.mrr_
elif name == 'academic_lp':
return self.evaluator.author_link_prediction(logits, self.dataset.train_batch, self.dataset.test_batch)
elif self.evaluation_metric == 'mrr':

return self.evaluator.mrr_(n_embedding['_N'], self.dict2emd(r_embedding), self.dataset.train_triplets,
self.dataset.valid_triplets, self.dataset.test_triplets,
hits=[1, 3, 10], eval_bz=100)
elif self.evaluation_metric == 'academic_lp':
return self.evaluator.author_link_prediction(n_embedding, self.dataset.train_batch, self.dataset.test_batch)
elif self.evaluation_metric == 'roc_auc':
if mode == 'test':
eval_hg = self.test_hg
elif mode == 'valid':
eval_hg = self.val_hg
else:
raise ValueError('Mode error, supported test and valid.')
negative_graph = self.construct_negative_graph(eval_hg)
p_score = th.sigmoid(self.ScorePredictor(eval_hg, n_embedding, r_embedding))
n_score = th.sigmoid(self.ScorePredictor(negative_graph, n_embedding, r_embedding))
p_label = th.ones(len(p_score))
n_label = th.zeros(len(n_score))
return self.evaluator.cal_roc_auc(th.cat((p_label, n_label)).cpu(), th.cat((p_score, n_score)).cpu())
else:
return self.evaluator.link_prediction

def get_batch(self):
return self.dataset.train_batch, self.dataset.test_batch

def get_idx(self):
return self.train_hg, self.val_hg, self.test_hg
def get_train(self):
return self.train_hg

def get_labels(self):
return self.dataset.get_labels()

def dict2emd(self, r_embedding):
r_emd = []
for i in range(self.dataset.num_rels):
r_emd.append(r_embedding[str(i)])
return th.stack(r_emd).squeeze()

def construct_negative_graph(self, hg):
e_dict = {
etype: hg.edges(etype=etype, form='eid')
for etype in hg.canonical_etypes}
neg_srcdst = self.negative_sampler(hg, e_dict)
neg_pair_graph = dgl.heterograph(neg_srcdst,
{ntype: hg.number_of_nodes(ntype) for ntype in hg.ntypes})
return neg_pair_graph


class HeteroDotProductPredictor(th.nn.Module):
"""
References: `documentation of dgl <https://docs.dgl.ai/guide/training-link.html#heterogeneous-graphs>_`
"""
def forward(self, edge_subgraph, x, *args, **kwargs):
"""
Parameters
----------
edge_subgraph: dgl.Heterograph
the prediction graph only contains the edges of the target link
x: dict[str: th.Tensor]
the embedding dict. The key only contains the nodes involving with the target link.
Returns
-------
score: th.Tensor
the prediction of the edges in edge_subgraph
"""
with edge_subgraph.local_scope():
for ntype in edge_subgraph.ntypes:
edge_subgraph.nodes[ntype].data['x'] = x[ntype]
for etype in edge_subgraph.canonical_etypes:
edge_subgraph.apply_edges(
dgl.function.u_dot_v('x', 'x', 'score'), etype=etype)
score = edge_subgraph.edata['score']
if isinstance(score, dict):
result = []
for _, value in score.items():
result.append(value)
score = th.cat(result)
return score.squeeze()
class HeteroDistMultPredictor(th.nn.Module):
def forward(self, edge_subgraph, x, r_embedding, *args, **kwargs):
"""
DistMult factorization (Yang et al. 2014) as the scoring function,
which is known to perform well on standard link prediction benchmarks when used on its own.
In DistMult, every relation r is associated with a diagonal matrix :math:`R_{r} \in \mathbb{R}^{d \times d}`
and a triple (s, r, o) is scored as
.. math::
f(s, r, o)=e_{s}^{T} R_{r} e_{o}
Parameters
----------
edge_subgraph: dgl.Heterograph
the prediction graph only contains the edges of the target link
x: dict[str: th.Tensor]
the node embedding dict. The key only contains the nodes involving with the target link.
r_embedding: th.Tensor
the all relation types embedding
Returns
-------
score: th.Tensor
the prediction of the edges in edge_subgraph
"""
with edge_subgraph.local_scope():
for ntype in edge_subgraph.ntypes:
edge_subgraph.nodes[ntype].data['x'] = x[ntype]
for etype in edge_subgraph.canonical_etypes:
e = r_embedding[etype[1]]
n = edge_subgraph.num_edges(etype)
if 1 == len(edge_subgraph.canonical_etypes):
edge_subgraph.edata['e'] = e.expand(n, -1)
else:
edge_subgraph.edata['e'] = {etype: e.expand(n, -1)}
edge_subgraph.apply_edges(
dgl.function.u_mul_e('x', 'e', 's'), etype=etype)
edge_subgraph.apply_edges(
dgl.function.e_mul_v('s', 'x', 'score'), etype=etype)

score = edge_subgraph.edata['score']
if isinstance(score, dict):
result = []
for _, value in score.items():
result.append(th.sum(value, dim=1))
score = th.cat(result)
else:
score = th.sum(score, dim=1)
return score


+ 3
- 1
openhgnn/trainerflow/__init__.py View File

@@ -59,7 +59,9 @@ SUPPORTED_FLOWS = {
'kgcntrainer': 'openhgnn.trainerflow.kgcn_trainer',
'HeGAN_trainer': 'openhgnn.trainerflow.HeGAN_trainer',
'mp2vec_trainer': 'openhgnn.trainerflow.mp2vec_trainer',
'herec_trainer': 'openhgnn.trainerflow.herec_trainer',
'HeCo_trainer': 'openhgnn.trainerflow.HeCo_trainer',
'DMGI_trainer': 'openhgnn.trainerflow.DMGI_trainer',
'slicetrainer':'openhgnn.trainerflow.slice_trainer',
'slicetrainer': 'openhgnn.trainerflow.slice_trainer',
'hde_trainer': 'openhgnn.trainerflow.hde_trainer',
}

+ 65
- 32
openhgnn/trainerflow/base_flow.py View File

@@ -14,17 +14,27 @@ class BaseFlow(ABC):
}

def __init__(self, args):
"""

Parameters
----------
args

Attributes
-------------
evaluate_interval: int
the interval of evaluation in validation
"""
super(BaseFlow, self).__init__()
self.evaluator = None
self.evaluate_interval = 1
self.load_from_checkpoint = True
if hasattr(args, '_checkpoint'):
self._checkpoint = os.path.join(args._checkpoint,
f"{args.model}_{args.dataset}.pt")
else:
if self.load_from_checkpoint:
if hasattr(args, 'load_from_pretrained'):
self._checkpoint = os.path.join("./openhgnn/output/{}".format(args.model),
f"{args.model}_{args.dataset}.pt")
f"{args.model}_{args.dataset}_{args.task}.pt")
else:
self._checkpoint = None

@@ -32,6 +42,7 @@ class BaseFlow(ABC):
args.HGB_results_path = os.path.join("./openhgnn/output/{}/{}_{}.txt".format(args.model, args.dataset[5:], args.seed))

self.args = args
self.logger = self.args.logger
self.model_name = args.model
self.device = args.device
self.task = build_task(args)
@@ -43,7 +54,7 @@ class BaseFlow(ABC):
self.optimizer = None
self.loss_fn = self.task.get_loss_fn()

def preprocess_feature(self):
def preprocess(self):
r"""
Every trainerflow should run the preprocess_feature if you want to get a feature preprocessing.
The Parameters in input_feature will be added into optimizer and input_feature will be added into the model.
@@ -66,38 +77,58 @@ class BaseFlow(ABC):
if hasattr(self.args, 'feat'):
pass
else:
# Default 0, nothing to do.
self.args.feat = 0
if self.args.feat == 0:
print("feat0, pass!")
if isinstance(self.hg.ndata['h'], dict):
self.input_feature = HeteroFeature(self.hg.ndata['h'], get_nodes_dict(self.hg),
self.args.hidden_dim, act=act).to(self.device)
elif isinstance(self.hg.ndata['h'], torch.Tensor):
self.input_feature = HeteroFeature({self.hg.ntypes[0]: self.hg.ndata['h']}, get_nodes_dict(self.hg),
self.args.hidden_dim, act=act).to(self.device)
self.feature_preprocess(act)
self.optimizer.add_param_group({'params': self.input_feature.parameters()})
# for early stop, load the model with input_feature module.
self.model.add_module('input_feature', self.input_feature)
self.load_from_pretrained()

def feature_preprocess(self, act):
"""
Feat
0, 1 ,2
Node feature
1 node type & more than 1 node types
no feature

Returns
-------

"""

if self.hg.ndata.get('h', {}) == {} or self.args.feat == 2:
if self.hg.ndata.get('h', {}) == {}:
self.logger.feature_info('Assign embedding as features, because hg.ndata is empty.')
else:
self.logger.feature_info('feat2, drop features!')
self.hg.ndata.pop('h')
self.input_feature = HeteroFeature({}, get_nodes_dict(self.hg), self.args.hidden_dim,
act=act).to(self.device)
elif self.args.feat == 0:
self.input_feature = self.init_feature(act)
elif self.args.feat == 1:
if self.args.task != 'node_classification':
print('it is only for node classification task!')
self.logger.feature_info('\'feat 1\' is only for node classification task, set feat 0!')
self.input_feature = self.init_feature(act)
else:
h_dict = self.hg.ndata.pop('h')
print('feat1, preserve target nodes!')
self.logger.feature_info('feat1, preserve target nodes!')
self.input_feature = HeteroFeature({self.category: h_dict[self.category]}, get_nodes_dict(self.hg), self.args.hidden_dim,
act=act).to(self.device)
elif self.args.feat == 2:
self.hg.ndata.pop('h')
self.input_feature = HeteroFeature({}, get_nodes_dict(self.hg), self.args.hidden_dim,
act=act).to(self.device)
print('feat2, drop features!')
# if isinstance(self.hg.ndata['h'], dict):
# self.input_feature = HeteroFeature(self.hg.ndata['h'], get_nodes_dict(self.hg), self.args.hidden_dim, act=act).to(self.device)
# elif isinstance(self.hg.ndata['h'], torch.Tensor):
# self.input_feature = HeteroFeature({self.hg.ntypes[0]: self.hg.ndata['h']}, get_nodes_dict(self.hg), self.args.hidden_dim, act=act).to(self.device)
# else:
# self.input_feature = HeteroFeature({}, get_nodes_dict(self.hg), self.args.hidden_dim,
# act=act).to(self.device)
self.optimizer.add_param_group({'params': self.input_feature.parameters()})
# for early stop, load the model with input_feature module.
self.model.add_module('input_feature', self.input_feature)

def init_feature(self, act):
self.logger.feature_info("Feat is 0, nothing to do!")
if isinstance(self.hg.ndata['h'], dict):
# The heterogeneous contains more than one node type.
input_feature = HeteroFeature(self.hg.ndata['h'], get_nodes_dict(self.hg),
self.args.hidden_dim, act=act).to(self.device)
elif isinstance(self.hg.ndata['h'], torch.Tensor):
# The heterogeneous only contains one node type.
input_feature = HeteroFeature({self.hg.ntypes[0]: self.hg.ndata['h']}, get_nodes_dict(self.hg),
self.args.hidden_dim, act=act).to(self.device)
return input_feature

@abstractmethod
def train(self):
@@ -128,13 +159,15 @@ class BaseFlow(ABC):
raise NotImplementedError

def load_from_pretrained(self):
if self.load_from_checkpoint:
if self.args.load_from_pretrained:
try:
ck_pt = torch.load(self._checkpoint)
self.model.load_state_dict(ck_pt)
self.logger.info('[Load Model] Load model from pretrained model:' + self._checkpoint)
except FileNotFoundError:
print(f"'{self._checkpoint}' doesn't exists")
return self.model
self.logger.info('[Load Model] Do not load the model from pretrained, '
'{} doesn\'t exists'.format(self._checkpoint))
# return self.model

def save_checkpoint(self):
if self._checkpoint and hasattr(self.model, "_parameters()"):


+ 107
- 0
openhgnn/trainerflow/hde_trainer.py View File

@@ -0,0 +1,107 @@
import torch as th
from torch import nn
from tqdm import tqdm
import torch
from . import BaseFlow, register_flow
from ..models import build_model
from ..utils import extract_embed, EarlyStopping, get_nodes_dict
import torch.optim as optim
from sklearn.metrics import roc_auc_score
from ..sampler import hde_sampler

@register_flow("hde_trainer")
class hde_trainer(BaseFlow):
"""
HDE trainer flow.
Supported Model: HDE
Supported Dataset:imdb4hde
The trainerflow supports link prediction task. It first calculates HDE for every node in the graph,
and uses the HDE as a part of the initial feature for each node.
And then it performs standard message passing and link prediction operations.
Please notice that for different target node set, the HDE for each node can be different.
For more details, please refer to the original paper: http://www.shichuan.org/doc/116.pdf
"""

def __init__(self, args):
super(hde_trainer, self).__init__(args)
self.target_link = self.task.dataset.target_link
self.loss_fn = self.task.get_loss_fn()
self.args.out_node_type = self.task.dataset.out_ntypes
self.type2idx = {'A': 0, 'B': 1}
self.node_type = len(self.type2idx)
self.num_fea = (self.node_type * (args.max_dist + 1)) * 2 + self.node_type
self.sample_size = self.num_fea * (1 + args.num_neighbor + args.num_neighbor * args.num_neighbor)
self.args.patience = 10
args.input_dim = self.num_fea
args.output_dim = args.emb_dim // 2

self.model = build_model(self.model_name).build_model_from_args(self.args, self.hg).to(self.device)
# initialization
for m in self.model.modules():
if isinstance(m, (nn.Conv2d, nn.Linear)):
nn.init.kaiming_normal_(m.weight, mode='fan_in')
self.loss_fn = nn.CrossEntropyLoss().to(self.device)
self.optimizer = optim.Adam(self.model.parameters(), lr=args.lr, weight_decay=1e-2)
self.evaluator = roc_auc_score
self.HDE_sampler = hde_sampler.HDESampler(self)
self.HDE_sampler.dgl2nx()

self.data_A_train, self.data_B_train, self.data_y_train, \
self.val_batch_A_fea, self.val_batch_B_fea, self.val_batch_y, \
self.test_batch_A_fea, self.test_batch_B_fea, self.test_batch_y = self.HDE_sampler.compute_hde(args)

def train(self):
epoch_iter = tqdm(range(self.max_epoch))
stopper = EarlyStopping(self.args.patience, self._checkpoint)
for epoch in tqdm(range(self.max_epoch), ncols=80):
loss = self._mini_train_step()
if epoch % 2 == 0:
val_metric = self._test_step('valid')
epoch_iter.set_description(
f"Epoch: {epoch:03d}, roc_auc: {val_metric:.4f}, Loss:{loss:.4f}"
)
early_stop = stopper.step_score(val_metric, self.model)
if early_stop:
print('Early Stop!\tEpoch:' + str(epoch))
break
print(f"Valid_score_ = {stopper.best_score: .4f}")
stopper.load_model(self.model)

test_auc = self._test_step(split="test")
val_auc = self._test_step(split="valid")
print(f"Test roc_auc = {test_auc:.4f}")
return dict(Test_mrr=test_auc, Val_mrr=val_auc)

def _mini_train_step(self,):
self.model.train()
all_loss = 0
for (train_batch_A_fea, train_batch_B_fea, train_batch_y) in zip(self.data_A_train, self.data_B_train, self.data_y_train):
# train
self.model.train()
train_batch_A_fea = torch.FloatTensor(train_batch_A_fea).to(self.device)
train_batch_B_fea = torch.FloatTensor(train_batch_B_fea).to(self.device)
train_batch_y = torch.LongTensor(train_batch_y).to(self.device)
logits = self.model(train_batch_A_fea, train_batch_B_fea)
loss = self.loss_fn(logits, train_batch_y.squeeze())
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
all_loss += loss.item()
return all_loss

def _test_step(self, split=None, logits=None):
self.model.eval()
with th.no_grad():
if split == 'valid':
data_A_eval = self.val_batch_A_fea
data_B_eval = self.val_batch_B_fea
data_y_eval = self.val_batch_y
elif split == 'test':
data_A_eval = self.test_batch_A_fea
data_B_eval = self.test_batch_B_fea
data_y_eval = self.test_batch_y
logits = self.model(data_A_eval, data_B_eval)
pred = logits.argmax(dim=1)
metric = self.evaluator(data_y_eval.cpu().numpy(), pred.cpu().numpy())

return metric

+ 87
- 0
openhgnn/trainerflow/herec_trainer.py View File

@@ -0,0 +1,87 @@
import os
import numpy
from tqdm import tqdm
import torch.sparse as sparse
import torch.optim as optim
from torch.utils.data import DataLoader
from ..models import build_model
from . import BaseFlow, register_flow
from ..sampler import random_walk_sampler
import dgl


@register_flow("herec_trainer")
class HERecTrainer(BaseFlow):
def __init__(self, args):
super(HERecTrainer, self).__init__(args)
self.model = build_model(self.model_name).build_model_from_args(self.args, self.hg).to(self.device)
self.random_walk_sampler = None

self.dataloader = None

self.metapath = self.task.dataset.meta_paths_dict[self.args.meta_path_keys[0]]
self.output_dir = './openhgnn/output/' + self.model_name
self.embeddings_file_path = os.path.join(self.output_dir, self.args.dataset + '_' +
self.args.meta_path_keys[0] + '_herec_embeddings.npy')
self.load_trained_embeddings = False

def preprocess(self):

for i, elem in enumerate(self.metapath):
if i == 0:
adj = self.hg.adj(etype=elem)
else:
adj = sparse.mm(adj, self.hg.adj(etype=elem))
adj = adj.coalesce()

g = dgl.graph(data=(adj.indices()[0], adj.indices()[1]))
g.edata['rw_prob'] = adj.values()

self.random_walk_sampler = random_walk_sampler.RandomWalkSampler(g=g.to('cpu'),
rw_length=self.args.rw_length,
rw_walks=self.args.rw_walks,
window_size=self.args.window_size,
neg_size=self.args.neg_size, rw_prob='rw_prob')

self.dataloader = DataLoader(self.random_walk_sampler, batch_size=self.args.batch_size,
shuffle=True, num_workers=self.args.num_workers,
collate_fn=self.random_walk_sampler.collate)

def train(self):
emb = self.load_embeddings()

# todo: only supports node classification now
self.task.evaluate(logits=emb, name='f1_lr')

def load_embeddings(self):
if not self.load_trained_embeddings or not os.path.exists(self.embeddings_file_path):
self.train_embeddings()
emb = numpy.load(self.embeddings_file_path)
return emb

def train_embeddings(self):
self.preprocess()

optimizer = optim.SparseAdam(list(self.model.parameters()), lr=self.args.lr)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, len(self.dataloader))

for epoch in range(self.max_epoch):
print('\n\n\nEpoch: ' + str(epoch + 1))
running_loss = 0.0
for i, sample_batched in enumerate(tqdm(self.dataloader)):

if len(sample_batched[0]) > 1:
pos_u = sample_batched[0].to(self.device)
pos_v = sample_batched[1].to(self.device)
neg_v = sample_batched[2].to(self.device)

scheduler.step()
optimizer.zero_grad()
loss = self.model.forward(pos_u, pos_v, neg_v)
loss.backward()
optimizer.step()

running_loss = running_loss * 0.9 + loss.item() * 0.1
if i > 0 and i % 50 == 0:
print(' Loss: ' + str(running_loss))
self.model.save_embedding(self.embeddings_file_path)

+ 71
- 120
openhgnn/trainerflow/link_prediction.py View File

@@ -6,7 +6,6 @@ import torch
import torch.nn.functional as F
from . import BaseFlow, register_flow
from ..models import build_model
from dgl.dataloading.negative_sampler import Uniform
from ..utils import extract_embed, EarlyStopping, get_nodes_dict, add_reverse_edges


@@ -14,7 +13,8 @@ from ..utils import extract_embed, EarlyStopping, get_nodes_dict, add_reverse_ed
class LinkPrediction(BaseFlow):
"""
Link Prediction trainer flows.
Here is a tutorial teach you how to train a GNN for link prediction. <https://docs.dgl.ai/en/latest/tutorials/blitz/4_link_predict.html>_
Here is a tutorial teach you how to train a GNN for
`link prediction <https://docs.dgl.ai/en/latest/tutorials/blitz/4_link_predict.html>_`.

When training, you will need to remove the edges in the test set from the original graph.
DGL recommends you to treat the pairs of nodes as another graph, since you can describe a pair of nodes with an edge.
@@ -27,34 +27,57 @@ class LinkPrediction(BaseFlow):
"""

def __init__(self, args):
"""

Parameters
----------
args

Attributes
------------
target_link: list
list of edge types which are target link type to be predicted

score_fn: str
score function used in calculating the scores of links, supported function: distmult[Default if not specified] & dot product

r_embedding: nn. ParameterDict
In DistMult, the representations of edge types are involving the calculation of score.
General models do not generate the representations of edge types, so we generate the embeddings of edge types.
The dimension of embedding is `self.args.hidden_dim`.

"""
super(LinkPrediction, self).__init__(args)

self.target_link = self.task.dataset.target_link
self.loss_fn = self.task.get_loss_fn()
self.train_hg = self.task.get_train().to(self.device)
if hasattr(self.args, 'flag_add_reverse_edges'):
self.train_hg = add_reverse_edges(self.train_hg)
if not hasattr(self.args, 'out_dim'):
self.args.out_dim = self.args.hidden_dim

self.args.out_node_type = self.task.dataset.out_ntypes
self.args.out_dim = self.args.hidden_dim
self.train_hg, self.val_hg, self.test_hg = self.task.get_idx()
self.train_hg = add_reverse_edges(self.train_hg)
self.model = build_model(self.model_name).build_model_from_args(self.args, self.train_hg).to(self.device)
self.args.score_fn = 'distmult'

if not hasattr(self.args, 'score_fn'):
self.args.score_fn = 'distmult'
if self.args.score_fn == 'distmult':
self.r_embedding = nn.ParameterDict({etype[1]: nn.Parameter(th.Tensor(1, self.args.out_dim))
"""
In DistMult, the representations of edge types are involving the calculation of score.
General models do not generate the representations of edge types, so we generate the embeddings of edge types.
"""
self.r_embedding = nn.ParameterDict({etype[1]: nn.Parameter(th.Tensor(1, self.args.hidden_dim))
for etype in self.hg.canonical_etypes}).to(self.device)
for _, para in self.r_embedding.items():
nn.init.xavier_uniform_(para)
else:
self.r_embedding = None

self.evaluator = self.task.get_evaluator('roc_auc')
self.optimizer = self.candidate_optimizer[args.optimizer](self.model.parameters(), lr=args.lr, weight_decay=args.weight_decay)
if self.args.score_fn == 'distmult':
self.optimizer.add_param_group({'params': self.r_embedding.parameters()})
self.patience = args.patience
self.max_epoch = args.max_epoch

self.train_hg = self.train_hg.to(self.device)
self.val_hg = self.val_hg.to(self.device)
self.test_hg = self.test_hg.to(self.device)
self.negative_sampler = Uniform(1)
self.positive_graph = self.train_hg.edge_type_subgraph(self.target_link)

def preprocess(self):
@@ -62,47 +85,56 @@ class LinkPrediction(BaseFlow):
In link prediction, you will have a positive graph consisting of all the positive examples as edges,
and a negative graph consisting of all the negative examples.
The positive graph and the negative graph will contain the same set of nodes as the original graph.
Returns
-------

"""
self.preprocess_feature()
super(LinkPrediction, self).preprocess()

def train(self):
self.preprocess()
epoch_iter = tqdm(range(self.max_epoch))
stopper = EarlyStopping(self.args.patience, self._checkpoint)
for epoch in tqdm(range(self.max_epoch), ncols=80):
stopper = EarlyStopping(self.patience, self._checkpoint)
for epoch in tqdm(range(self.max_epoch)):
if self.args.mini_batch_flag:
loss = self._mini_train_step()
else:
loss = self._full_train_setp()
if epoch % 2 == 0:
if epoch % self.evaluate_interval == 0:
val_metric = self._test_step('valid')
epoch_iter.set_description(
f"Epoch: {epoch:03d}, roc_auc: {val_metric:.4f}, Loss:{loss:.4f}"
)
self.logger.train_info(f"Epoch: {epoch:03d}, roc_auc: {val_metric:.4f}, Loss:{loss:.4f}")
early_stop = stopper.step_score(val_metric, self.model)
if early_stop:
print('Early Stop!\tEpoch:' + str(epoch))
self.logger.train_info(f'Early Stop!\tEpoch:{epoch:03d}')
break
print(f"Valid_score_ = {stopper.best_score: .4f}")
self.logger.train_info(f"Valid score = {stopper.best_score: .4f}")
stopper.load_model(self.model)

############ TEST SCORE #########
# Test
if self.args.dataset in ['HGBl-amazon', 'HGBl-LastFM', 'HGBl-PubMed']:
# Test in HGB datasets.
self.model.eval()
with torch.no_grad():
val_metric = self._test_step('valid')
h_dict = self.model.input_feature()
embedding = self.model(self.hg, h_dict)
score = th.sigmoid(self.ScorePredictor(self.test_hg, embedding))
self.task.dataset.save_results(hg=self.test_hg, score=score, file_path=self.args.HGB_results_path)
score = th.sigmoid(self.task.ScorePredictor(self.task.test_hg, embedding, self.r_embedding))
self.task.dataset.save_results(hg=self.task.test_hg, score=score, file_path=self.args.HGB_results_path)
return val_metric, val_metric, epoch
test_mrr = self._test_step(split="test")
val_mrr = self._test_step(split="valid")
print(f"Test mrr = {test_mrr:.4f}")
return dict(Test_mrr=test_mrr, Val_mrr=val_mrr)
test_score = self._test_step(split="test")
val_score = self._test_step(split="valid")
self.logger.train_info(f"Test score = {test_score:.4f}")
return dict(Test_mrr=test_score, Val_mrr=val_score)

def _full_train_setp(self):
self.model.train()
h_dict = self.model.input_feature()
embedding = self.model(self.train_hg, h_dict)
# construct a negative graph according to the positive graph in each training epoch.
negative_graph = self.task.construct_negative_graph(self.positive_graph)
loss = self.loss_calculation(self.positive_graph, negative_graph, embedding)
# negative_graph = self.construct_negative_graph(self.train_hg)
# loss = self.loss_calculation(self.train_hg, negative_graph, embedding)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
return loss.item()

def _mini_train_step(self,):
self.model.train()
@@ -123,101 +155,20 @@ class LinkPrediction(BaseFlow):
return all_loss

def loss_calculation(self, positive_graph, negative_graph, embedding):
p_score = self.ScorePredictor(positive_graph, embedding)
n_score = self.ScorePredictor(negative_graph, embedding)
p_score = self.task.ScorePredictor(positive_graph, embedding, self.r_embedding)
n_score = self.task.ScorePredictor(negative_graph, embedding, self.r_embedding)

p_label = th.ones(len(p_score), device=self.device)
n_label = th.zeros(len(n_score), device=self.device)
loss = F.binary_cross_entropy_with_logits(th.cat((p_score, n_score)), th.cat((p_label, n_label)))
return loss

def ScorePredictor(self, edge_subgraph, x):
if self.args.score_fn == 'dot-product':
with edge_subgraph.local_scope():

for ntype in edge_subgraph.ntypes:
edge_subgraph.nodes[ntype].data['x'] = x[ntype]
for etype in edge_subgraph.canonical_etypes:
edge_subgraph.apply_edges(
dgl.function.u_dot_v('x', 'x', 'score'), etype=etype)
score = edge_subgraph.edata['score']
if isinstance(score, dict):
result = []
for _, value in score.items():
result.append(value)
score = th.cat(result)
return score.squeeze()
elif self.args.score_fn == 'distmult':
score_list = []
with edge_subgraph.local_scope():
for ntype in edge_subgraph.ntypes:
edge_subgraph.nodes[ntype].data['x'] = x[ntype]
for etype in edge_subgraph.canonical_etypes:
e = self.r_embedding[etype[1]]
n = edge_subgraph.num_edges(etype)
if 1 == len(edge_subgraph.canonical_etypes):
edge_subgraph.edata['e'] = e.expand(n, -1)
else:
edge_subgraph.edata['e'] = {etype: e.expand(n, -1)}
edge_subgraph.apply_edges(
dgl.function.u_mul_e('x', 'e', 's'), etype=etype)
edge_subgraph.apply_edges(
dgl.function.e_mul_v('s', 'x', 'score'), etype=etype)
if 1 == len(edge_subgraph.canonical_etypes):
score = th.sum(edge_subgraph.edata['score'], dim=1)
else:
score = th.sum(edge_subgraph.edata['score'].pop(etype), dim=1)
#score = th.sum(th.mul(edge_subgraph.edata['score'].pop(etype), e), dim=1)
score_list.append(score)
return th.cat(score_list)

def regularization_loss(self, embedding):
return th.mean(embedding.pow(2)) + th.mean(self.r_embedding.pow(2))

def construct_negative_graph(self, hg):
e_dict = {
etype: hg.edges(etype=etype, form='eid')
for etype in hg.canonical_etypes}
neg_srcdst = self.negative_sampler(hg, e_dict)
neg_pair_graph = dgl.heterograph(neg_srcdst,
{ntype: hg.number_of_nodes(ntype) for ntype in hg.ntypes})
return neg_pair_graph

def _full_train_setp(self):
self.model.train()
h_dict = self.model.input_feature()
embedding = self.model(self.train_hg, h_dict)
negative_graph = self.construct_negative_graph(self.positive_graph)
loss = self.loss_calculation(self.positive_graph, negative_graph, embedding)
# negative_graph = self.construct_negative_graph(self.train_hg)
# loss = self.loss_calculation(self.train_hg, negative_graph, embedding)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
#print(loss.item())
return loss.item()

def _test_step(self, split=None, logits=None):
def _test_step(self, split=None):
self.model.eval()
with th.no_grad():
h_dict = self.model.input_feature()
embedding = self.model(self.train_hg, h_dict)
if split == 'valid':
eval_hg = self.val_hg
# label = self.task.dataset.val_label
elif split == 'test':
label = self.task.dataset.test_label
score = th.sigmoid(self.ScorePredictor(self.test_hg, embedding))
metric = self.evaluator(label.cpu(), score.cpu())
return metric

# score = th.sigmoid(self.ScorePredictor(eval_hg, embedding))
# metric = self.evaluator(label.cpu(), score.cpu())
negative_graph = self.construct_negative_graph(eval_hg)
p_score = th.sigmoid(self.ScorePredictor(eval_hg, embedding))
n_score = th.sigmoid(self.ScorePredictor(negative_graph, embedding))
p_label = th.ones(len(p_score))
n_label = th.zeros(len(n_score))
metric = self.evaluator(th.cat((p_label, n_label)).cpu(), th.cat((p_score, n_score)).cpu())

return metric
return self.task.evaluate(embedding, self.r_embedding, mode=split)

+ 17
- 27
openhgnn/trainerflow/mp2vec_trainer.py View File

@@ -1,39 +1,32 @@
import os.path
import numpy
from tqdm import tqdm
import torch.optim as optim
from torch.utils.data import DataLoader
from ..models import build_model
from . import BaseFlow, register_flow
from ..tasks import build_task
from ..sampler import mp2vec_sampler
from ..sampler import random_walk_sampler


@register_flow("mp2vec_trainer")
class Metapath2VecTrainer(BaseFlow):
def __init__(self, args):
super(Metapath2VecTrainer, self).__init__(args)
self.args = args
self.model_name = args.model
self.device = args.device

self.task = build_task(args)
self.hg = self.task.get_graph().to(self.device)

self.model = build_model(self.model_name).build_model_from_args(self.args, self.hg).to(self.device)
self.model = self.model.to(self.device)
self.mp2vec_sampler = None
self.dataloader = None
self.embeddings_file_name = self.args.dataset + '_mp2vec_embeddings'
self.output_dir = './openhgnn/output/' + self.model_name
self.embeddings_file_path = os.path.join(self.output_dir, self.args.dataset + '_mp2vec_embeddings.npy')
self.load_trained_embeddings = False

def preprocess(self):
metapath = self.task.dataset.meta_paths[0]
start_ntype = metapath[0][0]
metapath_edges = [elem[1] for elem in metapath]
self.mp2vec_sampler = mp2vec_sampler.Metapath2VecSampler(hg=self.hg.to('cpu'), metapath=metapath_edges,
start_ntype=start_ntype, rw_length=self.args.rw_length,
rw_walks=self.args.rw_walks,
window_size=self.args.window_size,
neg_size=self.args.neg_size)
metapath = self.task.dataset.meta_paths_dict[self.args.meta_path_key]
self.mp2vec_sampler = random_walk_sampler.RandomWalkSampler(g=self.hg.to('cpu'),
metapath=metapath * self.args.rw_length,
rw_walks=self.args.rw_walks,
window_size=self.args.window_size,
neg_size=self.args.neg_size)

self.dataloader = DataLoader(self.mp2vec_sampler, batch_size=self.args.batch_size,
shuffle=True, num_workers=self.args.num_workers,
@@ -47,11 +40,9 @@ class Metapath2VecTrainer(BaseFlow):
self.task.evaluate(logits=emb[start_idx:end_idx], name='f1_lr')

def load_embeddings(self):
try:
emb = numpy.load(self.embeddings_file_name + '.npy')
except Exception:
if not self.load_trained_embeddings or not os.path.exists(self.embeddings_file_path):
self.train_embeddings()
emb = numpy.load(self.embeddings_file_name + '.npy')
emb = numpy.load(self.embeddings_file_path)
return emb

def train_embeddings(self):
@@ -61,25 +52,24 @@ class Metapath2VecTrainer(BaseFlow):
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, len(self.dataloader))

for epoch in range(self.max_epoch):
print('\n\n\nEpoch: ' + str(epoch + 1))
self.logger.info('Epoch: ' + str(epoch + 1))
running_loss = 0.0
for i, sample_batched in enumerate(tqdm(self.dataloader)):

if len(sample_batched[0]) > 1:
pos_u = sample_batched[0].to(self.device)
pos_v = sample_batched[1].to(self.device)
neg_v = sample_batched[2].to(self.device)

scheduler.step()
optimizer.zero_grad()
loss = self.model.forward(pos_u, pos_v, neg_v)
loss.backward()
optimizer.step()
scheduler.step()

running_loss = running_loss * 0.9 + loss.item() * 0.1
if i > 0 and i % 50 == 0:
print(' Loss: ' + str(running_loss))
self.model.save_embedding(self.embeddings_file_name)
self.logger.info(' Loss: ' + str(running_loss))
self.model.save_embedding(self.embeddings_file_path)

def get_ntype_range(self, target_ntype):
start_idx = 0


+ 1
- 0
openhgnn/utils/__init__.py View File

@@ -2,4 +2,5 @@ from .best_config import BEST_CONFIGS
from .dgl_graph import *
from .utils import *
from .evaluator import *
from .logger import Logger


+ 15
- 4
openhgnn/utils/best_config.py View File

@@ -65,7 +65,7 @@ BEST_CONFIGS = {
},
},
'GTN': {
'general': {'lr': 0.005, 'weight_decay': 0.001, 'hidden_dim': 64, 'max_epoch': 50, 'patience': 10,
'general': {'lr': 0.005, 'weight_decay': 0.001, 'hidden_dim': 64, 'max_epoch': 100, 'patience': 20,
'norm_emd_flag': True, 'mini_batch_flag': False},
'acm4GTN': {
'num_layers': 2, 'num_channels': 2, 'adaptive_lr_flag': True,
@@ -158,7 +158,7 @@ BEST_CONFIGS = {
},
'NSHE': {
'general': {},
'acm4SNHE': {'weight_decay': 0.001, 'num_e_neg': 1, 'num_ns_neg': 4,
'acm4NSHE': {'weight_decay': 0.001, 'num_e_neg': 1, 'num_ns_neg': 4,
'max_epoch': 500, 'patience': 10,
}
},
@@ -210,17 +210,28 @@ BEST_CONFIGS = {
}
},
"link_prediction": {
'NARS': {
'general': {'num_hops': 3},
},
'HetGNN':{
'HetGNN': {
'general': {'max_epoch': 500, 'patience': 10, 'mini_batch_flag': True},
'academic4HetGNN': {
'lr': 0.01, 'weight_decay': 0.0001, 'dim': 128, 'batch_size': 64, 'window_size': 5,
'batches_per_epoch': 50, 'rw_length': 50, 'rw_walks': 10, 'rwr_prob': 0.5,
}
}
},
'RGCN': {
'general': {
},
'FB15k-237': {
'lr': 0.01, 'weight_decay': 0.0005, 'max_epoch': 100,
'hidden_dim': 16, 'n_bases': 40, 'n_layers': 2, 'batch_size': 126, 'fanout': 4, 'dropout': 0,
'validation': True
},
},
},
"recommendation": {


+ 60
- 33
openhgnn/utils/evaluator.py View File

@@ -41,10 +41,9 @@ class Evaluator():
def cal_roc_auc(self, y_true, y_pred):
return roc_auc_score(y_true, y_pred)

def mrr_(self, embedding, w, train_triplets, valid_triplets, test_triplets, hits=[], eval_bz=100, eval_p="filtered"):
def mrr_(self, embedding, w, train_triplets, valid_triplets, test_triplets, hits=[], eval_bz=100, eval_p='raw'):
if eval_p == "filtered":
mrr = calc_filtered_mrr(embedding, w, train_triplets, valid_triplets, test_triplets, hits)
pass
else:
mrr = calc_raw_mrr(embedding, w, test_triplets, hits, eval_bz)
return mrr
@@ -92,34 +91,25 @@ class Evaluator():
return micro_f1, macro_f1


def perturb_and_get_raw_rank(embedding, w, a, r, b, test_size, batch_size=1000):
def perturb_and_get_raw_rank(embedding, w, a, r, b, test_size, batch_size=100):
""" Perturb one element in the triplets
"""
n_batch = (test_size + batch_size - 1) // batch_size
ranks = []
for idx in range(n_batch):
#print("batch {} / {}".format(idx, n_batch))
# print("batch {} / {}".format(idx, n_batch))
batch_start = idx * batch_size
batch_end = min(test_size, (idx + 1) * batch_size)
batch_a = a[batch_start: batch_end]
batch_r = r[batch_start: batch_end]



emb_ar = embedding[batch_a] * w[batch_r]
emb_ar = emb_ar.transpose(0, 1).unsqueeze(2) # size: D x E x 1
emb_c = embedding.transpose(0, 1).unsqueeze(1) # size: D x 1 x V
# out-prod and reduce sum
out_prod = th.bmm(emb_ar, emb_c) # size D x E x V
score = th.sum(out_prod, dim=0) # size E x V

# emb_ar = (embedding[batch_a] + w[batch_r]).unsqueeze(1).expand((batch_end - batch_start,embedding.shape[0], -1))
# emb_c = embedding.unsqueeze(0)
# out_prod = th.sub(emb_ar, emb_c) # size D x E x V
# score = -th.norm(out_prod, p=1, dim=2) # size E x V

score = th.sigmoid(score)
target = b[batch_start: batch_end]
target = b[batch_start: batch_end].to(score.device)
ranks.append(sort_and_rank(score, target))
return th.cat(ranks)

@@ -134,17 +124,17 @@ def sort_and_rank(score, target):
# return MRR (raw), and Hits @ (1, 3, 10)
def calc_raw_mrr(embedding, w, test_triplets, hits=[], eval_bz=100):
with th.no_grad():
s = test_triplets[0]
r = test_triplets[1]
o = test_triplets[2]
test_size = test_triplets.shape[1]
s = test_triplets[:, 0]
r = test_triplets[:, 1]
o = test_triplets[:, 2]
test_size = test_triplets.shape[0]

# perturb subject
ranks = perturb_and_get_raw_rank(embedding, w, s, r, o, test_size, eval_bz)
ranks_s = perturb_and_get_raw_rank(embedding, w, o, r, s, test_size, eval_bz)
# perturb object
#ranks_o = perturb_and_get_raw_rank(embedding, w, s, r, o, test_size, eval_bz)
ranks_o = perturb_and_get_raw_rank(embedding, w, s, r, o, test_size, eval_bz)

#ranks = th.cat([ranks_s, ranks_o])
ranks = th.cat([ranks_s, ranks_o])
ranks += 1 # change to 1-indexed

mrr = th.mean(1.0 / ranks.float())
@@ -167,19 +157,56 @@ def perturb_s_and_get_filtered_rank(embedding, w, s, r, o, test_size, triplets_t
target_s = s[idx]
target_r = r[idx]
target_o = o[idx]
filtered_s = filter_s(triplets_to_filter, target_s, target_r, target_o, num_entities).to(target_s.device)
target_s_idx = th.nonzero(filtered_s == target_s).item()
filtered_s = filter_s(triplets_to_filter, target_s, target_r, target_o, num_entities)
target_s_idx = int((filtered_s == target_s).nonzero())
emb_s = embedding[filtered_s]
emb_r = w[target_r]
emb_r = w[str(target_r.item())]
emb_o = embedding[target_o]
emb_triplet = emb_s * emb_r * emb_o
scores = th.sigmoid(th.sum(emb_triplet, dim=1))
_, indices = th.sort(scores, descending=True)
rank = th.nonzero(indices == target_s_idx).item()
rank = int((indices == target_s_idx).nonzero())
ranks.append(rank)
return th.LongTensor(ranks)


def perturb_o_and_get_filtered_rank(embedding, w, s, r, o, test_size, triplets_to_filter):
""" Perturb object in the triplets
"""
num_entities = embedding.shape[0]
ranks = []
for idx in range(test_size):
if idx % 10000 == 0:
print("test triplet {} / {}".format(idx, test_size))
target_s = s[idx]
target_r = r[idx]
target_o = o[idx]
filtered_o = filter_o(triplets_to_filter, target_s, target_r, target_o, num_entities)
target_o_idx = int((filtered_o == target_o).nonzero())
emb_s = embedding[target_s]
emb_r = w[str(target_r.item())]
emb_o = embedding[filtered_o]
emb_triplet = emb_s * emb_r * emb_o
scores = th.sigmoid(th.sum(emb_triplet, dim=1))
_, indices = th.sort(scores, descending=True)
rank = int((indices == target_o_idx).nonzero())
ranks.append(rank)
return th.LongTensor(ranks)


def filter_o(triplets_to_filter, target_s, target_r, target_o, num_entities):
target_s, target_r, target_o = int(target_s), int(target_r), int(target_o)
filtered_o = []
# Do not filter out the test triplet, since we want to predict on it
if (target_s, target_r, target_o) in triplets_to_filter:
triplets_to_filter.remove((target_s, target_r, target_o))
# Do not consider an object if it is part of a triplet to filter
for o in range(num_entities):
if (target_s, target_r, o) not in triplets_to_filter:
filtered_o.append(o)
return th.LongTensor(filtered_o)


def filter_s(triplets_to_filter, target_s, target_r, target_o, num_entities):
target_s, target_r, target_o = int(target_s), int(target_r), int(target_o)
filtered_s = []
@@ -195,20 +222,19 @@ def filter_s(triplets_to_filter, target_s, target_r, target_o, num_entities):

def calc_filtered_mrr(embedding, w, train_triplets, valid_triplets, test_triplets, hits=[]):
with th.no_grad():
s = test_triplets[0]
r = test_triplets[1]
o = test_triplets[2]
test_size = test_triplets.shape[1]
s = test_triplets[:, 0]
r = test_triplets[:, 1]
o = test_triplets[:, 2]
test_size = test_triplets.shape[0]

triplets_to_filter = th.transpose(th.cat([train_triplets, valid_triplets, test_triplets], dim=1), 0, 1).tolist()
triplets_to_filter = th.cat([train_triplets, valid_triplets, test_triplets]).tolist()
triplets_to_filter = {tuple(triplet) for triplet in triplets_to_filter}
print('Perturbing subject...')
ranks_s = perturb_s_and_get_filtered_rank(embedding, w, s, r, o, test_size, triplets_to_filter)
print('Perturbing object...')
#ranks_o = perturb_o_and_get_filtered_rank(embedding, w, s, r, o, test_size, triplets_to_filter)
ranks_o = perturb_o_and_get_filtered_rank(embedding, w, s, r, o, test_size, triplets_to_filter)

#ranks = th.cat([ranks_s, ranks_o])
ranks = ranks_s
ranks = th.cat([ranks_s, ranks_o])
ranks += 1 # change to 1-indexed

mrr = th.mean(1.0 / ranks.float())
@@ -279,5 +305,6 @@ def cal_loss_f1(y, node_data, loss_func, mode):
macro_f1, micro_f1 = f1_node_classification(y_label.cpu(), y_pred.cpu())
return loss, macro_f1, micro_f1


def cal_acc(y_pred, y_true):
return th.sum(y_pred.argmax(dim=1) == y_true).item() / len(y_true)

+ 189
- 0
openhgnn/utils/logger.py View File

@@ -1,3 +1,11 @@
import logging
import os
import colorlog
import re
import datetime
from logging import getLogger
from colorama import init


def printInfo(metric, epoch, train_score, train_loss, val_score, val_loss):
if metric == 'f1_lr':
@@ -28,3 +36,184 @@ def printMetric(metric, score, mode):
print(f"{mode}_macro_{metric} = {score[0]:.4f}, {mode}_micro_{metric}: {score[1]:.4f}")
else:
print(f"{mode}_{metric} = {score:.4f}")


# UPDATE
# Hu Anke 2021/11/07

log_colors_config = {
'DEBUG': 'cyan',
'WARNING': 'yellow',
'ERROR': 'red',
'CRITICAL': 'red',
}


def get_local_time():
r"""Get current time

Returns:
str: current time
"""
cur = datetime.datetime.now()
cur = cur.strftime('%b-%d-%Y_%H-%M-%S')

return cur


def ensure_dir(dir_path):
r"""Make sure the directory exists, if it does not exist, create it

Args:
dir_path (str): directory path

"""
if not os.path.exists(dir_path):
os.makedirs(dir_path)


class RemoveColorFilter(logging.Filter):

def filter(self, record):
if record:
ansi_escape = re.compile(r'\x1B(?:[@-Z\\-_]|\[[0-?]*[ -/]*[@-~])')
record.msg = ansi_escape.sub('', str(record.msg))
return True


def set_color(log, color, highlight=True):
color_set = ['black', 'red', 'green', 'yellow', 'blue', 'pink', 'cyan', 'white']
try:
index = color_set.index(color)
except:
index = len(color_set) - 1
prev_log = '\033['
if highlight:
prev_log += '1;3'
else:
prev_log += '0;3'
prev_log += str(index) + 'm'
return prev_log + log + '\033[0m'


# UPDATE
# Hu Anke 2021/11/11
# UPDATE
# Hu AnKe 2021/12/16
class Logger:

def __init__(self, config):
"""
A logger that can show a message on standard output and write it into the
file named `filename` simultaneously.
All the message that you want to log MUST be str.

Args:
config (Config): An instance object of Config, used to record parameter information.

Example:
>>> logger = logging.getLogger(config)
>>> logger.debug(train_state)
>>> logger.info(train_result)
"""
init(autoreset=True)
LOGROOT = f'./openhgnn/output/{config.model}/'
dir_name = os.path.dirname(LOGROOT)
ensure_dir(dir_name)

logfilename = '{}-{}.log'.format(config.model, get_local_time())

logfilepath = os.path.join(LOGROOT, logfilename)

filefmt = "%(asctime)-15s %(levelname)s %(message)s"
filedatefmt = "%a %d %b %Y %H:%M:%S"
fileformatter = logging.Formatter(filefmt, filedatefmt)

sfmt = "%(log_color)s%(asctime)-15s %(levelname)s %(message)s"
sdatefmt = "%d %b %H:%M"
sformatter = colorlog.ColoredFormatter(sfmt, sdatefmt, log_colors=log_colors_config)
if not hasattr(config, 'state') or config.state.lower() == 'info':
level = logging.INFO
elif config.state.lower() == 'debug':
level = logging.DEBUG
elif config.state.lower() == 'error':
level = logging.ERROR
elif config.state.lower() == 'warning':
level = logging.WARNING
elif config.state.lower() == 'critical':
level = logging.CRITICAL
else:
level = logging.INFO

fh = logging.FileHandler(logfilepath, mode='a')
fh.setLevel(level)
fh.setFormatter(fileformatter)
remove_color_filter = RemoveColorFilter()
fh.addFilter(remove_color_filter)

sh = logging.StreamHandler()
sh.setLevel(level)
sh.setFormatter(sformatter)
root_logger = logging.getLogger()
for h in root_logger.handlers:
root_logger.removeHandler(h)

logging.basicConfig(level=level, handlers=[sh, fh])
self.logger = getLogger()
self.logger.info(config)

def info(self, s):
self.logger.info(s)
def load_best_config(self, s):
self.logger.info('[Load Best Config] ' + s)
def train_info(self, s):
self.logger.info('[Train Info] ' + s)
def feature_info(self, s):
self.logger.info('[Feature Transformation] ' + s)
# graph data analyze
def log_data_info(self, g):
num_nodes = g.num_nodes()
num_edges = g.num_edges()
node_types = len(g.ntypes)
edge_types = len(g.etypes)
c_etypes = len(g.canonical_etypes)
datainfo = {'total nodes':num_nodes, 'total edges':num_edges, 'node types':node_types, 'edge types': edge_types,
'c_etypes': c_etypes}

self.logger.info(datainfo)
return

# evaluate results
def log_eval_info(self, metric, epoch, train_score, train_loss, val_score, val_loss):
if metric == 'f1_lr':
eval_info = {f"Epoch: {epoch:03d}, Train_loss: {train_loss:.4f}, Train_macro_f1: {train_score[0]:.4f}, Train_micro_f1: {train_score[1]:.4f}, "
f"Val_macro_f1: {val_score[0]:.4f}, Val_micro_f1: {val_score[1]:.4f}, ValLoss:{val_loss: .4f}"}

# use acc
elif metric == 'acc':
eval_info = {f"Epoch: {epoch:03d}, Train_loss: {train_loss:.4f}, Train_acc: {train_score:.4f}, "
f"Val_acc: {val_score:.4f}, ValLoss:{val_loss: .4f}"}

elif metric == 'acc-ogbn-mag':
eval_info = {f"Epoch: {epoch:03d}, Train_loss: {train_loss:.4f}, Train_acc: {train_score:.4f}, "
f"Val_acc: {val_score:.4f}, ValLoss:{val_loss: .4f}"}

else:
eval_info = {f"Epoch: {epoch:03d}, Train_loss: {train_loss:.4f}, Train_macro_f1: {train_score[0]:.4f}, Train_micro_f1: {train_score[1]:.4f}, "
f"Val_macro_f1: {val_score[0]:.4f}, Val_micro_f1: {val_score[1]:.4f}, ValLoss:{val_loss: .4f}"}
self.logger.info(eval_info)
return

def log_metric_info_1(self, metric, score, mode):
met_info = {f"{mode}_{metric} = {score:.4f}"}
self.logger.info(met_info)
return

def log_metric_info_2(self, metric, score, mode):
met_info = {f"{mode}_macro_{metric} = {score[0]:.4f}, {mode}_micro_{metric}: {score[1]:.4f}"}
self.logger.info(met_info)

+ 30
- 5
openhgnn/utils/utils.py View File

@@ -5,7 +5,8 @@ import torch as th
from scipy.sparse import coo_matrix
import numpy as np
import random
from . import load_HIN, load_KG, load_OGB, BEST_CONFIGS
from . import load_HIN, load_KG, load_OGB
from .best_config import BEST_CONFIGS


def sum_up_params(model):
@@ -66,20 +67,20 @@ def add_reverse_edges(hg, copy_ndata=True, copy_edata=True, ignore_one_type=True
def set_best_config(args):
configs = BEST_CONFIGS.get(args.task)
if configs is None:
print('The task do not have a best_config!')
args.logger.load_best_config('The task: {} do not have a best_config!'.format(args.task))
return args
if args.model not in configs:
print('The model is not in the best config.')
args.logger.load_best_config('The model: {} is not in the best config.'.format(args.model))
return args
configs = configs[args.model]
for key, value in configs["general"].items():
args.__setattr__(key, value)
if args.dataset not in configs:
print('The dataset is not in the best config.')
args.logger.load_best_config('The dataset: {} is not in the best config of model: {}.'.format(args.dataset, args.model))
return args
for key, value in configs[args.dataset].items():
args.__setattr__(key, value)
print('Use the best config.')
args.logger.load_best_config('Load the best config of model: {} for dataset: {}.'.format(args.model, args.dataset))
return args


@@ -334,3 +335,27 @@ def extract_metapaths(category, canonical_etypes, self_loop=False):
if etype[0] != etype[2]:
meta_paths.append((etype, dst_e))
return meta_paths

# for etype in self.model.hg.etypes:
# g = self.model.hg[etype]
# for etype in ['paper-ref-paper','paper-cite-paper']:
# g = self.hg[etype]
# r = []
# for i in self.train_idx:
# neigh = g.predecessors(i)
# cen_label = self.labels[i]
# neigh_label = self.labels[neigh]
# if len(neigh) == 0:
# pass
# else:
# r.append((cen_label == neigh_label).sum() / len(neigh))
# for i in self.valid_idx:
# neigh = g.predecessors(i)
# cen_label = self.labels[i]
# neigh_label = self.labels[neigh]
# if len(neigh) == 0:
# pass
# else:
# r.append((cen_label == neigh_label).sum() / len(neigh))
# he = torch.stack(r).mean()
# print(etype+ str(he))

+ 10
- 8
space4hgnn/README.md View File

@@ -12,7 +12,7 @@ The installation process is same with OpenHGNN [Get Started](https://github.com/

#### 2.1 Generate designs randomly

Here we will generate a random design combination for each dataset and save it in a `.yaml` file. The candidate designs are list in [`./space4hgnn/generate_yaml.py`](./generate_yaml.py).
Here we will generate a random design combination for each dataset and save it in a `.yaml` file. The candidate designs are listed in [`./space4hgnn/generate_yaml.py`](./generate_yaml.py).

```bash
python ./space4hgnn/generate_yaml.py --gnn_type gcnconv --times 1 --key has_bn --configfile test
@@ -85,7 +85,9 @@ For **Meta-path model family**, ``--model`` is general_HGNN and ``--subgraph_ext
python space4hgnn.py -m general_HGNN -u metapath -t node_classification -d HGBn-ACM -g 0 -r 5 -a gcnconv -s 1 -k has_bn -v True -c test -p HGB
```

**Note: ** Similar with generating yaml file, experiment will load the design configuration from ``yaml_file_path``. And it will save the results into a `.csv` file in `prediction_file_path`.
**Note: **

Similar with generating yaml file, experiment will load the design configuration from ``yaml_file_path``. And it will save the results into a `.csv` file in `prediction_file_path`.

```python
yaml_file_path = './space4hgnn/config/{}/{}/{}_{}.yaml'.format(configfile, key, gnn_type, times)
@@ -94,7 +96,7 @@ prediction_file_path = './space4hgnn/prediction/excel/{}/{}_{}/{}_{}_{}_{}.csv'.
# Here prediction_file_path = './space4hgnn/prediction/test/has_bn_True/metapath_gcnconv_1_HGBn-ACM.yaml'
```

### 3 Run a bach of experiments
### 3 Run a batch of experiments

An example:

@@ -102,14 +104,14 @@ An example:
./space4hgnn/parallel.sh 0 5 has_bn True node_classification test_paral test_paral
```

It will generate configuration files for the bach of experiments. And
It will generate configuration files for the batch of experiments and launch a batch of experiments.

The following is the arguments descriptions:

1. The first argument controls which gpu to use. Here is 0.
2. Repeat times. Here is 5
3. Design dimension. Here is BN.
4. Choice of design dimension. Here set BN`` True``.
4. Choice of design dimension. Here set BN `` True``.
5. Task name. Here is nodeclassification
6. Configfile is the path to save configuration files.
7. Predictfile is the path to save prediction files.
@@ -132,12 +134,12 @@ python ./space4hgnn/prediction/excel/gather_all_Csv.py -p ./space4hgnn/predictio

##### 3.2.1 Ranking analysis

We analyze the results with average ranking following [GraphGym](https://github.com/snap-stanford/GraphGym#3-analyze-the-results), and
We analyze the results with average ranking following [GraphGym](https://github.com/snap-stanford/GraphGym#3-analyze-the-results), the according code is in [`figure/rank.py`](./figure/rank.py).

![space4hgnn_rank](../docs/source/_static/space4hgnn_rank.png)

##### 3.2.2 Distribution analysis
##### 3.2.2 Distribution estimates

We analyze the results with distribution estimates following [NDS](https://github.com/facebookresearch/nds), and
We analyze the results with distribution estimates following [NDS](https://github.com/facebookresearch/nds), and the according code is in [`figure/distribution.py`](./figure/distribution.py).

![space4hgnn_distribution](../docs/source/_static/space4hgnn_distribution.png)

Loading…
Cancel
Save