Are you sure you want to delete this task? Once this task is deleted, it cannot be recovered.
|
3 years ago | |
---|---|---|
.. | ||
conda_env | 3 years ago | |
config | 3 years ago | |
converter | 3 years ago | |
deprecated | 3 years ago | |
examples | 3 years ago | |
scripts | 3 years ago | |
tape | 3 years ago | |
tests | 3 years ago | |
.gitignore | 3 years ago | |
MANIFEST.in | 3 years ago | |
README.md | 3 years ago | |
TAPE.LICENSE | 3 years ago | |
clean_checkpoints.sh | 3 years ago | |
cleanup_results.sh | 3 years ago | |
data_refs.bib | 3 years ago | |
download_data.sh | 3 years ago | |
download_data_aws.sh | 3 years ago | |
environment.yml | 3 years ago | |
gridsearch_config.json | 3 years ago | |
mypy.ini | 3 years ago | |
requirements.txt | 3 years ago | |
setup.py | 3 years ago | |
tape_eval.py | 3 years ago | |
tape_train.py | 3 years ago | |
tox.ini | 3 years ago |
Our pretraining process, in brief, consists of training a protein language model on unlabeled amino acid chains (PFAM
dataset), and finetuning on the labeled downstream task datasets. Pretraining code and dataset can be found in pretrain folder. This folder is used for finetuning and performance evaluation.
The environment requirements can be found here.
Base environment is used for performing downstream tasks.
Torch1.7 environment is optional if you don't use the pretrained model we provide. This is due to the change of serialization strategy between torch1.4 and torch1.7. Therefore, when converting Megatron-ProteinLM's checkpoint to TAPE's checkpoint , you will need a newer torch (version 1.7 or higher) to load Megatron's pretrain model checkpoint.
We provide a pretrained model checkpoint for downstream tasks, you can download the checkpoint and finetune on downstream tasks, and you can also pretrain your own model based on your hardware situation.
The finetuning process can be split into two parts.
megatron-tape-generator.sh $MEGA_CKPT
(if you don't set $MEGA_CKPT, it will defaults to $PWD/models/mega).tape_train.py
and tape_eval.py
are used for finetuning and evaluating the performance of finetuned model. Below is the usage.# finetune train
python tape_train.py $BASE_MODEL $DOWNSTREAM_TASK --from_pretrained $PATH_TO_CONVERTED_CHECKPOINT --batch_size $BS --num_train_epochs $NUM_EPOCH --learning_rate $LR --output_dir $PATH_TO_OUTPUT_DIR --warmup_steps $WS
# finetune eval
python tape_eval.py $BASE_MODEL $DOWNSTREAM_TASK $PATH_TO_FINETUNED_MODEL --metrics $METRIC --split $DATA_SPLIT(opt)
# Note that for secondary_structure and remote_homology tasks, you need to set --split parameter, since there are more than one evaluation dataset split.
For our pretrain model, $BASE_MODEL should be transformer. Besides, if you find your GPU capacity not sufficient, you can consider setting --gradient_accumulation_steps, which stands for number of forward passes to make, for each backwards pass
.
There are two more parameters that can make finetune process easier.
--force_save
python tape_train.py --force_save $PARAM
.--time_or_name
python tape_train.py ... --time_or_name time/$NAME
.task-base_model_name-suffix
. If time_or_name
is not set (default=time), the suffix will be timestamp + 6-digit_random_int
; in other cases, the suffix of the checkpoint will be the parameter you passed to time_or_name
.Script megatron-converter.py
is used for transferring the parameters from pretrain language model (Megatron-LM framework) to TAPE.
Parameter explanation:
src
: the location of Megatron-ProteinLM's checkpoint.dst
: the location of TAPE's (random generated) checkpoint.out
: the location to save the transferred model.dtp
: default=torch.float32
(destination data type). Used to specify out model's data type. You can pass parameters like torch.float16
if you want a fp16 model.hidden
: hidden size of each encoder layer.heads
: number of attention heads for each attention layer in the ProteinBert encoder.layers
: number of hidden layers in the ProteinBert encoder, default=16.PS: if you meet problems when loading checkpoints (errors related to serialization), one possible solution is _use_new_zipfile_serialization
of set torch.save() to False.
We modified layer structure in modeling_bert.py, dividing it into four
sub-modules, with two
residual connections in it.
INPUT_LAYERNORM
, ATTENTION
) - connection1 end.POST_ATTN_LAYERNORM
, FFN
) - connection2 end.Task | Metric | TAPE Transformer | ProteinLM (ours) |
---|---|---|---|
contact prediction | P@L/5 | 0.36 | 0.52 |
remote_homology | Top 1 Accuracy | 0.21 | 0.26 |
secondary_structure | Accuracy (3-class) | 0.73 | 0.75 |
fluorescence | Spearman's rho | 0.68 | 0.68 |
stability | Spearman's rho | 0.73 | 0.77 |
Our work is based on the following papers.
Besides, part of the code is based on Megatron-LM and TAPE.
Evaluating Protein Transfer Learning with TAPE
@article{DBLP:journals/corr/abs-1909-08053,
author = {Mohammad Shoeybi and
Mostofa Patwary and
Raul Puri and
Patrick LeGresley and
Jared Casper and
Bryan Catanzaro},
title = {Megatron-LM: Training Multi-Billion Parameter Language Models Using
Model Parallelism},
journal = {CoRR},
volume = {abs/1909.08053},
year = {2019},
url = {http://arxiv.org/abs/1909.08053},
archivePrefix = {arXiv},
eprint = {1909.08053},
timestamp = {Tue, 24 Sep 2019 11:33:51 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-1909-08053.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
@article{DBLP:journals/corr/abs-1906-08230,
author = {Roshan Rao and
Nicholas Bhattacharya and
Neil Thomas and
Yan Duan and
Xi Chen and
John F. Canny and
Pieter Abbeel and
Yun S. Song},
title = {Evaluating Protein Transfer Learning with {TAPE}},
journal = {CoRR},
volume = {abs/1906.08230},
year = {2019},
url = {http://arxiv.org/abs/1906.08230},
archivePrefix = {arXiv},
eprint = {1906.08230},
timestamp = {Sat, 23 Jan 2021 01:20:25 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-1906-08230.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
“悟道”项目开源模型
Python Text C++ Shell Cuda other
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》