Are you sure you want to delete this task? Once this task is deleted, it cannot be recovered.
天河 11df90299c | 1 year ago | |
---|---|---|
ascend310_infer | 1 year ago | |
src | 1 year ago | |
README.md | 1 year ago | |
README_CN.md | 1 year ago | |
ascend_postprocess.py | 1 year ago | |
eval.py | 1 year ago | |
preprocess.py | 1 year ago | |
requirements.txt | 1 year ago | |
train.py | 1 year ago |
For many RNA molecules, secondary structure is critical to the correct function of the RNA. Predicting RNA secondary structure from nucleotide sequences is a long-standing problem in genomics, but prediction performance has reached a stable level over time. Traditional RNA secondary structure prediction algorithms are mainly based on thermodynamic models by free energy minimization, which impose strong a priori assumptions and run slowly.UFold, a deep learning-based approach for RNA secondary structure prediction, is trained directly on annotation data and base pairing rules.UFold proposes a new class image representation of RNA sequences , which can be efficiently processed by fully convolutional networks (FCNs).
The input to the model is generated by taking the outer product of all combinations of the four basic channels of One-Hot Encoding, which yields 16 channels. The additional channels representing the pairing probabilities are then concatenated with the 16-channel sequence representation and used together as inputs to the model.The UFold model is a variant of U-Net that takes the 17-channel tensor as input and transforms the data by successive convolution and maximum pooling operations.
Several benchmark datasets were used:
RNAStralign, containing 30,451 unique sequences from 8 RNA families;
ArchiveII, containing 3975 sequences from 10 RNA families, is the most widely used benchmark dataset for RNA structure prediction performance;
bpRNA-1m, containing 102,318 sequences from 2588 families, is one of the most comprehensive RNA structure datasets available;
bpRNA new, derived from Rfam 14.2, contains sequences from 1500 new RNA families.
To facilitate the use of the dataset, we processed the data files in bpseq format into pickle files. the data files used in the UFold model can be downloaded in netdisk, and the files need to be placed in the data folder when used.
pip install -r requirements.txt
Download the pickle file and place it in the data folder.
Modify first_epoch in config.json
to modify the number of training rounds.
Execute the training script.
After the dataset is prepared, start the training as follows.
python train.py --train_files dataset_A dataset_B
--train_files: One or more datasets can be selected from (['ArchiveII','TS0','bpnew','TS1','TS2','TS3']) for training.
--device_target: Can be selected from (['GPU', 'Ascend']).
--device_id: Can be selected according to the environment.
python ufold_test.py --test_files TS2
--test_files: One or more datasets can be selected for inference from (['ArchiveII','TS0','bpnew','TS1','TS2','TS3']).
--ckpt_file: Select the ckpt file to load.
--device_target: Can be selected from (['GPU', 'Ascend']).
--device_id: Can be selected according to the environment.
.
└─UFold
├─README.md # README
├─ckpt_models # Store the weight files generated by training with mindspore
├─data # The pickle file obtained by pre-processing the dataset
├─ascend310_infer # application for 310 inference
├─src
├─config.json # Parameter Setting
├─config.py # Loading Settings
├─data_generator.py # Custom Data Set Classes
├─Network.py # UFold Network Definition
├─utils.py # Fragmented functions
└─postprocess.py # Post-processing for optimization
├─eval.py # Evaluation Scripts
├─train.py # Training Scripts
├─preprocess.py # pre process for 310 inference
└─ascend_postprocess.py # post process for 310 inference
"BATCH_SIZE":1,
"epoch_first":100
config.json
, including BATCH_SIZE and training epoch.train.py
to start the UFold training python train.py --train_files dataset_A
--train_files: One or more datasets can be selected from (['ArchiveII','TS0','bpnew','TS1','TS2','TS3']) for training.
--device_target: Can be selected from (['GPU', 'Ascend']).
--device_id: Can be selected according to the environment.
# grep "loss is " train.log
epoch:1 epoch: 1, loss: 1.4842823
epcoh:2 epoch: 2, loss: 1.0897788
eval.py
for evaluation.Note: You can use the trained ckpt for evaluation, or use the trained provided ckpt files for evaluation. (ufold_train_pdbfinetune.ckpt is used for TS1, TS2, TS3; ufold_train.ckpt is used for bpnew, ArchiveII and TS0; ufold_train_99.ckpt can be used for all the test data.)
# 推理
python ufold_test.py --test_files TS2
--test_files: One or more datasets can be selected for inference from (['ArchiveII','TS0','bpnew','TS1','TS2','TS3']).
--ckpt_file: Select the ckpt file to load.
--device_target: Can be selected from (['GPU', 'Ascend']).
--device_id: Can be selected according to the environment.
The accuracy of the test data set is as follows.
Average testing precision with pure post-processing: 0.781516432132
python export.py --ckpt_file [CKPT_PATH] --device_target [DEVICE_TARGET] --device_id[DEVICE_ID]
Before reasoning, you need to run preprocess.py to get the required contacts, embedding_batch and seq_ori.
# Ascend310 preprocess
cd acs310_infer
python ../preprocess.py --test_files TS2
--test_files: from (['ArchiveII','TS0','bpnew','TS1','TS2','TS3']) to choose
The preprocessed data are stored in contact, ori and preprocess_Result folders respectively, and then the inference script is run for inference.
# Ascend310 inference
bash build.sh [MINDIR_PATH] [DATA_PATH] [ANN_FILE] [DEVICE_ID]
Finally, use ascend_postprocess.py to post-process the data that meets the model size, and finally get the accuracy.
# Ascend310 postprocess
python ../ascend_postprocess.py
Average testing precision with pure post-processing: 0.8701159951160506
使用mindspore完成Ufold的迁移,用于华为生物计算。
Python C++ Markdown Shell Text
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》