Are you sure you want to delete this task? Once this task is deleted, it cannot be recovered.
BAAI-WuDao 0135022773 | 3 years ago | |
---|---|---|
.. | ||
README.md | 3 years ago | |
blacklist_urls.py | 3 years ago | |
cleanup_dataset.py | 3 years ago | |
find_duplicates.py | 3 years ago | |
group_duplicates_url.py | 3 years ago | |
make_gpt2_dataset.py | 3 years ago | |
make_gpt2_sizes.py | 3 years ago | |
merge_jsons.py | 3 years ago | |
remove_group_duplicates.py | 3 years ago | |
run_make_gpt2_dataset.sh | 3 years ago | |
tokenizer.py | 3 years ago |
The following steps show how to prepare training dataset to train the mode.
pip install ftfy langdetect numpy torch pandas nltk sentencepiece boto3 tqdm regex bs4 newspaper3k htmlmin tldextract
git clone https://github.com/mattilyra/LSH
cd LSH
python setup.py install
python blacklist_urls.py <path to the dowloaded deduplicated URLs> <filename for clean urls. e.g. clean_urls.txt>
Download the content from the clean urls with openwebtext's utilities.
Merge the contents into one loose json file with 1 json per newline of the format {'text': text, 'url': unique_url}
. It is important for the url to be unique.
python cleanup_dataset.py <input data file> <output cleaned data filename>
python find_duplicates.py <input cleaned data file> <output possible duplicate urls filename>
is_similar
(default: 0.9), group urls that are similar. Basically, for each group, only one url we should keep and remove the rest.python group_duplicate_urls.py <possible duplicate urls file> <output file containing similar urls>
python remove_group_duplicates.py <file containing simialr documents> <cleaned data file> <outputfile containing deduplicate data>
shuf <cleaned deduped data file> -o train_data.json
“悟道”项目开源模型
Python Text C++ Shell Cuda other
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》