CLIP (Contrastive Lanuguage-Image Pre-Training):是一种基于图文对进行训练的transformer模型,在预训练完成以后,任意给定一张图片,它可以在不用微调的情况下,完成对图片的零样本分类。
论文 Alec Radford, Jong Wook Kim, et al., Learning Transferable Visual Models From Natural Language Supervision, 2021.
注:CLIP训练代码未开源,故MindFormers提供训练pretrain、finetune功能,但不不保证精度,目前仅对zero shot图片分类精度做了对齐。
数据集目录格式
└─Flickr8k
├─Flickr8k_Dataset
| └─Flickr8k_Dataset
└─Flickr8k_text
├─Flickr8k.devImages.txt
├─Flickr8k.testImages.txt
├─Flickr8k.trainImages.txt
└─Flickr8k.token.txt
数据集目录格式
└─cifar-100-python
├─meta
├─test
└─train
需开发者提前clone工程。
请参考使用脚本启动
脚本运行测试
当前clip多卡精度有异常,仅支持单卡,后续版本会修复
# pretrain
python run_mindformer.py --config ./configs/clip/run_clip_vit_b_32_pretrain_flickr8k.yaml --run_mode train --train_dataset_dir [DATASET_PATH]
# evaluate
python run_mindformer.py --config ./configs/clip/run_clip_vit_b_32_zero_shot_image_classification_cifar100.yaml --run_mode eval --eval_dataset_dir [DATASET_PATH]
# predict
python run_mindformer.py --config ./configs/clip/run_clip_vit_b_32_zero_shot_image_classification_cifar100.yaml --run_mode predict --predict_data [PATH_TO_IMAGE]
需开发者提前pip安装。具体接口说明请参考API接口
import mindspore; mindspore.set_context(mode=0, device_id=0)
from mindformers import CLIPModel, CLIPConfig
CLIPModel.show_support_list()
# 输出:
# - support list of CLIPModel is:
# - ['clip_vit_b_32', 'clip_vit_B_16', 'clip_vit_l_14', 'clip_vit_l_14@336']
# - -------------------------------------
# 模型标志加载模型
model = CLIPModel.from_pretrained("clip_vit_b_32")
#模型配置加载模型
config = CLIPConfig.from_pretrained("clip_vit_b_32")
# {'text_config': {'hidden_size': 512, 'vocab_size': 49408, 'max_position_embeddings': 77,
# 'num_hidden_layers': 12}, 'vision_config': {'hidden_size': 768, 'image_size': 224, 'patch_size': 32,
# 'num_hidden_layers': 12}, 'projection_dim': 512, 'ratio': 64, 'checkpoint_name_or_path': 'clip_vit_b_32',
# 'dtype': 'float16'}
model = CLIPModel(config)
import mindspore; mindspore.set_context(mode=0, device_id=0)
from mindformers.trainer import Trainer
from mindformers.tools.image_tools import load_image
# 初始化预训练任务
trainer = Trainer(task='contrastive_language_image_pretrain',
model='clip_vit_b_32',
train_dataset='./Flickr8k')
trainer.train() # 开启预训练
#初始化零样本图像分类下游任务
trainer = Trainer(task='zero_shot_image_classification',
model='clip_vit_b_32',
eval_dataset='./cifar-100-python')
img = load_image("https://ascend-repo-modelzoo.obs.cn-east-2.myhuaweicloud.com/XFormer_for_mindspore/clip/sunflower.png")
# 方式1: 使用训练好的权重进行评估和推理
trainer.evaluate(eval_checkpoint=True)
predict_result = trainer.predict(predict_checkpoint=True, input_data=img, top_k=3)
print(predict_result)
# 方式2: 从obs下载训练好的权重并进行评估和推理
trainer.evaluate() #下载权重进行评估
predict_result = trainer.predict(input_data=img, top_k=3) #下载权重进行推理
print(predict_result)
import mindspore; mindspore.set_context(mode=0, device_id=0)
from mindformers import pipeline
from mindformers.tools.image_tools import load_image
classifier = pipeline("zero_shot_image_classification",
model="clip_vit_b_32",
candidate_labels=["sunflower", "tree", "dog", "cat", "toy"])
img = load_image("https://ascend-repo-modelzoo.obs.cn-east-2."
"myhuaweicloud.com/XFormer_for_mindspore/clip/sunflower.png")
classifier(img)
# 输出
# [[{'score': 0.99995565, 'label': 'sunflower'}, {'score': 2.5318595e-05, 'label': 'toy'},
# {'score': 9.903885e-06, 'label': 'dog'}, {'score': 6.75336e-06, 'label': 'tree'},
# {'score': 2.396818e-06, 'label': 'cat'}]]
model | task_type | model_Type | datasets | Top1-accuracy | log | example |
---|---|---|---|---|---|---|
clip | pretrained | clip_vit_b_32 clip_vit_b_16 clip_vit_l_14 clip_vit_l_14@336 |
flickr8k | \ | \ | pretrain link |
clip | zero_shot_image_classification | clip_vit_b_32 clip_vit_b_16 clip_vit_l_14 clip_vit_l_14@336 |
cifar100 | 57.24% 61.41% 69.67% 68.19% |
\ | eval link predict link |
本仓库中的clip_vit_b_32
来自于openai/clip的ViT-B/32
, 基于下述的步骤获取:
从上述的链接中下载ViT-B/32
的模型权重
执行转换脚本,得到转换后的输出文件clip_vit_b_32.ckpt
其余参数获取方式相同
python mindformers/models/clip/convert_weight.py --torch_path "PATH OF ViT-B/32.pt" --mindspore_path "SAVE PATH OF clip_vit_b_32.ckpt"
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》