Are you sure you want to delete this task? Once this task is deleted, it cannot be recovered.
yuanz 9793e3f747 | 1 year ago | |
---|---|---|
.. | ||
images | 1 year ago | |
infer_encoder_decoder.ipynb | 1 year ago | |
processdta_02.ipynb | 1 year ago | |
readme.md | 1 year ago | |
train_encoder_decoder.ipynb | 1 year ago |
之前在huggingfacehttps://huggingface.co/nlpconnect/vit-gpt2-image-captioning上看到这个模型.
clip
训练其实挺失败的,loss
没有下降.主要也就是抱着学习的态度,把源码看懂,把流程跑通。分享中间的细节和踩坑经历。
能想出来还是非常厉害的(直呼胶水怪!!),目前从源码上看,大概是这么一回事:
vit
来作为encoder
部分,输出encoder_hidden_states
,绿色部分1
。gpt2
来作为decoder
部分,接受encoder_hidden_states
,绿色部分3
。encoder
输出的encoder_hidden_states
和decoder
接受的encoder_hidden_states
维度不一样,就加个linear
,绿色部分2
。训练的时候,模型需要的数据主要有两个维度:
pixel_value
:image
通过processor
生成label
:text
通过tokenizer
生成的input_ids
。loss
的时候,其实和gpt2
一模一样的(自回归,本质上就是向后错位一下)。目前已经把我训练好的模型,发布在huggingface
上了。https://huggingface.co/yuanzhoulvpi/vit-gpt2-image-chinese-captioning
本模块处理数据的方式和clip
模型差不多,可以看隔壁文件夹,训练clip
的数据处理思路。
processdta_02.ipynb
文件替换即可。processdta_01.ipynb
、processdta_02.ipynb
、processdta_03.ipynb
。train_encoder_decoder.ipynb
"google/vit-base-patch16-224"
模型。"yuanzhoulvpi/gpt2_chinese"
模型。VisionEncoderDecoderModel
粘起来。gpu使用的是3090,模型大概是2.16亿个参数。花了超过20个小时。但是大部分时间都是卡在IO上(加载图片上)
参考infer_encoder_decoder.ipynb
from transformers import (VisionEncoderDecoderModel,
AutoTokenizer,ViTImageProcessor)
import torch
from PIL import Image
vision_encoder_decoder_model_name_or_path = "yuanzhoulvpi/vit-gpt2-image-chinese-captioning"#"vit-gpt2-image-chinese-captioning/checkpoint-3200"
processor = ViTImageProcessor.from_pretrained(vision_encoder_decoder_model_name_or_path)
tokenizer = AutoTokenizer.from_pretrained(vision_encoder_decoder_model_name_or_path)
model = VisionEncoderDecoderModel.from_pretrained(vision_encoder_decoder_model_name_or_path)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
max_length = 16
num_beams = 4
gen_kwargs = {"max_length": max_length, "num_beams": num_beams}
def predict_step(image_paths):
images = []
for image_path in image_paths:
i_image = Image.open(image_path)
if i_image.mode != "RGB":
i_image = i_image.convert(mode="RGB")
images.append(i_image)
pixel_values = processor(images=images, return_tensors="pt").pixel_values
pixel_values = pixel_values.to(device)
output_ids = model.generate(pixel_values, **gen_kwargs)
preds = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
preds = [pred.strip() for pred in preds]
return preds
predict_step(['bigdata/image_data/train-1000200.jpg'])
No Description
CSV Jupyter Notebook Python
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》