看图问答(git-base-vqav2)
GIT (short for GenerativeImage2Text) model, base-sized version, fine-tuned on VQAv2.
It was introduced in the paper GIT: A Generative Image-to-text Transformer for Vision and Language by Wang et al. and first released in this repository.
GIT is a Transformer decoder conditioned on both CLIP image tokens and text tokens. The model is trained using "teacher forcing" on a lot of (image, text) pairs.
The goal for the model is simply to predict the next text token, giving the image tokens and previous text tokens.
The model has full access to (i.e. a bidirectional attention mask is used for) the image patch tokens, but only has access to the previous text tokens (i.e. a causal attention mask is used for the text tokens) when predicting the next text token.
This allows the model to be used for tasks like:
- image and video captioning
- visual question answering (VQA) on images and videos
- even image classification (by simply conditioning the model on the image and asking it to generate a class for it in text).
模型来源: https://hf-mirror.com/microsoft/git-base-vqav2
模型应用开发和部署
模型服务化
本模型基于 ServiceBoot微服务引擎 进行服务化封装,参见: 《CubeAI模型开发指南》
直接源代码运行
$ sh pip-install-reqs.sh
$ serviceboot start
或
$ python3 run_model_server.py
本地容器化部署
一键式本地容器化部署和运行,参见: 《CubeAI模型独立部署指南》 或 CubeAI Docker Builder
云原生网络部署
本模型服务可一键发布至 CubeAI智立方平台 进行共享和部署,参见: 《CubeAI模型发布指南》