YearMethodTypeDescriptionPublicationCode
2021 | CLIP | Hard-Prompt | Introduced contrastive language-image pre-training. | ICML 2021 | Link |
2022 | CoOp | Soft-Prompt | Context Optimization for prompt tuning using learnable embeddings. | IJCV 2022 | Link |
2022 | CPT | Hard-Prompt | Task-specific fine-tuning in vision-language tasks. | - | Link |
2022 | DenseCLIP | Text Soft-Prompt | Extended CLIP to dense vision tasks with optimized textual prompts. | CVPR 2022 | Link |
2022 | FewVLM | Hard-Prompt | Few-shot learning framework for VLMs using hard prompts. | ACL 2022 | Link |
2022 | ProDA | Text Soft-Prompt | Prompt distribution alignment for domain adaptation. | CVPR 2022 | Link |
2022 | ProGrad | Text Soft-Prompt | Gradient optimization to improve prompt effectiveness. | CVPR 2022 | Link |
2022 | PEVL | Hard-Prompt | Combined prompt tuning with vision encoders for enhanced alignment. | EMNLP 2022 | Link |
2022 | VPT | Visual Soft-Prompt | Visual embeddings as learnable prompts. | ECCV 2022 | Link |
2022 | TPT | Text Soft-Prompt | Enhanced text-based prompt-tuning methods. | NeurIPS 2022 | Link |
2023 | ViPT | Visual Soft-Prompt | Visual Prompt multi-modal Tracking for various downstream tasks. | CVPR 2023 | Link |
2023 | MaPLe | Visual-text & Modal-Prompt | Multi-modal Adaptive Prompt Learning. | CVPR 2023 | Link |
2023 | KgCoOp | Text Soft-Prompt | Knowledge-guided Context Optimization. | CVPR 2023 | Link |
2023 | LASP | Text Soft-Prompt | Text-to-Text Optimization for Language-Aware Soft Prompting. | CVPR 2023 | Link |
2023 | DAM-VP | Visual Soft-Prompt | Diversity-Aware Meta Visual Prompting. | CVPR 2023 | Link |
2023 | TaskRes | Text Soft-Prompt | Task Residual for Tuning Vision-Language Models. | CVPR 2023 | Link |
2023 | RPO | Text Hard-Prompt | Read-only Prompt Optimization for Few-shot Learning. | ICCV 2023 | Link |
2023 | PromptSRC | Visual-Text Soft-Prompt | Semantic-Rich Contextual Prompting. | ICCV 2023 | Link |
2024 | DePT | Visual-Text Soft-Prompt | Dense Prompt Tuning. | CVPR 2024 | Link |
2024 | TCP | Text Soft-Prompt | Text-Conditioned Prompting. | CVPR 2024 | Link |
2024 | MMA | Visual-Text Soft-Prompt | Multi-Modal Adaptive Prompting. | CVPR 2024 | Link |
2024 | HPT | Visual-Text Soft-Prompt | Hierarchical Prompt Tuning. | AAAI 2024 | Link |
2024 | CoPrompt | Soft-Prompt | Contextual Prompt Learning. | ICLR 2024 | Link |
2024 | CasPL | Visual-Text Soft-Prompt | Cascade Prompt Learning. | ECCV 2024 | Link |
2024 | PromptKD | Visual-Text Soft-Prompt | Knowledge Distillation-based Prompt Tuning. | CVPR 2024 | Link |
2025 | DPC | Visual-Text Soft-Prompt | Dual-Prompt Collaboration for tuning VLMs. | CVPR 2025 | Link |
2025 | 2SFS | Visual-Text Soft-Prompt | Two-Stage Few-Shot adaptation for VLMs. | CVPR 2025 | Link |
2025 | MMRL | Visual-Text Soft-Prompt | Multi-Modal Representation Learning | CVPR 2025 | Link |
2025 | NLPrompt | Text Soft-Prompt | Noise-Label Prompt Learning | CVPR 2025 | Link |
2025 | TAC | Text Soft-Prompt | Task-Aware Clustering for Prompting | CVPR 2025 | Link |
2025 | TextRefiner | Text Soft-Prompt | Internal visual features as prompt refiners | AAAI 2025 | Link |
2025 | ProText | Text Soft-Prompt | Prompting with text-only supervision | AAAI 2025 | Link |
ModelYearFine-Tuning StrategyArchitecturePretraining ObjectivePretrained Backbone ModelVision Encoder / TokenizerParametersTraining DataKey Innovations
ViLBERT | 2019 | Full fine-tuning | Two-stream Transformer | Image-text alignment with co-attentional modules | BERT | Object-based features (Faster R-CNN) | 110M | COCO, Conceptual Captions | Co-attentional streams for image-text fusion |
VisualBERT | 2019 | Full fine-tuning | Single-stream Transformer | Masked language modeling with visual embeddings | BERT | Object-based features (Faster R-CNN) | 110M | COCO, VQA | Shared encoder for text and image inputs |
CLIP | 2021 | Zero-shot inference | Encoder-decoder | Contrastive image-text learning | Pretrained from scratch | ViT/ResNet | 63M–355M | 400M web image-text pairs | Contrastive learning with dual encoders; broad generalization via natural language supervision |
ALIGN | 2021 | Zero-shot inference | Dual encoder (EffNet-L2 + Transformer) | Contrastive image-text alignment | EfficientNet-L2 | EfficientNet | 1.8B | 1B+ noisy web image-text pairs | Large-scale noisy training with CLIP-style contrastive objectives |
SimVLM | 2021 | Full fine-tuning | Unified Transformer encoder-decoder | Prefix language modeling (unified vision-text sequence) | BERT + ResNet | Unified Transformer | 1B+ | Vision-language pairs | Simplified architecture with prefix modeling and no region-level supervision |
Florence | 2022 | Full fine-tuning | Unified Transformer | Unified encoder with multi-task supervision | Swin Transformer | Swin | 892M | Multilingual web-scale dataset | High-performance universal VL encoder |
Flamingo | 2022 | Few-shot in-context learning | Perceiver Resampler + Decoder-only Transformer | Frozen vision-language backbones with trainable cross-attention | Chinchilla | ViT-L/14 + Perceiver | 80B | M3W, ALIGN | In-context few-shot learning with frozen backbones and cross-modal fusion |
BLIP-2 | 2023 | Modular fine-tuning via Q-Former | Vision encoder + frozen LLM + Q-Former | Two-stage: vision-text + vision-to-language generation | ViT-G + OPT/FlanT5 | ViT-G + Q-Former | 223M–400M | WebLI, COCO, CC3M, CC12M | Q-Former for modular downstream tasks |
IDEFICS | 2023 | Parameter-efficient tuning | Unified Transformer with vision encoder | Instruction-tuned vision-language | OPT + ViT | ViT | 80B | COCO, VQAv2, A-OKVQA | Open-source instruction-following VLM |
PaliGemma 2 | 2024 | LoRA, fine-grained adapters | Transformer encoder-decoder | Multilingual + synthetic datasets | Gemma + ViT | ViT | - | Synthetic + real data (DOCCI, LAION, CC12M) | Multilingual generation + grounding |
Gemini 2.0 | 2024 | Modular fine-tuning | PaLM-based encoder-decoder + vision module | Multimodal pretraining with sparse transformers | PaLM 2 + ViT | ViT | - | Multilingual, synthetic corpus | Flexible and efficient multimodal reasoning |
Kosmos-2.5 | 2024 | Selective fine-tuning | Decoder-only Transformer with ViT + resampler | Document text recognition + image-to-Markdown generation | - | ViT-G/14, ViT-L/14 | 1.3B | Document images, OCR, structured markup data | Layout-aware multimodal literacy via visual-text fusion with Markdown generation |
GPT-4V | 2024 | No tuning (chat interface) | Unified Transformer with vision-text fusion | Text + image pretraining | GPT-4 | Custom ViT-like encoder | - | Vision-language aligned corpus | GPT-4 vision support with image-text joint encoding |
Claude 3 Opus | 2024 | Supervised fine-tuning via API | Encoder-decoder transformer | Proprietary encoder-decoder | Proprietary | - | - | Multimodal benchmarks | Safe and high-performance multimodal chat |
LongVILA | 2024 | Efficient parameter tuning | Video-based encoder-decoder transformer | Video-language transformer | Custom video model | Patch + frame tokenizer | - | Long video, image sequences | Long-context video QA and interleaved image-text reasoning |
Molmo | 2024 | Instruction tuning | Encoder-decoder transformer | Transformer-based VLM | - | ViT-L/14 (CLIP) | 72B | Open PixMo data | Open-source transparent training |
Qwen 2.5 VL | 2025 | Instruction tuning | Transformer decoder with visual patch input | Vision transformer + LLM fusion | Qwen 2.5 + ViT | ViT | 3B/7B/72B | Docs, images, audio | OCR + document QA specialization |
DeepSeek Janus | 2025 | Adapter-based fine-tuning | Dual-stream Transformer with MoE | Multimodal instruction-following | DeepSeek + ViT | ViT | 7B | Instruction + synthetic datasets | Efficient MoE-based dual-stream VLM |
MiniCPM-o 2.6 | 2025 | Plug-in modules + instruction tuning | Modular lightweight Transformer | Multimodal instruction-following + OCR | MiniCPM + LLaMA3 | Vision adapter | 8B | Instruction-tuned corpus | GPT-4V-level OCR + real-time video understanding on-device |
Moondream | 2025 | Minimal fine-tuning | Decoder-only Transformer | Multimodal pretraining | - | Compact encoder | 1.86B | Open efficient datasets | Small footprint with privacy focus |
Pixtral | 2025 | Instruction tuning | Dual-stream compact transformer | Mistral-style ViT + LLM | Mistral + ViT | ViT | 12B | Multi-domain open-source corpus | ViT fusion in compact architecture |
DatasetSizeModalitiesLanguage(s)Category DiversityKnown Biases / Limitations
MS COCO | 328K images | Image-Text | English | 91 object categories | Western-centric content; limited cultural diversity; object-centric focus |
VQAv2 | 204K images; 1.1M Q&A pairs | Image-QA | English | Everyday scenes with varied Q&A | Language bias; answer priors; question redundancy |
RadGraph | 221K reports; 10.5M annotations | Text (Radiology reports) | English | Radiology findings | Domain-specific; requires medical expertise for annotation; limited to chest X-rays |
GQA | 113K images; 22M questions | Image-QA | English | Compositional reasoning | Synthetic question generation; potential over-reliance on scene graphs |
GeoBench-VLM | 10K+ tasks | Satellite-Text | English | Natural disasters, terrain, infrastructure | Sparse labels; coverage gaps |
SBU Captions | 1M images | Image-Text | English | Web-sourced everyday scenes | Noisy captions; duplicate entries |
MIMIC-CXR | 377K images; 227K studies | Image-Text | English | Chest X-rays | Hospital-centric; privacy restrictions |
EXAMS-V | 20,932 questions | Mixed Multimodal | 11 languages | Exam-style reasoning across disciplines | Regional bias; multilingual challenge |
RS5M | 5M images | Satellite-Text | English | Remote sensing imagery | Sparse labels; class imbalance; varying image quality |
VLM4Bio | 30K instances | Image-Text-QA | English | Biodiversity, taxonomy | Domain-specific; taxonomic bias; limited generalizability |
PMC-OA | ~1.65M image-text pairs | Image-Text-QA | English | High diversity within the biomedical domain; Covers a wide range of diagnostic procedures, disease types, and medical findings | Caption noise; requires medical expertise |
WebLI-100B | 100 Billion image-text pairs | Image-Text | 100+ languages | Global content | Cultural/geographic bias; noisy data |
Dataset TypeDataset NameDescriptionApplicationsLink
Detection | COCO | 330k images with annotations for detection and segmentation. | Object detection, instance segmentation, image captioning. | Link |
Detection | Open Images | 9M+ annotated images for detection. | Object detection, captioning, visual relationship detection. | Link |
Classification | ImageNet | 14M labeled images across 1K classes. | Image classification, transfer learning. | Link |
Classification | Visual Genome | 108k images with scene graphs and object annotations. | VQA, object detection, scene understanding. | Link |
Segmentation | ADE20K | 20k images labeled across 150 categories. | Semantic segmentation, scene parsing. | Link |
Segmentation | Cityscapes | Urban scenes with pixel-level annotations. | Autonomous driving, semantic segmentation. | Link |
Text-to-Image | Flickr30k | 31k images with 5 captions each. | Image captioning, text-to-image generation. | Link |
Text-to-Image | COCO Captions | Subset of COCO with image captions. | Captioning, text-image synthesis. | Link |
Multimodal Alignment | VQA | 200k+ questions over 100k images. | Visual QA, multimodal reasoning. | Link |
Multimodal Alignment | EndoVis-18-VLQA | QA pairs for surgical/endoscopic videos. | Medical QA, surgical assistance. | Link |
Multimodal Alignment | VLM4Bio | 469k QA pairs, 30k images for biodiversity tasks. | Scientific QA, bio-research. | Link |
Multimodal Alignment | MS-COCO Text-Only | Captions-only version of MS-COCO. | Text-based retrieval, text-image matching. | Link |
Pre-training | Conceptual Captions | 3.3M web-sourced image-caption pairs. | Vision-language pretraining. | Link |
Pre-training | PathQA Bench Public | 456k pathology QA pairs for PathChat. | Pathology education, clinical AI. | Link |
Pre-training | SBU Captions | 1M web-collected image-caption pairs. | Captioning, multimodal learning. | Link |
Multimodal Retrieval | Flickr30k Entities | Object-level annotations on Flickr30k. | Caption alignment, image-text retrieval. | Link |
Multimodal Retrieval | RS5M | 5M satellite images with English descriptions. | Remote sensing, domain-specific VLM tuning. | Link |
Multimodal Retrieval | v-SRL | Visual elements annotated with semantic roles. | Multimodal grounding, semantic role labeling. | Link |
Multimodal Reasoning | CLEVR | Synthetic QA over generated scenes. | Visual reasoning, compositional QA. | Link |
Multimodal Reasoning | GMAI-MMBench | 284 datasets across 38 medical modalities. | Medical QA, clinical AI benchmarking. | Link |
Multimodal Reasoning | NavGPT-Instruct-10k | 10k steps for navigation-based QA. | Navigational reasoning, autonomous systems. | Link |
Multimodal Reasoning | GQA | 22M compositional QA pairs. | Visual reasoning, scene-based QA. | Link |
Multimodal Reasoning | EXAMS-V | 20,932 multilingual questions in 20 subjects. | Multilingual education QA, model benchmarking. | Link |
Multimodal Reasoning | MMVP-VLM | QA pairs for evaluating pattern understanding. | Visual pattern QA, image-text alignment. | Link |
Semantic Segmentation | ADE20K | 20k images across diverse scenes. | Semantic segmentation, scene understanding. | Link |
Semantic Segmentation | Cityscapes | Urban street views with labels. | Road scene analysis, autonomous driving. | Link |
Cross-Modal Transfer | MIMIC-CXR | Chest X-rays + radiology reports. | Clinical VLMs, cross-modal training. | Link |
Cross-Modal Transfer | MedNLI | Medical Natural Language Inference dataset. | Textual reasoning in medicine. | Link |
Dataset NameImage-Text PairsQA PairsDescriptionApplicationLink
VQA-Med 2020 | ❌ | ✅ | Dataset for VQA and VQG in radiology; includes questions about abnormalities and image-based question generation. | Medical diagnosis, clinical decision support, multimodal QA | Link |
ROCO | ✅ | ❌ | Multimodal dataset from PubMed Central with radiology/non-radiology images, captions, keywords, and UMLS concepts. | Captioning, classification, retrieval, VQA | Link |
VQA-Med 2019 | ❌ | ✅ | 3,200 radiology images with 12,792 QA pairs across 4 categories: Modality, Plane, Organ system, Abnormality. | Medical image analysis, radiology AI, education | Link |
MIMIC-NLE | ✅ | ❌ | 377K chest X-rays with structured labels derived from free-text radiology reports. | Image understanding, NLP for radiology, decision support | Link |
SLAKE | ❌ | ✅ | Bilingual Med-VQA dataset with semantic labels and a medical knowledge base, covering many body parts and modalities. | Annotations, diverse QA, knowledge-based AI | Link |
GEMeX | ✅ | ✅ | Largest chest X-ray VQA dataset with explainability annotations to enhance visual reasoning in healthcare. | Med-VQA, visual reasoning, explainable AI | Link |
MS-CXR | ✅ | ❌ | 1,162 image-sentence pairs with bounding boxes and phrases for 8 findings, supporting semantic modeling. | Radiology annotation, contrastive learning, semantic modeling | Link |
MedICaT | ✅ | ❌ | 217K figures from 131K medical papers with captions, subfigure tags, and inline references. | Captioning, multimodal learning, retrieval | Link |
3D-RAD | ❌ | ✅ | 3D Med-VQA dataset using 4,000+ CT scans and 12,000+ QA pairs, including anomaly and temporal tasks. | 3D VQA, multi-temporal diagnosis, 3D understanding | Link |
ImageCLEFmed-MEDVQA-GI | ✅ | ✅ | 10K+ endoscopy images with 30K+ (synthetic) QA pairs, focused on gastrointestinal diagnosis. | GI image analysis, synthetic data, endoscopy VQA | Link |
BIOMEDICA | ✅ | ❌ | Over 24M image-text pairs from 6M biomedical articles across various disciplines, for generalist VLMs. | Biomedical VLM pretraining, retrieval, generalist AI | Link |
RadGraph | ❌ | ❌ | Annotated chest X-ray reports with clinical entities and relations; structured knowledge from unstructured text. | Info extraction, knowledge graphs, NLP for radiology | Link |
PMC-OA | ✅ | ❌ | 1.6M image-caption pairs from PubMed Central OA articles; used in PMC-CLIP training. | Medical retrieval, classification, multimodal learning | Link |
ReasonMed | ❌ | ✅ | 370K VQA samples for complex reasoning, generated via multi-agent CoT for explainable answers. | Medical reasoning, clinical QA, explainable AI | Link |
Lingshu | ✅ | ✅ | Aggregates 9.3M samples from 60+ datasets for generalist Med-VLMs across QA, reporting, and consultation. | Multimodal QA, report generation, medical dialogue | Link |
GMAI-VL-5.5M | ✅ | ✅ | 5.5M medical image-text pairs merged from multiple datasets, for general AI and clinical decision tasks. | General medical AI, QA, diagnosis, multimodal systems | Link |
ModelTaskDatasetMetricScore
Unified VLP [287] | Image Captioning | COCO, Flickr30K | BLEU-4 / CIDEr | 36.5 / 116.9 (COCO), 30.1 / 67.4 (Flickr) |
VinVL [288] | Image Captioning | COCO | BLEU-4 / CIDEr | 40.9 / 140.9 |
SimVLM [289] | Image Captioning | COCO | BLEU-4 / CIDEr | 40.3 / 143.3 |
BLIP [35] | Image Captioning | COCO | BLEU-4 / CIDEr | 41.7 / 143.5 |
RegionCLIP [290] | Image Captioning | COCO | BLEU-4 / CIDEr | 40.5 / 139.2 |
BLIP-2 [236] | Image Captioning | COCO, NoCaps | BLEU-4 / CIDEr | 43.7 / 123.7 (COCO), – (NoCaps) |
FIBER [291] | Image Captioning | COCO | CIDEr | 42.8 |
NLIP [292] | Image Captioning | Flickr30K | CIDEr | 135.2 |
LCL [293] | Image Captioning | COCO | CIDEr | 87.5 |
Unified VLP [287] | VQA | VQA 2.0 | VQA Score | 70.3% |
VinVL [288] | VQA | VQA 2.0 | VQA Score | 76.6% |
FewVLM [123] | VQA | VQA 2.0 | VQA Score | 51.1% |
SimVLM [289] | VQA | VQA 2.0 | VQA Score | 24.1% |
BLIP [35] | VQA | VQA 2.0 | VQA Score | 77.5% |
BLIP-2 [236] | VQA | VQA 2.0 | VQA Score | 79.3% |
VILA [164] | VQA | VQA 2.0, GQA | VQA Score | 80.8% (VQA 2.0), 63.3% (GQA) |
LCL [293] | VQA | VQA 2.0 | VQA Score | 73.4% |
TCL [294] | Image Retrieval | COCO, Flickr30K | R@1 | 62.3% / 88.7% |
CLIP [5] | Image Retrieval | COCO, Flickr30K | R@1 | 58.4% / 88.0% |
NLIP [292] | Image Retrieval | COCO | R@1 | 82.6% |
Cross-Attn [295] | Image Retrieval | COCO, Flickr30K | R@1 | 67.8% / 88.9% |
DreamLIP [296] | Image Retrieval | COCO, Flickr30K | R@1 | 58.3% / 87.2% |
https://www.sciencedirect.com/science/article/pii/S1566253525006955
https://github.com/SufyanDanish/VLM-Survey-?tab=readme-ov-file
'07.AI' 카테고리의 다른 글
인공지능 - 위험 관리 - 가디언 모델(Guardian Model) (0) | 2025.09.29 |
---|---|
머신러닝 - 파인튜닝 - ORPO(Odds Ratio Preference Optimization) 방식 (0) | 2025.09.25 |
LLM - Open AI, GPT-5 (1) | 2025.09.08 |
인공지능 - MoR (Mixture-of-Recursions) (1) | 2025.08.31 |
LLM - 토큰 팩토리 (Token Factory) (1) | 2025.08.30 |