CVPR 2026 · Main Conference

GraphVLM: Benchmarking
Vision Language Models
for Multimodal Graph Learning

A systematic benchmark exploring three complementary paradigms for multimodal graph learning and showing that VLMs can serve as powerful backbones.

Jiajin Liu^1,2 · Dongzhe Fan^1,2 · Chuanhao Ji^1,4 · Daochen Zha³ · Qiaoyu Tan^1*

¹NYU Shanghai · ²New York University · ³Rice University · ⁴East China Normal University

📄 arXiv Paper GitHub Code 🤗 Hugging Face

Abstract

Harnessing VLMs for Multimodal Graphs

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in aligning and understanding multimodal signals, yet their potential to reason over structured data, where multimodal entities are connected through explicit relational graphs, remains largely underexplored. Unlocking this capability is crucial for real-world applications such as social networks, recommendation systems, and scientific discovery, where multimodal information is inherently structured.

To bridge this gap, we present GraphVLM, a systematic benchmark designed to evaluate and harness the capabilities of VLMs for multimodal graph learning (MMGL). GraphVLM investigates three complementary paradigms for integrating VLMs with graph reasoning: (1) VLM-as-Encoder, which enriches graph neural networks through multimodal feature fusion; (2) VLM-as-Aligner, which bridges modalities in latent or linguistic space to facilitate LLM-based structured reasoning; and (3) VLM-as-Predictor, which directly employs VLMs as multimodal backbones for graph learning tasks.

Extensive experiments across six datasets from diverse domains demonstrate that VLMs enhance multimodal graph learning via all three roles. Among these paradigms, VLM-as-Predictor achieves the most substantial and consistent performance gains, revealing the untapped potential of vision–language models as a new foundation for multimodal graph learning.

Our Framework

Three Complementary VLM Paradigms

GraphVLM unifies GNN-, LLM-, and VLM-based methods under a consistent evaluation protocol, positioning VLMs in three functional roles within the multimodal graph learning landscape.

🟦

VLM-as-Encoder

Employs pre-trained VLMs to encode multimodal node features and fine-tunes them under graph learning paradigms. Investigates how structure–modality alignment enhances GNN-based methods.

🔗

VLM-as-Aligner

Examines both latent-space alignment (embedding injection) and prompt-level alignment (visual-to-text transformation) to extend LLM-based methods to multimodal graphs.

🎯

VLM-as-Predictor

Directly applies VLM-based methods as task-specific backbones, augmented with in-context learning or supervised fine-tuning using structural signals, delivering the strongest performance.

Comparison

The GraphVLM Advantage

BENCHMARKS	BACKBONES			VLM ROLES
				ENCODER		ALIGNER		PREDICTOR
	GNN	LLM	VLM	PRE-TRAINED	FINE-TUNED	PROMPT	LATENT	ZERO-SHOT	SFT
MM-Bench	✓	✗	✗	✓	✗	✗	✗	✗	✗
MAGB	✓	✗	✓	✓	✗	✗	✗	✓	✗
GraphVLM (Ours)	✓	✓	✓	✓	✓	✓	✓	✓	✓

Citation

Support Our Work

If GraphVLM aids your research, please cite our CVPR 2026 paper.

@misc{liu2026graphvlmbenchmarkingvisionlanguage,
  title         = {GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning},
  author        = {Jiajin Liu and Dongzhe Fan and Chuanhao Ji and Daochen Zha and Qiaoyu Tan},
  year          = {2026},
  eprint        = {2603.13370},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2603.13370},
}