A systematic benchmark exploring three complementary paradigms for multimodal graph learning and showing that VLMs can serve as powerful backbones.
1NYU Shanghai · 2New York University · 3Rice University · 4East China Normal University
GraphVLM unifies GNN-, LLM-, and VLM-based methods under a consistent evaluation protocol, positioning VLMs in three functional roles within the multimodal graph learning landscape.
Employs pre-trained VLMs to encode multimodal node features and fine-tunes them under graph learning paradigms. Investigates how structure–modality alignment enhances GNN-based methods.
Examines both latent-space alignment (embedding injection) and prompt-level alignment (visual-to-text transformation) to extend LLM-based methods to multimodal graphs.
Directly applies VLM-based methods as task-specific backbones, augmented with in-context learning or supervised fine-tuning using structural signals, delivering the strongest performance.
If GraphVLM aids your research, please cite our CVPR 2026 paper.
@misc{liu2026graphvlmbenchmarkingvisionlanguage, title = {GraphVLM: Benchmarking Vision Language Models for Multimodal Graph Learning}, author = {Jiajin Liu and Dongzhe Fan and Chuanhao Ji and Daochen Zha and Qiaoyu Tan}, year = {2026}, eprint = {2603.13370}, archivePrefix = {arXiv}, primaryClass = {cs.CV}, url = {https://arxiv.org/abs/2603.13370}, }