Spatial Reasoning
1. Spatial Reasoning
| SpatialVLM (chen2024spatialvlm) | Training VLMs on massive and diverse spatial reasoning dataset |
| Multimodal Visualization-of-Thought (MVoT) (li2025imagine) | Enables visual thinking in MLLMs by generating image visualizations of their reasoning traces |
| Thinking in Space (yang2025thinking) | Video-based visual-spatial intelligence benchmark (VSI-Bench) |
| LEGO-Puzzles (tang2025lego) | LEGO-Puzzles, a scalable benchmark designed to evaluate both spatial understanding and sequential reasoning in MLLMs through LEGO-based tasks |
| STI-Bench (li2025sti) | Spatial-Temporal Intelligence benchmark |
| vsGRPO-7B (liao2025improved) | Using VSI-100k and GRPO to train |
| PyVision (zhao2025pyvision) | Enables MLLMs to autonomously generate, execute, and refine Python-based tools |
| VILASR (wu2025reinforcing) | VIsion-LAnguage model that achieves sophisticated Spatial Reasoning through interwoven thinking and visual drawing |
| Thinking with Generated Images (chern2025thinking) | seeing with imagesā vs. āthinking with imagesā vs. āthinking with generated imagesā |
References
- Chen, B., Xu, Z., Kirmani, S., Ichter, B., Driess, D., Florence, P., Sadigh, D., Guibas, L. J., and Xia, F. SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
- Chern, E., Hu, Z., Chern, S., Kou, S., Su, J., Ma, Y., Deng, Z., and Liu, P. Thinking with Generated Images, 2025.
- Li, C., Wu, W., Zhang, H., Xia, Y., Mao, S., Dong, L., VuliÄ, I., and Wei, F. Imagine while Reasoning in Space: Multimodal Visualization-of-Thought. International Conference on Machine Learning (ICML), 2025.
- Li, Y., Zhang, Y., Lin, T., Liu, X., Cai, W., Liu, Z., and Zhao, B. STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?, 2025.
- Liao, Z., Xie, Q., Zhang, Y., Kong, Z., Lu, H., Yang, Z., and Deng, Z. Improved Visual-Spatial Reasoning via R1-Zero-Like Training, 2025.
- Tang, K., Gao, J., Zeng, Y., Duan, H., Sun, Y., Xing, Z., Liu, W., Lyu, K., and Chen, K. LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?, 2025.
- Wu, J., Guan, J., Feng, K., Liu, Q., Wu, S., Wang, L., Wu, W., and Tan, T. Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing, 2025.
- Yang, J., Yang, S., Gupta, A. W., Han, R., Fei-Fei, L., and Xie, S. Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025.
- Zhao, S., Zhang, H., Lin, S., Li, M., Wu, Q., Zhang, K., and Wei, C. PyVision: Agentic Vision with Dynamic Tooling, 2025.