Spatial Reasoning

1. Spatial Reasoning

SpatialVLM (chen2024spatialvlm) Training VLMs on massive and diverse spatial reasoning dataset
Multimodal Visualization-of-Thought (MVoT) (li2025imagine) Enables visual thinking in MLLMs by generating image visualizations of their reasoning traces
Thinking in Space (yang2025thinking) Video-based visual-spatial intelligence benchmark (VSI-Bench)
LEGO-Puzzles (tang2025lego) LEGO-Puzzles, a scalable benchmark designed to evaluate both spatial understanding and sequential reasoning in MLLMs through LEGO-based tasks
STI-Bench (li2025sti) Spatial-Temporal Intelligence benchmark
vsGRPO-7B (liao2025improved) Using VSI-100k and GRPO to train
PyVision (zhao2025pyvision) Enables MLLMs to autonomously generate, execute, and refine Python-based tools
VILASR (wu2025reinforcing) VIsion-LAnguage model that achieves sophisticated Spatial Reasoning through interwoven thinking and visual drawing
Thinking with Generated Images (chern2025thinking) seeing with imagesā€ vs. ā€œthinking with imagesā€ vs. ā€œthinking with generated imagesā€

References

  1. Chen, B., Xu, Z., Kirmani, S., Ichter, B., Driess, D., Florence, P., Sadigh, D., Guibas, L. J., and Xia, F. SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  2. Chern, E., Hu, Z., Chern, S., Kou, S., Su, J., Ma, Y., Deng, Z., and Liu, P. Thinking with Generated Images, 2025.
  3. Li, C., Wu, W., Zhang, H., Xia, Y., Mao, S., Dong, L., Vulić, I., and Wei, F. Imagine while Reasoning in Space: Multimodal Visualization-of-Thought. International Conference on Machine Learning (ICML), 2025.
  4. Li, Y., Zhang, Y., Lin, T., Liu, X., Cai, W., Liu, Z., and Zhao, B. STI-Bench: Are MLLMs Ready for Precise Spatial-Temporal World Understanding?, 2025.
  5. Liao, Z., Xie, Q., Zhang, Y., Kong, Z., Lu, H., Yang, Z., and Deng, Z. Improved Visual-Spatial Reasoning via R1-Zero-Like Training, 2025.
  6. Tang, K., Gao, J., Zeng, Y., Duan, H., Sun, Y., Xing, Z., Liu, W., Lyu, K., and Chen, K. LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning?, 2025.
  7. Wu, J., Guan, J., Feng, K., Liu, Q., Wu, S., Wang, L., Wu, W., and Tan, T. Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing, 2025.
  8. Yang, J., Yang, S., Gupta, A. W., Han, R., Fei-Fei, L., and Xie, S. Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025.
  9. Zhao, S., Zhang, H., Lin, S., Li, M., Wu, Q., Zhang, K., and Wei, C. PyVision: Agentic Vision with Dynamic Tooling, 2025.