MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning
Yuxuan Luo, Yuhui Yuan, Junwen Chen, Haonan Cai, Ziyi Yue, Yuwei Yang, Fatima Zohra Daha, Ji Li, Zhouhui Lian
Paper: arXiv:2506.10963
Project: MMMG
Overview
We introduce knowledge image generation as a new task to systematically evaluate the reasoning capability of modern text-to-image generation models. Knowledge images—such as diagrams, charts, and mind maps—play a fundamental role in human learning, as supported by dual-coding theory and the picture-superiority effect. Generating such images requires models to integrate world knowledge, logical reasoning, and pixel-level grounding into clear and explanatory visuals, posing challenges beyond conventional image synthesis.
To support comprehensive and controlled evaluation, we present MMMG, a Massive, Multi-Discipline, Multi-Tier Knowledge-Image Generation Benchmark.
Benchmark Design
MMMG consists of 4,456 expert-validated knowledge image–prompt pairs, covering:
- 10 disciplines (e.g., science, humanities, engineering)
- 6 educational levels, from elementary to professional
- Diverse knowledge formats, including charts, diagrams, and mind maps
To eliminate confounding visual complexity during evaluation, we adopt a unified Knowledge Graph (KG) representation, where each KG explicitly defines the target image’s core entities and their relational dependencies.
Evaluation Protocol
We propose MMMG-Score, a dedicated metric for knowledge image generation that jointly measures:
- Factual fidelity, computed via graph-edit distance between predicted and ground-truth KGs
- Visual clarity, assessing the readability and organization of generated images
Results
We conduct extensive evaluations on 16 state-of-the-art text-to-image models, revealing substantial reasoning deficiencies, including low entity fidelity, weak relational modeling, and visual clutter. Notably, GPT-4o achieves an MMMG-Score of only 50.20, highlighting the benchmark’s difficulty.
To foster future research, we further release FLUX-Reason, an open and effective baseline that integrates a reasoning LLM with diffusion models and is trained on 16,000 curated knowledge image–prompt pairs, achieving an MMMG-Score of 34.45.
Keywords
Text-to-Image Reasoning · Knowledge Image Generation · Multimodal Evaluation · Benchmark · Knowledge Graphs