CV

Education

Peking University
B.S. in Information and Computing Science, School of EECS
Sep. 2023 – Present

GPA: 3.77 / 4.0
Relevant Courses:
Computer Vision, Multimodal Learning, Image Processing,
Visual Computing, Machine Learning, Algorithm Design and Analysis

Research & Internship Experience

Wangxuan Institute of Computer Technology, Peking University
Research Intern
Sep. 2024 – Present

Conduct research on computer vision and generative models
Collaborate on academic projects under supervision of Zhouhui Lian

Course Projects

From LineArt to Yungang Grottoes: Multi-Structural Conditioned Image Generation
Nov. 2025 – Jan. 2026

Built a structure-guided image generation framework based on Stable Diffusion + ControlNet
Extended single-condition ControlNet to multi-condition control (sketch, depth, surface normals)
Improved geometric consistency and spatial hierarchy in stone-carving generation

PRGAN: GAN-based Reconstruction of Pottery Fragments
Dec. 2024 – Jan. 2025

Modeled pottery restoration as a 3D generative reconstruction task
Adopted voxel representation with surface normal priors for geometric consistency
Enabled automatic reconstruction from sparse archaeological fragments

CVToolkit: C++ Image and Video Processing System
May 2024 – Jun. 2024

Developed a C++ image & video editing toolkit
Implemented image enhancement, filters, GIF generation, and video editing
Integrated OpenCV and FFmpeg for video decoding, filtering, and playback control

Publications

Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation
CVPR 2026
Paper Link

Proposed GAR-Font, a global-aware autoregressive framework that models font generation beyond patch-level representations.
Designed a global-aware tokenizer (G-Tok) and a lightweight multimodal style encoder, enabling holistic font modeling and flexible text-guided control from few references.
Introduced a post-training refinement pipeline with LoRA-based NFA and GRPO-based SE to further improve global style faithfulness structural fidelity.

MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning
NeurIPS 2025 Datasets and Benchmarks Track
Paper Link

Introduced Knowledge Image Generation as a new evaluation task
Built a large-scale benchmark spanning 10 disciplines and 6 educational levels
Contributed to data collection, annotation protocol design, quality control, and baseline evaluation

Honors & Awards

Award for Academic Excellence, Peking University (2024)
BYD Scholarship, Peking University (2024)
Award for Scientific Research, Peking University (2025)
Tianchuang Scholarship, Peking University (2025)

Technical Skills

Programming: Python(PyTorch), C++, Java
Vision & ML: Computer Vision, Image Generation, Multimodal Learning, Post-Training
Languages: Mandarin (Native), English (Fluent, CET6)

Haonan Cai