Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation

1 Wangxuan Institute of Computer Technology, Peking University
2 School of Electronics Engineering and Computer Science, Peking University

Accepted to CVPR 2026
*Indicates Equal Contribution, Indicates Corresponding Author
Teaser Image

GAR-Font results under visual and multimodal few-shot settings. The poem reflects the key contributions of our model: global-aware tokenization for style fidelity, multimodal style encoding for text-image control, reduced reference requirements, and an autoregressive design that enables controllable high-quality font synthesis.

Abstract

Manual font design is an intricate process that transforms a stylistic visual concept into a coherent glyph set. This challenge persists in automated Few-shot Font Generation (FFG), where models often struggle to preserve both the structural integrity and stylistic fidelity from limited references. While autoregressive (AR) models have demonstrated impressive generative capabilities, their application to FFG is constrained by conventional patch-level tokenization, which neglects global dependencies crucial for coherent font synthesis. Moreover, existing FFG methods remain within the image-to-image paradigm, relying solely on visual references and overlooking the role of language in conveying stylistic intent during font design. To address these limitations, we propose GAR-Font, a novel AR framework for multimodal few-shot font generation. GAR-Font introduces a global-aware tokenizer that effectively captures both local structures and global stylistic patterns, a multimodal style encoder offering flexible style control through a lightweight language-style adapter without requiring intensive multimodal pretraining, and a post-refinement pipeline that further enhances structural fidelity and style coherence. Extensive experiments show that GAR-Font outperforms existing FFG methods, excelling in maintaining global style faithfulness and achieving higher-quality results with textual stylistic guidance.

Overview of GAR-Font

Overview architecture of GAR-Font

The overall architecture of GAR-Font. It comprises a global-aware tokenizer (G-Tok) and an autoregressive generator (AR Generator), equipped with a multimodal style encoder.

Global-aware Tokenizer (G-Tok)

Global-aware Tokenizer architecture

Architecture of the Global-aware Tokenizer (G-Tok). A hybrid CNN–ViT encoder extracts local stroke details and global style context, followed by vector quantization and a hybrid decoder.

The Global-aware Tokenizer (G-Tok) discretizes glyph images into structured visual tokens while preserving both fine-grained stroke geometry and global stylistic fidelity. A hybrid CNN–ViT encoder first captures local structural details through convolutional layers and aggregates long-range dependencies through attention mechanism. The resulting latent representations are quantized into a learnable codebook, enabling compact yet expressive token sequences. A hybrid decoder reconstructs glyphs while modeling sequential dependencies, ensuring that both structural fidelity and overall style faithfulness are maintained. This design provides a strong discrete representation foundation for autoregressive font generation.

AR Generator with Multimodal Style Encoder

Visual Pretraining Stage

(a) Visual Pretraining: learning a stable content–style space from visual references.

Multimodal Adaptation Stage

(b) Vision–Language Adaptation: aligning textual guidance with visual style representations via a lightweight adapter.

Building upon the discrete representations provided by G-Tok, the autoregressive (AR) generator performs conditional sequential token prediction for glyph synthesis. In the first stage, the generator and content–style aggregator are trained under purely visual conditions to establish a stable and expressive style–content representation space. In the second stage, a lightweight vision–language adapter is introduced to extend the visual style encoder into a multimodal one. Text embeddings are aligned with visual style features through cross-attention, enabling flexible text-guided style modulation without disrupting the learned visual priors. Together, this two-stage paradigm ensures robust visual generation while supporting controllable multimodal style specification.

Post-refinement for GAR-Font

To further improve few-shot style generalization and structural accuracy, GAR-Font introduces a two-stage post-refinement strategy. The first stage, Novel Font Adaptation (NFA), performs lightweight adaptation on unseen font styles by updating only the LoRA layers of the Transformer generator. Using a mixed token-level cross-entropy and pixel-level reconstruction loss, NFA enables stable few-shot specialization while preserving the pretrained generator’s prior knowledge. This significantly enhances style fidelity when only a few reference glyphs are available. The second stage, Structural Enhancement (SE), further improves glyph readability and structural precision. The generator is treated as a policy that produces token sequences, and decoded glyphs are evaluated with a composite reward composed of OCR confidence and style consistency. Group-relative advantage normalization is applied to stabilize training, and only the LoRA layers are updated through advantage-weighted likelihood maximization with KL regularization. This reinforcement-based refinement reduces structural distortions while maintaining stylistic fidelity, achieving clearer and more accurate glyph synthesis.

Quantitative Results

Vision-only Few-shot Font Generation Results

Vision-only FFG. GAR-Font consistently outperforms existing methods in structural accuracy and style fidelity across UFSC and UFUC datasets.

Multimodal Few-shot Font Generation Results

Multimodal FFG. Incorporating textual style descriptions improves performance over vision-only baselines and reduces reliance on visual references.

Generations of GAR-Font

BibTeX

@misc{cai2026patchesglobalawareautoregressivemodel,
      title={Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation}, 
      author={Haonan Cai and Yuxuan Luo and Zhouhui Lian},
      year={2026},
      eprint={2601.01593},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.01593}, 
}