AAAI 2026

TextGround4M: A Prompt-Aligned Dataset
for Layout-Aware Text Rendering

Dongxing Mao1  ·  Yilin Wang2  ·  Linjie Li3  ·  Zhengyuan Yang3  ·  Alex Jinpeng Wang1*

1Central South University    2Zhejiang University    3Microsoft Research    *Corresponding author

Paper Code Dataset
4.1M
Training Pairs
11.7M
Raw Collected
1K
Benchmark Samples
7
Semantic Scenarios
3
Difficulty Levels
6
Eval Metrics

Abstract

Despite recent advances in text-to-image (T2I) generation, models still struggle to accurately render prompt-specified text with correct spatial layout—especially in multi-span, structured settings. This challenge is driven not only by the lack of datasets that align prompts with the exact text and layout expected in the image, but also by the absence of effective metrics for evaluating layout quality.

To address these issues, we introduce TextGround4M, a large-scale dataset of over 4 million prompt-image pairs, each annotated with span-level text grounded in the prompt and corresponding bounding boxes, enabling fine-grained supervision for layout-aware, prompt-grounded text rendering. Building on this, we propose a lightweight training strategy for autoregressive T2I models that appends layout-aware span tokens during training, without altering model architecture or inference behavior. We further construct TextGround-Bench, a benchmark with stratified layout complexity, and introduce two new layout-aware metrics to address the long-standing lack of spatial evaluation in text rendering.
TextGround4M vs existing datasets

Comparison with existing datasets. Prior datasets annotate all visible text without prompt grounding or layout structure. TextGround4M provides prompt-aligned, span-level annotations with precise spatial bounding boxes, enabling faithful and layout-aware text rendering.

Method

📝
Prompt + Image
Given a natural language prompt and a training image with prompt-grounded text spans and bounding boxes.
🔤
Append Layout Tokens
Span text tokens + bounding box coordinates appended after image tokens as autoregressive targets during training.
🖼️
Inference: Prompt Only
Span tokens omitted at test time. Standard one-pass decoding — no layout inputs, no architectural changes.
Method pipeline

Training and inference pipeline. During training, prompt-grounded span tokens and bounding box tokens are appended after visual tokens to provide layout-aware supervision. At inference time, only image tokens are generated — no layout annotation required.

Key Properties
✓ No architecture changes
Plug-and-play for any autoregressive T2I backbone.
✓ Zero inference overhead
Layout tokens are training-time only — standard decoding at test time.
✓ Free-form input
No explicit layout hints needed — the model internalizes prompt-to-layout alignment.

TextGround4M Dataset

📦 Scale & Sources
4.1M samples from 10 public datasets (Mario10M, AnyWord-3M, TextScenesHQ, etc.) and CommonCrawl image mining via a GPT-4o-driven hierarchical query pipeline (7 scenarios → 90 topics → 2,900+ subtopics).
🔍 Annotation Pipeline
Qwen2.5-VL generates fine-grained captions with quoted text spans. PaddleOCR extracts word-level boxes. A multi-stage alignment module associates caption spans to OCR boxes via exact match, partial overlap, and fuzzy matching.
🧹 3-Stage Filtering
(1) Image/bbox heuristics (resolution, aspect ratio, OCR confidence). (2) Trivial case pruning (single-bbox downsampling). (3) VLM semantic verification via Qwen2.5-VL.
Data pipeline

Dataset construction pipeline. 11.7M raw pairs from public datasets and CommonCrawl image mining are processed through VLM captioning, OCR extraction, span alignment, and multi-stage filtering to produce 4.1M high-quality samples.

Scenario distribution

Semantic scenario distribution of the image-mining subset across 7 major categories.

Semantic scenarios covered:

Signage Product Packaging Posters & Banners Digital Advertising Book Covers Educational Content Social Media

TextGround-Bench

1,000 samples stratified by number of prompt-grounded bounding boxes and maximum token length per span.

Easy

400 samples
≤ 2 boxes & ≤ 4 tokens/span

Medium

300 samples
≥ 3 boxes or ≥ 5 tokens/span

Hard

300 samples
≥ 3 boxes & ≥ 5 tokens/span

📊 CLIP Score
Global semantic alignment between generated image and prompt (CLIP-ViT-B/32).
🔤 Acc / F1 / CER
OCR-based word-level accuracy, F1, and character error rate against prompt-grounded spans.
📐 Layout IoU (new)
Average IoU between OCR-detected boxes and ground-truth reference boxes.
✅ Prompt Coverage (new)
Fraction of prompt-specified text spans successfully rendered in the generated image.

Results

Method Easy Medium Hard
CSAccF1CER↓IoUPC CSAccF1CER↓IoUPC CSAccF1CER↓IoUPC
OpenTextDiffuser-2 29.530.734.288.94.126.8 27.415.722.490.60.78.2 20.14.47.194.90.21.2
OpenPixArt-Σ 27.71.10.682.70.00.4 25.70.90.782.50.00.2 19.10.80.781.90.00.0
OpenJanus Pro 7B 30.334.933.883.44.830.4 28.314.917.682.51.612.6 19.24.97.286.90.26.0
OpenSD 3.5 Large 31.480.649.774.68.672.2 29.076.858.464.06.365.5 17.358.745.168.61.938.4
OpenFLUX.1-dev 26.979.444.075.79.282.0 29.070.264.273.19.260.3 19.351.937.576.02.860.8
ClosedDALL·E 3 26.558.838.188.46.549.1 28.654.540.588.44.646.4 19.748.137.081.51.442.4
ClosedGPT-4o 26.784.550.373.518.284.9 29.386.679.168.515.282.6 19.277.965.464.78.483.5
OursJanus Pro 1B (zero-shot) 28.710.910.089.90.87.2 26.84.55.588.70.33.3 19.81.52.290.00.01.1
Ours+ Vanilla FT † 27.920.320.187.53.815.0 26.510.313.486.51.47.6 19.92.64.291.20.42.5
Ours+ Text Only † 27.520.520.386.03.516.4 26.411.815.285.01.87.9 20.22.54.291.10.42.9
Ours+ BBox Only † 28.934.333.883.88.030.6 27.018.023.082.83.717.0 19.94.47.190.50.65.3
Ours+ Pre-Image † 29.438.637.382.610.434.4 27.320.326.182.75.019.6 19.74.77.489.80.46.3
OursText + BBox (Ours) † 29.034.133.383.08.030.8 27.521.927.881.24.319.1 19.74.57.390.60.86.1

† Fine-tuned on TextGround4M at 512×512.   Bold = best per column.   Underline = second best.   CER ↓ lower is better, all others ↑ higher is better.

Qualitative Analysis

Model Comparison on TextGround-Bench

Model comparison on TextGround-Bench

Qualitative comparison of open-source and proprietary models on TextGround-Bench. Our method produces more accurate and spatially consistent text rendering compared to strong baselines.

Before vs. After Fine-tuning

Before vs after fine-tuning

Fine-tuning on TextGround4M yields significant improvements in layout consistency and textual accuracy.

Text Only vs. BBox Only vs. Text + BBox

Supervision strategy comparison

BBox-only can hallucinate text; Text-only misses layout. Joint supervision achieves accurate, well-positioned rendering.

BibTeX

@article{mao2026textground4m,
  title        = {TextGround4M: A Prompt-Aligned Dataset for Layout-Aware Text Rendering},
  author       = {Mao, Dongxing and Wang, Yilin and Li, Linjie and
                  Yang, Zhengyuan and Wang, Alex Jinpeng},
  journal      = {Proceedings of the AAAI Conference on Artificial Intelligence},
  volume       = {40},
  number       = {10},
  pages        = {7918--7926},
  year         = {2026},
  month        = {Mar.},
  doi          = {10.1609/aaai.v40i10.37736},
  url          = {https://ojs.aaai.org/index.php/AAAI/article/view/37736}
}