AAAI 2026

TextGround4M: A Prompt-Aligned Dataset
for Layout-Aware Text Rendering

Dongxing Mao¹ · Yilin Wang² · Linjie Li³ · Zhengyuan Yang³ · Alex Jinpeng Wang¹^*

¹Central South University ²Zhejiang University ³Microsoft Research ^*Corresponding author

Abstract

Despite recent advances in text-to-image (T2I) generation, models still struggle to accurately render prompt-specified text with correct spatial layout—especially in multi-span, structured settings. This challenge is driven not only by the lack of datasets that align prompts with the exact text and layout expected in the image, but also by the absence of effective metrics for evaluating layout quality.

To address these issues, we introduce TextGround4M, a large-scale dataset of over 4 million prompt-image pairs, each annotated with span-level text grounded in the prompt and corresponding bounding boxes, enabling fine-grained supervision for layout-aware, prompt-grounded text rendering. Building on this, we propose a lightweight training strategy for autoregressive T2I models that appends layout-aware span tokens during training, without altering model architecture or inference behavior. We further construct TextGround-Bench, a benchmark with stratified layout complexity, and introduce two new layout-aware metrics to address the long-standing lack of spatial evaluation in text rendering.

Comparison with existing datasets. Prior datasets annotate all visible text without prompt grounding or layout structure. TextGround4M provides prompt-aligned, span-level annotations with precise spatial bounding boxes, enabling faithful and layout-aware text rendering.

Method

📝

Prompt + Image

Given a natural language prompt and a training image with prompt-grounded text spans and bounding boxes.

→

🔤

Append Layout Tokens

Span text tokens + bounding box coordinates appended after image tokens as autoregressive targets during training.

→

🖼️

Inference: Prompt Only

Span tokens omitted at test time. Standard one-pass decoding — no layout inputs, no architectural changes.

Training and inference pipeline. During training, prompt-grounded span tokens and bounding box tokens are appended after visual tokens to provide layout-aware supervision. At inference time, only image tokens are generated — no layout annotation required.

Key Properties

✓ No architecture changes
Plug-and-play for any autoregressive T2I backbone.

✓ Zero inference overhead
Layout tokens are training-time only — standard decoding at test time.

✓ Free-form input
No explicit layout hints needed — the model internalizes prompt-to-layout alignment.

TextGround4M Dataset

📦 Scale & Sources

4.1M samples from 10 public datasets (Mario10M, AnyWord-3M, TextScenesHQ, etc.) and CommonCrawl image mining via a GPT-4o-driven hierarchical query pipeline (7 scenarios → 90 topics → 2,900+ subtopics).

🔍 Annotation Pipeline

Qwen2.5-VL generates fine-grained captions with quoted text spans. PaddleOCR extracts word-level boxes. A multi-stage alignment module associates caption spans to OCR boxes via exact match, partial overlap, and fuzzy matching.

🧹 3-Stage Filtering

(1) Image/bbox heuristics (resolution, aspect ratio, OCR confidence). (2) Trivial case pruning (single-bbox downsampling). (3) VLM semantic verification via Qwen2.5-VL.

Dataset construction pipeline. 11.7M raw pairs from public datasets and CommonCrawl image mining are processed through VLM captioning, OCR extraction, span alignment, and multi-stage filtering to produce 4.1M high-quality samples.

Semantic scenario distribution of the image-mining subset across 7 major categories.

Semantic scenarios covered:

Signage Product Packaging Posters & Banners Digital Advertising Book Covers Educational Content Social Media

TextGround-Bench

1,000 samples stratified by number of prompt-grounded bounding boxes and maximum token length per span.

Easy

400 samples
≤ 2 boxes & ≤ 4 tokens/span

Medium

300 samples
≥ 3 boxes or ≥ 5 tokens/span

Hard

300 samples
≥ 3 boxes & ≥ 5 tokens/span

📊 CLIP Score

Global semantic alignment between generated image and prompt (CLIP-ViT-B/32).

🔤 Acc / F1 / CER

OCR-based word-level accuracy, F1, and character error rate against prompt-grounded spans.

📐 Layout IoU (new)

Average IoU between OCR-detected boxes and ground-truth reference boxes.

✅ Prompt Coverage (new)

Fraction of prompt-specified text spans successfully rendered in the generated image.

Results

Method	Easy						Medium						Hard
Method	CS	Acc	F1	CER↓	IoU	PC	CS	Acc	F1	CER↓	IoU	PC	CS	Acc	F1	CER↓	IoU	PC
OpenTextDiffuser-2	29.5	30.7	34.2	88.9	4.1	26.8	27.4	15.7	22.4	90.6	0.7	8.2	20.1	4.4	7.1	94.9	0.2	1.2
OpenPixArt-Σ	27.7	1.1	0.6	82.7	0.0	0.4	25.7	0.9	0.7	82.5	0.0	0.2	19.1	0.8	0.7	81.9	0.0	0.0
OpenJanus Pro 7B	30.3	34.9	33.8	83.4	4.8	30.4	28.3	14.9	17.6	82.5	1.6	12.6	19.2	4.9	7.2	86.9	0.2	6.0
OpenSD 3.5 Large	31.4	80.6	49.7	74.6	8.6	72.2	29.0	76.8	58.4	64.0	6.3	65.5	17.3	58.7	45.1	68.6	1.9	38.4
OpenFLUX.1-dev	26.9	79.4	44.0	75.7	9.2	82.0	29.0	70.2	64.2	73.1	9.2	60.3	19.3	51.9	37.5	76.0	2.8	60.8
ClosedDALL·E 3	26.5	58.8	38.1	88.4	6.5	49.1	28.6	54.5	40.5	88.4	4.6	46.4	19.7	48.1	37.0	81.5	1.4	42.4
ClosedGPT-4o	26.7	84.5	50.3	73.5	18.2	84.9	29.3	86.6	79.1	68.5	15.2	82.6	19.2	77.9	65.4	64.7	8.4	83.5
OursJanus Pro 1B (zero-shot)	28.7	10.9	10.0	89.9	0.8	7.2	26.8	4.5	5.5	88.7	0.3	3.3	19.8	1.5	2.2	90.0	0.0	1.1
Ours+ Vanilla FT †	27.9	20.3	20.1	87.5	3.8	15.0	26.5	10.3	13.4	86.5	1.4	7.6	19.9	2.6	4.2	91.2	0.4	2.5
Ours+ Text Only †	27.5	20.5	20.3	86.0	3.5	16.4	26.4	11.8	15.2	85.0	1.8	7.9	20.2	2.5	4.2	91.1	0.4	2.9
Ours+ BBox Only †	28.9	34.3	33.8	83.8	8.0	30.6	27.0	18.0	23.0	82.8	3.7	17.0	19.9	4.4	7.1	90.5	0.6	5.3
Ours+ Pre-Image †	29.4	38.6	37.3	82.6	10.4	34.4	27.3	20.3	26.1	82.7	5.0	19.6	19.7	4.7	7.4	89.8	0.4	6.3
OursText + BBox (Ours) †	29.0	34.1	33.3	83.0	8.0	30.8	27.5	21.9	27.8	81.2	4.3	19.1	19.7	4.5	7.3	90.6	0.8	6.1

† Fine-tuned on TextGround4M at 512×512. Bold = best per column. Underline = second best. CER ↓ lower is better, all others ↑ higher is better.

Qualitative Analysis

Model Comparison on TextGround-Bench

Qualitative comparison of open-source and proprietary models on TextGround-Bench. Our method produces more accurate and spatially consistent text rendering compared to strong baselines.

Before vs. After Fine-tuning

Fine-tuning on TextGround4M yields significant improvements in layout consistency and textual accuracy.

Text Only vs. BBox Only vs. Text + BBox

BBox-only can hallucinate text; Text-only misses layout. Joint supervision achieves accurate, well-positioned rendering.

BibTeX

@article{mao2026textground4m,
  title        = {TextGround4M: A Prompt-Aligned Dataset for Layout-Aware Text Rendering},
  author       = {Mao, Dongxing and Wang, Yilin and Li, Linjie and
                  Yang, Zhengyuan and Wang, Alex Jinpeng},
  journal      = {Proceedings of the AAAI Conference on Artificial Intelligence},
  volume       = {40},
  number       = {10},
  pages        = {7918--7926},
  year         = {2026},
  month        = {Mar.},
  doi          = {10.1609/aaai.v40i10.37736},
  url          = {https://ojs.aaai.org/index.php/AAAI/article/view/37736}
}

TextGround4M: A Prompt-Aligned Datasetfor Layout-Aware Text Rendering