1Central South University 2Zhejiang University 3Microsoft Research *Corresponding author
Comparison with existing datasets. Prior datasets annotate all visible text without prompt grounding or layout structure. TextGround4M provides prompt-aligned, span-level annotations with precise spatial bounding boxes, enabling faithful and layout-aware text rendering.
Training and inference pipeline. During training, prompt-grounded span tokens and bounding box tokens are appended after visual tokens to provide layout-aware supervision. At inference time, only image tokens are generated — no layout annotation required.
Dataset construction pipeline. 11.7M raw pairs from public datasets and CommonCrawl image mining are processed through VLM captioning, OCR extraction, span alignment, and multi-stage filtering to produce 4.1M high-quality samples.
Semantic scenario distribution of the image-mining subset across 7 major categories.
Semantic scenarios covered:
Signage Product Packaging Posters & Banners Digital Advertising Book Covers Educational Content Social Media1,000 samples stratified by number of prompt-grounded bounding boxes and maximum token length per span.
400 samples
≤ 2 boxes & ≤ 4 tokens/span
300 samples
≥ 3 boxes or ≥ 5 tokens/span
300 samples
≥ 3 boxes & ≥ 5 tokens/span
| Method | Easy | Medium | Hard | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CS | Acc | F1 | CER↓ | IoU | PC | CS | Acc | F1 | CER↓ | IoU | PC | CS | Acc | F1 | CER↓ | IoU | PC | |
| OpenTextDiffuser-2 | 29.5 | 30.7 | 34.2 | 88.9 | 4.1 | 26.8 | 27.4 | 15.7 | 22.4 | 90.6 | 0.7 | 8.2 | 20.1 | 4.4 | 7.1 | 94.9 | 0.2 | 1.2 |
| OpenPixArt-Σ | 27.7 | 1.1 | 0.6 | 82.7 | 0.0 | 0.4 | 25.7 | 0.9 | 0.7 | 82.5 | 0.0 | 0.2 | 19.1 | 0.8 | 0.7 | 81.9 | 0.0 | 0.0 |
| OpenJanus Pro 7B | 30.3 | 34.9 | 33.8 | 83.4 | 4.8 | 30.4 | 28.3 | 14.9 | 17.6 | 82.5 | 1.6 | 12.6 | 19.2 | 4.9 | 7.2 | 86.9 | 0.2 | 6.0 |
| OpenSD 3.5 Large | 31.4 | 80.6 | 49.7 | 74.6 | 8.6 | 72.2 | 29.0 | 76.8 | 58.4 | 64.0 | 6.3 | 65.5 | 17.3 | 58.7 | 45.1 | 68.6 | 1.9 | 38.4 |
| OpenFLUX.1-dev | 26.9 | 79.4 | 44.0 | 75.7 | 9.2 | 82.0 | 29.0 | 70.2 | 64.2 | 73.1 | 9.2 | 60.3 | 19.3 | 51.9 | 37.5 | 76.0 | 2.8 | 60.8 |
| ClosedDALL·E 3 | 26.5 | 58.8 | 38.1 | 88.4 | 6.5 | 49.1 | 28.6 | 54.5 | 40.5 | 88.4 | 4.6 | 46.4 | 19.7 | 48.1 | 37.0 | 81.5 | 1.4 | 42.4 |
| ClosedGPT-4o | 26.7 | 84.5 | 50.3 | 73.5 | 18.2 | 84.9 | 29.3 | 86.6 | 79.1 | 68.5 | 15.2 | 82.6 | 19.2 | 77.9 | 65.4 | 64.7 | 8.4 | 83.5 |
| OursJanus Pro 1B (zero-shot) | 28.7 | 10.9 | 10.0 | 89.9 | 0.8 | 7.2 | 26.8 | 4.5 | 5.5 | 88.7 | 0.3 | 3.3 | 19.8 | 1.5 | 2.2 | 90.0 | 0.0 | 1.1 |
| Ours+ Vanilla FT † | 27.9 | 20.3 | 20.1 | 87.5 | 3.8 | 15.0 | 26.5 | 10.3 | 13.4 | 86.5 | 1.4 | 7.6 | 19.9 | 2.6 | 4.2 | 91.2 | 0.4 | 2.5 |
| Ours+ Text Only † | 27.5 | 20.5 | 20.3 | 86.0 | 3.5 | 16.4 | 26.4 | 11.8 | 15.2 | 85.0 | 1.8 | 7.9 | 20.2 | 2.5 | 4.2 | 91.1 | 0.4 | 2.9 |
| Ours+ BBox Only † | 28.9 | 34.3 | 33.8 | 83.8 | 8.0 | 30.6 | 27.0 | 18.0 | 23.0 | 82.8 | 3.7 | 17.0 | 19.9 | 4.4 | 7.1 | 90.5 | 0.6 | 5.3 |
| Ours+ Pre-Image † | 29.4 | 38.6 | 37.3 | 82.6 | 10.4 | 34.4 | 27.3 | 20.3 | 26.1 | 82.7 | 5.0 | 19.6 | 19.7 | 4.7 | 7.4 | 89.8 | 0.4 | 6.3 |
| OursText + BBox (Ours) † | 29.0 | 34.1 | 33.3 | 83.0 | 8.0 | 30.8 | 27.5 | 21.9 | 27.8 | 81.2 | 4.3 | 19.1 | 19.7 | 4.5 | 7.3 | 90.6 | 0.8 | 6.1 |
† Fine-tuned on TextGround4M at 512×512. Bold = best per column. Underline = second best. CER ↓ lower is better, all others ↑ higher is better.
Qualitative comparison of open-source and proprietary models on TextGround-Bench. Our method produces more accurate and spatially consistent text rendering compared to strong baselines.
Fine-tuning on TextGround4M yields significant improvements in layout consistency and textual accuracy.
BBox-only can hallucinate text; Text-only misses layout. Joint supervision achieves accurate, well-positioned rendering.
@article{mao2026textground4m,
title = {TextGround4M: A Prompt-Aligned Dataset for Layout-Aware Text Rendering},
author = {Mao, Dongxing and Wang, Yilin and Li, Linjie and
Yang, Zhengyuan and Wang, Alex Jinpeng},
journal = {Proceedings of the AAAI Conference on Artificial Intelligence},
volume = {40},
number = {10},
pages = {7918--7926},
year = {2026},
month = {Mar.},
doi = {10.1609/aaai.v40i10.37736},
url = {https://ojs.aaai.org/index.php/AAAI/article/view/37736}
}