NVIDIA's New OCR Model Reads the Room—in Six Languages at Once
Training an optical character recognition model to handle multiple languages presents a classic data problem. Manually labeling millions of documents is expensive, while scraping the web yields...
Training an optical character recognition model to handle multiple languages presents a classic data problem. Manually labeling millions of documents is expensive, while scraping the web yields messy, imprecise results. NVIDIA's latest release, Nemotron OCR v2, sidesteps this by leaning into a synthetic data strategy.
The team generated 12.2 million training images across English, Japanese, Korean, Russian, and Chinese (both Simplified and Traditional). They did this programmatically, rendering text onto images with full control over layouts, fonts, and visual noise. This method guarantees perfect labels—every bounding box and character is known because it was placed there. The key was ensuring the synthetic images felt real enough for the model to work on actual documents later.
The results are significant. Where its predecessor struggled with non-English scripts, producing error rates that rendered text nearly unreadable, v2's error rates dropped to near-zero on a synthetic benchmark. It also operates as a single, unified model; you don't need to pre-select a language variant. Architecturally, a shared backbone for text detection, recognition, and layout analysis allows it to process 34.7 pages per second on a single A100 GPU, markedly faster than several alternatives.
NVIDIA has open-sourced both the model and the massive synthetic dataset. The approach is designed to scale: adding a new language primarily requires source text and appropriate fonts, suggesting a path toward much broader multilingual support.
Source: Hugging Face Blog
Ready to Modernize Your Business?
Get your AI automation roadmap in minutes, not months.
Analyze Your Workflows →