Why AI is so good at translation and how it speeds shipping

Thomas Schiavone

June 27, 2026

Table of contents

Machine translation isn't what it used to be

Why is AI so good at translation?

Has AI translation reached human quality?

How we chose the models behind Courier AI Translation

How AI translation speeds up iteration

When does AI translation still need a human?

What this means for shipping in multiple languages

AI translation has quietly reached human quality for the kind of content most products ship. Independent reviewers can no longer reliably tell top AI output from professional human translation on standard text. That's a real shift, and it changes more than quality: when translation takes seconds instead of weeks, localization stops being a slow phase and becomes something you iterate on every release.

If you still picture machine translation as a rough draft you wouldn't put your brand name on, this post is for you. The short version: the models got good, we understand why, and the speed is the part that actually changes how you work.

Machine translation isn't what it used to be

The skepticism is earned. If your reference point is the Google Translate of 2015, you remember output that was technically correct and unmistakably robotic: odd word order, idioms translated into nonsense.

So most teams built a workflow around that limit. Translation became a phase: export your strings, send them to a vendor, pay per word, wait days or weeks, re-import, and hope nothing broke. The machine was a rough first pass; humans did the real work.

That workflow is still common. The problem is that it's built for a quality ceiling that no longer exists.

Why is AI so good at translation?

Because translation is close to what these models do internally on every token.

For decades, the goal of machine translation was an interlingua: a language-neutral representation of meaning you could encode any language into and decode any language out of. Hand-built versions mostly failed. Recent interpretability research shows that large models grow one on their own.

MIT's "Semantic Hub Hypothesis" found that models map sentences with the same meaning in different languages to nearby points in their middle layers, much like the human brain routes meaning through a single hub. A study from Shanghai Jiao Tong, "Converging to a Lingua Franca," describes the mechanism plainly: the model encodes input into a shared "lingua franca" space, reasons there, then decodes into the target language.

The clearest evidence: when an English-dominant model reads Chinese, its middle layers sit closer to the English word for the concept before it produces the Chinese output. It thinks in its dominant language and speaks the target language at the end. Translation isn't a feature bolted on. It's part of what the model is already doing.

This is also why the transformer, the architecture behind every modern model, was introduced in 2017 as a translation system. And it's why quality keeps climbing: the same research shows the shared meaning-space gets more language-agnostic as models grow.

Has AI translation reached human quality?

For high-resource languages and standard content, yes.

The most thorough independent assessment, Intento's State of Translation Automation 2025, evaluated 46 systems across 11 language pairs with expert linguists. It found the gap between human and the best automated translation has "virtually disappeared" for most high-resource pairs. Large language models went from 55% of top performers to 89% in a single year.

That climb is recent and steep. Watch one frontier family improve generation over generation and you can see how fast the ground shifted:

Model (release)	MMLU (general knowledge)	Where translation stood
Claude 2.1 (Nov 2023)	78.5%	Dedicated NMT engines still led translation benchmarks
Claude 3 Opus (Mar 2024)	86.8%	LLMs entering the top tier (WMT23: "here but not quite there yet")
Claude 3.5 Sonnet (Jun 2024)	88.7%	Claude 3.5 among the best systems in the WMT24 cluster
Claude 4 family (2025)	~90% (est.)	LLMs reach 89% of top-rated translation systems (Intento 2025)

Sources: Anthropic model announcements (2023 to 2024), WMT23 and WMT24 findings, Intento 2025. MMLU is a general-knowledge benchmark, not a translation score, so read it as a proxy for overall capability; the 4-series figure is an estimate, since Anthropic no longer publishes plain MMLU for those models.

In Lokalise's 2025 blind study, professional translators rated AI output "good" often enough to use it as a first pass and reserve human review for the exceptions. That's the inversion that matters: the human moves from translator to editor.

This is sharpest for product copy. Notification text, UI strings, and onboarding flows are short, standard, and high-frequency, exactly the content where models are at parity.

How we chose the models behind Courier AI Translation

We didn't pick a model off a generic leaderboard. Notification copy has specific demands, so we optimized for those.

Tone and voice. A notification that's accurate but stiff still fails. "You're all set, welcome aboard" has to sound like your brand in every language. Independent evaluations repeatedly rate Anthropic's Claude models highest for stylistic fluency, and at WMT24, the field's main translation competition, Claude ranked first in 9 of 11 language pairs.
Consistency where it counts. On Anthropic's own multilingual benchmarks, Claude holds 96 to 98% of its English-level performance across Spanish, French, German, Italian, Portuguese, Japanese, Chinese, Korean, and Arabic, the languages most products ship first.
The right model for the job. Most translation runs on Claude Sonnet 4.6, which pairs near-top-tier quality with the speed you want when editing templates live. For long, high-stakes templates where terminology has to stay consistent, Claude Opus 4.8 is the heavier option. Both keep a full template in context instead of translating string by string.

Here's what that consistency looks like in numbers. Anthropic publishes zero-shot scores for each language as a percentage of the model's English-level performance, so 98% means the model is within two points of its English baseline:

Language	Opus 4.1	Sonnet 4.5	Haiku 4.5
English (baseline)	100%	100%	100%
Spanish	98.1%	98.2%	96.4%
Portuguese (BR)	97.8%	97.8%	96.1%
Italian	97.7%	97.9%	96.0%
French	97.9%	97.5%	95.7%
German	97.7%	97.0%	94.3%
Arabic	97.1%	97.2%	92.5%
Chinese (Simplified)	97.1%	96.9%	94.2%
Japanese	96.9%	96.8%	93.5%
Korean	96.6%	96.7%	93.3%
Hindi	96.8%	96.7%	92.4%
Swahili	89.8%	91.1%	78.3%
Yoruba	80.3%	79.7%	52.7%

Source: Anthropic multilingual benchmarks. These are the latest models Anthropic publishes numbers for; Courier runs on the newer Sonnet 4.6 and Opus 4.8 in the same family. Two things stand out: the languages most products ship first cluster at 96 to 98%, and the drop-off for low-resource languages like Swahili and Yoruba is exactly where human review still earns its keep.

No single model is best for every language or domain, and a model can't reliably catch its own mistakes. That's why the model is one half of the product and review is the other.

How AI translation speeds up iteration

This is the part that changes how you work.

The old economics: vendors charge roughly $0.10 to $0.30 per word, a translator handles 1,500 to 2,500 words a day, turnaround runs days to weeks, and rush jobs add 25 to 100%. Fine for a one-time document. Painful for software, because product copy is never finished.

Put the two models side by side and the gap is obvious:

	Traditional vendor workflow	AI translation
Cost per word	$0.10 to $0.30	Fractions of a cent (raw output)
Throughput	1,500 to 2,500 words per translator per day	Thousands of words in seconds
Turnaround	Days to weeks	Seconds
Rush premium	25 to 100% surcharge	None
Fix one string in 12 languages	New ticket, new cycle	Re-translate instantly

Sources: Translife and Alconost 2025/2026 rate guides. The right comparison for product copy isn't AI versus human cost per word, it's "continuous" versus "waits in a queue."

Localization teams call this the sequential bottleneck. You ship in two-week sprints, but translation takes six, so your international product is always a sprint behind, and the gap compounds. Translation was the one step you couldn't make continuous, because it waited on a human queue.

When translation takes seconds at near-zero cost, that constraint disappears. You stop rationing which markets get localized copy. You fix a typo across twelve languages without a ticket. You test a localized onboarding flow instead of shipping one version and hoping. The shift isn't "we translate faster," it's "we iterate on localized content like everything else."

When does AI translation still need a human?

When the stakes or the language are outside the safe zone.

The shared meaning-space that makes models great at translation is only partially aligned. Research across dozens of languages finds a well-aligned core plus fragmented regions, worst for low-resource and typologically distant languages. The same benchmarks that put Spanish at 98% put some African languages far lower. And models can favor natural-sounding phrasing over exact terminology, which is fine for a welcome message and risky for a medical instruction.

So the right posture is to let the model translate and put a human on review, focused on the exceptions rather than every string. That's how Courier AI Translation works: add a language, the model translates every field, and you review side by side with your source, override anything, and re-check only what changed when you edit the original. Your content is never used to train AI models.

What this means for shipping in multiple languages

A few things follow.

Treat localization as continuous, the way you treat deployment. The reason it had to be a phase, slow and human-gated translation, no longer holds.

Move people from translating to reviewing. The value is in catching the 10% of strings that need judgment, not hand-translating the 90% that don't.

Re-run the math on what's worth localizing. The old calculation assumed a per-word cost and multi-day latency. When both approach zero, content that wasn't worth translating suddenly is.

The teams that win the next few years of global product won't have the biggest translation budgets. They'll be the ones who stopped waiting on translation and started shipping it on every release.

If you want to feel the difference, Courier AI Translation localizes any notification template into any language in seconds, inside the editor you already use. Design once, translate into every language after that, and keep everything in sync as your copy changes.

Similar resources

AICustomer JourneysGuide

Your Entire Lifecycle Marketing Department, Run from Claude Fable 5

With the rollout of Claude' Fable model, one thing is becoming increasingly clear. Marketing execution (especially the long-tail work), will be done in an AI editor. In Courier, connect your agent to the MCP server or CLI, install Courier Skills, and keep a small folder of markdown context files. From there, one person with a coding agent covers the work that used to require a lifecycle marketer, an email designer, a marketing ops hire, and an engineer: building journeys, shipping templates, auditing every notification, and debugging delivery without opening a dashboard.

By Kyle Seyler

June 09, 2026

AIGuideEngineering

Human-in-the-loop for AI payment agents: building approval notifications that work

AI agents need human approval before taking consequential actions: financial commitments, irreversible changes, decisions that affect other people. This post covers how to design those checkpoints and build the notification infrastructure: multi-channel delivery, live context, escalation, and a back-and-forth question loop between reviewers and the agent.

By Eric Lee

May 26, 2026

Product NewsCourier UpdatesAI

What we shipped this month: May 2026 Edition

Courier shipped five launches in May 2026: AI Agent in Journeys (GA), the new Journeys API for code-driven flows, Custom Environments, Design Studio styling controls, and Courier Console v3. Each one closes a gap between writing software and shipping the messages that go with it.

By Kyle Seyler

May 20, 2026

Multichannel Notifications Platform for SaaS

Products

In-App Notifications

Embeddable Designer

Design Studio

Workflow Builder

User Preferences

Multi-Channel Routing

Solutions

Transactional

Alert notifications

User Preferences

Notification Feed

Healthcare

HR Tech

SaaS

Platform

Integrations

Customers

Blog

API Status

Subprocessors

Security

Responsible Disclosure Policy