Translation quality claims are easy to make and hard to verify, especially now that “general-purpose” large language models (LLMs) can produce fluent text on demand. But for global enterprises—especially in regulated industries—fluency isn’t enough. The real standard is evidence that a model preserves meaning, follows target-language conventions and style guidelines, and performs consistently across languages and domains. Meeting that standard requires structured, comprehensive translation quality evaluation. TowerZen, the latest in a family of LLMs developed by TransPerfect and acquisition partner Unbabel, was subjected to exactly that kind of evaluation. Purpose-built for translation and integrated into TransPerfect’s GlobalLink® technology platform, TowerZen sets a new standard for translation quality.
This blog details the technical evaluation behind TowerZen: how we validated quality and how we designed the evaluation and data pipeline to make the results trustworthy.
Quality Begins with Data
Artificial intelligence (AI) models are a product of training data, and the examples they learn from become the patterns they reproduce. For TowerZen, the goal wasn’t “more data”—it was quality data: the kind of examples that reflect what real clients expect in production.
Quality-filtered training data
TowerZen was trained with a custom curriculum combining supervised fine-tuning and reinforcement learning on quality-filtered datasets, enabling more consistent translation quality across a broader range of subject matters, language varieties, and industries. Examples are screened before training, so the model learns from translations that meet quality standards rather than absorbing noise (inconsistent terminology, errors, or misaligned segments), and spends more time focused on challenging examples.
Why that matters technically
Noisy parallel data can inflate apparent fluency while degrading adequacy and consistency. The result often shows up as an apparent “drift” from the source. The quality-first pipeline we developed helps shift learning toward:
- Adequacy: meaning preservation
- Fluency: grammatical correctness and idiomaticity
- Locale convention: meeting expectations of speakers of different varieties of the same language
- Consistency: terminology stability across translation projects
Building a Benchmark That Reflects Production Reality
For evaluation, we used ZBench v3, TransPerfect’s internal benchmark derived from human translations across a broad range of domains. The intent is to stress the model on what it will actually see: mixed subject matter, varied styles, and real-world segment distributions.
Automatic Evaluation at Scale
Automatic metrics are useful when you need broad coverage across languages and thousands of segments. We used COMET with the reference-based model Unbabel/wmt22-comet-da. COMET is widely used because it correlates well with human judgments compared to older lexical overlap metrics.
What we compared
In the automatic evaluation, we compared:
- TowerZen-9B
- TowerZen-2B
- Tower-4 Sugarloaf (previous best Tower iteration)
- DeepL
- TransPerfect internal production models (language-specialized)
How we aggregated across languages: Borda counts
Instead of relying only on raw score averages, we also computed Borda counts, ranking each model per language pair by COMET and then averaging ranks across languages. This helps reduce the chance that a few language pairs dominate the conclusion due to scale differences.
Outcome (automatic): TowerZen models scored above Tower-4 Sugarloaf overall, with TowerZen-9B the top-ranked system by COMET and Borda aggregation.
Human Evaluation: Blinded and Forced-Ranking Design
Automatic metrics have limits, especially when differences are small, subjective, or when errors are subtle (word sense, negation, agreement, register). COMET also isn’t trained to distinguish between language varieties, so it can’t tell us when a French translation appropriate for France would seem unnatural in Canada. To address these limits, we ran an extensive human evaluation of machine translation designed to be comparative, blinded, and auditable.
Design: two complementary tasks
We ran all annotation in DataForce, TransPerfect’s internal evaluation platform. Annotators performed two tasks:
- Absolute rating: 1 (unacceptable) to 5 (excellent)
- Relative ranking: forced choice across systems, best to worst
Each task answers a different question:
- Ratings capture “Is this output good enough?”
Rankings capture “Which output is better?” even when all outputs are acceptable.
Blinding and shuffling
To reduce bias, annotators never saw which system produced which output, and system labels were shuffled between tasks (e.g., “MT version A” in one task was not the same model in the next).
Coverage and competitors
Human evaluation covered 30 language pairs, comparing up to five systems depending on availability:
- TowerZen-9B
- TowerZen-2B
- Tower-4 Sugarloaf
- DeepL
- TransPerfect internal production models (language-specialized)
How We Interpreted Results
We analyzed results along two dimensions:
Ratings: absolute quality level
Most language pairs clustered between 4.0 and 4.5 average ratings, with a few lower-performing cases (e.g., JA→EN and EN→ZHHK closer to ~3). This helps identify where translations are “generally strong” versus “still variable.”
Rankings: head-to-head stability
From ranking data, we computed:
- Win ratios (how often system X beats system Y)
- Average rank/Borda counts (overall preference ordering)
Outcome (human):
- TowerZen-9B was the top system overall by average rank.
- Across all head-to-head comparisons, TowerZen-9B won ~54% of comparisons, with a 57% win rate against TransPerfect internal production models.
On average rank, TowerZen-9B ranked 1 overall, closely followed by Tower-4 Sugarloaf.
What This Means in Practice
Better evaluation outcomes translate into operational impact:
- Less post-editing: higher initial quality reduces time and cost per segment
- Faster turnaround: fewer corrections before content delivery
- More confidence at scale: higher-quality input makes quality estimation and automated QA more effective
- More control than generic MT: TowerZen can be tailored to terminology, tone, and domain needs
Because TowerZen is integrated into GlobalLink, these gains can be applied directly in production localization workflows.
Looking Ahead
TowerZen-9B is our strongest translation model to date, and we’re continuing to invest in:
- Expanded language coverage
- Stronger performance in specialized domains
- Improved compliance with complex style guide instructions
- Improved glossary and terminology adherence
- Document-level translation, which uses broader context across segments
Ongoing improvements will be guided by continued translation quality evaluation, ensuring TowerZen remains a high-performing LLM translation model for enterprise use, and by ongoing human evaluation of machine translation to validate progress across languages and domains.
To learn more about TowerZen, contact us.