TowerZen: Setting New Standards in Translation Quality Through Rigorous Evaluation

AI translation quality is easy to claim but difficult to validate. This blog breaks down the evaluation framework behind TowerZen, including the datasets, benchmarking methods, and human review processes used to assess translation performance across languages, domains, and enterprise use cases.

Crystal Maganzini

Crystal Maganzini

Updated on May 18, 2026

Translation quality claims are easy to make and hard to verify, especially now that “general-purpose” large language models (LLMs) can produce fluent text on demand. But for global enterprises—especially in regulated industries—fluency isn’t enough. The real standard is evidence that a model preserves meaning, follows target-language conventions and style guidelines, and performs consistently across languages and domains. Meeting that standard requires structured, comprehensive translation quality evaluation. TowerZen, the latest in a family of LLMs developed by TransPerfect and acquisition partner Unbabel, was subjected to exactly that kind of evaluation. Purpose-built for translation and integrated into TransPerfect’s GlobalLink® technology platform, TowerZen sets a new standard for translation quality.

This blog details the technical evaluation behind TowerZen: how we validated quality and how we designed the evaluation and data pipeline to make the results trustworthy.

Quality Begins with Data

Artificial intelligence (AI) models are a product of training data, and the examples they learn from become the patterns they reproduce. For TowerZen, the goal wasn’t “more data”—it was quality data: the kind of examples that reflect what real clients expect in production.

Quality-filtered training data

TowerZen was trained with a custom curriculum combining supervised fine-tuning and reinforcement learning on quality-filtered datasets, enabling more consistent translation quality across a broader range of subject matters, language varieties, and industries. Examples are screened before training, so the model learns from translations that meet quality standards rather than absorbing noise (inconsistent terminology, errors, or misaligned segments), and spends more time focused on challenging examples.

Why that matters technically

Noisy parallel data can inflate apparent fluency while degrading adequacy and consistency. The result often shows up as an apparent “drift” from the source. The quality-first pipeline we developed helps shift learning toward:

  • Adequacy: meaning preservation
  • Fluency: grammatical correctness and idiomaticity
  • Locale convention: meeting expectations of speakers of different varieties of the same language
  • Consistency: terminology stability across translation projects

Building a Benchmark That Reflects Production Reality

For evaluation, we used ZBench v3, TransPerfect’s internal benchmark derived from human translations across a broad range of domains. The intent is to stress the model on what it will actually see: mixed subject matter, varied styles, and real-world segment distributions.

Automatic Evaluation at Scale

Automatic metrics are useful when you need broad coverage across languages and thousands of segments. We used COMET with the reference-based model Unbabel/wmt22-comet-da. COMET is widely used because it correlates well with human judgments compared to older lexical overlap metrics.

What we compared

In the automatic evaluation, we compared:

  • TowerZen-9B
  • TowerZen-2B
  • Tower-4 Sugarloaf (previous best Tower iteration)
  • DeepL
  • TransPerfect internal production models (language-specialized)

How we aggregated across languages: Borda counts

Instead of relying only on raw score averages, we also computed Borda counts, ranking each model per language pair by COMET and then averaging ranks across languages. This helps reduce the chance that a few language pairs dominate the conclusion due to scale differences.

Outcome (automatic): TowerZen models scored above Tower-4 Sugarloaf overall, with TowerZen-9B the top-ranked system by COMET and Borda aggregation.

Human Evaluation: Blinded and Forced-Ranking Design

Automatic metrics have limits, especially when differences are small, subjective, or when errors are subtle (word sense, negation, agreement, register). COMET also isn’t trained to distinguish between language varieties, so it can’t tell us when a French translation appropriate for France would seem unnatural in Canada. To address these limits, we ran an extensive human evaluation of machine translation designed to be comparative, blinded, and auditable.

Design: two complementary tasks

We ran all annotation in DataForce, TransPerfect’s internal evaluation platform. Annotators performed two tasks:

  1. Absolute rating: 1 (unacceptable) to 5 (excellent)
  2. Relative ranking: forced choice across systems, best to worst

Each task answers a different question:

  • Ratings capture “Is this output good enough?”

Rankings capture “Which output is better?” even when all outputs are acceptable.

Blinding and shuffling

To reduce bias, annotators never saw which system produced which output, and system labels were shuffled between tasks (e.g., “MT version A” in one task was not the same model in the next).

Coverage and competitors

Human evaluation covered 30 language pairs, comparing up to five systems depending on availability:

  • TowerZen-9B
  • TowerZen-2B
  • Tower-4 Sugarloaf
  • DeepL
  • TransPerfect internal production models (language-specialized)

How We Interpreted Results

We analyzed results along two dimensions:

Ratings: absolute quality level

Most language pairs clustered between 4.0 and 4.5 average ratings, with a few lower-performing cases (e.g., JA→EN and EN→ZHHK closer to ~3). This helps identify where translations are “generally strong” versus “still variable.”

Rankings: head-to-head stability

From ranking data, we computed:

  • Win ratios (how often system X beats system Y)
  • Average rank/Borda counts (overall preference ordering)

Outcome (human):

  • TowerZen-9B was the top system overall by average rank.
  • Across all head-to-head comparisons, TowerZen-9B won ~54% of comparisons, with a 57% win rate against TransPerfect internal production models.

On average rank, TowerZen-9B ranked 1 overall, closely followed by Tower-4 Sugarloaf.

What This Means in Practice

Better evaluation outcomes translate into operational impact:

  • Less post-editing: higher initial quality reduces time and cost per segment 
  • Faster turnaround: fewer corrections before content delivery
  • More confidence at scale: higher-quality input makes quality estimation and automated QA more effective
  • More control than generic MT: TowerZen can be tailored to terminology, tone, and domain needs

Because TowerZen is integrated into GlobalLink, these gains can be applied directly in production localization workflows.

Looking Ahead

TowerZen-9B is our strongest translation model to date, and we’re continuing to invest in:

  • Expanded language coverage 
  • Stronger performance in specialized domains 
  • Improved compliance with complex style guide instructions 
  • Improved glossary and terminology adherence 
  • Document-level translation, which uses broader context across segments

Ongoing improvements will be guided by continued translation quality evaluation, ensuring TowerZen remains a high-performing LLM translation model for enterprise use, and by ongoing human evaluation of machine translation to validate progress across languages and domains.

To learn more about TowerZen, contact us.

About the Author

Crystal Maganzini

Crystal Maganzini

Global Director, AI Solutions at TransPerfect

Latest Articles

TowerZen

TowerZen: Setting New Standards in Translation Quality Through Rigorous Evaluation

AI translation quality is easy to claim but difficult to validate. This blog breaks down the evaluation framework behind TowerZen, including the datasets, benchmarking methods, and...

Crystal Maganzini

Crystal Maganzini

Updated on May 18, 2026

Updated on May 18, 2026

AI Governance

AI Governance: Risk-Managed Global Content at Scale

74% of enterprises prioritize AI strategies—yet most lack governance frameworks. Learn how to scale AI-powered localization with the right guardrails.

Imran Sadiq

Imran Sadiq

Updated on April 22, 2026

Updated on April 22, 2026

More with Less

More with Less in 2026: How Teams Deliver Volume and Quality with Flat Budgets

40% of enterprise leaders expect flat localization budgets in 2026, yet AI and customer experience remain top priorities. Learn strategies to deliver more with less.

Imran Sadiq

Imran Sadiq

Updated on April 22, 2026

Updated on April 22, 2026

GL Voice

What’s New in GlobalLink Voice: Flexible File Support and Smarter Collaboration

See what’s new in the GlobalLink Voice update, from flexible file imports to smarter collaboration. Explore the latest features now.

Imran Sadiq

Imran Sadiq

Updated on April 21, 2026

Updated on April 21, 2026

Sacled GL

GlobalLink Live Update: AI Summarization, Smarter Booking, and Expanded Platform Integrations

Run multilingual events with AI interpretation and live captioning. Explore GlobalLink Live updates, including summarization and sentiment analysis.

Imran Sadiq

Imran Sadiq

Updated on April 21, 2026

Updated on April 21, 2026

Multilingual Content

How to Increase Multilingual Content Throughput Without Adding Cost

46% of enterprises say capacity is their top challenge. Learn how to increase multilingual content throughput without inflating budgets.

Imran Sadiq

Imran Sadiq

Updated on April 19, 2026

Updated on April 19, 2026

GL Scribe

What’s New in GlobalLink Scribe: Smarter PII Redaction, AI Insights, and More

Discover the latest GlobalLink Scribe updates, including smarter PII redaction, AI-powered transcript summaries, and automatic language detection.

Imran Sadiq

Imran Sadiq

Updated on April 21, 2026

Updated on April 21, 2026

Globallink TV

Branded, Localized, and Free to Try: GlobalLink TV Update

Discover corporate video streaming with AI subtitles and multilingual playback. Try GlobalLink TV free for 30 days and localize your video content today.

Updated on March 23, 2026

Updated on March 23, 2026

File Sharing

When Your File-Sharing Tools Don’t Keep Up

GlobalLink Share offers secure file sharing built for large files and ongoing access control. See why teams are switching. Try GlobalLink Share free.

Updated on March 17, 2026

Updated on March 17, 2026