TowerZen: Setting New Standards in Translation Quality Through Rigorous Evaluation

AI translation quality is easy to claim but difficult to validate. This blog breaks down the evaluation framework behind TowerZen, including the datasets, benchmarking methods, and human review processes used to assess translation performance across languages, domains, and enterprise use cases.

Crystal Maganzini

Updated on May 18, 2026

Translation quality claims are easy to make and hard to verify, especially now that “general-purpose” large language models (LLMs) can produce fluent text on demand. But for global enterprises—especially in regulated industries—fluency isn’t enough. The real standard is evidence that a model preserves meaning, follows target-language conventions and style guidelines, and performs consistently across languages and domains. Meeting that standard requires structured, comprehensive translation quality evaluation. TowerZen, the latest in a family of LLMs developed by TransPerfect and acquisition partner Unbabel, was subjected to exactly that kind of evaluation. Purpose-built for translation and integrated into TransPerfect’s GlobalLink® technology platform, TowerZen sets a new standard for translation quality.

This blog details the technical evaluation behind TowerZen: how we validated quality and how we designed the evaluation and data pipeline to make the results trustworthy.

Quality Begins with Data

Artificial intelligence (AI) models are a product of training data, and the examples they learn from become the patterns they reproduce. For TowerZen, the goal wasn’t “more data”—it was quality data: the kind of examples that reflect what real clients expect in production.

Quality-filtered training data

TowerZen was trained with a custom curriculum combining supervised fine-tuning and reinforcement learning on quality-filtered datasets, enabling more consistent translation quality across a broader range of subject matters, language varieties, and industries. Examples are screened before training, so the model learns from translations that meet quality standards rather than absorbing noise (inconsistent terminology, errors, or misaligned segments), and spends more time focused on challenging examples.

Why that matters technically

Noisy parallel data can inflate apparent fluency while degrading adequacy and consistency. The result often shows up as an apparent “drift” from the source. The quality-first pipeline we developed helps shift learning toward:

Adequacy: meaning preservation
Fluency: grammatical correctness and idiomaticity
Locale convention: meeting expectations of speakers of different varieties of the same language
Consistency: terminology stability across translation projects

Building a Benchmark That Reflects Production Reality

For evaluation, we used ZBench v3, TransPerfect’s internal benchmark derived from human translations across a broad range of domains. The intent is to stress the model on what it will actually see: mixed subject matter, varied styles, and real-world segment distributions.

Automatic Evaluation at Scale

Automatic metrics are useful when you need broad coverage across languages and thousands of segments. We used COMET with the reference-based model Unbabel/wmt22-comet-da. COMET is widely used because it correlates well with human judgments compared to older lexical overlap metrics.

What we compared

In the automatic evaluation, we compared:

TowerZen-9B
TowerZen-2B
Tower-4 Sugarloaf (previous best Tower iteration)
DeepL
TransPerfect internal production models (language-specialized)

How we aggregated across languages: Borda counts

Instead of relying only on raw score averages, we also computed Borda counts, ranking each model per language pair by COMET and then averaging ranks across languages. This helps reduce the chance that a few language pairs dominate the conclusion due to scale differences.

Outcome (automatic): TowerZen models scored above Tower-4 Sugarloaf overall, with TowerZen-9B the top-ranked system by COMET and Borda aggregation.

Human Evaluation: Blinded and Forced-Ranking Design

Automatic metrics have limits, especially when differences are small, subjective, or when errors are subtle (word sense, negation, agreement, register). COMET also isn’t trained to distinguish between language varieties, so it can’t tell us when a French translation appropriate for France would seem unnatural in Canada. To address these limits, we ran an extensive human evaluation of machine translation designed to be comparative, blinded, and auditable.

Design: two complementary tasks

We ran all annotation in DataForce, TransPerfect’s internal evaluation platform. Annotators performed two tasks:

Absolute rating: 1 (unacceptable) to 5 (excellent)
Relative ranking: forced choice across systems, best to worst

Each task answers a different question:

Ratings capture “Is this output good enough?”

Rankings capture “Which output is better?” even when all outputs are acceptable.

Blinding and shuffling

To reduce bias, annotators never saw which system produced which output, and system labels were shuffled between tasks (e.g., “MT version A” in one task was not the same model in the next).

Coverage and competitors

Human evaluation covered 30 language pairs, comparing up to five systems depending on availability:

TowerZen-9B
TowerZen-2B
Tower-4 Sugarloaf
DeepL
TransPerfect internal production models (language-specialized)

How We Interpreted Results

We analyzed results along two dimensions:

Ratings: absolute quality level

Most language pairs clustered between 4.0 and 4.5 average ratings, with a few lower-performing cases (e.g., JA→EN and EN→ZHHK closer to ~3). This helps identify where translations are “generally strong” versus “still variable.”

Rankings: head-to-head stability

From ranking data, we computed:

Win ratios (how often system X beats system Y)
Average rank/Borda counts (overall preference ordering)

Outcome (human):

TowerZen-9B was the top system overall by average rank.
Across all head-to-head comparisons, TowerZen-9B won ~54% of comparisons, with a 57% win rate against TransPerfect internal production models.

On average rank, TowerZen-9B ranked 1 overall, closely followed by Tower-4 Sugarloaf.

What This Means in Practice

Better evaluation outcomes translate into operational impact:

Less post-editing: higher initial quality reduces time and cost per segment
Faster turnaround: fewer corrections before content delivery
More confidence at scale: higher-quality input makes quality estimation and automated QA more effective
More control than generic MT: TowerZen can be tailored to terminology, tone, and domain needs

Because TowerZen is integrated into GlobalLink, these gains can be applied directly in production localization workflows.

Looking Ahead

TowerZen-9B is our strongest translation model to date, and we’re continuing to invest in:

Expanded language coverage
Stronger performance in specialized domains
Improved compliance with complex style guide instructions
Improved glossary and terminology adherence
Document-level translation, which uses broader context across segments

Ongoing improvements will be guided by continued translation quality evaluation, ensuring TowerZen remains a high-performing LLM translation model for enterprise use, and by ongoing human evaluation of machine translation to validate progress across languages and domains.

To learn more about TowerZen, contact us.

About the Author

Crystal Maganzini

Global Director, AI Solutions at TransPerfect

Latest Articles

App Localization Gets Smarter with GlobalLink Strings: AI Editing and Live UI Context

App localization in GlobalLink Strings now includes AI translation editing, Live App Streaming, and automated screenshot capture. See what’s new.

Imran Sadiq

Updated on June 26, 2026

GlobalLink Portal Adds AI Translation and Direct Quoting

GlobalLink Portal now offers built-in AI translation through GlobalLink Now and automatic quoting. Read the Q1 2026 update to see what's new.

Imran Sadiq

Updated on June 26, 2026

Choose Your Speech-to-Text Engine: The Latest from GlobalLink Scribe

Speech-to-text users can now pick the ASR engine that fits each file's quality or security needs. Discover what's new in GlobalLink Scribe for Q1 2026.

Imran Sadiq

Updated on June 18, 2026

New GlobalLink TV Features: Playlists, Searchable Transcripts, and GlobalLink Scribe Integration

GlobalLink TV now offers playlists, searchable transcripts, and new integrations to streamline your corporate video workflows. Try it free today.

Imran Sadiq

Updated on June 17, 2026

The Latest Update From GlobalLink Live: Analytics and a Pre-Session Safety Net

GlobalLink Live now offers an analytics dashboard for AI interpretation, finer model controls, and an interpreter readiness checklist. Request a demo today.

Imran Sadiq

Updated on June 17, 2026

What’s New in GlobalLink TMS: AI Agent, Knowledge Chatbot, and Smarter Quoting

Translation management system updates from GlobalLink: new AI agent, GLK chatbot, Sapphire quote entry, refreshed notifications. See what's new today.

Imran Sadiq

Updated on June 12, 2026

Share Goes Mobile: The Latest Updates from GlobalLink Share

The GlobalLink Share mobile app is now available for iOS and Android, along with an updated Pro plan supporting 100 GB transfers. Download the app to start sharing on the go.

Imran Sadiq

Updated on June 10, 2026

What’s New in GlobalLink Web: AI Live Assist, Smarter SEO, and a Localized Dashboard

Discover the latest GlobalLink Web updates: AI Live Assist, multilingual SEO, and a localized dashboard. Start translating today.

Imran Sadiq

Updated on June 11, 2026

What’s New in GlobalLink Now: Sharper AI Translation and More Control

Discover the latest AI translation updates in GlobalLink Now, including our proprietary TowerLLM model, alternative suggestions, and dark mode. Learn more.

Imran Sadiq

Updated on May 22, 2026