Colorful flowchart diagram with connected shapes—squares, circles, and arrows—illustrating a process or workflow on a light gradient background.

Consistently evaluating outputs from large language models (LLMs) across languages remains difficult, particularly at scale and in low-resource languages.

Automated evaluation methods, including LLM-as-a-judge systems, are increasingly being explored as a way to scale evaluation. More structured, multi-step methods such as Agent-as-a-Judge are also emerging, although they remain at an early stage.

Appen has now launched a Multilingual LLM-as-a-Judge (LLMaaJ) Managed Service for evaluating model outputs across languages and locales. The company cites research showing performance gaps of up to 24.3% between high- and low-resource languages, which its approach aims to address.

“Without culturally calibrated evaluation, teams ship models that perform well in English and fail silently everywhere else,” Appen said.

The service provides an Appen-hosted endpoint where clients can submit model outputs and receive structured, rubric-based assessments.

Each endpoint is configured for a specific locale, taking into account cultural nuance, idiomatic expressions, figurative language, and local communication norms. A distinctive feature of the offering is the integration of locale-specific trusted sources into the system, allowing it to ground evaluations in market-specific sources curated by human experts.

Delivered as a fully managed service, the offering includes model selection and configuration, prompt design aligned with evaluation rubrics, integration of search tools and data sources, and ongoing performance monitoring. Clients access the system through a single endpoint, while Appen manages the underlying infrastructure — that means they do not need to manage prompts, models, or calibration infrastructure internally.

The workflow begins with calibration using a client-provided set of human-annotated examples. Appen refines prompts, tests candidate judge models, and tunes parameters until the system reaches more than 90% agreement with human-annotated ground truth.

Once deployed, the service includes ongoing quality checks, combining weekly human review with a system that identifies low-confidence evaluations and routes them to expert reviewers. “Under this managed service model, expert human review is applied in targeted, high-impact areas and edge cases,” Appen said.

The launch comes as companies look to scale evaluation of AI systems across languages, where quality can vary significantly, and traditional human review does not easily scale.

According to Appen, the service builds on its multilingual data experience, including coverage across more than 500 languages and a contributor network spanning over 100 countries.