Logo

About LLM Evaluation

Evaluating Large Language Models is notoriously difficult. Unlike traditional software, "correctness" is often subjective. This guide explores the three main pillars of evaluation: statistical overlap, semantic embeddings, and model-based judging.

1) Lexical Overlap Metrics

Traditional NLP metrics rely on counting n-gram overlaps between a candidate response and a reference. While they lack semantic understanding, they remain the industry standard for benchmarking speed and consistency.

BLEU (2002)

Bilingual Evaluation Understudy

BLEU measures how much a candidate text overlaps with a reference text. Original Paper

General Questions

  • What was it created for?
    To measure the quality of machine translation.
  • What does it measure?
    It measures how much a candidate (generated) text overlaps with one reference or more (there are several ways to translate something). More specifically, n-gram overlap. BLEU use clipped counts meaning that it does not count a word more than the maximum occurence of that word in the reference.
  • What is an n-gram?
    A sequence of n words. E.g., 2-gram of "I am a cat" is ["I am", "am a", "a cat"].
  • Precision vs Recall?
    It measures how many words in the candidate appear in the reference, not how many words in the reference are covered.

Computation Details

  • How is it computed?
    Need to define a maximum n-grams, 4 by default. A too low max n-grams will look at exact words and give too much importance to single filler words, while a max n-gram too big will penalize structure differences. BLEU uses a geomatric mean meaning it computes 1-gram precision, 2-gram precision until the max n-gram precision and average the results.
  • What is the Brevity Penalty (BP) and why is it needed:
    If the candidate is very short, and the words in the candidate are present in the references, then the precision might be very high. The BP consists in penalizing candidates that are shorter in length than the references. If the candidate is longer than the reference then BP = 1, else BP < 1.
  • Smoothing:
    Not present in the original paper, but if one of n-gram computation is zero then BLEU is zero. Smoothing prevents this.

Interpretation

  • Limitations:
    It compares exact words, ignoring semantic similarity or synonyms.
  • Best use-case:
    Evaluate translations. Not very good for summarization/dialogue since there are many valid outputs, and low word overlap

ROUGE (2004)

Recall-Oriented Understudy for Gisting Evaluation

ROUGE measures how much of the reference text is covered by the candidate text. Original Paper

General Questions

  • What was it created for?
    To measure the quality of automatically generated summaries vs human-written summaries.
  • What does it measure?
    It measures how many words within the references appear in the candidate (generated) text. More specifically, it measures the n-gram overlap.
  • Precision vs Recall?
    It is considered a recall metric because it measures how many grams of the reference appear in the candidate, not the other way around. While precision and F1 variants exist, recall is most commonly used.

Computation Details

  • What are ROUGE-1 and ROUGE-2?
    ROUGE-1 measures how many unigrams from the reference appear in the candidate, divided by the total unigrams in the reference. ROUGE-2 follows the same logic but uses bigrams. If multiple references exist, the maximum score is typically taken.
  • What is ROUGE-L?
    It is based on the Longest Common Subsequence (LCS); it measures the longest sequence of unigrams from the reference that appear in the candidate.

Interpretation

  • What do the scores indicate?
    High ROUGE-1 indicates good content coverage. High ROUGE-2 indicates similar phrasing, and high ROUGE-L indicates a similar overall structure.
  • Best use-case:
    Evaluating summaries. It can also be applied to translation or dialogue systems.
  • Limitations:
    It relies on exact matches, ignoring synonyms or semantic meaning. A high ROUGE score is possible even with poor readability if the candidate contains many "empty" words (words are simply there without any meaning or logic) found in the reference.

METEOR (2005)

Metric for Evaluation of Translation with Explicit ORdering

METEOR improves upon BLEU by incorporating recall and linguistic flexibility like synonyms and stemming. Original Paper

General Questions

  • What was it created for?
    To measure the quality of machine translations while fixing BLEU’s limitations, specifically its lack of recall and synonym handling.
  • What does it measure?
    It measures the harmonic mean (F-score) of precision and recall. It looks for matches across several levels: exact word matches, stemmed forms, and synonyms.
  • Why is it considered more balanced?
    Unlike BLEU (precision-focused) or ROUGE (recall-focused), METEOR combines both and allows for linguistic flexibility, making it more semantically aware.

Computation Details

  • How is it computed?
    It aligns words between the candidate and reference, computes precision and recall to create an F-score, and then applies a fragmentation penalty. The final score is: $F \times (1 - )$.
  • What is the Fragmentation Penalty?
    It penalizes disordered or scattered matches. If matching words are in a different order than the reference, more "chunks" are formed, which increases the penalty and lowers the score.

Interpretation

  • What does a high score indicate?
    Strong overlap in content and correct word order. It suggests the output uses similar words, stems, or synonyms compared to the reference.
  • Best use-case:
    Primary use in Machine Translation, though it is also effective for summarization tasks.
  • Limitations:
    Requires specific linguistic resources (stemmers, synonym databases), is computationally slower, and can still miss very deep semantic paraphrases.

Implementation and Example

We want to compare the 2 following sentences:

CANDIDATE

"The quick brown fox leaps over the lazy dog"

REFERENCE

"A fast brown fox jumps over the lazy dog"

Python implementation
View code
            
import math
from collections import Counter
import re

#for meteor
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')

#to avoid case sensitive comparison and punctuation differences
def normalize(text):
    text = text.lower()
    text = re.sub(r'[^ws]', '', text) 
    return text.split()

#n-grams computation
def ngrams(tokens, n):
    return [tuple(tokens[i:i+n]) for i in range(len(tokens)-n+1)]

def bleu_score(candidate, references, max_n=4):
    precisions = []
    for n in range(1, max_n + 1):
        cand_ngrams = Counter(ngrams(candidate, n))
        
        # max reference count for each n-gram
        max_ref_counts = Counter()
        for ref in references:
            ref_counts = Counter(ngrams(ref, n))
            for ng in ref_counts:
                max_ref_counts[ng] = max(max_ref_counts[ng], ref_counts[ng])
        
        match_count = sum(min(c, max_ref_counts[ng]) for ng, c in cand_ngrams.items())
        total_count = sum(cand_ngrams.values())
        precision = match_count / total_count if total_count > 0 else 0
        
        #smoothing to avoid log(0)
        if precision == 0:
            precision = 1e-9 
            
        precisions.append(precision)
    
    if min(precisions) > 0:
        geo_mean = math.exp(sum(math.log(p) for p in precisions) / max_n)
    else:
        geo_mean = 0.0

    # choose reference length closest to candidate length
    cand_len = len(candidate)
    ref_lens = [len(r) for r in references]
    ref_len = min(ref_lens, key=lambda r: abs(r - cand_len))

    brevity_penalty = 1.0 if cand_len > ref_len else math.exp(1 - ref_len / cand_len)
    bleu = brevity_penalty * geo_mean
    return bleu

#rouge
def rouge_n(candidate, references, n=1):
    """Compute ROUGE-N recall."""
    cand_ngrams = Counter(ngrams(candidate, n))
    
    # for multi-reference, take the maximum recall
    best_recall = 0.0
    for ref in references:
        ref_ngrams = Counter(ngrams(ref, n))
        match_count = sum(min(cand_ngrams[ng], ref_ngrams[ng]) for ng in ref_ngrams)
        total_ref_ngrams = sum(ref_ngrams.values())
        recall = match_count / total_ref_ngrams if total_ref_ngrams > 0 else 0
        best_recall = max(best_recall, recall)
    return best_recall

def lcs_length(x, y):
    """Compute length of the Longest Common Subsequence (LCS)."""
    m, n = len(x), len(y)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    for i in range(m):
        for j in range(n):
            if x[i] == y[j]:
                dp[i+1][j+1] = dp[i][j] + 1
            else:
                dp[i+1][j+1] = max(dp[i][j+1], dp[i+1][j])
    return dp[m][n]

def rouge_l(candidate, references, beta=1.0):
    """Compute ROUGE-L F1 (based on LCS)."""
    best_f1 = 0.0
    for ref in references:
        lcs = lcs_length(candidate, ref)
        prec = lcs / len(candidate) if candidate else 0
        rec = lcs / len(ref) if ref else 0
        if prec + rec == 0:
            f1 = 0
        else:
            f1 = ((1 + beta**2) * prec * rec) / (rec + beta**2 * prec)
        best_f1 = max(best_f1, f1)
    return best_f1

#--- CODE EXECUTION ---
# Define candidate and reference sentences 
candidate = normalize("The quick brown fox leaps over the lazy dog") 
reference = normalize("A fast brown fox jumps over the lazy dog")

# bleu
bleu = bleu_score(candidate, [reference])
print(f"BLEU: {bleu:.2f}")

# rouge
rouge1 = rouge_n(candidate, [reference], n=1)
rouge2 = rouge_n(candidate, [reference], n=2)
rougel = rouge_l(candidate, [reference])

print(f"ROUGE-1 (recall): {rouge1:.2f}")
print(f"ROUGE-2 (recall): {rouge2:.2f}")
print(f"ROUGE-L (F1): {rougel:.2f}")

# meteor
score = meteor_score([reference], candidate)
print(f"METEOR: {score:.2f}")

            
METRIC RESULTS
BLEU 0.35
ROUGE-1 (Recall) 0.67
ROUGE-2 (Recall) 0.50
ROUGE-L (F1) 0.67
METEOR 0.89

Interpretation

The high METEOR (0.89) score indicates that while the exact wording differs (e.g., "fast" vs "quick"), the semantic meaning is well-preserved. The lower BLEU (0.35) reflects the penalty for n-gram mismatches caused by these synonyms.

2) Semantic Embedding Metrics

Metrics like BERTScore and BLEURT move beyond exact word matches by using contextual embeddings to measure meaning.

BERTScore (2019)

Bidirectional Encoder Representations from Transformers Score

BERTScore captures semantic similarity through contextual embeddings, recognizing paraphrases that lexical metrics often miss. Original Paper

General Questions

  • What was it created to address?
    To overcome the limitations of lexical overlap (like BLEU/ROUGE), which fail to recognize semantic equivalence or synonyms. BERTScore measures the quality of generated text by focusing on semantic meaning using pre-trained models like BERT.
  • What kind of linguistic relationship can BERTScore recognize that BLEU cannot?
    It can recognize semantic equivalence even when surface-level words differ significantly (e.g., matching "feline" to "cat" or "rug" to "mat").
  • What does it measure?
    It computes Precision (how well the generated text tokens are supported by the reference), Recall (how well the reference's information is covered by the generated text), and the F1 score (harmonic mean of precision and recall).

Computation Details

  • How does BERTScore represent the generated and reference texts mathematically?
    Texts are tokenized and converted into high-dimensional vectors (embeddings) using a pre-trained contextual model like BERT or RoBERTa.
  • Similarity Function:
    The similarity between individual token vectors is calculated using cosine similarity.
  • How is it computed?
    BERTScore uses a token-wise maximum matching approach. To compute the precision, it loops through every token in the candidate text and finds the token in the reference text that yields the maximum cosine similarity score. These scores are then summed. For Recall it's the opposite: for each token in the reference, it finds the token in the candidate that yields the highest cosine similarity score.

Interpretation

  • Limitations:
    Subject to model biases and lower interpretability. It may also reward hallucinations if they are semantically close to the target, even if factually wrong.
  • Best use-case:
    Evaluating text summaries and complex natural language generation where paraphrasing is encouraged.

BLEURT (2020)

Bilingual Evaluation Understudy with Representations from Transformers

BLEURT is a learned metric that leverages pre-training and fine-tuning to align its scores directly with human qualitative judgment. Original Paper

General Questions

  • What was BLEURT created to address?
    It was created to address the poor correlation between lexical overlap metrics (like BLEU/ROUGE) and human judgments. While BERTScore improved semantic recognition, BLEURT aims to directly learn the complex criteria used by humans to assign quality scores.
  • What is the core technological foundation?
    BLEURT is a learned evaluation metric based on the BERT architecture, trained as a regression model.
  • What does BLEURT output?
    It outputs a single, continuous, real-valued score that directly predicts the level of quality a human evaluator would assign to the candidate sentence relative to the reference.

Computation Details

  • How is BLEURT computed?
    BLEURT is first trained on millions of synthetic data, typically Wikipedia text that contains perturbations (deletions, substitutions, etc). Then the model is trained on a human-labeled dataset for high quality outputs.
  • What is the benefit of training on synthetic data?
    It exposes the model to small errors, making the model more robust.
  • How is the score computed?
    The candidate and reference are given to the trained model, which outputs a single score representing how close the candidate is to the reference.

Interpretation

  • What does a high score mean?
    A high BLEURT score means that the candidate is close in quality to the reference according to human judgment.
  • Advantage over BERTScore:
    BLEURT is trained on human-labeled data which aligns better with human judgment compared to BERTScore.
  • Limitations:
    BLEURT quality depends on the quality of the underlying data used for training (e.g., human-labeled data can be domain-specific).

Implementation and Example

We take the same the 2 following sentences:

CANDIDATE

"The quick brown fox leaps over the lazy dog"

REFERENCE

"A fast brown fox jumps over the lazy dog"

Python implementation
View code
            
from bert_score import score
import torch
from bleurt_pytorch import BleurtConfig, BleurtForSequenceClassification, BleurtTokenizer

config = BleurtConfig.from_pretrained('lucadiliello/BLEURT-20-D12')
model = BleurtForSequenceClassification.from_pretrained('lucadiliello/BLEURT-20-D12')
tokenizer = BleurtTokenizer.from_pretrained('lucadiliello/BLEURT-20-D12')

candidate = ["The quick brown fox leaps over the lazy dog"]
reference = ["A fast brown fox jumps over the lazy dog"]

#BERTScore
P, R, F1 = score(candidate, reference, lang="en")
print(f"BERTScore F1: {F1[0]:.4f}")

#BLEURT
model.eval()
with torch.no_grad():
    inputs = tokenizer(reference, candidate, padding='longest', return_tensors='pt')
    res = model(**inputs).logits.flatten().tolist()
print(f'BLEURT: {res[0]}')

            
METRIC RESULTS
BERTScore 0.99
BLEURT 0.70

Interpretation

The high BERTScore (0.99) score indicates that the 2 sentences describe the same exact event with no loss of information. BLEURT (0.70) is trained on human judgement and shows that even though the 2 sentences are close to each other, there are some lexical variations that make them different (probably "quick" vs "fast" or "leaps" vs "jumps").

3) LLM-as-a-judge

The LLM-as-a-Judge Paradigm

Evaluating model outputs traditionally requires a human-written reference, which is both time-consuming and rigid. In many cases, a model's response might be superior to the reference, or multiple "correct" answers may exist.

LLM-as-a-judge shifts this evaluation to a more flexible, model-driven approach. To test this, I compared the summarization capabilities of Gemini 2.0 Flash Lite and Gemini 2.5 Pro by generating summaries for 10 distinct Wikipedia pages.

Data Acquisition: Generating Summaries
View Code

from google import genai
from google.genai import types
import vertexai
from vertexai.generative_models import GenerativeModel
from vertexai.generative_models import GenerationConfig, SafetySetting

import pandas as pd
import time

import json
from concurrent.futures import ThreadPoolExecutor, as_completed

# LLM config
system_instructions = """
    You are an expert at summarizing a web page. 
    You will receive a Wikipedia URL and you will summarize the topic in exactly 500 words.
"""

TEMPERATURE = 0

client = genai.Client(vertexai=True, project='datalake--dev', location='europe-west4')


response_schema = {
    "type": "OBJECT",
    "properties": {
        "summary": {"type": "STRING"},
    },
    "required": ["summary"],
}

CONFIG_SUMMARIZE = types.GenerateContentConfig(
    system_instruction=system_instructions,
    temperature=TEMPERATURE,
    response_mime_type="application/json", 
    response_schema=response_schema
)


#Function that summarizes a url
def summarize(url, llm_model_name, config):
    response = client.models.generate_content(
        model=llm_model_name,
        contents=url,
        config=config,
    )
    response_json = json.loads(response.text)

    return response_json.get("summary", "")


#Function that creates a dataframe with a summary from both models for each url
def generate_dataset(model_a, model_b, urls):
    data_list = []
    for url in urls:
        summary_a = summarize(url, model_a, CONFIG_SUMMARIZE)
        summary_b = summarize(url, model_b, CONFIG_SUMMARIZE)
        
        row_data = {
            "url": url,
            "summary_a": summary_a,
            "summary_b": summary_b
        }
        
        data_list.append(row_data)
        
    df = pd.DataFrame(data_list)
    
    return df


#Define models, urls and run the script
MODEL_A = "gemini-2.0-flash-lite"
MODEL_B = "gemini-2.5-pro"


URLS = ['https://en.wikipedia.org/wiki/Machine_learning', 
        'https://en.wikipedia.org/wiki/Generative_artificial_intelligence',
        'https://en.wikipedia.org/wiki/Large_language_model',
        'https://en.wikipedia.org/wiki/Prompt_engineering',
        'https://en.wikipedia.org/wiki/Transformer_(deep_learning)',
        'https://en.wikipedia.org/wiki/Attention_(machine_learning)',
        'https://en.wikipedia.org/wiki/LLM-as-a-Judge',
        'https://en.wikipedia.org/wiki/Reinforcement_learning',
        'https://en.wikipedia.org/wiki/Markov_decision_process',
        'https://en.wikipedia.org/wiki/Generative_adversarial_network'
       ]


df = generate_dataset(MODEL_A, MODEL_B, URLS)

Evaluation Methodology

I deployed five LLM judges to evaluate the summary pairs. Each judge selected a winner based on three core dimensions:

  • 01.

    Factual Accuracy: Does the summary remain truthful to the source material?

  • 02.

    Readability: Is the text coherent, grammatically sound, and naturally paced?

  • 03.

    Conciseness: Does it provide the maximum information density with minimal fluff?

The final "Winner" for each Wikipedia page is determined by a majority vote across all five judges.

Implementation: LLM-as-a-Judge Logic
View Code

from enum import Enum
from pydantic import BaseModel, Field

#response schema
class SummaryChoice(str, Enum):
    summary_a = "summary_a"
    summary_b = "summary_b"

class OutputSchema(BaseModel):
    summary_choice: SummaryChoice = Field(description="Best Summary")
    rationale: str = Field(description="Concise rationale")


#evaluate the summaries: series of LLMs that evaluate the best summary
client = genai.Client(vertexai=True, project='datalake--dev', location='europe-west4')

system_instructions = """
    You are a highly efficient Comparative Summarization Analyst. Your sole task is to determine which of two provided texts, **Summary A** or **Summary B**, is superior.
    
    You must compare them based on the following criteria:
        1.  **Factual Accuracy:** Which summary is more accurate? 
        2.  **Readability:** Which summary is grammatically correct, coherent, and flows more naturally?
        3.  **Conciseness:** Which summary is shorter while still retaining all essential information?

    Your output must only contain the chosen winner and a concise rationale for that choice.
    """

TEMPERATURE = 0


CONFIG_JUDGE = types.GenerateContentConfig(
    system_instruction=system_instructions,
    safety_settings = SAFETY_SETTINGS,
    temperature=TEMPERATURE,
    response_mime_type="application/json", 
    response_schema=OutputSchema.model_json_schema()
)

def judge(summary_a, summary_b, model, config):
    
    content = f"""
        Summary A: {summary_a}
        
        Summary B: {summary_b}
    """
    
    response = client.models.generate_content(
        model=model,
        contents=content,
        config=config,
    )
    response_json = json.loads(response.text)

    return response_json.get("summary_choice", ""), response_json.get("rationale", "")

JUDGE_MODELS = ["gemini-2.0-flash-lite", 
                "gemini-2.0-flash", 
                "gemini-2.5-flash-lite", 
                "gemini-2.5-flash", 
                "gemini-2.5-pro"
               ]

data_list = []
for i, row in df.iterrows():
    for judge_model in JUDGE_MODELS:
        sc, r = judge(row['summary_a'], row['summary_b'], judge_model, CONFIG_JUDGE)
    
        row_data = {
            "url": row['url'],
            "judge_model": judge_model,
            "summary_choice": sc,
            "rationale": r
        }

        data_list.append(row_data)

df_judge = pd.DataFrame(data_list)

Experimental Results

Key Finding

Gemini 2.5 Pro consistently outperformed Gemini 2.0 Flash Lite across 9 of the 10 Wikipedia pages, demonstrating a higher grasp of nuance and structural density.

Self-Correction/Bias

A potential "self-preference" bias exists: the judge pool included the same models being tested. This is a common challenge in LLM-based evaluation.

The primary advantage of this approach is its scalability and flexibility. We can define bespoke criteria (e.g., tone, safety, or formatting) that traditional metrics like ROUGE or BLEU cannot capture.

However, the lack of reproducibility remains a hurdle. By utilizing an ensemble of five judges and calculating an average or majority vote, we successfully mitigated individual model variance and achieved a more stable consensus.