Technical Metrics for GenAI for Text

Evaluating Text Generation is hard. Here are technical metrics to use - BLEU, GLEU, ROUGE Score, Perplexity. The article also describe pros and cons of each metric

Evaluating Text Generation: A Look Beyond BLEU METRIC

While evaluating the quality of machine-generated text (MT) is crucial, there's no single perfect metric. This article explores various metrics used for Generative AI (GenAI) for text, going beyond the popular

BLEU score:

BLEU (BiLingual Evaluation Understudy): BLEU assesses n-gram (sequence of n words) overlap between generated text and a set of reference translations. It works well for comparing machine translation (MT) systems, but suffers from shortcomings:

Shortcoming 1: Sensitivity to Order: Reordering words can significantly affect BLEU score, even if meaning remains similar.

Shortcoming 2: Focus on Precision, not Recall: BLEU penalizes missing n-grams from the reference harshly, even if the generated text is good overall.

GLEU

GLEU (Gym Levenshtein Edit Distance): GLEU addresses some BLEU limitations by incorporating Levenshtein edit distance, a measure of similarity between strings. It considers word insertions, deletions, and substitutions.

METEOR

METEOR (Metric for Evaluation of Translation with Ordering): METEOR builds on BLEU, incorporating synonyms, stemming (reducing words to their root form), and punctuation information for a more nuanced evaluation.

PERPLEXITY

Perplexity: Perplexity measures how well a language model predicts the next word in a sequence. Lower perplexity indicates better prediction capability. However, perplexity doesn't directly assess semantic coherence or fluency.

ROUGE

ROUGE (Recall-Oriented Understudy for Gisting Evaluation): ROUGE family of metrics (ROUGE-L, ROUGE-S, ROUGE-W) focuses on recall, considering the ratio of n-grams in the generated text that are also present in the reference text.

Other Metrics:

CIDEr (Consensus-based Image Description Evaluation): CIDEr incorporates BLEU along with word embedding similarity to evaluate the factual correctness and semantic similarity of generated descriptions.

MoverScore: MoverScore assesses semantic similarity between generated text and reference text using a pre-trained sentence embedding model. It focuses on capturing the overall meaning rather than exact word overlap.

BERTScore: BERTScore leverages pre-trained BERT models to calculate both precision and recall for n-gram matches, along with a measure of grammatical correctness. Choosing the Right Metric:

It depends on use case and task. Here are some considerations:

Machine Translation (MT): BLEU, GLEU, METEOR, ROUGE are commonly used.
Text Summarization: ROUGE metrics are often preferred.
Text Generation for Creativity: Metrics like CIDEr or MoverScore might be more suitable, as they assess semantic similarity and factual correctness.

Technical Metrics for GenAI for Text