How to do an NLG Evaluation: Metrics

Ehud Reiter's Blog

A metric-based evaluation give an NLG system a score by computing how similar its output text is to “gold-standard” reference texts.  There are a number of different metrics (including BLEU, METEOR, and ROUGE), which are based on different scoring functions.

I am not a great fan of metric-based evaluation, for reasons I explain below, and would be very dubious if, for example, I was asked to review a paper on NLG which only presented a metric-based evaluation.   Nevertheless, I will also below give some advice on best practice for such evaluations.

Why I am Dubious About Metric-Based Evaluation

I have written about this in other blog entries, including Evaluation in Medicine and NLG/NLP and Types of NLG Evaluation: Which is Right for Me?.  But needless to say, I wont pass up an opportunity to express my views once more…

Evaluation is a form of hypothesis testing…

View original post 1,425 more words

Leave a comment

Create a free website or blog at WordPress.com.

Up ↑