A metric-based evaluation give an NLG system a score by computing how similar its output text is to “gold-standard” reference texts. There are a number of different metrics (including BLEU, METEOR, and ROUGE), which are based on different scoring functions.
I am not a great fan of metric-based evaluation, for reasons I explain below, and would be very dubious if, for example, I was asked to review a paper on NLG which only presented a metric-based evaluation. Nevertheless, I will also below give some advice on best practice for such evaluations.
Why I am Dubious About Metric-Based Evaluation
I have written about this in other blog entries, including Evaluation in Medicine and NLG/NLP and Types of NLG Evaluation: Which is Right for Me?. But needless to say, I wont pass up an opportunity to express my views once more…
Evaluation is a form of hypothesis testing…
View original post 1,425 more words
Leave a comment