Blog posts

2020

Seven recommendations for machine translation evaluation

5 minute read

Published: December 15, 2020

Evaluating machine translation systems is not as obvious as it seems on first glance.

Using old training data or test sets

2 minute read

Published: December 14, 2020

In MT research there is a long-standing tradition of using old data sets when newer versions are available for the same language pair and domain.

Statistical significance testing

3 minute read

Published: December 14, 2020

In recent blog posts I have described many potential issues with MT evaluation. Surely statistical significance testing should help mitigate some of those problems? That may seem reasonable, but the truth is: it can be laughably easy to arrive at results that are statistically significant according to the most popular test in MT research, bootstrap resampling.

Single training runs and estimates of variance

5 minute read

Published: December 14, 2020

Consider a very simple example of a table reporting BLEU scores:

Simulating low-resource experiments

4 minute read

Published: December 14, 2020

Low-resource machine translation has been an active area of research for years. On a high level, what many papers on low-resource MT have in common is that they simulate low-resource scenarios.

Designing human evaluations and reporting the outcomes

3 minute read

Published: December 14, 2020

Every once in a while, there is an MT paper claiming to have achieved human parity (e.g. Hassan et al., 2018, Popel et al., 2020). To be fair, a message like

Computing and reporting BLEU scores

9 minute read

Published: December 14, 2020

BLEU scores are ubiquitous in MT research and they usually appear in tables that look like this:

Comparing to previous work

1 minute read

Published: December 14, 2020

What is common between many of the questionable practices I blog about is that they are seemingly legitimized by saying

Mathias Müller

Blog posts

2020