Seven recommendations for machine translation evaluation
Published:
Evaluating machine translation systems is not as obvious as it seems on first glance.
Published:
Evaluating machine translation systems is not as obvious as it seems on first glance.
Published:
In MT research there is a long-standing tradition of using old data sets when newer versions are available for the same language pair and domain.
Published:
In recent blog posts I have described many potential issues with MT evaluation. Surely statistical significance testing should help mitigate some of those problems? That may seem reasonable, but the truth is: it can be laughably easy to arrive at results that are statistically significant according to the most popular test in MT research, bootstrap resampling.
Published:
Consider a very simple example of a table reporting BLEU scores:
Published:
Low-resource machine translation has been an active area of research for years. On a high level, what many papers on low-resource MT have in common is that they simulate low-resource scenarios.
Published:
Every once in a while, there is an MT paper claiming to have achieved human parity (e.g. Hassan et al., 2018, Popel et al., 2020). To be fair, a message like
Published:
BLEU scores are ubiquitous in MT research and they usually appear in tables that look like this:
Published:
What is common between many of the questionable practices I blog about is that they are seemingly legitimized by saying
Published:
In MT research there is a long-standing tradition of using old data sets when newer versions are available for the same language pair and domain.
Published:
BLEU scores are ubiquitous in MT research and they usually appear in tables that look like this:
Published:
Evaluating machine translation systems is not as obvious as it seems on first glance.
Published:
In MT research there is a long-standing tradition of using old data sets when newer versions are available for the same language pair and domain.
Published:
In recent blog posts I have described many potential issues with MT evaluation. Surely statistical significance testing should help mitigate some of those problems? That may seem reasonable, but the truth is: it can be laughably easy to arrive at results that are statistically significant according to the most popular test in MT research, bootstrap resampling.
Published:
Consider a very simple example of a table reporting BLEU scores:
Published:
Low-resource machine translation has been an active area of research for years. On a high level, what many papers on low-resource MT have in common is that they simulate low-resource scenarios.
Published:
Every once in a while, there is an MT paper claiming to have achieved human parity (e.g. Hassan et al., 2018, Popel et al., 2020). To be fair, a message like
Published:
BLEU scores are ubiquitous in MT research and they usually appear in tables that look like this:
Published:
What is common between many of the questionable practices I blog about is that they are seemingly legitimized by saying
Published:
Every once in a while, there is an MT paper claiming to have achieved human parity (e.g. Hassan et al., 2018, Popel et al., 2020). To be fair, a message like
Published:
Low-resource machine translation has been an active area of research for years. On a high level, what many papers on low-resource MT have in common is that they simulate low-resource scenarios.
Published:
Evaluating machine translation systems is not as obvious as it seems on first glance.
Published:
In MT research there is a long-standing tradition of using old data sets when newer versions are available for the same language pair and domain.
Published:
In recent blog posts I have described many potential issues with MT evaluation. Surely statistical significance testing should help mitigate some of those problems? That may seem reasonable, but the truth is: it can be laughably easy to arrive at results that are statistically significant according to the most popular test in MT research, bootstrap resampling.
Published:
Consider a very simple example of a table reporting BLEU scores:
Published:
Low-resource machine translation has been an active area of research for years. On a high level, what many papers on low-resource MT have in common is that they simulate low-resource scenarios.
Published:
Every once in a while, there is an MT paper claiming to have achieved human parity (e.g. Hassan et al., 2018, Popel et al., 2020). To be fair, a message like
Published:
BLEU scores are ubiquitous in MT research and they usually appear in tables that look like this:
Published:
What is common between many of the questionable practices I blog about is that they are seemingly legitimized by saying
Published:
Every once in a while, there is an MT paper claiming to have achieved human parity (e.g. Hassan et al., 2018, Popel et al., 2020). To be fair, a message like
Published:
What is common between many of the questionable practices I blog about is that they are seemingly legitimized by saying
Published:
Consider a very simple example of a table reporting BLEU scores:
Published:
Evaluating machine translation systems is not as obvious as it seems on first glance.
Published:
What is common between many of the questionable practices I blog about is that they are seemingly legitimized by saying
Published:
In recent blog posts I have described many potential issues with MT evaluation. Surely statistical significance testing should help mitigate some of those problems? That may seem reasonable, but the truth is: it can be laughably easy to arrive at results that are statistically significant according to the most popular test in MT research, bootstrap resampling.
Published:
Consider a very simple example of a table reporting BLEU scores: