GEM is a benchmark environment for Natural Language Generation with a focus on its Evaluation, both through human annotations and automated Metrics.

GEM aims to:

  • measure NLG progress across many NLG tasks across languages.
  • audit data and models and present results via data cards and model robustness reports.
  • develop standards for evaluation of generated text using both automated and human metrics.

We will regularly update GEM and to encourage more inclusive practices in evaluation by extending existing data or developing datasets for additional languages.