We are regularly publishing papers on aspects of GEM that describe findings or resources we find worthwhile to share. Please have a look below:
GEMv1 OverviewGEM Workshop 2021
This is our first overview paper, introducing GEM and the initial set of 13 tasks and associated baselines.
GEMv2 OverviewArXiv
This is our second overview paper, expanding GEM to 40 tasks and 51 languages, introducing the automatic evaluation on the HuggingFace Hub.
Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated TextArXiv
In this survey paper, we discuss many of the principles underlying GEM and propose a set of best practices to follow for model evaluation. See also the shortened version presented at the MLEval workshop at ICLR 2022.
Data CardsGEM Workshop 2021
In "Reusable Templates and Guides For Documenting Datasets and Models for Natural Language Processing and Generation: A Case Study of the HuggingFace and GEM Data and Model Cards", we describe the approach for data documentation in GEMv1 and the similar approach used by HuggingFace datasets.
Evaluation SuitesNeurIPS 2021
In the paper "Automatic Construction of Evaluation Suites for Natural Language Generation Datasets", we discuss how to build data collections that test robustness of models and show that they are much more expressive than typical test splits.
NL-Augmenter 🦎 → 🐍GEM Workshop 2021
This was a collaborative & participatory workshop collecting >117 different ways to transform text and >23 ways to filter out subpopulations of datasets.