As part of GEM, we are continuously producing resources for the research community. This page provides download links and brief explanations of each.
Our growing collection of millions of outputs and automatic scores for 20+ models across all GEM tasks. This resource is to be used for work on model evaluation, to characterize model shortcomings, and to provide baseline outputs for model comparison.
All our datasets can be loaded via this data loader implemented in HuggingFace datasets.
All our datasets can be loaded via this data loader implemented in TFDS.
Our package for model evaluation. If you want to compute our full suite of metrics with additional convenience functions like caching and parallelism, simply add your dataset to it and follow the instructions in the README.
If you want to run robustness tests on your model and data, NL-Augmenter can help! More information can be found on the dedicated site.