List of Tasks

The list below links to data statements [1, 2] for each of the datasets that are part of GEM tasks. The template used to produce the initial statements and a guide on how to write them can be found here: [download template] [view guide]. We have released an extended version of this template and an interactive collection tool.

  • conversational_weatherData-to-Text|English
    The purpose of this dataset is to assess how well a model can learn a template-like structure in a very low data setting. The task here is to produce a response to a weather-related query. The reply is further specified through the data attributes and discourse structure in the input. The output contains both the lexicalized text and discourse markers for attributes (e.g., `_ARG_TEMP_ 34`).
  • dartData-to-Text|English
    DART is an English dataset aggregating multiple other data-to-text dataset in a common triple-based format. The new format is completely flat, thus not requiring a model to learn hierarchical structures, while still retaining the full information.
  • e2e_nlgData-to-Text|English
    The E2E NLG dataset is an English benchmark dataset for data-to-text models that verbalize a set of 2-9 key-value attribute pairs in the restaurant domain. The version used for GEM is the cleaned E2E NLG dataset, which filters examples with hallucinations and outputs that don't fully cover all input attributes.
  • mlb_data_to_textData-to-Text|English
    The MLB dataset is an English sport-related data-to-text dataset in the baseball domain. The input is a large table with results of a game and the output is a description of the game.
  • RotoWire_English-GermanData-to-Text|English, German
    This dataset is a data-to-text dataset in the basketball domain. The input are tables in a fixed format with statistics about a game (in English) and the target is a German translation of the originally English description. The translations were done by professional translators with basketball experience. The dataset can be used to evaluate the cross-lingual data-to-text capabilities of a model with complex inputs.
  • sportsett_basketballData-to-Text|English
    The sportsett dataset is an English data-to-text dataset in the basketball domain. The inputs are statistics summarizing an NBA game and the outputs are high-quality descriptions of the game in natural language.
  • surface_realisation_st_2020Data-to-Text|Arabic, Chinese, English, French, Hindi, Indonesian, Japanese, Korean, Portuguese, Russian, Spanish, Castilian
    This dataset was used as part of the multilingual surface realization shared task in which a model gets full or partial universal dependency structures and has to reconstruct the natural language. This dataset support 11 languages.
  • tottoData-to-Text|English
    ToTTo is a high-quality English table-to-text dataset with more than 100,000 examples in which a table from Wikipedia with highlighted cells is paired with a sentence that describes the highlighted cells. All examples in the dataset were post-edited in multiple steps to ensure that the targets are fully faithful to the input information.
  • turku_hockey_data2textData-to-Text|Finnish
    This is a Finnish data-to-text dataset in which the input is structured information about a hockey game and the output a description of the game.
  • viggoData-to-Text|English
    ViGGO is an English data-to-text generation dataset in the video game domain, with target responses being more conversational than information-seeking, yet constrained to the information presented in a meaning representation. The dataset is relatively small with about 5,000 datasets but very clean, and can thus serve for evaluating transfer learning, low-resource, or few-shot capabilities of neural models.
  • web_nlgData-to-Text|Russian, English
    WebNLG is a bi-lingual dataset (English, Russian) of parallel DBpedia triple sets and short texts that cover about 450 different DBpedia properties. The WebNLG data was originally created to promote the development of RDF verbalisers able to generate short text and to handle micro-planning (i.e., sentence segmentation and ordering, referring expression generation, aggregation); the goal of the task is to generate texts starting from 1 to 7 input triples which have entities in common (so the input is actually a connected Knowledge Graph). The dataset contains about 17,000 triple sets and 45,000 crowdsourced texts in English, and 7,000 triples sets and 19,000 crowdsourced texts in Russian. A challenging test set section with entities and/or properties that have not been seen at training time is available.
  • CrossWOZDialog Response Generation|Chinese
    CrossWOZ is a Chinese multi-domain task-oriented dialogue dataset . It contains 6K dialogue sessions and 102K utterances for 5 domains, including hotel, restaurant, attraction, metro, and taxi. About 60{\%} of the dialogues have cross-domain user goals that favor inter-domain dependency and encourage natural transition across domains in conversation.
  • cs_restaurantsDialog Response Generation|Czech
    The Czech Restaurants dataset is a task oriented dialog dataset in which a model needs to verbalize a response that a service agent could provide which is specified through a series of dialog acts. The dataset originated as a translation of an English dataset to test the generation capabilities of an NLG system on a highly morphologically rich language like Czech.
  • dstc10_track2_task2Dialog Response Generation|En
    The DSTC10 Track2 Task 2 follows the DSTC9 Track1 task, where participants have to implement knowledge-grounded dialog systems. The training dataset is inherited from the DSTC9 challenge and is in the written domain, while the test set is newly collected and consists of noisy ASR transcripts. Hence, the dataset facilitates building models for grounded dialog response generation.
  • RiSAWOZDialog Response Generation|Mandarin Chinese
    RiSAWOZ is a Chinese dialog dataset. It can be used to study various dialogue tasks, such as Dialogue State Tracking, Dialogue Context-to-Text Generation, Coreference Resolution and Unified Generative Ellipsis and Coreference Resolution.
  • schema_guided_dialogDialog Response Generation|English
    The GEM version of this dataset functions as a response generation dataset. The input specifies dialog acts that a model needs to verbalize. The Schema-Guided Dialog dataset is challenging since it comprises multiple domains from hotel and travel to restaurants, and a wide range of dialog acts. The context of each conversation is provided as well.
  • TaskmasterDialog Response Generation|English
    This is a large task-oriented dialog dataset in which a model has to produce the response. The input contains the context and a structured representation of what the model is supposed to generate. The input is already pre-formatted as string, turning this into a pure text-to-text problem.
  • opusparcusParaphrasing|German, English, Finnish, French, Russian, Swedish
    Opusparcus is a paraphrase corpus for six European languages - German, English, Finnish, French, Russian, and Swedish. The paraphrases consist of subtitles from movies and TV shows.
  • turku_paraphrase_corpusParaphrasing|Finnish
    This is a Finnish paraphrase corpus which consists of pairs of text passages, where a typical passage is about a sentence long. It can be used to either identify or generate paraphrases.
  • FairytaleQAQuestion Generation|English
    The FairytaleQA Dataset is an English-language dataset focusing on narrative comprehension of kindergarten to eighth-grade students. Generated by educational experts based on an evidence-based theoretical framework, FairytaleQA consists of 10,580 explicit and implicit questions derived from 278 children-friendly stories, covering seven types of narrative elements or relations. The Dataset was corrected to support both the tasks of Question Generation and Question Answering.
  • squad_v2Question Generation|English
    SQuAD2.0 is a dataset that tests the ability of a system to not only answer reading comprehension questions, but also abstain when presented with a question that cannot be answered based on the provided paragraph. F1 score is used to evaluate models on the leaderboard. In GEM, we are using this dataset for the question-generation task in which a model should generate squad-like questions from an input text.
  • ARTReasoning|English
    Abductive reasoning is inference to the most plausible explanation. For example, if Jenny finds her house in a mess when she returns from work, and remembers that she left a window open, she can hypothesize that a thief broke into her house and caused the mess, as the most plausible explanation.
  • common_genReasoning|English
    CommonGen is an English text generation task to explicitly test machines for the ability of generative commonsense reasoning. Given a set of common concepts, the task is to generate a coherent sentence describing an everyday scenario using these concepts. CommonGen is challenging because it inherently requires 1) relational reasoning using background commonsense knowledge, and 2) compositional generalization ability to work on unseen concept combinations. The dataset, constructed through a combination of crowd-sourcing from AMT and existing caption corpora, consists of 30k concept-sets and 50k sentences in total. Note that the CommonGen test set is private and requires submission to the external leaderboard.
  • BiSECTSimplification|English, German, French, Spanish, Castilian
    This dataset is composed of 1 million complex sentences with the task to split and simplify them while retaining the full meaning. Compared to other simplification corpora, BiSECT requires more significant edits. BiSECT offers splits in English, German, French, and Spanish.
  • cochrane-simplificationSimplification|English
    Cochrane is an English dataset for paragraph-level simplification of medical texts. Cochrane is a database of systematic reviews of clinical questions, many of which have summaries in plain English targeting readers without a university education. The dataset comprises about 4,500 of such pairs.
  • SIMPITIKISimplification|Italian
    SIMPITIKI is an Italian Simplification dataset. Its examples were selected from Italian Wikipedia such that their editing tracking descriptions contain any of the words "Simplified"/"Simplify"/"Simplification".
  • wiki_auto_asset_turkSimplification|English
    WikiAuto is an English simplification dataset that we paired with ASSET and TURK, two very high-quality evaluation datasets, as test sets. The input is an English sentence taken from Wikipedia and the target a simplified sentence. ASSET and TURK contain the same test examples but have references that are simplified in different ways (splitting sentences vs. rewriting and splitting).
  • indonlgSummarization|Indonesian, Javanese, Sundanese
    IndoNLG is a collection of various Indonesian, Javanese, and Sundanese NLG tasks including summarization, question answering, chit-chat, and three different pairs of machine translation (MT) tasks.
  • mlsumSummarization|German, Spanish, Castilian
    MLSum is a multilingual summarization dataset crawled from different news websites. The GEM version supports the German and Spanish subset alongside specifically collected challenge sets for COVID-related articles to test out-of-domain generalization.
  • OrangeSumSummarization|French
    OrangeSum is a French summarization dataset inspired by XSum. It features two subtasks - abstract generation and title generation. The data was sourced from "Orange Actu" articles between 2011 and 2020.
  • squalitySummarization|English
    SQuALITY (Summarization-format QUestion Answering with Long Input Texts, Yes!) is a summarization dataset that is (1) Abstractive, (2) Long-input - The input document are short stories between 3000--6000 words. (3) Question-focused - Each story is associated with multiple question-summary pairs. (4) Multi-reference - Each question is paired with 4 summaries. (5) High-quality - The summaries are crowdsourced from skilled and trained writers.
  • wiki_cat_sumSummarization|English
    WikiCatSum is an English summarization dataset in three domains - animals, companies, and film. It provides multiple paragraphs of text paired with a summary of the paragraphs.
  • wiki_linguaSummarization|English, Spanish, Castilian, Portuguese, French, German, Russian, Italian, Indonesian, Dutch, Flemish, Arabic, Chinese, Vietnamese, Thai, Japanese, Korean, Hindi, Czech, Turkish
  • xlsumSummarization|Amharic, Arabic, Azerbaijani, Bengali, Bangla, Burmese, Chinese (family), English, French, Gujarati, Hausa, Hindi, Igbo, Indonesian, Japanese, Rundi, Korean, Kirghiz, Kyrgyz, Marathi, Nepali (individual language), Oromo, Pushto, Pashto, Persian, Ghanaian Pidgin English, Portuguese, Panjabi, Punjabi, Russian, Scottish Gaelic, Gaelic, Serbian, Romano-Serbian, Sinhala, Sinhalese, Somali, Spanish, Castilian, Swahili (individual language), Kiswahili, Tamil, Telugu, Thai, Tigrinya, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, Yoruba
    XLSum is a highly multilingual summarization dataset supporting 44 language. The data stems from BBC news articles.
  • xsumSummarization|English
    XSum is an English news summarization dataset where the task is to predict the first sentence of an article from the rest of it.
  • xwikisSummarization|German, English, French, Czech
    The XWikis Corpus provides datasets with different language pairs and directions for cross-lingual and multi-lingual abstractive document summarisation.
  • SciDuetText-to-Slide|English
    This dataset supports the document-to-slide generation task where a model has to generate presentation slide content from the text of a document.