common_genReasoning

common_gen

CommonGen is an English text generation task to explicitly test machines for the ability of generative commonsense reasoning. Given a set of common concepts, the task is to generate a coherent sentence describing an everyday scenario using these concepts. CommonGen is challenging because it inherently requires 1) relational reasoning using background commonsense knowledge, and 2) compositional generalization ability to work on unseen concept combinations. The dataset, constructed through a combination of crowd-sourcing from AMT and existing caption corpora, consists of 30k concept-sets and 50k sentences in total. Note that the CommonGen test set is private and requires submission to the external leaderboard.

You can load the dataset via:

import datasets
data = datasets.load_dataset('GEM/common_gen')

The data loader can be found here.

website

link

paper

Link

authors

Bill Yuchen Lin (USC), Wangchunshu Zhou (USC), Ming Shen (USC), Pei Zhou (USC), Chandra Bhagavatula (AllenAI), Yejin Choi (AllenAI + UW), Xiang Ren (USC)

Quick-Use

Contact Name

If known, provide the name of at least one person the reader can contact for questions about the dataset.

Bill Yuchen Lin

Multilingual?

Is the dataset multilingual?

no

Covered Languages

What languages/dialects are covered in the dataset?

English

License

What is the license of the dataset?

mit: MIT License

Communicative Goal

Provide a short description of the communicative goal of a model trained for this task on this dataset.

The speaker is required to produce a coherent sentence which mentions all of the source concepts, and which describes a likely situation that could be captured in a picture or video.

Additional Annotations?

Does the dataset have additional annotations for each instance?

none

Contains PII?

Does the source language data likely contain Personal Identifying Information about the data creators or subjects?

no PII

Dataset Overview
  • Where to find the Data and its Documentation

  • Languages and Intended Use

  • Credit

  • Dataset Structure

Where to find the Data and its Documentation

Webpage

What is the webpage for the dataset (if it exists)?

link

Download

What is the link to where the original dataset is hosted?

Link

Paper

What is the link to the paper describing the dataset (open access preferred)?

Link

BibTex

Provide the BibTex-formatted reference for the dataset. Please use the correct published version (ACL anthology, etc.) instead of google scholar created Bibtex.

@inproceedings{lin-etal-2020-commongen,
title = "{C}ommon{G}en: A Constrained Text Generation Challenge for Generative Commonsense Reasoning",
author = "Lin, Bill Yuchen  and
Zhou, Wangchunshu  and
Shen, Ming  and
Zhou, Pei  and
Bhagavatula, Chandra  and
Choi, Yejin  and
Ren, Xiang",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.findings-emnlp.165",
pages = "1823--1840",
}
Contact Name

If known, provide the name of at least one person the reader can contact for questions about the dataset.

Bill Yuchen Lin

Contact Email

If known, provide the email of at least one person the reader can contact for questions about the dataset.

yuchen.lin@usc.edu

Has a Leaderboard?

Does the dataset have an active leaderboard?

yes

Leaderboard Link

Provide a link to the leaderboard.

Link

Leaderboard Details

Briefly describe how the leaderboard evaluates models.

The model outputs are evaluated against the crowdsourced references, and ranked by SPICE score. The leaderboard also reports BLEU-4 and CIDEr scores.

Languages and Intended Use

Multilingual?

Is the dataset multilingual?

no

Covered Dialects

What dialects are covered? Are there multiple dialects per language?

No information is provided on regional restrictions and we thus assume that the covered dialects are those spoken by raters on Mechanical Turk.

Covered Languages

What languages/dialects are covered in the dataset?

English

Whose Language?

Whose language is in the dataset?

The concepts were extracted from multiple English image captioning datasets and the data was collected via Amazon Mechanical Turk. No information on regional restrictions is provided.

License

What is the license of the dataset?

mit: MIT License

Intended Use

What is the intended use of the dataset?

CommonGen is a constrained text generation task, associated with a benchmark dataset, to explicitly test machines for the ability of generative commonsense reasoning.

Primary Task

What primary task does the dataset support?

Reasoning

Communicative Goal

Provide a short description of the communicative goal of a model trained for this task on this dataset.

The speaker is required to produce a coherent sentence which mentions all of the source concepts, and which describes a likely situation that could be captured in a picture or video.

Credit

Curation Organization Type(s)

In what kind of organization did the dataset curation happen?

academic, independent

Curation Organization(s)

Name the organization(s).

The dataset was curated by a joint team of researchers from the University of Southern California and Allen Institute for Artificial Intelligence.

Dataset Creators

Who created the original dataset? List the people involved in collecting the dataset and their affiliation(s).

Bill Yuchen Lin (USC), Wangchunshu Zhou (USC), Ming Shen (USC), Pei Zhou (USC), Chandra Bhagavatula (AllenAI), Yejin Choi (AllenAI + UW), Xiang Ren (USC)

Funding

Who funded the data creation?

The research is based upon work supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), the DARPA MCS program, and NSF SMA 18-29268.

Who added the Dataset to GEM?

Who contributed to the data card and adding the dataset to GEM? List the people+affiliations involved in creating this data card and who helped integrate this dataset into GEM.

Yacine Jernite created the initial data card. It was later extended by Simon Mille. Sebastian Gehrmann migrated it to the GEMv2 format.

Dataset Structure

Data Fields

List and describe the fields present in the dataset.

A data instance has the following fields:

  • concepts: a list of string values denoting the concept the system should write about. Has 3 to 5 items, constitutes the input of the task.
  • target: a sentence string mentioning all of the above mentioned concepts. Constitutes the desired output of the task.
Example Instance

Provide a JSON formatted example of a typical instance in the dataset.

[
{
"concepts": ['ski', 'mountain', 'skier'],
"target": 'Skier skis down the mountain',
},
{
"concepts": ['ski', 'mountain', 'skier'],
"target": 'Three skiers are skiing on a snowy mountain.',
},
]
Data Splits

Describe and name the splits in the dataset if there are more than one.

Each example in the dataset consists of a set of 3 to 5 concepts denoted by a single noun, verb, or adjective (the input), and a sentence using these concepts (the output). The dataset provides several such sentences for each such concept.

Train Dev Test
Total concept-sets 32,651 993 1,497
Total sentences 67,389 4,018 6,042
Average sentence length 10.54 11.55 13.34
Splitting Criteria

Describe any criteria for splitting the data, if used. If there are differences between the splits (e.g., if the training annotations are machine-generated and the dev and test ones are created by humans, or if different numbers of annotators contributed to each example), describe them here.

The dev and test set were created by sampling sets of concepts of size 4 or 5 (and as many of size 3 for the dev set) present in the source captioning datasets and having crowd-workers write reference sentences using these concepts.

Conversely, the training set has more concept sets of size 3 than of size 4 and 5, and uses the original captions from the source datasets as references.

The authors also ensured that the training, dev and test set have different combinations of unique concepts to ensure compositionality (details in Table 1).

Dataset in GEM
  • Rationale for Inclusion in GEM

  • GEM-Specific Curation

  • Getting Started with the Task

Rationale for Inclusion in GEM

Why is the Dataset in GEM?

What does this dataset contribute toward better generation evaluation and why is it part of GEM?

CommonGen is a medium sized corpus with a unique reasoning challenge and interesting evaluation possibilities.

Similar Datasets

Do other datasets for the high level task exist?

no

Ability that the Dataset measures

What aspect of model ability can be measured with this dataset?

Commonsense reasoning

GEM-Specific Curation

Modificatied for GEM?

Has the GEM version of the dataset been modified in any way (data, processing, splits) from the original curated data?

yes

GEM Modifications

What changes have been made to he original dataset?

other

Modification Details

For each of these changes, described them in more details and provided the intended purpose of the modification

4 challenge sets for CommenGen were added to the GEM evaluation suite.

Additional Splits?

Does GEM provide additional splits to the dataset?

yes

Split Information

Describe how the new splits were created

  1. Data Shift

We created subsets of the training and development sets of ~500 randomly selected inputs each.

  1. Transformations

We applied input scrambling on a subset of 500 randomly selected test instances; the order of the concepts was randomly reassigned.

  1. Subpopulations

We created a subpopulation based on input length, taking into account the number of concepts the input test structures. By comparing inputs of different lengths, we can see to what extent systems are able to handle different input sizes

Concept number Frequency English
4 747
5 750
Split Motivation

What aspects of the model's generation capacities were the splits created to test?

Generalization and Robustness

Getting Started with the Task

Pointers to Resources

Getting started with in-depth research on the task. Add relevant pointers to resources that researchers can consult when they want to get started digging deeper into the task.

Previous Results
  • Previous Results

Previous Results

Measured Model Abilities

What aspect of model ability can be measured with this dataset?

Commonsense Reasoning

Metrics

What metrics are typically used for this task?

Other: Other Metrics, BLEU, ROUGE, METEOR

Other Metrics

Definitions of other metrics

  • SPICE: An evaluation metric for image captioning that is defined over scene graphs
  • CIDEr: An n-gram overlap metric based on cosine similarity between the TF-IDF weighted ngram counts
Proposed Evaluation

List and describe the purpose of the metrics and evaluation methodology (including human evaluation) that the dataset creators used when introducing this task.

The main metrics are captioning metrics since the original concept lists were extracted from captioning datasets. A human subject study with five graduate students was conducted and they were asked to rank the "commonsense plausibility" of two models at a time.

Previous results available?

Are previous results available?

yes

Other Evaluation Approaches

What evaluation approaches have others used?

The currently best performing model KFCNet (https://aclanthology.org/2021.findings-emnlp.249/) uses the same automatic evaluation but does not conduct any human evaluation.

Relevant Previous Results

What are the most relevant previous results for this task/dataset?

The most relevant results can be seen on the leaderboard

Dataset Curation
  • Original Curation

  • Language Data

  • Structured Annotations

  • Consent

  • Private Identifying Information (PII)

  • Maintenance

Original Curation

Original Curation Rationale

Original curation rationale

The dataset creators selected sets of concepts that appeared in image and video captions (as identified by a POS tagger) to ensure that a likely real-world scenario including the set could be imagined and constructed. Section 3.1 of the paper describes a sampling scheme which encourages diversity of sets while selecting common concepts.

Communicative Goal

What was the communicative goal?

The speaker is required to produce a coherent sentence which mentions all of the source concepts, and which describes a likely situation that could be captured in a picture or video.

Sourced from Different Sources

Is the dataset aggregated from different data sources?

yes

Source Details

List the sources (one per line)

Language Data

How was Language Data Obtained?

How was the language data obtained?

Crowdsourced

Where was it crowdsourced?

If crowdsourced, where from?

Amazon Mechanical Turk

Language Producers

What further information do we have on the language producers?

The training data consists of concept sets and captions for the source datasets. The concept sets are the sets of labels of the images or videos, selected with a heuristic to maximize diversity while ensuring that they represent likely scenarios.

The dev and test set sentences were created by Amazon Mechanical Turk crowd workers. The workers were shown an example generation and a set of 4 or 5 concept names along with their part-of-speech and asked to write:

  1. One sentence mentioning all of the concepts
  2. A rationale explaining how the sentence connects the concept

A screenshot of the interface is provided in Figure 7 of the Appendix.

Topics Covered

Does the language in the dataset focus on specific topics? How would you describe them?

Information was not provided.

Data Validation

Was the text validated by a different worker or a data curator?

validated by data curator

Was Data Filtered?

Were text instances selected or filtered?

algorithmically

Filter Criteria

What were the selection criteria?

During the data collection, workers who provided rationales that were too short, failed to have good coverage of the input in their sentences, or workers whose output had a high perplexity under a GPT-2 model were disqualified from the pool and replaced with newcomers.

Structured Annotations

Additional Annotations?

Does the dataset have additional annotations for each instance?

none

Annotation Service?

Was an annotation service used?

no

Consent

Any Consent Policy?

Was there a consent policy involved when gathering the data?

no

Justification for Using the Data

If not, what is the justification for reusing the data?

The data was sourced from Mechanical Turk which means that raters were aware that their annotations may be publicly released for research purposes.

Private Identifying Information (PII)

Contains PII?

Does the source language data likely contain Personal Identifying Information about the data creators or subjects?

no PII

Justification for no PII

Provide a justification for selecting no PII above.

The concepts are restricted to verbs, adjectives, and common nouns, and no personal information is given in the captions.

Maintenance

Any Maintenance Plan?

Does the original dataset have a maintenance plan?

no

Broader Social Context
  • Previous Work on the Social Impact of the Dataset

  • Impact on Under-Served Communities

  • Discussion of Biases

Previous Work on the Social Impact of the Dataset

Usage of Models based on the Data

Are you aware of cases where models trained on the task featured in this dataset ore related tasks have been used in automated systems?

no

Impact on Under-Served Communities

Addresses needs of underserved Communities?

Does this dataset address the needs of communities that are traditionally underserved in language technology, and particularly language generation technology? Communities may be underserved for exemple because their language, language variety, or social or geographical context is underepresented in NLP and NLG resources (datasets and models).

no

Discussion of Biases

Any Documented Social Biases?

Are there documented social biases in the dataset? Biases in this context are variations in the ways members of different social categories are represented that can have harmful downstream consequences for members of the more disadvantaged group.

no

Are the Language Producers Representative of the Language?

Does the distribution of language producers in the dataset accurately represent the full distribution of speakers of the language world-wide? If not, how does it differ?

The dataset is created using data from image captioning systems and might inherit some of the social biases represented therein (see e.g. Tang et al. 2020).

Another related concern is the exposure bias introduced by the initial selection of pictures and video, which are likely to over-represent situations that are common in the US at the expense of other parts of the world (Flickr, for example, is a US-based company founded in Canada). For more discussion of the potential impacts of exposure bias, see e.g. The Social Impact of Natural Language Processing.

Considerations for Using the Data
  • PII Risks and Liability

  • Licenses

  • Known Technical Limitations

PII Risks and Liability

Potential PII Risk

Considering your answers to the PII part of the Data Curation Section, describe any potential privacy to the data subjects and creators risks when using the dataset.

The concepts are restricted to verbs, adjectives, and common nouns, and no personal information is given in the captions.

Licenses

Copyright Restrictions on the Dataset

Based on your answers in the Intended Use part of the Data Overview Section, which of the following best describe the copyright and licensing status of the dataset?

open license - commercial use allowed

Copyright Restrictions on the Language Data

Based on your answers in the Language part of the Data Curation Section, which of the following best describe the copyright and licensing status of the underlying language data?

open license - commercial use allowed

Known Technical Limitations

Technical Limitations

Describe any known technical limitations, such as spurrious correlations, train/test overlap, annotation biases, or mis-annotations, and cite the works that first identified these limitations when possible.

The dataset is in English, a language with an abundance of existing resources.

The use of GPT-2 to validate development ant test sentences might be cause for similar concern, but we do note that the authors only use the model to discount very high perplexity sequences which is less likely to surface those biases.

The language in the development and test set is crowdsourced, which means that it was written by workers whose main goal was speed. This is likely to impact the quality and variety of the targets. The population of crowdsource workers is also not identically distributed as the the base population of the locations the workers come from, which may lead to different representation of situations or underlying expectations of what these situations are.

Unsuited Applications

When using a model trained on this dataset in a setting where users or the public may interact with its predictions, what are some pitfalls to look out for? In particular, describe some applications of the general task featured in this dataset that its curation or properties make it less suitable for.

Due to the overrepresentation of US-situations, the system may not work for users across the world. Moreover, only limited information on the dataset quality are provided and the system may fail as a result of unknown issues.

Discouraged Use Cases

What are some discouraged use cases of a model trained to maximize the proposed metrics on this dataset? In particular, think about settings where decisions made by a model that performs reasonably well on the metric my still have strong negative consequences for user or members of the public.

Any system needs to be evaluated on a broader set of unseen concepts then provided in the dataset. Since the references for the test set are private, it is not known how well findings generalize beyond the collection methodology.