e2e_nlg
The E2E NLG dataset is an English benchmark dataset for data-to-text models that verbalize a set of 2-9 key-value attribute pairs in the restaurant domain. The version used for GEM is the cleaned E2E NLG dataset, which filters examples with hallucinations and outputs that don't fully cover all input attributes.
You can load the dataset via:
import datasets
data = datasets.load_dataset('GEM/e2e_nlg')
The data loader can be found here.
website
authors
Jekaterina Novikova, Ondrej Dusek and Verena Rieser
Quick-Use
Contact Name
If known, provide the name of at least one person the reader can contact for questions about the
dataset.
If known, provide the name of at least one person the reader can contact for questions about the dataset.
Ondrej Dusek
Multilingual?
Is the dataset multilingual?
Is the dataset multilingual?
no
Covered Languages
What languages/dialects are covered in the dataset?
What languages/dialects are covered in the dataset?
English
License
What is the license of the dataset?
What is the license of the dataset?
cc-by-sa-4.0: Creative Commons Attribution Share Alike 4.0 International
Communicative Goal
Provide a short description of the communicative goal of a model trained for this task on this dataset.
Provide a short description of the communicative goal of a model trained for this task on this dataset.
Producing a text informing/recommending a restaurant, given all and only the attributes specified on the input.
Additional Annotations?
Does the dataset have additional annotations for each instance?
Does the dataset have additional annotations for each instance?
none
Contains PII?
Does the source language data likely contain Personal Identifying Information about the data creators
or subjects?
Does the source language data likely contain Personal Identifying Information about the data creators or subjects?
no PII
Dataset Overview
-
Where to find the Data and its Documentation
-
Languages and Intended Use
-
Credit
-
Dataset Structure
-
Where to find the Data and its Documentation
-
Languages and Intended Use
-
Credit
-
Dataset Structure
Where to find the Data and its Documentation
Webpage
What is the webpage for the dataset (if it exists)?
What is the webpage for the dataset (if it exists)?
Download
What is the link to where the original dataset is hosted?
What is the link to where the original dataset is hosted?
Paper
What is the link to the paper describing the dataset (open access preferred)?
What is the link to the paper describing the dataset (open access preferred)?
First data release, Detailed E2E Challenge writeup, Cleaned E2E version
BibTex
Provide the BibTex-formatted reference for the dataset. Please use the correct published version
(ACL anthology, etc.) instead of google scholar created Bibtex.
Provide the BibTex-formatted reference for the dataset. Please use the correct published version (ACL anthology, etc.) instead of google scholar created Bibtex.
@inproceedings{e2e_cleaned,
address = {Tokyo, Japan},
title = {Semantic {Noise} {Matters} for {Neural} {Natural} {Language} {Generation}},
url = {https://www.aclweb.org/anthology/W19-8652/},
booktitle = {Proceedings of the 12th {International} {Conference} on {Natural} {Language} {Generation} ({INLG} 2019)},
author = {Dušek, Ondřej and Howcroft, David M and Rieser, Verena},
year = {2019},
pages = {421--426},
}
Contact Name
If known, provide the name of at least one person the reader can contact for questions about the
dataset.
If known, provide the name of at least one person the reader can contact for questions about the dataset.
Ondrej Dusek
Contact Email
If known, provide the email of at least one person the reader can contact for questions about the
dataset.
If known, provide the email of at least one person the reader can contact for questions about the dataset.
Has a Leaderboard?
Does the dataset have an active leaderboard?
Does the dataset have an active leaderboard?
no
Languages and Intended Use
Multilingual?
Is the dataset multilingual?
Is the dataset multilingual?
no
Covered Dialects
What dialects are covered? Are there multiple dialects per language?
What dialects are covered? Are there multiple dialects per language?
Dialect-specific data was not collected and the language is general British English.
Covered Languages
What languages/dialects are covered in the dataset?
What languages/dialects are covered in the dataset?
English
Whose Language?
Whose language is in the dataset?
Whose language is in the dataset?
The original dataset was collected using the CrowdFlower (now Appen) platform using native English speakers (self-reported). No demographic information was provided, but the collection was geographically limited to English-speaking countries.
License
What is the license of the dataset?
What is the license of the dataset?
cc-by-sa-4.0: Creative Commons Attribution Share Alike 4.0 International
Intended Use
What is the intended use of the dataset?
What is the intended use of the dataset?
The dataset was collected to test neural model on a very well specified realization task.
Primary Task
What primary task does the dataset support?
What primary task does the dataset support?
Data-to-Text
Communicative Goal
Provide a short description of the communicative goal of a model trained for this task on this
dataset.
Provide a short description of the communicative goal of a model trained for this task on this dataset.
Producing a text informing/recommending a restaurant, given all and only the attributes specified on the input.
Credit
Curation Organization Type(s)
In what kind of organization did the dataset curation happen?
In what kind of organization did the dataset curation happen?
academic
Curation Organization(s)
Name the organization(s).
Name the organization(s).
Heriot-Watt University
Dataset Creators
Who created the original dataset? List the people involved in collecting the dataset and their
affiliation(s).
Who created the original dataset? List the people involved in collecting the dataset and their affiliation(s).
Jekaterina Novikova, Ondrej Dusek and Verena Rieser
Funding
Who funded the data creation?
Who funded the data creation?
This research received funding from the EPSRC projects DILiGENt (EP/M005429/1) and MaDrIgAL (EP/N017536/1).
Who added the Dataset to GEM?
Who contributed to the data card and adding the dataset to GEM? List the people+affiliations
involved in creating this data card and who helped integrate this dataset into GEM.
Who contributed to the data card and adding the dataset to GEM? List the people+affiliations involved in creating this data card and who helped integrate this dataset into GEM.
Simon Mille wrote the initial data card and Yacine Jernite the data loader. Sebastian Gehrmann migrated the data card to the v2 format and moved the data loader to the hub.
Dataset Structure
Data Fields
List and describe the fields present in the dataset.
List and describe the fields present in the dataset.
The data is in a CSV format, with the following fields:
mr
-- the meaning representation (MR, input)ref
-- reference, i.e. the corresponding natural-language description (output)
There are additional fields (fixed
, orig_mr
) indicating whether the data was
modified in the
cleaning process and what was the original MR before cleaning, but these aren't used for NLG.
The MR has a flat structure -- attribute-value pairs are comma separated, with values enclosed in brackets (see example above). There are 8 attributes:
name
-- restaurant namenear
-- a landmark close to the restaurantarea
-- location (riverside, city centre)food
-- food type / cuisine (e.g. Japanese, Indian, English etc.)eatType
-- restaurant type (restaurant, coffee shop, pub)priceRange
-- price range (low, medium, high, <£20, £20-30, >£30)rating
-- customer rating (low, medium, high, 1/5, 3/5, 5/5)familyFriendly
-- is the restaurant family-friendly (yes/no)
The same MR is often repeated multiple times with different synonymous references.
How were labels chosen?
How were the labels chosen?
How were the labels chosen?
The source MRs were generated automatically at random from a set of valid attribute values. The labels were crowdsourced and are natural language
Example Instance
Provide a JSON formatted example of a typical instance in the dataset.
Provide a JSON formatted example of a typical instance in the dataset.
{
"input": "name[Alimentum], area[riverside], familyFriendly[yes], near[Burger King]",
"target": "Alimentum is a kids friendly place in the riverside area near Burger King."
}
Data Splits
Describe and name the splits in the dataset if there are more than one.
Describe and name the splits in the dataset if there are more than one.
MRs | Distinct MRs | References | |
---|---|---|---|
Training | 12,568 | 8,362 | 33,525 |
Development | 1,484 | 1,132 | 4,299 |
Test | 1,847 | 1,358 | 4,693 |
Total | 15,899 | 10,852 | 42,517 |
“Distinct MRs” are MRs that remain distinct even if restaurant/place names (attributes name
,
near
)
are delexicalized, i.e., replaced with a placeholder.
Splitting Criteria
Describe any criteria for splitting the data, if used. If there are differences between the splits
(e.g., if the training annotations are machine-generated and the dev and test ones are created by
humans, or if different numbers of annotators contributed to each example), describe them here.
Describe any criteria for splitting the data, if used. If there are differences between the splits (e.g., if the training annotations are machine-generated and the dev and test ones are created by humans, or if different numbers of annotators contributed to each example), describe them here.
The data are divided so that MRs in different splits do not overlap.
Dataset in GEM
-
Rationale for Inclusion in GEM
-
GEM-Specific Curation
-
Getting Started with the Task
-
Rationale for Inclusion in GEM
-
GEM-Specific Curation
-
Getting Started with the Task
Rationale for Inclusion in GEM
Why is the Dataset in GEM?
What does this dataset contribute toward better generation evaluation and why is it part of GEM?
What does this dataset contribute toward better generation evaluation and why is it part of GEM?
The E2E dataset is one of the largest limited-domain NLG datasets and is frequently used as a data-to-text generation benchmark. The E2E Challenge included 20 systems of very different architectures, with system outputs available for download.
Similar Datasets
Do other datasets for the high level task exist?
Do other datasets for the high level task exist?
yes
Unique Language Coverage
Does this dataset cover other languages than other datasets for the same task?
Does this dataset cover other languages than other datasets for the same task?
no
Difference from other GEM datasets
What else sets this dataset apart from other similar datasets in GEM?
What else sets this dataset apart from other similar datasets in GEM?
The dataset is much cleaner than comparable datasets, and it is also a relatively easy task, making for a straightforward evaluation.
Ability that the Dataset measures
What aspect of model ability can be measured with this dataset?
What aspect of model ability can be measured with this dataset?
surface realization.
GEM-Specific Curation
Modificatied for GEM?
Has the GEM version of the dataset been modified in any way (data, processing, splits) from the
original curated data?
Has the GEM version of the dataset been modified in any way (data, processing, splits) from the original curated data?
yes
Additional Splits?
Does GEM provide additional splits to the dataset?
Does GEM provide additional splits to the dataset?
yes
Split Information
Describe how the new splits were created
Describe how the new splits were created
4 special test sets for E2E were added to the GEM evaluation suite.
- We created subsets of the training and development sets of ~500 randomly selected inputs each.
- We applied input scrambling on a subset of 500 randomly selected test instances; the order of the input properties was randomly reassigned.
- For the input size, we created subpopulations based on the number of restaurant properties in the input.
Input length | Frequency English |
---|---|
2 | 5 |
3 | 120 |
4 | 389 |
5 | 737 |
6 | 1187 |
7 | 1406 |
8 | 774 |
9 | 73 |
10 | 2 |
Split Motivation
What aspects of the model's generation capacities were the splits created to test?
What aspects of the model's generation capacities were the splits created to test?
Generalization and robustness
Getting Started with the Task
Previous Results
-
Previous Results
-
Previous Results
Previous Results
Measured Model Abilities
What aspect of model ability can be measured with this dataset?
What aspect of model ability can be measured with this dataset?
Surface realization.
Metrics
What metrics are typically used for this task?
What metrics are typically used for this task?
BLEU
, METEOR
, ROUGE
Proposed Evaluation
List and describe the purpose of the metrics and evaluation methodology (including human
evaluation) that the dataset creators used when introducing this task.
List and describe the purpose of the metrics and evaluation methodology (including human evaluation) that the dataset creators used when introducing this task.
The official evaluation script combines the MT-Eval and COCO Captioning libraries with the following metrics.
- BLEU
- CIDEr
- NIST
- METEOR
- ROUGE-L
Previous results available?
Are previous results available?
Are previous results available?
yes
Other Evaluation Approaches
What evaluation approaches have others used?
What evaluation approaches have others used?
Most previous results, including the shared task results, used the library provided by the dataset creators. The shared task also conducted a human evaluation using the following two criteria:
Quality
: When collecting quality ratings, system outputs were presented to crowd workers together with the corresponding meaning representation, which implies that correctness of the NL utterance relative to the MR should also influence this ranking. The crowd workers were asked: “How do you judge the overall quality of the utterance in terms of its grammatical correctness, fluency, adequacy and other important factors?”Naturalness
: When collecting naturalness ratings, system outputs were presented to crowd workers without the corresponding meaning representation. The crowd workers were asked: “Could the utterance have been produced by a native speaker?”
Relevant Previous Results
What are the most relevant previous results for this task/dataset?
What are the most relevant previous results for this task/dataset?
The shared task writeup has in-depth evaluations of systems (https://www.sciencedirect.com/science/article/pii/S0885230819300919)
Dataset Curation
-
Original Curation
-
Language Data
-
Structured Annotations
-
Consent
-
Private Identifying Information (PII)
-
Maintenance
-
Original Curation
-
Language Data
-
Structured Annotations
-
Consent
-
Private Identifying Information (PII)
-
Maintenance
Original Curation
Original Curation Rationale
Original curation rationale
Original curation rationale
The dataset was collected to showcase/test neural NLG models. It is larger and contains more lexical richness and syntactic variation than previous closed-domain NLG datasets.
Communicative Goal
What was the communicative goal?
What was the communicative goal?
Producing a text informing/recommending a restaurant, given all and only the attributes specified on the input.
Sourced from Different Sources
Is the dataset aggregated from different data sources?
Is the dataset aggregated from different data sources?
no
Language Data
How was Language Data Obtained?
How was the language data obtained?
How was the language data obtained?
Crowdsourced
Where was it crowdsourced?
If crowdsourced, where from?
If crowdsourced, where from?
Other crowdworker platform
Language Producers
What further information do we have on the language producers?
What further information do we have on the language producers?
Human references describing the MRs were collected by crowdsourcing on the CrowdFlower (now Appen) platform, with either textual or pictorial MRs as a baseline. The pictorial MRs were used in 20% of cases -- these yield higher lexical variation but introduce noise.
Topics Covered
Does the language in the dataset focus on specific topics? How would you describe them?
Does the language in the dataset focus on specific topics? How would you describe them?
The dataset is focused on descriptions of restaurants.
Data Validation
Was the text validated by a different worker or a data curator?
Was the text validated by a different worker or a data curator?
validated by data curator
Data Preprocessing
How was the text data pre-processed? (Enter N/A if the text was not pre-processed)
How was the text data pre-processed? (Enter N/A if the text was not pre-processed)
There were basic checks (length, valid characters, repetition).
Was Data Filtered?
Were text instances selected or filtered?
Were text instances selected or filtered?
algorithmically
Filter Criteria
What were the selection criteria?
What were the selection criteria?
The cleaned version of the dataset which we are using in GEM was algorithmically filtered. They used regular expressions to match all human-generated references with a more accurate input when attributes were hallucinated or dropped. Additionally, train-test overlap stemming from the transformation was removed. As a result, this data is much cleaner than the original dataset but not perfect (about 20% of instances may have misaligned slots, compared to 40% of the original data.
Structured Annotations
Additional Annotations?
Does the dataset have additional annotations for each instance?
Does the dataset have additional annotations for each instance?
none
Annotation Service?
Was an annotation service used?
Was an annotation service used?
no
Consent
Any Consent Policy?
Was there a consent policy involved when gathering the data?
Was there a consent policy involved when gathering the data?
yes
Consent Policy Details
What was the consent policy?
What was the consent policy?
Since a crowdsourcing platform was used, the involved raters waived their rights to the data and are aware that the produced annotations can be publicly released.
Private Identifying Information (PII)
Contains PII?
Does the source language data likely contain Personal Identifying Information about the data
creators or subjects?
Does the source language data likely contain Personal Identifying Information about the data creators or subjects?
no PII
Justification for no PII
Provide a justification for selecting no PII
above.
Provide a justification for selecting no PII
above.
The dataset is artificial and does not contain any description of people.
Maintenance
Any Maintenance Plan?
Does the original dataset have a maintenance plan?
Does the original dataset have a maintenance plan?
no
Broader Social Context
-
Previous Work on the Social Impact of the Dataset
-
Impact on Under-Served Communities
-
Discussion of Biases
-
Previous Work on the Social Impact of the Dataset
-
Impact on Under-Served Communities
-
Discussion of Biases
Previous Work on the Social Impact of the Dataset
Usage of Models based on the Data
Are you aware of cases where models trained on the task featured in this dataset ore related tasks
have been used in automated systems?
Are you aware of cases where models trained on the task featured in this dataset ore related tasks have been used in automated systems?
no
Impact on Under-Served Communities
Addresses needs of underserved Communities?
Does this dataset address the needs of communities that are traditionally underserved in language
technology, and particularly language generation technology? Communities may be underserved for
exemple because their language, language variety, or social or geographical context is
underepresented in NLP and NLG resources (datasets and models).
Does this dataset address the needs of communities that are traditionally underserved in language technology, and particularly language generation technology? Communities may be underserved for exemple because their language, language variety, or social or geographical context is underepresented in NLP and NLG resources (datasets and models).
no
Discussion of Biases
Any Documented Social Biases?
Are there documented social biases in the dataset? Biases in this context are variations in the
ways members of different social categories are represented that can have harmful downstream
consequences for members of the more disadvantaged group.
Are there documented social biases in the dataset? Biases in this context are variations in the ways members of different social categories are represented that can have harmful downstream consequences for members of the more disadvantaged group.
no
Are the Language Producers Representative of the Language?
Does the distribution of language producers in the dataset accurately represent the full
distribution of speakers of the language world-wide? If not, how does it differ?
Does the distribution of language producers in the dataset accurately represent the full distribution of speakers of the language world-wide? If not, how does it differ?
The source data is generated randomly, so it should not contain biases. The human references may be biased by the workers' demographic, but that was not investigated upon data collection.
Considerations for Using the Data
-
PII Risks and Liability
-
Licenses
-
Known Technical Limitations
-
PII Risks and Liability
-
Licenses
-
Known Technical Limitations
PII Risks and Liability
Licenses
Copyright Restrictions on the Dataset
Based on your answers in the Intended Use part of the Data Overview Section, which of the following
best describe the copyright and licensing status of the dataset?
Based on your answers in the Intended Use part of the Data Overview Section, which of the following best describe the copyright and licensing status of the dataset?
open license - commercial use allowed
Copyright Restrictions on the Language Data
Based on your answers in the Language part of the Data Curation Section, which of the following
best describe the copyright and licensing status of the underlying language data?
Based on your answers in the Language part of the Data Curation Section, which of the following best describe the copyright and licensing status of the underlying language data?
open license - commercial use allowed
Known Technical Limitations
Technical Limitations
Describe any known technical limitations, such as spurrious correlations, train/test overlap,
annotation biases, or mis-annotations, and cite the works that first identified these limitations
when possible.
Describe any known technical limitations, such as spurrious correlations, train/test overlap, annotation biases, or mis-annotations, and cite the works that first identified these limitations when possible.
The cleaned version still has data points with hallucinated or omitted attributes.
Unsuited Applications
When using a model trained on this dataset in a setting where users or the public may interact with
its predictions, what are some pitfalls to look out for? In particular, describe some applications
of the general task featured in this dataset that its curation or properties make it less suitable
for.
When using a model trained on this dataset in a setting where users or the public may interact with its predictions, what are some pitfalls to look out for? In particular, describe some applications of the general task featured in this dataset that its curation or properties make it less suitable for.
The data only pertains to the restaurant domain and the included attributes. A model cannot be expected to handle other domains or attributes.