cs_restaurants
The Czech Restaurants dataset is a task oriented dialog dataset in which a model needs to verbalize a response that a service agent could provide which is specified through a series of dialog acts. The dataset originated as a translation of an English dataset to test the generation capabilities of an NLG system on a highly morphologically rich language like Czech.
You can load the dataset via:
import datasets
data = datasets.load_dataset('GEM/cs_restaurants')
The data loader can be found here.
Quick-Use
Contact Name
If known, provide the name of at least one person the reader can contact for questions about the
dataset.
If known, provide the name of at least one person the reader can contact for questions about the dataset.
Ondrej Dusek
Multilingual?
Is the dataset multilingual?
Is the dataset multilingual?
no
Covered Languages
What languages/dialects are covered in the dataset?
What languages/dialects are covered in the dataset?
Czech
License
What is the license of the dataset?
What is the license of the dataset?
cc-by-sa-4.0: Creative Commons Attribution Share Alike 4.0 International
Communicative Goal
Provide a short description of the communicative goal of a model trained for this task on this dataset.
Provide a short description of the communicative goal of a model trained for this task on this dataset.
Producing a text expressing the given intent/dialogue act and all and only the attributes specified in the input meaning representation.
Additional Annotations?
Does the dataset have additional annotations for each instance?
Does the dataset have additional annotations for each instance?
none
Contains PII?
Does the source language data likely contain Personal Identifying Information about the data creators
or subjects?
Does the source language data likely contain Personal Identifying Information about the data creators or subjects?
no PII
Dataset Overview
-
Where to find the Data and its Documentation
-
Languages and Intended Use
-
Credit
-
Dataset Structure
-
Where to find the Data and its Documentation
-
Languages and Intended Use
-
Credit
-
Dataset Structure
Where to find the Data and its Documentation
Download
What is the link to where the original dataset is hosted?
What is the link to where the original dataset is hosted?
Paper
What is the link to the paper describing the dataset (open access preferred)?
What is the link to the paper describing the dataset (open access preferred)?
BibTex
Provide the BibTex-formatted reference for the dataset. Please use the correct published version
(ACL anthology, etc.) instead of google scholar created Bibtex.
Provide the BibTex-formatted reference for the dataset. Please use the correct published version (ACL anthology, etc.) instead of google scholar created Bibtex.
@inproceedings{cs_restaurants,
address = {Tokyo, Japan},
title = {Neural {Generation} for {Czech}: {Data} and {Baselines}},
shorttitle = {Neural {Generation} for {Czech}},
url = {https://www.aclweb.org/anthology/W19-8670/},
urldate = {2019-10-18},
booktitle = {Proceedings of the 12th {International} {Conference} on {Natural} {Language} {Generation} ({INLG} 2019)},
author = {Dušek, Ondřej and Jurčíček, Filip},
month = oct,
year = {2019},
pages = {563--574},
}
Contact Name
If known, provide the name of at least one person the reader can contact for questions about the
dataset.
If known, provide the name of at least one person the reader can contact for questions about the dataset.
Ondrej Dusek
Contact Email
If known, provide the email of at least one person the reader can contact for questions about the
dataset.
If known, provide the email of at least one person the reader can contact for questions about the dataset.
Has a Leaderboard?
Does the dataset have an active leaderboard?
Does the dataset have an active leaderboard?
no
Languages and Intended Use
Multilingual?
Is the dataset multilingual?
Is the dataset multilingual?
no
Covered Dialects
What dialects are covered? Are there multiple dialects per language?
What dialects are covered? Are there multiple dialects per language?
No breakdown of dialects is provided.
Covered Languages
What languages/dialects are covered in the dataset?
What languages/dialects are covered in the dataset?
Czech
Whose Language?
Whose language is in the dataset?
Whose language is in the dataset?
Six professional translators produced the outputs
License
What is the license of the dataset?
What is the license of the dataset?
cc-by-sa-4.0: Creative Commons Attribution Share Alike 4.0 International
Intended Use
What is the intended use of the dataset?
What is the intended use of the dataset?
The dataset was created to test neural NLG systems in Czech and their ability to deal with rich morphology.
Primary Task
What primary task does the dataset support?
What primary task does the dataset support?
Dialog Response Generation
Communicative Goal
Provide a short description of the communicative goal of a model trained for this task on this
dataset.
Provide a short description of the communicative goal of a model trained for this task on this dataset.
Producing a text expressing the given intent/dialogue act and all and only the attributes specified in the input meaning representation.
Credit
Curation Organization Type(s)
In what kind of organization did the dataset curation happen?
In what kind of organization did the dataset curation happen?
academic
Curation Organization(s)
Name the organization(s).
Name the organization(s).
Charles University, Prague
Dataset Creators
Who created the original dataset? List the people involved in collecting the dataset and their
affiliation(s).
Who created the original dataset? List the people involved in collecting the dataset and their affiliation(s).
Ondrej Dusek and Filip Jurcicek
Funding
Who funded the data creation?
Who funded the data creation?
This research was supported by the Charles University project PRIMUS/19/SCI/10 and by the Ministry of Education, Youth and Sports of the Czech Republic under the grant agreement LK11221. This work used using language resources distributed by the LINDAT/CLARIN project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2015071).
Who added the Dataset to GEM?
Who contributed to the data card and adding the dataset to GEM? List the people+affiliations
involved in creating this data card and who helped integrate this dataset into GEM.
Who contributed to the data card and adding the dataset to GEM? List the people+affiliations involved in creating this data card and who helped integrate this dataset into GEM.
Simon Mille wrote the initial data card and Yacine Jernite the data loader. Sebastian Gehrmann migrated the data card and loader to the v2 format.
Dataset Structure
Data Fields
List and describe the fields present in the dataset.
List and describe the fields present in the dataset.
The data is stored in a JSON or CSV format, with identical contents. The data has 4 fields:
da
: the input meaning representation/dialogue act (MR)delex_da
: the input MR, delexicalized -- all slot values are replaced with placeholders, such asX-name
text
: the corresponding target natural language text (reference)delex_text
: the target text, delexicalized (delexicalization is applied regardless of inflection)
In addition, the data contains a JSON file with all possible inflected forms for all slot values in the
dataset (surface_forms.json
).
Each slot -> value entry contains a list of inflected forms for the given value, with the base form
(lemma), the inflected form, and
a morphological tag.
The same MR is often repeated multiple times with different synonymous reference texts.
Reason for Structure
How was the dataset structure determined?
How was the dataset structure determined?
The data originated as a translation and localization of Wen et al.'s SF restaurant NLG dataset.
How were labels chosen?
How were the labels chosen?
How were the labels chosen?
The input MRs were collected from Wen et al.'s SF restaurant NLG data and localized by randomly replacing slot values (using a list of Prague restaurant names, neighborhoods etc.).
The generated slot values were then automatically replaced in reference texts in the data.
Example Instance
Provide a JSON formatted example of a typical instance in the dataset.
Provide a JSON formatted example of a typical instance in the dataset.
{
"input": "inform_only_match(food=Turkish,name='Švejk Restaurant',near='Charles Bridge',price_range=cheap)",
"target": "Našla jsem pouze jednu levnou restauraci poblíž Karlova mostu , kde podávají tureckou kuchyni , Švejk Restaurant ."
}
Data Splits
Describe and name the splits in the dataset if there are more than one.
Describe and name the splits in the dataset if there are more than one.
Property | Value |
---|---|
Total instances | 5,192 |
Unique MRs | 2,417 |
Unique delexicalized instances | 2,752 |
Unique delexicalized MRs | 248 |
The data is split in a roughly 3:1:1 proportion into training, development and test sections, making sure no delexicalized MR appears in two different parts. On the other hand, most DA types/intents are represented in all data parts.
Splitting Criteria
Describe any criteria for splitting the data, if used. If there are differences between the splits
(e.g., if the training annotations are machine-generated and the dev and test ones are created by
humans, or if different numbers of annotators contributed to each example), describe them here.
Describe any criteria for splitting the data, if used. If there are differences between the splits (e.g., if the training annotations are machine-generated and the dev and test ones are created by humans, or if different numbers of annotators contributed to each example), describe them here.
The creators ensured that after delexicalization of the meaning representation there was no overlap between training and test.
The data is split at a 3:1:1 rate between training, validation, and test.
Dataset in GEM
-
Rationale for Inclusion in GEM
-
GEM-Specific Curation
-
Getting Started with the Task
-
Rationale for Inclusion in GEM
-
GEM-Specific Curation
-
Getting Started with the Task
Rationale for Inclusion in GEM
Why is the Dataset in GEM?
What does this dataset contribute toward better generation evaluation and why is it part of GEM?
What does this dataset contribute toward better generation evaluation and why is it part of GEM?
This is one of a few non-English data-to-text datasets, in a well-known domain, but covering a morphologically rich language that is harder to generate since named entities need to be inflected. This makes it harder to apply common techniques such as delexicalization or copy mechanisms.
Similar Datasets
Do other datasets for the high level task exist?
Do other datasets for the high level task exist?
yes
Unique Language Coverage
Does this dataset cover other languages than other datasets for the same task?
Does this dataset cover other languages than other datasets for the same task?
yes
Difference from other GEM datasets
What else sets this dataset apart from other similar datasets in GEM?
What else sets this dataset apart from other similar datasets in GEM?
The dialog acts in this dataset are much more varied than the e2e dataset which is the closest in style.
Ability that the Dataset measures
What aspect of model ability can be measured with this dataset?
What aspect of model ability can be measured with this dataset?
surface realization
GEM-Specific Curation
Modificatied for GEM?
Has the GEM version of the dataset been modified in any way (data, processing, splits) from the
original curated data?
Has the GEM version of the dataset been modified in any way (data, processing, splits) from the original curated data?
yes
Additional Splits?
Does GEM provide additional splits to the dataset?
Does GEM provide additional splits to the dataset?
yes
Split Information
Describe how the new splits were created
Describe how the new splits were created
5 challenge sets for the Czech Restaurants dataset were added to the GEM evaluation suite.
- Data shift: We created subsets of the training and development sets of 500 randomly selected inputs each.
- Scrambling: We applied input scrambling on a subset of 500 randomly selected test instances; the order of the input dialogue acts was randomly reassigned.
- We identified different subsets of the test set that we could compare to each other so that we would have a better understanding of the results. There are currently two selections that we have made:
The first comparison is based on input size: the number of predicates differs between different inputs, ranging from 1 to 5. The table below provides an indication of the distribution of inputs with a particular length. It is clear from the table that this distribution is not balanced, and comparisions between items should be done with caution. Particularly for input size 4 and 5, there may not be enough data to draw reliable conclusions.
Input length | Number of inputs |
---|---|
1 | 183 |
2 | 267 |
3 | 297 |
4 | 86 |
5 | 9 |
The second comparison is based on the type of act. Again we caution against comparing the different
groups that have relatively few items.
It is probably OK to compare inform
and ?request
, but the other acts are all
low-frequent.
Act | Frequency |
---|---|
?request | 149 |
inform | 609 |
?confirm | 22 |
inform_only_match | 16 |
inform_no_match | 34 |
?select | 12 |
Split Motivation
What aspects of the model's generation capacities were the splits created to test?
What aspects of the model's generation capacities were the splits created to test?
Generalization and robustness.
Getting Started with the Task
Technical Terms
Technical terms used in this card and the dataset and their definitions
Technical terms used in this card and the dataset and their definitions
- utterance: something a system or user may say in a turn
- meaning representation: a representation of meaning that the system should be in accordance with. The specific type of MR in this dataset are dialog acts which describe what a dialog system should do, e.g., inform a user about a value.
Previous Results
-
Previous Results
-
Previous Results
Previous Results
Measured Model Abilities
What aspect of model ability can be measured with this dataset?
What aspect of model ability can be measured with this dataset?
Surface realization
Metrics
What metrics are typically used for this task?
What metrics are typically used for this task?
BLEU
, ROUGE
, METEOR
Proposed Evaluation
List and describe the purpose of the metrics and evaluation methodology (including human
evaluation) that the dataset creators used when introducing this task.
List and describe the purpose of the metrics and evaluation methodology (including human evaluation) that the dataset creators used when introducing this task.
This dataset uses the suite of word-overlap-based automatic metrics from the E2E NLG Challenge (BLEU, NIST, ROUGE-L, METEOR, and CIDEr). In addition, the slot error rate is measured.
Previous results available?
Are previous results available?
Are previous results available?
no
Dataset Curation
-
Original Curation
-
Language Data
-
Structured Annotations
-
Consent
-
Private Identifying Information (PII)
-
Maintenance
-
Original Curation
-
Language Data
-
Structured Annotations
-
Consent
-
Private Identifying Information (PII)
-
Maintenance
Original Curation
Original Curation Rationale
Original curation rationale
Original curation rationale
The dataset was created to test neural NLG systems in Czech and their ability to deal with rich morphology.
Communicative Goal
What was the communicative goal?
What was the communicative goal?
Producing a text expressing the given intent/dialogue act and all and only the attributes specified in the input MR.
Sourced from Different Sources
Is the dataset aggregated from different data sources?
Is the dataset aggregated from different data sources?
no
Language Data
How was Language Data Obtained?
How was the language data obtained?
How was the language data obtained?
Created for the dataset
Creation Process
If created for the dataset, describe the creation process.
If created for the dataset, describe the creation process.
Six professional translators translated the underlying dataset with the following instructions:
- Each utterance should be translated by itself
- fluent spoken-style Czech should be produced
- Facts should be preserved
- If possible, synonyms should be varied to create diverse utterances
- Entity names should be inflected as necessary
- the reader of the generated text should be addressed using formal form and self-references should use the female form.
The translators did not have access to the meaning representation.
Data Validation
Was the text validated by a different worker or a data curator?
Was the text validated by a different worker or a data curator?
validated by data curator
Was Data Filtered?
Were text instances selected or filtered?
Were text instances selected or filtered?
not filtered
Structured Annotations
Additional Annotations?
Does the dataset have additional annotations for each instance?
Does the dataset have additional annotations for each instance?
none
Annotation Service?
Was an annotation service used?
Was an annotation service used?
no
Consent
Any Consent Policy?
Was there a consent policy involved when gathering the data?
Was there a consent policy involved when gathering the data?
no
Justification for Using the Data
If not, what is the justification for reusing the data?
If not, what is the justification for reusing the data?
It was not explicitly stated but we can safely assume that the translators agreed to this use of their data.
Private Identifying Information (PII)
Contains PII?
Does the source language data likely contain Personal Identifying Information about the data
creators or subjects?
Does the source language data likely contain Personal Identifying Information about the data creators or subjects?
no PII
Justification for no PII
Provide a justification for selecting no PII
above.
Provide a justification for selecting no PII
above.
This dataset does not include any information about individuals.
Maintenance
Any Maintenance Plan?
Does the original dataset have a maintenance plan?
Does the original dataset have a maintenance plan?
no
Broader Social Context
-
Previous Work on the Social Impact of the Dataset
-
Impact on Under-Served Communities
-
Discussion of Biases
-
Previous Work on the Social Impact of the Dataset
-
Impact on Under-Served Communities
-
Discussion of Biases
Previous Work on the Social Impact of the Dataset
Usage of Models based on the Data
Are you aware of cases where models trained on the task featured in this dataset ore related tasks
have been used in automated systems?
Are you aware of cases where models trained on the task featured in this dataset ore related tasks have been used in automated systems?
no
Impact on Under-Served Communities
Addresses needs of underserved Communities?
Does this dataset address the needs of communities that are traditionally underserved in language
technology, and particularly language generation technology? Communities may be underserved for
exemple because their language, language variety, or social or geographical context is
underepresented in NLP and NLG resources (datasets and models).
Does this dataset address the needs of communities that are traditionally underserved in language technology, and particularly language generation technology? Communities may be underserved for exemple because their language, language variety, or social or geographical context is underepresented in NLP and NLG resources (datasets and models).
yes
Details on how Dataset Addresses the Needs
Describe how this dataset addresses the needs of underserved communities.
Describe how this dataset addresses the needs of underserved communities.
The dataset may help improve NLG methods for morphologically rich languages beyond Czech.
Discussion of Biases
Any Documented Social Biases?
Are there documented social biases in the dataset? Biases in this context are variations in the
ways members of different social categories are represented that can have harmful downstream
consequences for members of the more disadvantaged group.
Are there documented social biases in the dataset? Biases in this context are variations in the ways members of different social categories are represented that can have harmful downstream consequences for members of the more disadvantaged group.
yes
Links and Summaries of Analysis Work
Provide links to and summaries of works analyzing these biases.
Provide links to and summaries of works analyzing these biases.
To ensure consistency of translation, the data always uses formal/polite address for the user, and uses the female form for first-person self-references (as if the dialogue agent producing the sentences was female). This prevents data sparsity and ensures consistent results for systems trained on the dataset, but does not represent all potential situations arising in Czech.
Considerations for Using the Data
-
PII Risks and Liability
-
Licenses
-
Known Technical Limitations
-
PII Risks and Liability
-
Licenses
-
Known Technical Limitations
PII Risks and Liability
Licenses
Copyright Restrictions on the Dataset
Based on your answers in the Intended Use part of the Data Overview Section, which of the following
best describe the copyright and licensing status of the dataset?
Based on your answers in the Intended Use part of the Data Overview Section, which of the following best describe the copyright and licensing status of the dataset?
open license - commercial use allowed
Copyright Restrictions on the Language Data
Based on your answers in the Language part of the Data Curation Section, which of the following
best describe the copyright and licensing status of the underlying language data?
Based on your answers in the Language part of the Data Curation Section, which of the following best describe the copyright and licensing status of the underlying language data?
open license - commercial use allowed
Known Technical Limitations
Technical Limitations
Describe any known technical limitations, such as spurrious correlations, train/test overlap,
annotation biases, or mis-annotations, and cite the works that first identified these limitations
when possible.
Describe any known technical limitations, such as spurrious correlations, train/test overlap, annotation biases, or mis-annotations, and cite the works that first identified these limitations when possible.
The test set may lead users to over-estimate the performance of their NLG systems with respect to their generalisability, because there are no unseen restaurants or addresses in the test set. This is something we will look into for future editions of the GEM shared task.