GEM xlsum

xlsum

XLSum is a highly multilingual summarization dataset supporting 44 language. The data stems from BBC news articles.

You can load the dataset via:

import datasets
data = datasets.load_dataset('GEM/xlsum')

The data loader can be found here.

website

Github

paper

ACL Anthology

Quick-Use

Contact Name

If known, provide the name of at least one person the reader can contact for questions about the dataset.

Tahmid Hasan

Multilingual?

Is the dataset multilingual?

yes

Covered Languages

What languages/dialects are covered in the dataset?

Amharic, Arabic, Azerbaijani, Bengali, Bangla, Burmese, Chinese (family), English, French, Gujarati, Hausa, Hindi, Igbo, Indonesian, Japanese, Rundi, Korean, Kirghiz, Kyrgyz, Marathi, Nepali (individual language), Oromo, Pushto, Pashto, Persian, Ghanaian Pidgin English, Portuguese, Panjabi, Punjabi, Russian, Scottish Gaelic, Gaelic, Serbian, Romano-Serbian, Sinhala, Sinhalese, Somali, Spanish, Castilian, Swahili (individual language), Kiswahili, Tamil, Telugu, Thai, Tigrinya, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, Yoruba

License

What is the license of the dataset?

cc-by-nc-sa-4.0: Creative Commons Attribution Non Commercial Share Alike 4.0 International

Communicative Goal

Provide a short description of the communicative goal of a model trained for this task on this dataset.

Summarize news-like text in one of 45 languages.

Additional Annotations?

Does the dataset have additional annotations for each instance?

none

Contains PII?

Does the source language data likely contain Personal Identifying Information about the data creators or subjects?

likely

Dataset Overview

Where to find the Data and its Documentation

Languages and Intended Use

Credit

Dataset Structure

Where to find the Data and its Documentation

Webpage

What is the webpage for the dataset (if it exists)?

Github

Download

What is the link to where the original dataset is hosted?

Huggingface

Paper

What is the link to the paper describing the dataset (open access preferred)?

ACL Anthology

BibTex

Provide the BibTex-formatted reference for the dataset. Please use the correct published version (ACL anthology, etc.) instead of google scholar created Bibtex.

@inproceedings{hasan-etal-2021-xl,
title = "{XL}-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages",
author = "Hasan, Tahmid  and
Bhattacharjee, Abhik  and
Islam, Md. Saiful  and
Mubasshir, Kazi  and
Li, Yuan-Fang  and
Kang, Yong-Bin  and
Rahman, M. Sohel  and
Shahriyar, Rifat",
booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-acl.413",
pages = "4693--4703",
}

Contact Name

If known, provide the name of at least one person the reader can contact for questions about the dataset.

Tahmid Hasan

Contact Email

If known, provide the email of at least one person the reader can contact for questions about the dataset.

tahmidhasan@cse.buet.ac.bd

Has a Leaderboard?

Does the dataset have an active leaderboard?

yes

Leaderboard Link

Provide a link to the leaderboard.

Explainaboard

Leaderboard Details

Briefly describe how the leaderboard evaluates models.

The leaderboard ranks models based on ROUGE scores (R1/R2/RL) of the generated summaries.

Languages and Intended Use

Multilingual?

Is the dataset multilingual?

yes

Covered Languages

What languages/dialects are covered in the dataset?

Amharic, Arabic, Azerbaijani, Bengali, Bangla, Burmese, Chinese (family), English, French, Gujarati, Hausa, Hindi, Igbo, Indonesian, Japanese, Rundi, Korean, Kirghiz, Kyrgyz, Marathi, Nepali (individual language), Oromo, Pushto, Pashto, Persian, Ghanaian Pidgin English, Portuguese, Panjabi, Punjabi, Russian, Scottish Gaelic, Gaelic, Serbian, Romano-Serbian, Sinhala, Sinhalese, Somali, Spanish, Castilian, Swahili (individual language), Kiswahili, Tamil, Telugu, Thai, Tigrinya, Turkish, Ukrainian, Urdu, Uzbek, Vietnamese, Welsh, Yoruba

License

What is the license of the dataset?

cc-by-nc-sa-4.0: Creative Commons Attribution Non Commercial Share Alike 4.0 International

Intended Use

What is the intended use of the dataset?

Abstractive summarization has centered around the English language, as most large abstractive summarization datasets are available in English only. Though there have been some recent efforts for curating multilingual abstractive summarization datasets, they are limited in terms of the number of languages covered, the number of training samples, or both. To this end, XL-Sum presents a large-scale abstractive summarization dataset of 1.35 million news articles from 45 languages crawled from the British Broadcasting Corporation website. It is intended to be used for both multilingual and per-language summarization tasks.

Primary Task

What primary task does the dataset support?

Summarization

Communicative Goal

Provide a short description of the communicative goal of a model trained for this task on this dataset.

Summarize news-like text in one of 45 languages.

Credit

Curation Organization Type(s)

In what kind of organization did the dataset curation happen?

academic

Curation Organization(s)

Name the organization(s).

Bangladesh University of Engineering and Technology

Who added the Dataset to GEM?

Who contributed to the data card and adding the dataset to GEM? List the people+affiliations involved in creating this data card and who helped integrate this dataset into GEM.

Tahmid Hasan (Bangladesh University of Engineering and Technology), Abhik Bhattacharjee (Bangladesh University of Engineering and Technology)

Dataset Structure

Data Fields

List and describe the fields present in the dataset.

gem_id: A string representing the article ID.
url: A string representing the article URL.
title: A string containing the article title.
summary: A string containing the article summary.
text : A string containing the article text.

Example Instance

Provide a JSON formatted example of a typical instance in the dataset.

{
"gem_id": "GEM-xlsum_english-train-1589",
"url": "[BBC news](https://www.bbc.com/news)/technology-17657859",
"title": "Yahoo files e-book advert system patent applications",
"summary": "Yahoo has signalled it is investigating e-book adverts as a way to stimulate its earnings.",
"text": "Yahoo's patents suggest users could weigh the type of ads against the sizes of discount before purchase. It says in two US patent applications that ads for digital book readers have been \"less than optimal\" to date. The filings suggest that users could be offered titles at a variety of prices depending on the ads' prominence They add that the products shown could be determined by the type of book being read, or even the contents of a specific chapter, phrase or word. The paperwork was published by the US Patent and Trademark Office late last week and relates to work carried out at the firm's headquarters in Sunnyvale, California. \"Greater levels of advertising, which may be more valuable to an advertiser and potentially more distracting to an e-book reader, may warrant higher discounts,\" it states. Free books It suggests users could be offered ads as hyperlinks based within the book's text, in-laid text or even \"dynamic content\" such as video. Another idea suggests boxes at the bottom of a page could trail later chapters or quotes saying \"brought to you by Company A\". It adds that the more willing the customer is to see the ads, the greater the potential discount. \"Higher frequencies... may even be great enough to allow the e-book to be obtained for free,\" it states. The authors write that the type of ad could influence the value of the discount, with \"lower class advertising... such as teeth whitener advertisements\" offering a cheaper price than \"high\" or \"middle class\" adverts, for things like pizza. The inventors also suggest that ads could be linked to the mood or emotional state the reader is in as a they progress through a title. For example, they say if characters fall in love or show affection during a chapter, then ads for flowers or entertainment could be triggered. The patents also suggest this could applied to children's books - giving the Tom Hanks animated film Polar Express as an example. It says a scene showing a waiter giving the protagonists hot drinks \"may be an excellent opportunity to show an advertisement for hot cocoa, or a branded chocolate bar\". Another example states: \"If the setting includes young characters, a Coke advertisement could be provided, inviting the reader to enjoy a glass of Coke with his book, and providing a graphic of a cool glass.\" It adds that such targeting could be further enhanced by taking account of previous titles the owner has bought. 'Advertising-free zone' At present, several Amazon and Kobo e-book readers offer full-screen adverts when the device is switched off and show smaller ads on their menu screens, but the main text of the titles remains free of marketing. Yahoo does not currently provide ads to these devices, and a move into the area could boost its shrinking revenues. However, Philip Jones, deputy editor of the Bookseller magazine, said that the internet firm might struggle to get some of its ideas adopted. \"This has been mooted before and was fairly well decried,\" he said. \"Perhaps in a limited context it could work if the merchandise was strongly related to the title and was kept away from the text. \"But readers - particularly parents - like the fact that reading is an advertising-free zone. Authors would also want something to say about ads interrupting their narrative flow.\""
}

Data Splits

Describe and name the splits in the dataset if there are more than one.

The splits in the dataset are specified by the language names, which are as follows:

amharic
arabic
azerbaijani
bengali
burmese
chinese_simplified
chinese_traditional
english
french
gujarati
hausa
hindi
igbo
indonesian
japanese
kirundi
korean
kyrgyz
marathi
nepali
oromo
pashto
persian
pidgin
portuguese
punjabi
russian
scottish_gaelic
serbian_cyrillic
serbian_latin
sinhala
somali
spanish
swahili
tamil
telugu
thai
tigrinya
turkish
ukrainian
urdu
uzbek
vietnamese
welsh
yoruba

Splitting Criteria

Describe any criteria for splitting the data, if used. If there are differences between the splits (e.g., if the training annotations are machine-generated and the dev and test ones are created by humans, or if different numbers of annotators contributed to each example), describe them here.

We used a 80%-10%-10% split for all languages with a few exceptions. English was split 93%-3.5%-3.5% for the evaluation set size to resemble that of CNN/DM and XSum; Scottish Gaelic, Kyrgyz and Sinhala had relatively fewer samples, their evaluation sets were increased to 500 samples for more reliable evaluation. Same articles were used for evaluation in the two variants of Chinese and Serbian to prevent data leakage in multilingual training. Individual dataset download links with train-dev-test example counts are given below:

Language	ISO 639-1 Code	BBC subdomain(s)	Train	Dev	Test	Total
Amharic	am	BBC amharic	5761	719	719	7199
Arabic	ar	BBC arabic	37519	4689	4689	46897
Azerbaijani	az	BBC azeri	6478	809	809	8096
Bengali	bn	BBC bengali	8102	1012	1012	10126
Burmese	my	BBC burmese	4569	570	570	5709
Chinese (Simplified)	zh-CN	BBC ukchina/simp, BBC zhongwen/simp	37362	4670	4670	46702
Chinese (Traditional)	zh-TW	BBC ukchina/trad, BBC zhongwen/trad	37373	4670	4670	46713
English	en	BBC english, BBC sinhala `*`	306522	11535	11535	329592
French	fr	BBC afrique	8697	1086	1086	10869
Gujarati	gu	BBC gujarati	9119	1139	1139	11397
Hausa	ha	BBC hausa	6418	802	802	8022
Hindi	hi	BBC hindi	70778	8847	8847	88472
Igbo	ig	BBC igbo	4183	522	522	5227
Indonesian	id	BBC indonesia	38242	4780	4780	47802
Japanese	ja	BBC japanese	7113	889	889	8891
Kirundi	rn	BBC gahuza	5746	718	718	7182
Korean	ko	BBC korean	4407	550	550	5507
Kyrgyz	ky	BBC kyrgyz	2266	500	500	3266
Marathi	mr	BBC marathi	10903	1362	1362	13627
Nepali	np	BBC nepali	5808	725	725	7258
Oromo	om	BBC afaanoromoo	6063	757	757	7577
Pashto	ps	BBC pashto	14353	1794	1794	17941
Persian	fa	BBC persian	47251	5906	5906	59063
Pidgin`**`	pcm	BBC pidgin	9208	1151	1151	11510
Portuguese	pt	BBC portuguese	57402	7175	7175	71752
Punjabi	pa	BBC punjabi	8215	1026	1026	10267
Russian	ru	BBC russian, BBC ukrainian `*`	62243	7780	7780	77803
Scottish Gaelic	gd	BBC naidheachdan	1313	500	500	2313
Serbian (Cyrillic)	sr	BBC serbian/cyr	7275	909	909	9093
Serbian (Latin)	sr	BBC serbian/lat	7276	909	909	9094
Sinhala	si	BBC sinhala	3249	500	500	4249
Somali	so	BBC somali	5962	745	745	7452
Spanish	es	BBC mundo	38110	4763	4763	47636
Swahili	sw	BBC swahili	7898	987	987	9872
Tamil	ta	BBC tamil	16222	2027	2027	20276
Telugu	te	BBC telugu	10421	1302	1302	13025
Thai	th	BBC thai	6616	826	826	8268
Tigrinya	ti	BBC tigrinya	5451	681	681	6813
Turkish	tr	BBC turkce	27176	3397	3397	33970
Ukrainian	uk	BBC ukrainian	43201	5399	5399	53999
Urdu	ur	BBC urdu	67665	8458	8458	84581
Uzbek	uz	BBC uzbek	4728	590	590	5908
Vietnamese	vi	BBC vietnamese	32111	4013	4013	40137
Welsh	cy	BBC cymrufyw	9732	1216	1216	12164
Yoruba	yo	BBC yoruba	6350	793	793	7936

* A lot of articles in BBC Sinhala and BBC Ukrainian were written in English and Russian respectively. They were identified using Fasttext and moved accordingly. ** West African Pidgin English

Dataset in GEM

Rationale for Inclusion in GEM

GEM-Specific Curation

Getting Started with the Task

Rationale for Inclusion in GEM

Why is the Dataset in GEM?

What does this dataset contribute toward better generation evaluation and why is it part of GEM?

Traditional abstractive text summarization has been centered around English and other high-resource languages. XL-Sum provides a large collection of high-quality article-summary pairs for 45 languages where the languages range from high-resource to extremely low-resource. This enables the research community to explore the summarization capabilities of different models for multiple languages and languages in isolation. We believe the addition of XL-Sum to GEM makes the domain of abstractive text summarization more diversified and inclusive to the research community. We hope our efforts in this work will encourage the community to push the boundaries of abstractive text summarization beyond the English language, especially for low and mid-resource languages, bringing technological advances to communities of these languages that have been traditionally under-served.

Similar Datasets

Do other datasets for the high level task exist?

yes

Unique Language Coverage

Does this dataset cover other languages than other datasets for the same task?

yes

Difference from other GEM datasets

What else sets this dataset apart from other similar datasets in GEM?

The summaries are highly concise and abstractive.

Ability that the Dataset measures

What aspect of model ability can be measured with this dataset?

Conciseness, abstractiveness, and overall summarization capability.

GEM-Specific Curation

Modificatied for GEM?

Has the GEM version of the dataset been modified in any way (data, processing, splits) from the original curated data?

no

Additional Splits?

Does GEM provide additional splits to the dataset?

no

Getting Started with the Task

Previous Results

Previous Results

Previous Results

Measured Model Abilities

What aspect of model ability can be measured with this dataset?

Conciseness, abstractiveness, and overall summarization capability.

Metrics

What metrics are typically used for this task?

ROUGE

Proposed Evaluation

List and describe the purpose of the metrics and evaluation methodology (including human evaluation) that the dataset creators used when introducing this task.

ROUGE is the de facto evaluation metric used for text summarization. However, it was designed specifically for evaluating English texts. Due to the nature of the metric, scores are heavily dependent on text tokenization / stemming / unnecessary character removal, etc. Some modifications to the original ROUGE evaluation were done such as punctuation only removal, language specific tokenization/stemming to enable reliable comparison of source and target summaries across different scripts.

Previous results available?

Are previous results available?

no

Dataset Curation

Original Curation

Language Data

Structured Annotations

Consent

Private Identifying Information (PII)

Maintenance

Original Curation

Original Curation Rationale

Original curation rationale

State-of-the-art text summarization models are heavily data-driven, i.e., a large number of article-summary pairs are required to train them effectively. As a result, abstractive summarization has centered around the English language, as most large abstractive summarization datasets are available in English only. Though there have been some recent efforts for curating multilingual abstractive summarization datasets, they are limited in terms of the number of languages covered, the number of training samples, or both. To this end, we curate XL-Sum, a large-scale abstractive summarization dataset of 1.35 million news articles from 45 languages crawled from the British Broadcasting Corporation website.

Communicative Goal

What was the communicative goal?

Introduce new languages in the english-centric domain of abstractive text summarization and enable both multilingual and per-language summarization.

Sourced from Different Sources

Is the dataset aggregated from different data sources?

yes

Source Details

List the sources (one per line)

British Broadcasting Corporation (BBC) news websites.

Language Data

How was Language Data Obtained?

How was the language data obtained?

Found

Where was it found?

If found, where from?

Multiple websites

Language Producers

What further information do we have on the language producers?

The language content was written by professional news editors hired by BBC.

Topics Covered

Does the language in the dataset focus on specific topics? How would you describe them?

News

Data Validation

Was the text validated by a different worker or a data curator?

not validated

Data Preprocessing

How was the text data pre-processed? (Enter N/A if the text was not pre-processed)

We used 'NFKC' normalization on all text instances.

Was Data Filtered?

Were text instances selected or filtered?

algorithmically

Filter Criteria

What were the selection criteria?

We designed a crawler to recursively crawl pages starting from the homepage by visiting different article links present in each page visited. We were able to take advantage of the fact that all BBC sites have somewhat similar structures, and were able to scrape articles from all sites. We discarded pages with no textual contents (mostly pages consisting of multimedia contents) before further processing. We designed a number of heuristics to make the extraction effective by carefully examining the HTML structures of the crawled pages:

The desired summary must be present within the beginning two paragraphs of an article.
The summary paragraph must have some portion of texts in bold format.
The summary paragraph may contain some hyperlinks that may not be bold. The proportion of bold texts and hyperlinked texts to the total length of the paragraph in consideration must be at least 95%.
All texts except the summary and the headline must be included in the input text (including image captions).
The input text must be at least twice as large as the summary.

Structured Annotations

Additional Annotations?

Does the dataset have additional annotations for each instance?

none

Annotation Service?

Was an annotation service used?

no

Consent

Any Consent Policy?

Was there a consent policy involved when gathering the data?

yes

Consent Policy Details

What was the consent policy?

BBC's policy specifies that the text content within its websites can be used for non-commercial research only.

Private Identifying Information (PII)

Contains PII?

Does the source language data likely contain Personal Identifying Information about the data creators or subjects?

likely

Categories of PII

What categories of PII are present or suspected in the data?

generic PII

Any PII Identification?

Did the curators use any automatic/manual method to identify PII in the dataset?

no identification

Maintenance

Any Maintenance Plan?

Does the original dataset have a maintenance plan?

no

Broader Social Context

Previous Work on the Social Impact of the Dataset

Impact on Under-Served Communities

Discussion of Biases

Previous Work on the Social Impact of the Dataset

Usage of Models based on the Data

Are you aware of cases where models trained on the task featured in this dataset ore related tasks have been used in automated systems?

no

Impact on Under-Served Communities

Addresses needs of underserved Communities?

Does this dataset address the needs of communities that are traditionally underserved in language technology, and particularly language generation technology? Communities may be underserved for exemple because their language, language variety, or social or geographical context is underepresented in NLP and NLG resources (datasets and models).

yes

Details on how Dataset Addresses the Needs

Describe how this dataset addresses the needs of underserved communities.

This dataset introduces summarization corpus for many languages where there weren't any datasets like this curated before.

Discussion of Biases

Any Documented Social Biases?

Are there documented social biases in the dataset? Biases in this context are variations in the ways members of different social categories are represented that can have harmful downstream consequences for members of the more disadvantaged group.

no

Are the Language Producers Representative of the Language?

Does the distribution of language producers in the dataset accurately represent the full distribution of speakers of the language world-wide? If not, how does it differ?

Yes

Considerations for Using the Data

PII Risks and Liability

Licenses

Known Technical Limitations

PII Risks and Liability

Licenses

Copyright Restrictions on the Dataset

Based on your answers in the Intended Use part of the Data Overview Section, which of the following best describe the copyright and licensing status of the dataset?

research use only, non-commercial use only

Copyright Restrictions on the Language Data

Based on your answers in the Language part of the Data Curation Section, which of the following best describe the copyright and licensing status of the underlying language data?

research use only, non-commercial use only

Known Technical Limitations

Technical Limitations

Describe any known technical limitations, such as spurrious correlations, train/test overlap, annotation biases, or mis-annotations, and cite the works that first identified these limitations when possible.

Human evaluation showed most languages had a high percentage of good summaries in the upper nineties, almost none of the summaries contained any conflicting information, while about one-third on average had information that was not directly inferrable from the source article. Since generally multiple articles are written regarding an important event, there could be an overlap between the training and evaluation data in terms on content.

Unsuited Applications

When using a model trained on this dataset in a setting where users or the public may interact with its predictions, what are some pitfalls to look out for? In particular, describe some applications of the general task featured in this dataset that its curation or properties make it less suitable for.

The dataset is limited to news domain only. Hence it wouldn't be advisable to use a model trained on this dataset for summarizing texts from a different domain i.e. literature, scientific text etc. Another pitfall could be hallucinations in the model generated summary.

Discouraged Use Cases

What are some discouraged use cases of a model trained to maximize the proposed metrics on this dataset? In particular, think about settings where decisions made by a model that performs reasonably well on the metric my still have strong negative consequences for user or members of the public.

ROUGE evaluates the quality of the summary as a whole by considering up to 4-gram overlaps. Therefore, in an article about India if the word "India" in the generated summary gets replaced by "Pakistan" due to model hallucination, the overall score wouldn't be reduced significantly, but the entire meaning could get changed.

xlsum

website

paper

Quick-Use

Contact Name If known, provide the name of at least one person the reader can contact for questions about the dataset.

Multilingual? Is the dataset multilingual?

Covered Languages What languages/dialects are covered in the dataset?

License What is the license of the dataset?

Communicative Goal Provide a short description of the communicative goal of a model trained for this task on this dataset.

Additional Annotations? Does the dataset have additional annotations for each instance?

Contains PII? Does the source language data likely contain Personal Identifying Information about the data creators or subjects?

Dataset Overview Where to find the Data and its Documentation Languages and Intended Use Credit Dataset Structure

Where to find the Data and its Documentation

Languages and Intended Use

Credit

Dataset Structure

Where to find the Data and its Documentation

Webpage What is the webpage for the dataset (if it exists)?

Download What is the link to where the original dataset is hosted?

Paper What is the link to the paper describing the dataset (open access preferred)?

BibTex Provide the BibTex-formatted reference for the dataset. Please use the correct published version (ACL anthology, etc.) instead of google scholar created Bibtex.

Contact Name If known, provide the name of at least one person the reader can contact for questions about the dataset.

Contact Email If known, provide the email of at least one person the reader can contact for questions about the dataset.

Has a Leaderboard? Does the dataset have an active leaderboard?

Leaderboard Link Provide a link to the leaderboard.

Leaderboard Details Briefly describe how the leaderboard evaluates models.

Languages and Intended Use

Multilingual? Is the dataset multilingual?

Covered Languages What languages/dialects are covered in the dataset?

License What is the license of the dataset?

Intended Use What is the intended use of the dataset?

Primary Task What primary task does the dataset support?

Communicative Goal Provide a short description of the communicative goal of a model trained for this task on this dataset.

Credit

Curation Organization Type(s) In what kind of organization did the dataset curation happen?

Curation Organization(s) Name the organization(s).

Who added the Dataset to GEM? Who contributed to the data card and adding the dataset to GEM? List the people+affiliations involved in creating this data card and who helped integrate this dataset into GEM.

Dataset Structure

Data Fields List and describe the fields present in the dataset.

Example Instance Provide a JSON formatted example of a typical instance in the dataset.

Data Splits Describe and name the splits in the dataset if there are more than one.

Dataset in GEM Rationale for Inclusion in GEM GEM-Specific Curation Getting Started with the Task

Rationale for Inclusion in GEM

GEM-Specific Curation

Getting Started with the Task

Rationale for Inclusion in GEM

Why is the Dataset in GEM? What does this dataset contribute toward better generation evaluation and why is it part of GEM?

Similar Datasets Do other datasets for the high level task exist?

Unique Language Coverage Does this dataset cover other languages than other datasets for the same task?

Difference from other GEM datasets What else sets this dataset apart from other similar datasets in GEM?

Ability that the Dataset measures What aspect of model ability can be measured with this dataset?

GEM-Specific Curation

Modificatied for GEM? Has the GEM version of the dataset been modified in any way (data, processing, splits) from the original curated data?

Additional Splits? Does GEM provide additional splits to the dataset?

Getting Started with the Task

Previous Results Previous Results

Previous Results

Previous Results

Measured Model Abilities What aspect of model ability can be measured with this dataset?

Metrics What metrics are typically used for this task?

Proposed Evaluation List and describe the purpose of the metrics and evaluation methodology (including human evaluation) that the dataset creators used when introducing this task.

Previous results available? Are previous results available?

Dataset Curation Original Curation Language Data Structured Annotations Consent Private Identifying Information (PII) Maintenance

Original Curation

Language Data

Structured Annotations

Consent

Private Identifying Information (PII)

Maintenance

Original Curation

Original Curation Rationale Original curation rationale

Communicative Goal What was the communicative goal?

Sourced from Different Sources Is the dataset aggregated from different data sources?

Source Details List the sources (one per line)

Language Data

How was Language Data Obtained? How was the language data obtained?

Where was it found? If found, where from?

Language Producers What further information do we have on the language producers?

Topics Covered Does the language in the dataset focus on specific topics? How would you describe them?

Data Validation Was the text validated by a different worker or a data curator?

Contact Name

If known, provide the name of at least one person the reader can contact for questions about the dataset.

Multilingual?

Is the dataset multilingual?

Covered Languages

What languages/dialects are covered in the dataset?

License

What is the license of the dataset?

Communicative Goal

Provide a short description of the communicative goal of a model trained for this task on this dataset.

Additional Annotations?

Does the dataset have additional annotations for each instance?

Contains PII?

Does the source language data likely contain Personal Identifying Information about the data creators or subjects?

Dataset Overview

Where to find the Data and its Documentation

Languages and Intended Use

Credit

Dataset Structure

Webpage

What is the webpage for the dataset (if it exists)?

Download

What is the link to where the original dataset is hosted?

Paper

What is the link to the paper describing the dataset (open access preferred)?

BibTex

Provide the BibTex-formatted reference for the dataset. Please use the correct published version (ACL anthology, etc.) instead of google scholar created Bibtex.

Contact Name

If known, provide the name of at least one person the reader can contact for questions about the dataset.

Contact Email

If known, provide the email of at least one person the reader can contact for questions about the dataset.

Has a Leaderboard?

Does the dataset have an active leaderboard?

Leaderboard Link

Provide a link to the leaderboard.

Leaderboard Details

Briefly describe how the leaderboard evaluates models.

Multilingual?

Is the dataset multilingual?

Covered Languages

What languages/dialects are covered in the dataset?

License

What is the license of the dataset?

Intended Use

What is the intended use of the dataset?

Primary Task

What primary task does the dataset support?

Communicative Goal

Provide a short description of the communicative goal of a model trained for this task on this dataset.

Curation Organization Type(s)

In what kind of organization did the dataset curation happen?

Curation Organization(s)

Name the organization(s).

Who added the Dataset to GEM?

Who contributed to the data card and adding the dataset to GEM? List the people+affiliations involved in creating this data card and who helped integrate this dataset into GEM.

Data Fields

List and describe the fields present in the dataset.

Example Instance

Provide a JSON formatted example of a typical instance in the dataset.

Data Splits

Describe and name the splits in the dataset if there are more than one.

Dataset in GEM

Rationale for Inclusion in GEM

GEM-Specific Curation

Getting Started with the Task

Why is the Dataset in GEM?

What does this dataset contribute toward better generation evaluation and why is it part of GEM?

Similar Datasets

Do other datasets for the high level task exist?

Unique Language Coverage

Does this dataset cover other languages than other datasets for the same task?

Difference from other GEM datasets

What else sets this dataset apart from other similar datasets in GEM?

Ability that the Dataset measures

What aspect of model ability can be measured with this dataset?

Modificatied for GEM?

Has the GEM version of the dataset been modified in any way (data, processing, splits) from the original curated data?

Additional Splits?

Does GEM provide additional splits to the dataset?

Previous Results

Previous Results

Measured Model Abilities

What aspect of model ability can be measured with this dataset?

Metrics

What metrics are typically used for this task?

Proposed Evaluation

List and describe the purpose of the metrics and evaluation methodology (including human evaluation) that the dataset creators used when introducing this task.

Previous results available?

Are previous results available?

Dataset Curation

Original Curation

Language Data

Structured Annotations

Consent

Private Identifying Information (PII)

Maintenance

Original Curation Rationale

Original curation rationale

Communicative Goal

What was the communicative goal?

Sourced from Different Sources

Is the dataset aggregated from different data sources?

Source Details

List the sources (one per line)

How was Language Data Obtained?

How was the language data obtained?

Where was it found?

If found, where from?

Language Producers

What further information do we have on the language producers?

Topics Covered

Does the language in the dataset focus on specific topics? How would you describe them?

Data Validation

Was the text validated by a different worker or a data curator?

Data Preprocessing

How was the text data pre-processed? (Enter N/A if the text was not pre-processed)

Was Data Filtered?

Were text instances selected or filtered?

Filter Criteria

What were the selection criteria?

Additional Annotations?

Does the dataset have additional annotations for each instance?

Annotation Service?

Was an annotation service used?

Any Consent Policy?

Was there a consent policy involved when gathering the data?

Consent Policy Details

What was the consent policy?

Contains PII?

Does the source language data likely contain Personal Identifying Information about the data creators or subjects?

Categories of PII

What categories of PII are present or suspected in the data?

Any PII Identification?

Did the curators use any automatic/manual method to identify PII in the dataset?

Any Maintenance Plan?

Does the original dataset have a maintenance plan?

Broader Social Context

Previous Work on the Social Impact of the Dataset

Impact on Under-Served Communities

Discussion of Biases

Usage of Models based on the Data

Are you aware of cases where models trained on the task featured in this dataset ore related tasks have been used in automated systems?

Details on how Dataset Addresses the Needs

Describe how this dataset addresses the needs of underserved communities.

Any Documented Social Biases?

Are there documented social biases in the dataset? Biases in this context are variations in the ways members of different social categories are represented that can have harmful downstream consequences for members of the more disadvantaged group.

Are the Language Producers Representative of the Language?

Does the distribution of language producers in the dataset accurately represent the full distribution of speakers of the language world-wide? If not, how does it differ?

Considerations for Using the Data

PII Risks and Liability

Licenses

Known Technical Limitations

Copyright Restrictions on the Dataset

Based on your answers in the Intended Use part of the Data Overview Section, which of the following best describe the copyright and licensing status of the dataset?

Copyright Restrictions on the Language Data

Based on your answers in the Language part of the Data Curation Section, which of the following best describe the copyright and licensing status of the underlying language data?

Technical Limitations

Describe any known technical limitations, such as spurrious correlations, train/test overlap, annotation biases, or mis-annotations, and cite the works that first identified these limitations when possible.