xsumSummarization

xsum

XSum is an English news summarization dataset where the task is to predict the first sentence of an article from the rest of it.

You can load the dataset via:

import datasets
data = datasets.load_dataset('GEM/xsum')

The data loader can be found here.

website

n/a

authors

Shashi Narayan, Shay B. Cohen, Mirella Lapata (all affiliated with University of Edinburgh at the time of dataset creation)

Quick-Use

Contact Name

If known, provide the name of at least one person the reader can contact for questions about the dataset.

Shashi Narayan

Multilingual?

Is the dataset multilingual?

no

Covered Languages

What languages/dialects are covered in the dataset?

English

License

What is the license of the dataset?

cc-by-sa-4.0: Creative Commons Attribution Share Alike 4.0 International

Communicative Goal

Provide a short description of the communicative goal of a model trained for this task on this dataset.

Given a news article, produce a single sentence summary of the content of the article.

Additional Annotations?

Does the dataset have additional annotations for each instance?

none

Contains PII?

Does the source language data likely contain Personal Identifying Information about the data creators or subjects?

yes/very likely

Dataset Overview
  • Where to find the Data and its Documentation

  • Languages and Intended Use

  • Credit

  • Dataset Structure

Where to find the Data and its Documentation

Download

What is the link to where the original dataset is hosted?

Github

Paper

What is the link to the paper describing the dataset (open access preferred)?

ACL Anthology

BibTex

Provide the BibTex-formatted reference for the dataset. Please use the correct published version (ACL anthology, etc.) instead of google scholar created Bibtex.

@InProceedings{xsum-emnlp,
author =      "Shashi Narayan and Shay B. Cohen and Mirella Lapata",
title =       "Don't Give Me the Details, Just the Summary! {T}opic-Aware Convolutional Neural Networks for Extreme Summarization",
booktitle =   "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing ",
year =        "2018",
address =     "Brussels, Belgium",
}
Contact Name

If known, provide the name of at least one person the reader can contact for questions about the dataset.

Shashi Narayan

Contact Email

If known, provide the email of at least one person the reader can contact for questions about the dataset.

shashinarayan@google.com

Has a Leaderboard?

Does the dataset have an active leaderboard?

no

Languages and Intended Use

Multilingual?

Is the dataset multilingual?

no

Covered Dialects

What dialects are covered? Are there multiple dialects per language?

Since the source of the dataset are BBC articles, the language is in British English of the variation written by journalists.

Covered Languages

What languages/dialects are covered in the dataset?

English

Whose Language?

Whose language is in the dataset?

Professional journalists

License

What is the license of the dataset?

cc-by-sa-4.0: Creative Commons Attribution Share Alike 4.0 International

Intended Use

What is the intended use of the dataset?

The dataset is for the task of abstractive summarization in its extreme form, its about summarizing a document in a single sentence. The idea is to create a short, one-sentence news summary answering the question "What is the article about?".

Primary Task

What primary task does the dataset support?

Summarization

Communicative Goal

Provide a short description of the communicative goal of a model trained for this task on this dataset.

Given a news article, produce a single sentence summary of the content of the article.

Credit

Curation Organization Type(s)

In what kind of organization did the dataset curation happen?

academic

Curation Organization(s)

Name the organization(s).

University of Edinburgh

Dataset Creators

Who created the original dataset? List the people involved in collecting the dataset and their affiliation(s).

Shashi Narayan, Shay B. Cohen, Mirella Lapata (all affiliated with University of Edinburgh at the time of dataset creation)

Funding

Who funded the data creation?

European Research Council (Lapata; award number 681760), the European Union under the Horizon 2020 SUMMA project (Narayan, Cohen; grant agreement 688139), and Huawei Technologies (Cohen).

Who added the Dataset to GEM?

Who contributed to the data card and adding the dataset to GEM? List the people+affiliations involved in creating this data card and who helped integrate this dataset into GEM.

The original data card was written by Laura Perez-Beltrachini and the data loader by Yacine Jernite. Sebastian Gehrmann migrated the data card to the new format and extended it. The v2 data loader was migrated by Abinaya Mahendiran

Dataset Structure

Data Fields

List and describe the fields present in the dataset.

  • Document: Input news article.
  • Summary: One sentence summary of the article.
  • Id: BBC ID of the article.
Reason for Structure

How was the dataset structure determined?

The Document/Summary format is standard for summarization datasets.

How were labels chosen?

How were the labels chosen?

The labels are the first sentence of the source article.

Example Instance

Provide a JSON formatted example of a typical instance in the dataset.

{
'document': 'The researchers have sequenced the genome of a strain of bacterium that causes the virulent infection.\nA survey in 2007 showed that bleeding canker had spread rapidly, with almost half of the two million horse chestnuts displaying symptoms of the disease.\nThe findings have been published in the journal PLoS One.\nA visible symptom of the disease is a lesion on the bark, which oozes a resin on to the trunk or sometimes the branches.\nThe bark underneath the canker is killed, and if cankers manage to go all the way around the trunk then the horse chestnut (Aesculus hippocastanum) will die because it cuts off the food supply. [...]',
'target': "A team of UK scientists hopes to shed light on the mysteries of bleeding canker, a disease that is threatening the nation's horse chestnut trees.",
}
Data Splits

Describe and name the splits in the dataset if there are more than one.

Section Number of Documents
Training 204,045
Validation 11,332
Testing 11,334
Total 226k
Section number of words number of sentences
Documents 431.07 19.77
Summary 23.26 1.00
Splitting Criteria

Describe any criteria for splitting the data, if used. If there are differences between the splits (e.g., if the training annotations are machine-generated and the dev and test ones are created by humans, or if different numbers of annotators contributed to each example), describe them here.

The identifiers in the URLs were used to randomly split the dataset into training (90%, 204,045), validation (5%, 11,332), and test (5%, 11,334) sets.

Dataset Curation
  • Original Curation

  • Language Data

  • Structured Annotations

  • Consent

  • Private Identifying Information (PII)

  • Maintenance

Original Curation

Original Curation Rationale

Original curation rationale

Comparable datasets are often very extractive which is not a strategy that works for one-sentence summaries. The dataset curators thus created this dataset as a way to evaluate truly abstractive models

Communicative Goal

What was the communicative goal?

Same as the communicative goal in GEM: A model should summarize a news article in a single sentence

Sourced from Different Sources

Is the dataset aggregated from different data sources?

no

Language Data

How was Language Data Obtained?

How was the language data obtained?

Found

Where was it found?

If found, where from?

Single website

Language Producers

What further information do we have on the language producers?

The data was collected from articles between 2010 and 2017. No other information

Topics Covered

Does the language in the dataset focus on specific topics? How would you describe them?

The collected articles included the following topics: News, Politics, Sports, Weather, Business, Technology, Science, Health, Family, Education, Entertainment and Arts

The dataset curators also used LDA to gain insight into this question and found that the following were the top keywords associated with each topic:

  • T1: charge, court, murder, police, arrest, guilty, sentence, boy, bail, space, crown, trial
  • T2: church, abuse, bishop, child, catholic, gay, pope, school, christian, priest, cardinal
  • T3: council, people, government, local, housing, home, house, property, city, plan, authority
  • T4: clinton, party, trump, climate, poll, vote, plaid, election, debate, change, candidate, campaign
  • T5: country, growth, report, business, export, fall, bank, security, economy, rise, global, inflation
  • T6: hospital, patient, trust, nhs, people, care, health, service, staff, report, review, system, child
Data Validation

Was the text validated by a different worker or a data curator?

not validated

Data Preprocessing

How was the text data pre-processed? (Enter N/A if the text was not pre-processed)

The text was extracted from the HTML of the webpage. No further processing was done.

Was Data Filtered?

Were text instances selected or filtered?

not filtered

Structured Annotations

Additional Annotations?

Does the dataset have additional annotations for each instance?

none

Annotation Service?

Was an annotation service used?

no

Consent

Any Consent Policy?

Was there a consent policy involved when gathering the data?

no

Justification for Using the Data

If not, what is the justification for reusing the data?

The copyright license of the data allows reusing it for this purpose.

Private Identifying Information (PII)

Contains PII?

Does the source language data likely contain Personal Identifying Information about the data creators or subjects?

yes/very likely

Categories of PII

What categories of PII are present or suspected in the data?

generic PII

Any PII Identification?

Did the curators use any automatic/manual method to identify PII in the dataset?

no identification

Maintenance

Any Maintenance Plan?

Does the original dataset have a maintenance plan?

no

Broader Social Context
  • Previous Work on the Social Impact of the Dataset

  • Impact on Under-Served Communities

  • Discussion of Biases

Previous Work on the Social Impact of the Dataset

Usage of Models based on the Data

Are you aware of cases where models trained on the task featured in this dataset ore related tasks have been used in automated systems?

no

Impact on Under-Served Communities

Addresses needs of underserved Communities?

Does this dataset address the needs of communities that are traditionally underserved in language technology, and particularly language generation technology? Communities may be underserved for exemple because their language, language variety, or social or geographical context is underepresented in NLP and NLG resources (datasets and models).

no

Discussion of Biases

Any Documented Social Biases?

Are there documented social biases in the dataset? Biases in this context are variations in the ways members of different social categories are represented that can have harmful downstream consequences for members of the more disadvantaged group.

unsure

Are the Language Producers Representative of the Language?

Does the distribution of language producers in the dataset accurately represent the full distribution of speakers of the language world-wide? If not, how does it differ?

The language and content of the data is focused on news and language in the UK and as such not representative of the speakers world-wide. Existing selection biases of the BBC exist in this dataset.