xwikisSummarization

xwikis

The XWikis Corpus provides datasets with different language pairs and directions for cross-lingual and multi-lingual abstractive document summarisation.

You can load the dataset via:

import datasets
data = datasets.load_dataset('GEM/xwikis')

The data loader can be found here.

website

Github

authors

Laura Perez-Beltrachini (University of Edinburgh)

Quick-Use

Contact Name

If known, provide the name of at least one person the reader can contact for questions about the dataset.

Laura Perez-Beltrachini

Multilingual?

Is the dataset multilingual?

yes

Covered Languages

What languages/dialects are covered in the dataset?

German, English, French, Czech

License

What is the license of the dataset?

cc-by-sa-4.0: Creative Commons Attribution Share Alike 4.0 International

Communicative Goal

Provide a short description of the communicative goal of a model trained for this task on this dataset.

Entity descriptive summarisation, that is, generate a summary that conveys the most salient facts of a document related to a given entity.

Additional Annotations?

Does the dataset have additional annotations for each instance?

found

Contains PII?

Does the source language data likely contain Personal Identifying Information about the data creators or subjects?

no PII

Dataset Overview
  • Where to find the Data and its Documentation

  • Languages and Intended Use

  • Credit

  • Dataset Structure

Where to find the Data and its Documentation

Webpage

What is the webpage for the dataset (if it exists)?

Github

Paper

What is the link to the paper describing the dataset (open access preferred)?

https://arxiv.org/abs/2202.09583

BibTex

Provide the BibTex-formatted reference for the dataset. Please use the correct published version (ACL anthology, etc.) instead of google scholar created Bibtex.

@InProceedings{clads-emnlp,
author =      "Laura Perez-Beltrachini and Mirella Lapata",
title =       "Models and Datasets for Cross-Lingual Summarisation",
booktitle =   "Proceedings of The 2021 Conference on Empirical Methods in Natural Language Processing ",
year =        "2021",
address =     "Punta Cana, Dominican Republic",
}
Contact Name

If known, provide the name of at least one person the reader can contact for questions about the dataset.

Laura Perez-Beltrachini

Contact Email

If known, provide the email of at least one person the reader can contact for questions about the dataset.

lperez@ed.ac.uk

Has a Leaderboard?

Does the dataset have an active leaderboard?

no

Languages and Intended Use

Multilingual?

Is the dataset multilingual?

yes

Covered Languages

What languages/dialects are covered in the dataset?

German, English, French, Czech

License

What is the license of the dataset?

cc-by-sa-4.0: Creative Commons Attribution Share Alike 4.0 International

Intended Use

What is the intended use of the dataset?

Cross-lingual and Multi-lingual single long input document abstractive summarisation.

Primary Task

What primary task does the dataset support?

Summarization

Communicative Goal

Provide a short description of the communicative goal of a model trained for this task on this dataset.

Entity descriptive summarisation, that is, generate a summary that conveys the most salient facts of a document related to a given entity.

Credit

Curation Organization Type(s)

In what kind of organization did the dataset curation happen?

academic

Dataset Creators

Who created the original dataset? List the people involved in collecting the dataset and their affiliation(s).

Laura Perez-Beltrachini (University of Edinburgh)

Who added the Dataset to GEM?

Who contributed to the data card and adding the dataset to GEM? List the people+affiliations involved in creating this data card and who helped integrate this dataset into GEM.

Laura Perez-Beltrachini (University of Edinburgh) and Ronald Cardenas (University of Edinburgh)

Dataset Structure

Data Splits

Describe and name the splits in the dataset if there are more than one.

For each language pair and direction there exists a train/valid/test split. The test split is a sample of size 7k from the intersection of titles existing in the four languages (cs,fr,en,de). Train/valid are randomly split.

Dataset in GEM
  • Rationale for Inclusion in GEM

  • GEM-Specific Curation

  • Getting Started with the Task

Rationale for Inclusion in GEM

Similar Datasets

Do other datasets for the high level task exist?

no

GEM-Specific Curation

Modificatied for GEM?

Has the GEM version of the dataset been modified in any way (data, processing, splits) from the original curated data?

no

Additional Splits?

Does GEM provide additional splits to the dataset?

no

Getting Started with the Task

Previous Results
  • Previous Results

Previous Results

Measured Model Abilities

What aspect of model ability can be measured with this dataset?

  • identification of entity salient information
  • translation
  • multi-linguality
  • cross-lingual transfer, zero-shot, few-shot
Metrics

What metrics are typically used for this task?

ROUGE

Previous results available?

Are previous results available?

yes

Other Evaluation Approaches

What evaluation approaches have others used?

ROUGE-1/2/L

Dataset Curation
  • Original Curation

  • Language Data

  • Structured Annotations

  • Consent

  • Private Identifying Information (PII)

  • Maintenance

Original Curation

Sourced from Different Sources

Is the dataset aggregated from different data sources?

no

Language Data

How was Language Data Obtained?

How was the language data obtained?

Found

Where was it found?

If found, where from?

Single website

Data Validation

Was the text validated by a different worker or a data curator?

other

Was Data Filtered?

Were text instances selected or filtered?

not filtered

Structured Annotations

Additional Annotations?

Does the dataset have additional annotations for each instance?

found

Annotation Service?

Was an annotation service used?

no

Annotation Values

Purpose and values for each annotation

The input documents have section structure information.

Any Quality Control?

Quality control measures?

validated by another rater

Quality Control Details

Describe the quality control measures that were taken.

Bilingual annotators assessed the content overlap of source document and target summaries.

Consent

Any Consent Policy?

Was there a consent policy involved when gathering the data?

no

Private Identifying Information (PII)

Contains PII?

Does the source language data likely contain Personal Identifying Information about the data creators or subjects?

no PII

Maintenance

Any Maintenance Plan?

Does the original dataset have a maintenance plan?

no

Broader Social Context
  • Previous Work on the Social Impact of the Dataset

  • Impact on Under-Served Communities

  • Discussion of Biases

Previous Work on the Social Impact of the Dataset

Usage of Models based on the Data

Are you aware of cases where models trained on the task featured in this dataset ore related tasks have been used in automated systems?

no

Impact on Under-Served Communities

Addresses needs of underserved Communities?

Does this dataset address the needs of communities that are traditionally underserved in language technology, and particularly language generation technology? Communities may be underserved for exemple because their language, language variety, or social or geographical context is underepresented in NLP and NLG resources (datasets and models).

no

Discussion of Biases

Any Documented Social Biases?

Are there documented social biases in the dataset? Biases in this context are variations in the ways members of different social categories are represented that can have harmful downstream consequences for members of the more disadvantaged group.

no

Considerations for Using the Data
  • PII Risks and Liability

  • Licenses

  • Known Technical Limitations

PII Risks and Liability

Licenses

Copyright Restrictions on the Dataset

Based on your answers in the Intended Use part of the Data Overview Section, which of the following best describe the copyright and licensing status of the dataset?

public domain

Copyright Restrictions on the Language Data

Based on your answers in the Language part of the Data Curation Section, which of the following best describe the copyright and licensing status of the underlying language data?

public domain

Known Technical Limitations