GEM xwikis

xwikis

The XWikis Corpus provides datasets with different language pairs and directions for cross-lingual and multi-lingual abstractive document summarisation.

You can load the dataset via:

import datasets
data = datasets.load_dataset('GEM/xwikis')

The data loader can be found here.

website

Github

paper

https://arxiv.org/abs/2202.09583

authors

Laura Perez-Beltrachini (University of Edinburgh)

Quick-Use

Contact Name

If known, provide the name of at least one person the reader can contact for questions about the dataset.

Laura Perez-Beltrachini

Multilingual?

Is the dataset multilingual?

yes

Covered Languages

What languages/dialects are covered in the dataset?

German, English, French, Czech

License

What is the license of the dataset?

cc-by-sa-4.0: Creative Commons Attribution Share Alike 4.0 International

Communicative Goal

Provide a short description of the communicative goal of a model trained for this task on this dataset.

Entity descriptive summarisation, that is, generate a summary that conveys the most salient facts of a document related to a given entity.

Additional Annotations?

Does the dataset have additional annotations for each instance?

found

Contains PII?

Does the source language data likely contain Personal Identifying Information about the data creators or subjects?

no PII

Dataset Overview

Where to find the Data and its Documentation

Languages and Intended Use

Credit

Dataset Structure

Where to find the Data and its Documentation

Webpage

What is the webpage for the dataset (if it exists)?

Github

Paper

What is the link to the paper describing the dataset (open access preferred)?

https://arxiv.org/abs/2202.09583

BibTex

Provide the BibTex-formatted reference for the dataset. Please use the correct published version (ACL anthology, etc.) instead of google scholar created Bibtex.

@InProceedings{clads-emnlp,
author =      "Laura Perez-Beltrachini and Mirella Lapata",
title =       "Models and Datasets for Cross-Lingual Summarisation",
booktitle =   "Proceedings of The 2021 Conference on Empirical Methods in Natural Language Processing ",
year =        "2021",
address =     "Punta Cana, Dominican Republic",
}

Contact Name

If known, provide the name of at least one person the reader can contact for questions about the dataset.

Laura Perez-Beltrachini

Contact Email

If known, provide the email of at least one person the reader can contact for questions about the dataset.

lperez@ed.ac.uk

Has a Leaderboard?

Does the dataset have an active leaderboard?

no

Languages and Intended Use

Multilingual?

Is the dataset multilingual?

yes

Covered Languages

What languages/dialects are covered in the dataset?

German, English, French, Czech

License

What is the license of the dataset?

cc-by-sa-4.0: Creative Commons Attribution Share Alike 4.0 International

Intended Use

What is the intended use of the dataset?

Cross-lingual and Multi-lingual single long input document abstractive summarisation.

Primary Task

What primary task does the dataset support?

Summarization

Communicative Goal

Provide a short description of the communicative goal of a model trained for this task on this dataset.

Entity descriptive summarisation, that is, generate a summary that conveys the most salient facts of a document related to a given entity.

Credit

Curation Organization Type(s)

In what kind of organization did the dataset curation happen?

academic

Dataset Creators

Who created the original dataset? List the people involved in collecting the dataset and their affiliation(s).

Laura Perez-Beltrachini (University of Edinburgh)

Who added the Dataset to GEM?

Who contributed to the data card and adding the dataset to GEM? List the people+affiliations involved in creating this data card and who helped integrate this dataset into GEM.

Laura Perez-Beltrachini (University of Edinburgh) and Ronald Cardenas (University of Edinburgh)

Dataset Structure

Data Splits

Describe and name the splits in the dataset if there are more than one.

For each language pair and direction there exists a train/valid/test split. The test split is a sample of size 7k from the intersection of titles existing in the four languages (cs,fr,en,de). Train/valid are randomly split.

Dataset in GEM

Rationale for Inclusion in GEM

GEM-Specific Curation

Getting Started with the Task

Rationale for Inclusion in GEM

Similar Datasets

Do other datasets for the high level task exist?

no

GEM-Specific Curation

Modificatied for GEM?

Has the GEM version of the dataset been modified in any way (data, processing, splits) from the original curated data?

no

Additional Splits?

Does GEM provide additional splits to the dataset?

no

Getting Started with the Task

Previous Results

Previous Results

Previous Results

Measured Model Abilities

What aspect of model ability can be measured with this dataset?

identification of entity salient information
translation
multi-linguality
cross-lingual transfer, zero-shot, few-shot

Metrics

What metrics are typically used for this task?

ROUGE

Previous results available?

Are previous results available?

yes

Other Evaluation Approaches

What evaluation approaches have others used?

ROUGE-1/2/L

Dataset Curation

Original Curation

Language Data

Structured Annotations

Consent

Private Identifying Information (PII)

Maintenance

Original Curation

Sourced from Different Sources

Is the dataset aggregated from different data sources?

no

Language Data

How was Language Data Obtained?

How was the language data obtained?

Found

Where was it found?

If found, where from?

Single website

Data Validation

Was the text validated by a different worker or a data curator?

other

Was Data Filtered?

Were text instances selected or filtered?

not filtered

Structured Annotations

Additional Annotations?

Does the dataset have additional annotations for each instance?

found

Annotation Service?

Was an annotation service used?

no

Annotation Values

Purpose and values for each annotation

The input documents have section structure information.

Any Quality Control?

Quality control measures?

validated by another rater

Quality Control Details

Describe the quality control measures that were taken.

Bilingual annotators assessed the content overlap of source document and target summaries.

Consent

Any Consent Policy?

Was there a consent policy involved when gathering the data?

no

Private Identifying Information (PII)

Contains PII?

Does the source language data likely contain Personal Identifying Information about the data creators or subjects?

no PII

Maintenance

Any Maintenance Plan?

Does the original dataset have a maintenance plan?

no

Broader Social Context

Previous Work on the Social Impact of the Dataset

Impact on Under-Served Communities

Discussion of Biases

Previous Work on the Social Impact of the Dataset

Usage of Models based on the Data

Are you aware of cases where models trained on the task featured in this dataset ore related tasks have been used in automated systems?

no

Impact on Under-Served Communities

Addresses needs of underserved Communities?

Does this dataset address the needs of communities that are traditionally underserved in language technology, and particularly language generation technology? Communities may be underserved for exemple because their language, language variety, or social or geographical context is underepresented in NLP and NLG resources (datasets and models).

no

Discussion of Biases

Any Documented Social Biases?

Are there documented social biases in the dataset? Biases in this context are variations in the ways members of different social categories are represented that can have harmful downstream consequences for members of the more disadvantaged group.

no

Considerations for Using the Data

PII Risks and Liability

Licenses

Known Technical Limitations

PII Risks and Liability

Licenses

Copyright Restrictions on the Dataset

Based on your answers in the Intended Use part of the Data Overview Section, which of the following best describe the copyright and licensing status of the dataset?

public domain

Copyright Restrictions on the Language Data

Based on your answers in the Language part of the Data Curation Section, which of the following best describe the copyright and licensing status of the underlying language data?

public domain

xwikis

website

paper

authors

Quick-Use

Contact Name If known, provide the name of at least one person the reader can contact for questions about the dataset.

Multilingual? Is the dataset multilingual?

Covered Languages What languages/dialects are covered in the dataset?

License What is the license of the dataset?

Communicative Goal Provide a short description of the communicative goal of a model trained for this task on this dataset.

Additional Annotations? Does the dataset have additional annotations for each instance?

Contains PII? Does the source language data likely contain Personal Identifying Information about the data creators or subjects?

Dataset Overview Where to find the Data and its Documentation Languages and Intended Use Credit Dataset Structure

Where to find the Data and its Documentation

Languages and Intended Use

Credit

Dataset Structure

Where to find the Data and its Documentation

Webpage What is the webpage for the dataset (if it exists)?

Paper What is the link to the paper describing the dataset (open access preferred)?

BibTex Provide the BibTex-formatted reference for the dataset. Please use the correct published version (ACL anthology, etc.) instead of google scholar created Bibtex.

Contact Name If known, provide the name of at least one person the reader can contact for questions about the dataset.

Contact Email If known, provide the email of at least one person the reader can contact for questions about the dataset.

Has a Leaderboard? Does the dataset have an active leaderboard?

Languages and Intended Use

Multilingual? Is the dataset multilingual?

Covered Languages What languages/dialects are covered in the dataset?

License What is the license of the dataset?

Intended Use What is the intended use of the dataset?

Primary Task What primary task does the dataset support?

Communicative Goal Provide a short description of the communicative goal of a model trained for this task on this dataset.

Credit

Curation Organization Type(s) In what kind of organization did the dataset curation happen?

Dataset Creators Who created the original dataset? List the people involved in collecting the dataset and their affiliation(s).

Who added the Dataset to GEM? Who contributed to the data card and adding the dataset to GEM? List the people+affiliations involved in creating this data card and who helped integrate this dataset into GEM.

Dataset Structure

Data Splits Describe and name the splits in the dataset if there are more than one.

Dataset in GEM Rationale for Inclusion in GEM GEM-Specific Curation Getting Started with the Task

Rationale for Inclusion in GEM

GEM-Specific Curation

Getting Started with the Task

Rationale for Inclusion in GEM

Similar Datasets Do other datasets for the high level task exist?

GEM-Specific Curation

Modificatied for GEM? Has the GEM version of the dataset been modified in any way (data, processing, splits) from the original curated data?

Additional Splits? Does GEM provide additional splits to the dataset?

Getting Started with the Task

Previous Results Previous Results

Previous Results

Previous Results

Measured Model Abilities What aspect of model ability can be measured with this dataset?

Metrics What metrics are typically used for this task?

Previous results available? Are previous results available?

Other Evaluation Approaches What evaluation approaches have others used?

Dataset Curation Original Curation Language Data Structured Annotations Consent Private Identifying Information (PII) Maintenance

Original Curation

Language Data

Structured Annotations

Consent

Private Identifying Information (PII)

Maintenance

Original Curation

Sourced from Different Sources Is the dataset aggregated from different data sources?

Language Data

How was Language Data Obtained? How was the language data obtained?

Where was it found? If found, where from?

Data Validation Was the text validated by a different worker or a data curator?

Was Data Filtered? Were text instances selected or filtered?

Structured Annotations

Additional Annotations? Does the dataset have additional annotations for each instance?

Annotation Service? Was an annotation service used?

Annotation Values Purpose and values for each annotation

Any Quality Control? Quality control measures?

Quality Control Details Describe the quality control measures that were taken.

Consent

Any Consent Policy? Was there a consent policy involved when gathering the data?

Private Identifying Information (PII)

Contains PII? Does the source language data likely contain Personal Identifying Information about the data creators or subjects?

Maintenance

Any Maintenance Plan? Does the original dataset have a maintenance plan?

Contact Name

If known, provide the name of at least one person the reader can contact for questions about the dataset.

Multilingual?

Is the dataset multilingual?

Covered Languages

What languages/dialects are covered in the dataset?

License

What is the license of the dataset?

Communicative Goal

Provide a short description of the communicative goal of a model trained for this task on this dataset.

Additional Annotations?

Does the dataset have additional annotations for each instance?

Contains PII?

Does the source language data likely contain Personal Identifying Information about the data creators or subjects?

Dataset Overview

Where to find the Data and its Documentation

Languages and Intended Use

Credit

Dataset Structure

Webpage

What is the webpage for the dataset (if it exists)?

Paper

What is the link to the paper describing the dataset (open access preferred)?

BibTex

Provide the BibTex-formatted reference for the dataset. Please use the correct published version (ACL anthology, etc.) instead of google scholar created Bibtex.

Contact Name

If known, provide the name of at least one person the reader can contact for questions about the dataset.

Contact Email

If known, provide the email of at least one person the reader can contact for questions about the dataset.

Has a Leaderboard?

Does the dataset have an active leaderboard?

Multilingual?

Is the dataset multilingual?

Covered Languages

What languages/dialects are covered in the dataset?

License

What is the license of the dataset?

Intended Use

What is the intended use of the dataset?

Primary Task

What primary task does the dataset support?

Communicative Goal

Provide a short description of the communicative goal of a model trained for this task on this dataset.

Curation Organization Type(s)

In what kind of organization did the dataset curation happen?

Dataset Creators

Who created the original dataset? List the people involved in collecting the dataset and their affiliation(s).

Who added the Dataset to GEM?

Who contributed to the data card and adding the dataset to GEM? List the people+affiliations involved in creating this data card and who helped integrate this dataset into GEM.

Data Splits

Describe and name the splits in the dataset if there are more than one.

Dataset in GEM

Rationale for Inclusion in GEM

GEM-Specific Curation

Getting Started with the Task

Similar Datasets

Do other datasets for the high level task exist?

Modificatied for GEM?

Has the GEM version of the dataset been modified in any way (data, processing, splits) from the original curated data?

Additional Splits?

Does GEM provide additional splits to the dataset?

Previous Results

Previous Results

Measured Model Abilities

What aspect of model ability can be measured with this dataset?

Metrics

What metrics are typically used for this task?

Previous results available?

Are previous results available?

Other Evaluation Approaches

What evaluation approaches have others used?

Dataset Curation

Original Curation

Language Data

Structured Annotations

Consent

Private Identifying Information (PII)

Maintenance

Sourced from Different Sources

Is the dataset aggregated from different data sources?

How was Language Data Obtained?

How was the language data obtained?

Where was it found?

If found, where from?

Data Validation

Was the text validated by a different worker or a data curator?

Was Data Filtered?

Were text instances selected or filtered?

Additional Annotations?

Does the dataset have additional annotations for each instance?

Annotation Service?

Was an annotation service used?

Annotation Values

Purpose and values for each annotation

Any Quality Control?

Quality control measures?

Quality Control Details

Describe the quality control measures that were taken.

Any Consent Policy?

Was there a consent policy involved when gathering the data?

Contains PII?

Does the source language data likely contain Personal Identifying Information about the data creators or subjects?

Any Maintenance Plan?

Does the original dataset have a maintenance plan?

Broader Social Context

Previous Work on the Social Impact of the Dataset

Impact on Under-Served Communities

Discussion of Biases

Usage of Models based on the Data

Are you aware of cases where models trained on the task featured in this dataset ore related tasks have been used in automated systems?

Any Documented Social Biases?

Are there documented social biases in the dataset? Biases in this context are variations in the ways members of different social categories are represented that can have harmful downstream consequences for members of the more disadvantaged group.

Considerations for Using the Data

PII Risks and Liability

Licenses

Known Technical Limitations

Copyright Restrictions on the Dataset

Based on your answers in the Intended Use part of the Data Overview Section, which of the following best describe the copyright and licensing status of the dataset?

Copyright Restrictions on the Language Data

Based on your answers in the Language part of the Data Curation Section, which of the following best describe the copyright and licensing status of the underlying language data?