xwikis
The XWikis Corpus provides datasets with different language pairs and directions for cross-lingual and multi-lingual abstractive document summarisation.
You can load the dataset via:
import datasets
data = datasets.load_dataset('GEM/xwikis')
The data loader can be found here.
website
authors
Laura Perez-Beltrachini (University of Edinburgh)
Quick-Use
Contact Name
If known, provide the name of at least one person the reader can contact for questions about the
dataset.
If known, provide the name of at least one person the reader can contact for questions about the dataset.
Laura Perez-Beltrachini
Multilingual?
Is the dataset multilingual?
Is the dataset multilingual?
yes
Covered Languages
What languages/dialects are covered in the dataset?
What languages/dialects are covered in the dataset?
German
, English
, French
, Czech
License
What is the license of the dataset?
What is the license of the dataset?
cc-by-sa-4.0: Creative Commons Attribution Share Alike 4.0 International
Communicative Goal
Provide a short description of the communicative goal of a model trained for this task on this dataset.
Provide a short description of the communicative goal of a model trained for this task on this dataset.
Entity descriptive summarisation, that is, generate a summary that conveys the most salient facts of a document related to a given entity.
Additional Annotations?
Does the dataset have additional annotations for each instance?
Does the dataset have additional annotations for each instance?
found
Contains PII?
Does the source language data likely contain Personal Identifying Information about the data creators
or subjects?
Does the source language data likely contain Personal Identifying Information about the data creators or subjects?
no PII
Dataset Overview
-
Where to find the Data and its Documentation
-
Languages and Intended Use
-
Credit
-
Dataset Structure
-
Where to find the Data and its Documentation
-
Languages and Intended Use
-
Credit
-
Dataset Structure
Where to find the Data and its Documentation
Webpage
What is the webpage for the dataset (if it exists)?
What is the webpage for the dataset (if it exists)?
Paper
What is the link to the paper describing the dataset (open access preferred)?
What is the link to the paper describing the dataset (open access preferred)?
BibTex
Provide the BibTex-formatted reference for the dataset. Please use the correct published version
(ACL anthology, etc.) instead of google scholar created Bibtex.
Provide the BibTex-formatted reference for the dataset. Please use the correct published version (ACL anthology, etc.) instead of google scholar created Bibtex.
@InProceedings{clads-emnlp,
author = "Laura Perez-Beltrachini and Mirella Lapata",
title = "Models and Datasets for Cross-Lingual Summarisation",
booktitle = "Proceedings of The 2021 Conference on Empirical Methods in Natural Language Processing ",
year = "2021",
address = "Punta Cana, Dominican Republic",
}
Contact Name
If known, provide the name of at least one person the reader can contact for questions about the
dataset.
If known, provide the name of at least one person the reader can contact for questions about the dataset.
Laura Perez-Beltrachini
Contact Email
If known, provide the email of at least one person the reader can contact for questions about the
dataset.
If known, provide the email of at least one person the reader can contact for questions about the dataset.
Has a Leaderboard?
Does the dataset have an active leaderboard?
Does the dataset have an active leaderboard?
no
Languages and Intended Use
Multilingual?
Is the dataset multilingual?
Is the dataset multilingual?
yes
Covered Languages
What languages/dialects are covered in the dataset?
What languages/dialects are covered in the dataset?
German
, English
, French
, Czech
License
What is the license of the dataset?
What is the license of the dataset?
cc-by-sa-4.0: Creative Commons Attribution Share Alike 4.0 International
Intended Use
What is the intended use of the dataset?
What is the intended use of the dataset?
Cross-lingual and Multi-lingual single long input document abstractive summarisation.
Primary Task
What primary task does the dataset support?
What primary task does the dataset support?
Summarization
Communicative Goal
Provide a short description of the communicative goal of a model trained for this task on this
dataset.
Provide a short description of the communicative goal of a model trained for this task on this dataset.
Entity descriptive summarisation, that is, generate a summary that conveys the most salient facts of a document related to a given entity.
Credit
Curation Organization Type(s)
In what kind of organization did the dataset curation happen?
In what kind of organization did the dataset curation happen?
academic
Dataset Creators
Who created the original dataset? List the people involved in collecting the dataset and their
affiliation(s).
Who created the original dataset? List the people involved in collecting the dataset and their affiliation(s).
Laura Perez-Beltrachini (University of Edinburgh)
Who added the Dataset to GEM?
Who contributed to the data card and adding the dataset to GEM? List the people+affiliations
involved in creating this data card and who helped integrate this dataset into GEM.
Who contributed to the data card and adding the dataset to GEM? List the people+affiliations involved in creating this data card and who helped integrate this dataset into GEM.
Laura Perez-Beltrachini (University of Edinburgh) and Ronald Cardenas (University of Edinburgh)
Dataset Structure
Data Splits
Describe and name the splits in the dataset if there are more than one.
Describe and name the splits in the dataset if there are more than one.
For each language pair and direction there exists a train/valid/test split. The test split is a sample of size 7k from the intersection of titles existing in the four languages (cs,fr,en,de). Train/valid are randomly split.
Dataset in GEM
-
Rationale for Inclusion in GEM
-
GEM-Specific Curation
-
Getting Started with the Task
-
Rationale for Inclusion in GEM
-
GEM-Specific Curation
-
Getting Started with the Task
Rationale for Inclusion in GEM
Similar Datasets
Do other datasets for the high level task exist?
Do other datasets for the high level task exist?
no
GEM-Specific Curation
Modificatied for GEM?
Has the GEM version of the dataset been modified in any way (data, processing, splits) from the
original curated data?
Has the GEM version of the dataset been modified in any way (data, processing, splits) from the original curated data?
no
Additional Splits?
Does GEM provide additional splits to the dataset?
Does GEM provide additional splits to the dataset?
no
Getting Started with the Task
Previous Results
-
Previous Results
-
Previous Results
Previous Results
Measured Model Abilities
What aspect of model ability can be measured with this dataset?
What aspect of model ability can be measured with this dataset?
- identification of entity salient information
- translation
- multi-linguality
- cross-lingual transfer, zero-shot, few-shot
Metrics
What metrics are typically used for this task?
What metrics are typically used for this task?
ROUGE
Previous results available?
Are previous results available?
Are previous results available?
yes
Other Evaluation Approaches
What evaluation approaches have others used?
What evaluation approaches have others used?
ROUGE-1/2/L
Dataset Curation
-
Original Curation
-
Language Data
-
Structured Annotations
-
Consent
-
Private Identifying Information (PII)
-
Maintenance
-
Original Curation
-
Language Data
-
Structured Annotations
-
Consent
-
Private Identifying Information (PII)
-
Maintenance
Original Curation
Sourced from Different Sources
Is the dataset aggregated from different data sources?
Is the dataset aggregated from different data sources?
no
Language Data
How was Language Data Obtained?
How was the language data obtained?
How was the language data obtained?
Found
Where was it found?
If found, where from?
If found, where from?
Single website
Data Validation
Was the text validated by a different worker or a data curator?
Was the text validated by a different worker or a data curator?
other
Was Data Filtered?
Were text instances selected or filtered?
Were text instances selected or filtered?
not filtered
Structured Annotations
Additional Annotations?
Does the dataset have additional annotations for each instance?
Does the dataset have additional annotations for each instance?
found
Annotation Service?
Was an annotation service used?
Was an annotation service used?
no
Annotation Values
Purpose and values for each annotation
Purpose and values for each annotation
The input documents have section structure information.
Any Quality Control?
Quality control measures?
Quality control measures?
validated by another rater
Quality Control Details
Describe the quality control measures that were taken.
Describe the quality control measures that were taken.
Bilingual annotators assessed the content overlap of source document and target summaries.
Consent
Any Consent Policy?
Was there a consent policy involved when gathering the data?
Was there a consent policy involved when gathering the data?
no
Private Identifying Information (PII)
Contains PII?
Does the source language data likely contain Personal Identifying Information about the data
creators or subjects?
Does the source language data likely contain Personal Identifying Information about the data creators or subjects?
no PII
Maintenance
Any Maintenance Plan?
Does the original dataset have a maintenance plan?
Does the original dataset have a maintenance plan?
no
Broader Social Context
-
Previous Work on the Social Impact of the Dataset
-
Impact on Under-Served Communities
-
Discussion of Biases
-
Previous Work on the Social Impact of the Dataset
-
Impact on Under-Served Communities
-
Discussion of Biases
Previous Work on the Social Impact of the Dataset
Usage of Models based on the Data
Are you aware of cases where models trained on the task featured in this dataset ore related tasks
have been used in automated systems?
Are you aware of cases where models trained on the task featured in this dataset ore related tasks have been used in automated systems?
no
Impact on Under-Served Communities
Addresses needs of underserved Communities?
Does this dataset address the needs of communities that are traditionally underserved in language
technology, and particularly language generation technology? Communities may be underserved for
exemple because their language, language variety, or social or geographical context is
underepresented in NLP and NLG resources (datasets and models).
Does this dataset address the needs of communities that are traditionally underserved in language technology, and particularly language generation technology? Communities may be underserved for exemple because their language, language variety, or social or geographical context is underepresented in NLP and NLG resources (datasets and models).
no
Discussion of Biases
Any Documented Social Biases?
Are there documented social biases in the dataset? Biases in this context are variations in the
ways members of different social categories are represented that can have harmful downstream
consequences for members of the more disadvantaged group.
Are there documented social biases in the dataset? Biases in this context are variations in the ways members of different social categories are represented that can have harmful downstream consequences for members of the more disadvantaged group.
no
Considerations for Using the Data
-
PII Risks and Liability
-
Licenses
-
Known Technical Limitations
-
PII Risks and Liability
-
Licenses
-
Known Technical Limitations
PII Risks and Liability
Licenses
Copyright Restrictions on the Dataset
Based on your answers in the Intended Use part of the Data Overview Section, which of the following
best describe the copyright and licensing status of the dataset?
Based on your answers in the Intended Use part of the Data Overview Section, which of the following best describe the copyright and licensing status of the dataset?
public domain
Copyright Restrictions on the Language Data
Based on your answers in the Language part of the Data Curation Section, which of the following
best describe the copyright and licensing status of the underlying language data?
Based on your answers in the Language part of the Data Curation Section, which of the following best describe the copyright and licensing status of the underlying language data?
public domain