OrangeSum
OrangeSum is a French summarization dataset inspired by XSum. It features two subtasks: abstract generation and title generation. The data was sourced from "Orange Actu" articles between 2011 and 2020.
You can load the dataset via:
import datasets
data = datasets.load_dataset('GEM/OrangeSum')
The data loader can be found here.
paper
Quick-Use
Multilingual?
Is the dataset multilingual?
Is the dataset multilingual?
no
Covered Languages
What languages/dialects are covered in the dataset?
What languages/dialects are covered in the dataset?
French
License
What is the license of the dataset?
What is the license of the dataset?
other: Other license
Dataset Overview
-
Where to find the Data and its Documentation
-
Languages and Intended Use
-
Credit
-
Dataset Structure
-
Where to find the Data and its Documentation
-
Languages and Intended Use
-
Credit
-
Dataset Structure
Where to find the Data and its Documentation
Download
What is the link to where the original dataset is hosted?
What is the link to where the original dataset is hosted?
Paper
What is the link to the paper describing the dataset (open access preferred)?
What is the link to the paper describing the dataset (open access preferred)?
BibTex
Provide the BibTex-formatted reference for the dataset. Please use the correct published version
(ACL anthology, etc.) instead of google scholar created Bibtex.
Provide the BibTex-formatted reference for the dataset. Please use the correct published version (ACL anthology, etc.) instead of google scholar created Bibtex.
@inproceedings{kamal-eddine-etal-2021-barthez,
title = "{BART}hez: a Skilled Pretrained {F}rench Sequence-to-Sequence Model",
author = "Kamal Eddine, Moussa and
Tixier, Antoine and
Vazirgiannis, Michalis",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-main.740",
doi = "10.18653/v1/2021.emnlp-main.740",
pages = "9369--9390",
abstract = "Inductive transfer learning has taken the entire NLP field by storm, with models such as BERT and BART setting new state of the art on countless NLU tasks. However, most of the available models and research have been conducted for English. In this work, we introduce BARThez, the first large-scale pretrained seq2seq model for French. Being based on BART, BARThez is particularly well-suited for generative tasks. We evaluate BARThez on five discriminative tasks from the FLUE benchmark and two generative tasks from a novel summarization dataset, OrangeSum, that we created for this research. We show BARThez to be very competitive with state-of-the-art BERT-based French language models such as CamemBERT and FlauBERT. We also continue the pretraining of a multilingual BART on BARThez{'} corpus, and show our resulting model, mBARThez, to significantly boost BARThez{'} generative performance.",
}
Has a Leaderboard?
Does the dataset have an active leaderboard?
Does the dataset have an active leaderboard?
no
Languages and Intended Use
Multilingual?
Is the dataset multilingual?
Is the dataset multilingual?
no
Covered Languages
What languages/dialects are covered in the dataset?
What languages/dialects are covered in the dataset?
French
License
What is the license of the dataset?
What is the license of the dataset?
other: Other license
Primary Task
What primary task does the dataset support?
What primary task does the dataset support?
Summarization
Credit
Dataset Structure
Dataset in GEM
-
Rationale for Inclusion in GEM
-
GEM-Specific Curation
-
Getting Started with the Task
-
Rationale for Inclusion in GEM
-
GEM-Specific Curation
-
Getting Started with the Task
Rationale for Inclusion in GEM
Similar Datasets
Do other datasets for the high level task exist?
Do other datasets for the high level task exist?
no
GEM-Specific Curation
Modificatied for GEM?
Has the GEM version of the dataset been modified in any way (data, processing, splits) from the
original curated data?
Has the GEM version of the dataset been modified in any way (data, processing, splits) from the original curated data?
no
Additional Splits?
Does GEM provide additional splits to the dataset?
Does GEM provide additional splits to the dataset?
no
Getting Started with the Task
Pointers to Resources
Getting started with in-depth research on the task. Add relevant pointers to resources that
researchers can consult when they want to get started digging deeper into the task.
Getting started with in-depth research on the task. Add relevant pointers to resources that researchers can consult when they want to get started digging deeper into the task.
Papers about abstractive summarization using seq2seq models:
- Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond
- Get To The Point: Summarization with Pointer-Generator Networks
- BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
- BARThez: a Skilled Pretrained French Sequence-to-Sequence Model
Papers about (pretrained) Transformers:
Technical Terms
Technical terms used in this card and the dataset and their definitions
Technical terms used in this card and the dataset and their definitions
No unique technical words in this data card.
Previous Results
-
Previous Results
-
Previous Results
Previous Results
Measured Model Abilities
What aspect of model ability can be measured with this dataset?
What aspect of model ability can be measured with this dataset?
The ability of the model to generate human like titles and abstracts for given news articles.
Metrics
What metrics are typically used for this task?
What metrics are typically used for this task?
ROUGE
, BERT-Score
Proposed Evaluation
List and describe the purpose of the metrics and evaluation methodology (including human
evaluation) that the dataset creators used when introducing this task.
List and describe the purpose of the metrics and evaluation methodology (including human evaluation) that the dataset creators used when introducing this task.
Automatic Evaluation: Rouge-1, Rouge-2, RougeL and BERTScore were used.
Human evalutaion: a human evaluation study was conducted with 11 French native speakers. The evaluators were PhD students from the computer science department of the university of the authors, working in NLP and other fields of AI. They volunteered after receiving an email announcement. the best-Worst Scaling (Louviere et al.,2015) was used. Two summaries from two different systems, along with their input document, were presented to a human annotator who had to decide which one was better. The evaluators were asked to base their judgments on accuracy (does the summary contain accurate facts?), informativeness (is important in-formation captured?) and fluency (is the summary written in well-formed French?).
Previous results available?
Are previous results available?
Are previous results available?
no
Broader Social Context
-
Previous Work on the Social Impact of the Dataset
-
Impact on Under-Served Communities
-
Discussion of Biases
-
Previous Work on the Social Impact of the Dataset
-
Impact on Under-Served Communities
-
Discussion of Biases
Previous Work on the Social Impact of the Dataset
Usage of Models based on the Data
Are you aware of cases where models trained on the task featured in this dataset ore related tasks
have been used in automated systems?
Are you aware of cases where models trained on the task featured in this dataset ore related tasks have been used in automated systems?
no
Impact on Under-Served Communities
Addresses needs of underserved Communities?
Does this dataset address the needs of communities that are traditionally underserved in language
technology, and particularly language generation technology? Communities may be underserved for
exemple because their language, language variety, or social or geographical context is
underepresented in NLP and NLG resources (datasets and models).
Does this dataset address the needs of communities that are traditionally underserved in language technology, and particularly language generation technology? Communities may be underserved for exemple because their language, language variety, or social or geographical context is underepresented in NLP and NLG resources (datasets and models).
no
Discussion of Biases
Any Documented Social Biases?
Are there documented social biases in the dataset? Biases in this context are variations in the
ways members of different social categories are represented that can have harmful downstream
consequences for members of the more disadvantaged group.
Are there documented social biases in the dataset? Biases in this context are variations in the ways members of different social categories are represented that can have harmful downstream consequences for members of the more disadvantaged group.
no
Are the Language Producers Representative of the Language?
Does the distribution of language producers in the dataset accurately represent the full
distribution of speakers of the language world-wide? If not, how does it differ?
Does the distribution of language producers in the dataset accurately represent the full distribution of speakers of the language world-wide? If not, how does it differ?
The dataset contains news articles written by professional authors.
Considerations for Using the Data
-
PII Risks and Liability
-
Licenses
-
Known Technical Limitations
-
PII Risks and Liability
-
Licenses
-
Known Technical Limitations
PII Risks and Liability
Licenses
Copyright Restrictions on the Dataset
Based on your answers in the Intended Use part of the Data Overview Section, which of the following
best describe the copyright and licensing status of the dataset?
Based on your answers in the Intended Use part of the Data Overview Section, which of the following best describe the copyright and licensing status of the dataset?
open license - commercial use allowed
Copyright Restrictions on the Language Data
Based on your answers in the Language part of the Data Curation Section, which of the following
best describe the copyright and licensing status of the underlying language data?
Based on your answers in the Language part of the Data Curation Section, which of the following best describe the copyright and licensing status of the underlying language data?
open license - commercial use allowed