GEM OrangeSum

OrangeSum

OrangeSum is a French summarization dataset inspired by XSum. It features two subtasks: abstract generation and title generation. The data was sourced from "Orange Actu" articles between 2011 and 2020.

You can load the dataset via:

import datasets
data = datasets.load_dataset('GEM/OrangeSum')

The data loader can be found here.

paper

ACL Anthology

Quick-Use

Multilingual?

Is the dataset multilingual?

no

Covered Languages

What languages/dialects are covered in the dataset?

French

License

What is the license of the dataset?

other: Other license

Dataset Overview

Where to find the Data and its Documentation

Languages and Intended Use

Credit

Dataset Structure

Where to find the Data and its Documentation

Download

What is the link to where the original dataset is hosted?

Github

Paper

What is the link to the paper describing the dataset (open access preferred)?

ACL Anthology

BibTex

Provide the BibTex-formatted reference for the dataset. Please use the correct published version (ACL anthology, etc.) instead of google scholar created Bibtex.

@inproceedings{kamal-eddine-etal-2021-barthez,
title = "{BART}hez: a Skilled Pretrained {F}rench Sequence-to-Sequence Model",
author = "Kamal Eddine, Moussa  and
Tixier, Antoine  and
Vazirgiannis, Michalis",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-main.740",
doi = "10.18653/v1/2021.emnlp-main.740",
pages = "9369--9390",
abstract = "Inductive transfer learning has taken the entire NLP field by storm, with models such as BERT and BART setting new state of the art on countless NLU tasks. However, most of the available models and research have been conducted for English. In this work, we introduce BARThez, the first large-scale pretrained seq2seq model for French. Being based on BART, BARThez is particularly well-suited for generative tasks. We evaluate BARThez on five discriminative tasks from the FLUE benchmark and two generative tasks from a novel summarization dataset, OrangeSum, that we created for this research. We show BARThez to be very competitive with state-of-the-art BERT-based French language models such as CamemBERT and FlauBERT. We also continue the pretraining of a multilingual BART on BARThez{'} corpus, and show our resulting model, mBARThez, to significantly boost BARThez{'} generative performance.",
}

Has a Leaderboard?

Does the dataset have an active leaderboard?

no

Languages and Intended Use

Multilingual?

Is the dataset multilingual?

no

Covered Languages

What languages/dialects are covered in the dataset?

French

License

What is the license of the dataset?

other: Other license

Primary Task

What primary task does the dataset support?

Summarization

Credit

Dataset Structure

Dataset in GEM

Rationale for Inclusion in GEM

GEM-Specific Curation

Getting Started with the Task

Rationale for Inclusion in GEM

Similar Datasets

Do other datasets for the high level task exist?

no

GEM-Specific Curation

Modificatied for GEM?

Has the GEM version of the dataset been modified in any way (data, processing, splits) from the original curated data?

no

Additional Splits?

Does GEM provide additional splits to the dataset?

no

Getting Started with the Task

Pointers to Resources

Getting started with in-depth research on the task. Add relevant pointers to resources that researchers can consult when they want to get started digging deeper into the task.

Papers about abstractive summarization using seq2seq models:

Papers about (pretrained) Transformers:

Technical Terms

Technical terms used in this card and the dataset and their definitions

No unique technical words in this data card.

Previous Results

Previous Results

Previous Results

Measured Model Abilities

What aspect of model ability can be measured with this dataset?

The ability of the model to generate human like titles and abstracts for given news articles.

Metrics

What metrics are typically used for this task?

ROUGE, BERT-Score

Proposed Evaluation

List and describe the purpose of the metrics and evaluation methodology (including human evaluation) that the dataset creators used when introducing this task.

Automatic Evaluation: Rouge-1, Rouge-2, RougeL and BERTScore were used.

Human evalutaion: a human evaluation study was conducted with 11 French native speakers. The evaluators were PhD students from the computer science department of the university of the authors, working in NLP and other fields of AI. They volunteered after receiving an email announcement. the best-Worst Scaling (Louviere et al.,2015) was used. Two summaries from two different systems, along with their input document, were presented to a human annotator who had to decide which one was better. The evaluators were asked to base their judgments on accuracy (does the summary contain accurate facts?), informativeness (is important in-formation captured?) and fluency (is the summary written in well-formed French?).

Previous results available?

Are previous results available?

no

Broader Social Context

Previous Work on the Social Impact of the Dataset

Impact on Under-Served Communities

Discussion of Biases

Previous Work on the Social Impact of the Dataset

Usage of Models based on the Data

Are you aware of cases where models trained on the task featured in this dataset ore related tasks have been used in automated systems?

no

Impact on Under-Served Communities

Addresses needs of underserved Communities?

Does this dataset address the needs of communities that are traditionally underserved in language technology, and particularly language generation technology? Communities may be underserved for exemple because their language, language variety, or social or geographical context is underepresented in NLP and NLG resources (datasets and models).

no

Discussion of Biases

Any Documented Social Biases?

Are there documented social biases in the dataset? Biases in this context are variations in the ways members of different social categories are represented that can have harmful downstream consequences for members of the more disadvantaged group.

no

Are the Language Producers Representative of the Language?

Does the distribution of language producers in the dataset accurately represent the full distribution of speakers of the language world-wide? If not, how does it differ?

The dataset contains news articles written by professional authors.

Considerations for Using the Data

PII Risks and Liability

Licenses

Known Technical Limitations

PII Risks and Liability

Licenses

Copyright Restrictions on the Dataset

Based on your answers in the Intended Use part of the Data Overview Section, which of the following best describe the copyright and licensing status of the dataset?

open license - commercial use allowed

Copyright Restrictions on the Language Data

Based on your answers in the Language part of the Data Curation Section, which of the following best describe the copyright and licensing status of the underlying language data?

open license - commercial use allowed

OrangeSum

paper

Quick-Use

Multilingual? Is the dataset multilingual?

Covered Languages What languages/dialects are covered in the dataset?

License What is the license of the dataset?

Dataset Overview Where to find the Data and its Documentation Languages and Intended Use Credit Dataset Structure

Where to find the Data and its Documentation

Languages and Intended Use

Credit

Dataset Structure

Where to find the Data and its Documentation

Download What is the link to where the original dataset is hosted?

Paper What is the link to the paper describing the dataset (open access preferred)?

BibTex Provide the BibTex-formatted reference for the dataset. Please use the correct published version (ACL anthology, etc.) instead of google scholar created Bibtex.

Has a Leaderboard? Does the dataset have an active leaderboard?

Languages and Intended Use

Multilingual? Is the dataset multilingual?

Covered Languages What languages/dialects are covered in the dataset?

License What is the license of the dataset?

Primary Task What primary task does the dataset support?

Credit

Dataset Structure

Dataset in GEM Rationale for Inclusion in GEM GEM-Specific Curation Getting Started with the Task

Rationale for Inclusion in GEM

GEM-Specific Curation

Getting Started with the Task

Rationale for Inclusion in GEM

Similar Datasets Do other datasets for the high level task exist?

GEM-Specific Curation

Modificatied for GEM? Has the GEM version of the dataset been modified in any way (data, processing, splits) from the original curated data?

Additional Splits? Does GEM provide additional splits to the dataset?

Getting Started with the Task

Pointers to Resources Getting started with in-depth research on the task. Add relevant pointers to resources that researchers can consult when they want to get started digging deeper into the task.

Technical Terms Technical terms used in this card and the dataset and their definitions

Previous Results Previous Results

Previous Results

Previous Results

Measured Model Abilities What aspect of model ability can be measured with this dataset?

Metrics What metrics are typically used for this task?

Proposed Evaluation List and describe the purpose of the metrics and evaluation methodology (including human evaluation) that the dataset creators used when introducing this task.

Previous results available? Are previous results available?

Broader Social Context Previous Work on the Social Impact of the Dataset Impact on Under-Served Communities Discussion of Biases

Previous Work on the Social Impact of the Dataset

Impact on Under-Served Communities

Discussion of Biases

Previous Work on the Social Impact of the Dataset

Usage of Models based on the Data Are you aware of cases where models trained on the task featured in this dataset ore related tasks have been used in automated systems?

Impact on Under-Served Communities

Discussion of Biases

Any Documented Social Biases? Are there documented social biases in the dataset? Biases in this context are variations in the ways members of different social categories are represented that can have harmful downstream consequences for members of the more disadvantaged group.

Are the Language Producers Representative of the Language? Does the distribution of language producers in the dataset accurately represent the full distribution of speakers of the language world-wide? If not, how does it differ?

Considerations for Using the Data PII Risks and Liability Licenses Known Technical Limitations

PII Risks and Liability

Licenses

Known Technical Limitations

PII Risks and Liability

Licenses

Copyright Restrictions on the Dataset Based on your answers in the Intended Use part of the Data Overview Section, which of the following best describe the copyright and licensing status of the dataset?

Copyright Restrictions on the Language Data Based on your answers in the Language part of the Data Curation Section, which of the following best describe the copyright and licensing status of the underlying language data?

Known Technical Limitations

Multilingual?

Is the dataset multilingual?

Covered Languages

What languages/dialects are covered in the dataset?

License

What is the license of the dataset?

Dataset Overview

Where to find the Data and its Documentation

Languages and Intended Use

Credit

Dataset Structure

Download

What is the link to where the original dataset is hosted?

Paper

What is the link to the paper describing the dataset (open access preferred)?

BibTex

Provide the BibTex-formatted reference for the dataset. Please use the correct published version (ACL anthology, etc.) instead of google scholar created Bibtex.

Has a Leaderboard?

Does the dataset have an active leaderboard?

Multilingual?

Is the dataset multilingual?

Covered Languages

What languages/dialects are covered in the dataset?

License

What is the license of the dataset?

Primary Task

What primary task does the dataset support?

Dataset in GEM

Rationale for Inclusion in GEM

GEM-Specific Curation

Getting Started with the Task

Similar Datasets

Do other datasets for the high level task exist?

Modificatied for GEM?

Has the GEM version of the dataset been modified in any way (data, processing, splits) from the original curated data?

Additional Splits?

Does GEM provide additional splits to the dataset?

Pointers to Resources

Getting started with in-depth research on the task. Add relevant pointers to resources that researchers can consult when they want to get started digging deeper into the task.

Technical Terms

Technical terms used in this card and the dataset and their definitions

Previous Results

Previous Results

Measured Model Abilities

What aspect of model ability can be measured with this dataset?

Metrics

What metrics are typically used for this task?

Proposed Evaluation

List and describe the purpose of the metrics and evaluation methodology (including human evaluation) that the dataset creators used when introducing this task.

Previous results available?

Are previous results available?

Broader Social Context

Previous Work on the Social Impact of the Dataset

Impact on Under-Served Communities

Discussion of Biases

Usage of Models based on the Data

Are you aware of cases where models trained on the task featured in this dataset ore related tasks have been used in automated systems?

Any Documented Social Biases?

Are there documented social biases in the dataset? Biases in this context are variations in the ways members of different social categories are represented that can have harmful downstream consequences for members of the more disadvantaged group.

Are the Language Producers Representative of the Language?

Does the distribution of language producers in the dataset accurately represent the full distribution of speakers of the language world-wide? If not, how does it differ?

Considerations for Using the Data

PII Risks and Liability

Licenses

Known Technical Limitations

Copyright Restrictions on the Dataset

Based on your answers in the Intended Use part of the Data Overview Section, which of the following best describe the copyright and licensing status of the dataset?

Copyright Restrictions on the Language Data

Based on your answers in the Language part of the Data Curation Section, which of the following best describe the copyright and licensing status of the underlying language data?