OrangeSumSummarization

OrangeSum

OrangeSum is a French summarization dataset inspired by XSum. It features two subtasks: abstract generation and title generation. The data was sourced from "Orange Actu" articles between 2011 and 2020.

You can load the dataset via:

import datasets
data = datasets.load_dataset('GEM/OrangeSum')

The data loader can be found here.

Quick-Use

Multilingual?

Is the dataset multilingual?

no

Covered Languages

What languages/dialects are covered in the dataset?

French

License

What is the license of the dataset?

other: Other license

Dataset Overview
  • Where to find the Data and its Documentation

  • Languages and Intended Use

  • Credit

  • Dataset Structure

Where to find the Data and its Documentation

Download

What is the link to where the original dataset is hosted?

Github

Paper

What is the link to the paper describing the dataset (open access preferred)?

ACL Anthology

BibTex

Provide the BibTex-formatted reference for the dataset. Please use the correct published version (ACL anthology, etc.) instead of google scholar created Bibtex.

@inproceedings{kamal-eddine-etal-2021-barthez,
title = "{BART}hez: a Skilled Pretrained {F}rench Sequence-to-Sequence Model",
author = "Kamal Eddine, Moussa  and
Tixier, Antoine  and
Vazirgiannis, Michalis",
booktitle = "Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2021",
address = "Online and Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.emnlp-main.740",
doi = "10.18653/v1/2021.emnlp-main.740",
pages = "9369--9390",
abstract = "Inductive transfer learning has taken the entire NLP field by storm, with models such as BERT and BART setting new state of the art on countless NLU tasks. However, most of the available models and research have been conducted for English. In this work, we introduce BARThez, the first large-scale pretrained seq2seq model for French. Being based on BART, BARThez is particularly well-suited for generative tasks. We evaluate BARThez on five discriminative tasks from the FLUE benchmark and two generative tasks from a novel summarization dataset, OrangeSum, that we created for this research. We show BARThez to be very competitive with state-of-the-art BERT-based French language models such as CamemBERT and FlauBERT. We also continue the pretraining of a multilingual BART on BARThez{'} corpus, and show our resulting model, mBARThez, to significantly boost BARThez{'} generative performance.",
}
Has a Leaderboard?

Does the dataset have an active leaderboard?

no

Languages and Intended Use

Multilingual?

Is the dataset multilingual?

no

Covered Languages

What languages/dialects are covered in the dataset?

French

License

What is the license of the dataset?

other: Other license

Primary Task

What primary task does the dataset support?

Summarization

Credit

Dataset Structure

Dataset in GEM
  • Rationale for Inclusion in GEM

  • GEM-Specific Curation

  • Getting Started with the Task

Rationale for Inclusion in GEM

Similar Datasets

Do other datasets for the high level task exist?

no

GEM-Specific Curation

Modificatied for GEM?

Has the GEM version of the dataset been modified in any way (data, processing, splits) from the original curated data?

no

Additional Splits?

Does GEM provide additional splits to the dataset?

no

Getting Started with the Task

Pointers to Resources

Getting started with in-depth research on the task. Add relevant pointers to resources that researchers can consult when they want to get started digging deeper into the task.

Papers about abstractive summarization using seq2seq models:

Papers about (pretrained) Transformers:

Technical Terms

Technical terms used in this card and the dataset and their definitions

No unique technical words in this data card.

Previous Results
  • Previous Results

Previous Results

Measured Model Abilities

What aspect of model ability can be measured with this dataset?

The ability of the model to generate human like titles and abstracts for given news articles.

Metrics

What metrics are typically used for this task?

ROUGE, BERT-Score

Proposed Evaluation

List and describe the purpose of the metrics and evaluation methodology (including human evaluation) that the dataset creators used when introducing this task.

Automatic Evaluation: Rouge-1, Rouge-2, RougeL and BERTScore were used.

Human evalutaion: a human evaluation study was conducted with 11 French native speakers. The evaluators were PhD students from the computer science department of the university of the authors, working in NLP and other fields of AI. They volunteered after receiving an email announcement. the best-Worst Scaling (Louviere et al.,2015) was used. Two summaries from two different systems, along with their input document, were presented to a human annotator who had to decide which one was better. The evaluators were asked to base their judgments on accuracy (does the summary contain accurate facts?), informativeness (is important in-formation captured?) and fluency (is the summary written in well-formed French?).

Previous results available?

Are previous results available?

no

Broader Social Context
  • Previous Work on the Social Impact of the Dataset

  • Impact on Under-Served Communities

  • Discussion of Biases

Previous Work on the Social Impact of the Dataset

Usage of Models based on the Data

Are you aware of cases where models trained on the task featured in this dataset ore related tasks have been used in automated systems?

no

Impact on Under-Served Communities

Addresses needs of underserved Communities?

Does this dataset address the needs of communities that are traditionally underserved in language technology, and particularly language generation technology? Communities may be underserved for exemple because their language, language variety, or social or geographical context is underepresented in NLP and NLG resources (datasets and models).

no

Discussion of Biases

Any Documented Social Biases?

Are there documented social biases in the dataset? Biases in this context are variations in the ways members of different social categories are represented that can have harmful downstream consequences for members of the more disadvantaged group.

no

Are the Language Producers Representative of the Language?

Does the distribution of language producers in the dataset accurately represent the full distribution of speakers of the language world-wide? If not, how does it differ?

The dataset contains news articles written by professional authors.

Considerations for Using the Data
  • PII Risks and Liability

  • Licenses

  • Known Technical Limitations

PII Risks and Liability

Licenses

Copyright Restrictions on the Dataset

Based on your answers in the Intended Use part of the Data Overview Section, which of the following best describe the copyright and licensing status of the dataset?

open license - commercial use allowed

Copyright Restrictions on the Language Data

Based on your answers in the Language part of the Data Curation Section, which of the following best describe the copyright and licensing status of the underlying language data?

open license - commercial use allowed

Known Technical Limitations