GEM Workshop at ACL 2021

The workshop will be held as part of ACL-IJCNLP 2021, August 1-6, 2021. It will take place on August 6. It is endorsed by the ACL Special Interest Group on Natural Language Generation (SIGGEN).

Note: Our system output submission form is perpetually open, please continue participating in our benchmark.

Workshop Overview

Natural language generation is one of the most active research fields in NLP, with generation, summarization, and dialog among the most submitted-to tracks. As such, the number of available datasets, metrics, models, and evaluation strategies are increasing rapidly. This is leading to the situation where new models are often evaluated on different anglo-centric tasks with incompatible evaluation setups. With GEM, we are aiming to solve this problem by standardizing and improving the corpora on which to evaluate NLG models, and by supporting the development of better evaluation approaches. Submitted papers analyze the state of NLG evaluation and propose better alternatives. Moreover, we are organizing the living GEM benchmark which incorporates new advances in data and human and automatic evaluation to make it easier to evaluate models on challenging tasks with the correct tools. In our shared task, models were applied to up to 11 tasks in 18 languages, 80 challenge sets, and their outputs characterized using a combination of human evaluation and over 50 automatic metrics. Through the presented papers and the shared task, we aim to uncover shortcomings and opportunities for progress.


All times in UTC, please use a converter like this one to convert to your local time.

We do not distinguish between workshop papers and Findings of the ACL papers that are being presented - they are all great! Links to all papers are coming soon!

Time (UTC) Session
11:30 - 12:00 Welcome and Explanation of Logistics
12:00 - 13:00 Poster Session
Evaluating the Efficacy of Summarization Evaluation across Languages
Fajri Koto, Jey Han Lau, and Timothy Baldwin
Automatic Text Simplification for Social Good: Progress and Challenges
Sanja Stajner
Flesch-Kincaid is Not a Text Simplification Evaluation Metric
Teerapaun Tanprasert and David Kauchak
Human Perception in Natural Language Generation
Lorenzo De Mattei, Huiyuan Lai, Felice Dell'Orletta, and Malvina Nissim
Semantic Similarity Based Evaluation for Abstractive News Summarization
Figen Beken Fikri, Kemal Oflazer, and Berrin Yanikoglu
Shades of BLEU, Flavours of Success: The Case of MultiWOZ
Tomáš Nekvinda and Ondřej Dušek
13:00 - 13:45 Panel Discussion with Hady Elsahar, Seraphina Goldfarb-Tarrant, He He, and Ehud Reiter
13:45 - 14:00 Break
14:00 - 15:00 Talk Session
Personalized Response Generation with Tensor Factorization
Zhenghui Wang, Lingxiao Luo, and Diyi Yang
A Review of Human Evaluation for Style Transfer
Eleftheria Briakou, Sweta Agrawal, Ke Zhang, Joel Tetreault, and Marine Carpuat
GOT: Testing for Originality in Natural Language Generation
Jennifer Brooks and Abdou Youssef
Evaluating Text Generation from Discourse Representation Structures
Chunliu Wang, Rik van Noord, Arianna Bisazza, and Johan Bos
15:00 - 16:00 Poster Session
Detecting Hallucinated Content in Conditional Neural Sequence Generation
Chunting Zhou, Graham Neubig, Jiatao Gu, Mona Diab, Paco Guzman,
Luke Zettlemoyer, and Marjan Ghazvininejad
Synthesizing Adversarial Negative Responses for Robust Response Ranking and Evaluation
Prakhar Gupta, Yulia Tsvetkov, and Jeffrey Bigham
Perceptual Models of Machine-Edited Text
Elizabeth Merkhofer, Monica-Ann Mendoza, Rebecca Marvin, and John Henderson
Improving Automated Evaluation of Open Domain Dialog via Diverse Reference Augmentation
Varun Gangal, Harsh Jhamtani, Eduard Hovy, and Taylor Berg-Kirkpatrick
XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages
Tahmid Hasan, Abhik Bhattacharjee, Md. Saiful Islam, Kazi Mubasshir, Yuan-Fang Li,
Yong-Bin Kang, M. Sohel Rahman, and Rifat Shahriyar
Human Evaluation of Creative NLG Systems: An Interdisciplinary Survey on Recent Papers
Mika Hämäläinen and Khalid Alnajjar
16:00 - 17:00 Keynote by Asli Celikyilmaz
17:00 - 17:45 Panel Discussion with Anya Belz, Asli Celikyilmaz, Mike Lewis, Lisa Li, and Wang Lu
17:45 - 18:00 Break
18:00 - 19:00 GEM Overview Session
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics
Everyone listed on the GEM team page
Reusable Templates and Guides For Documenting Datasets and Models for Natural Language
Processing and Generation: A Case Study of the HuggingFace and GEM Data and Model Cards
Angelina McMillan-Major, Salomey Osei, Juan Diego Rodriguez, Pawan Sasanka Ammanamanchi,
Sebastian Gehrmann, and Yacine Jernite
Preliminary Results of the GEM Shared Task
GEM Organizers
NL-Augmenter: A Collaborative Effort to Transform and Filter Text Datasets
Kaustubh Dhole, Sebastian Gehrmann, Jascha Sohl-Dickstein, Varun Prashant Gangal,
Tongshuang Wu, Simon Mille, Zhenhao Li, Aadesh Gupta, Samson Tan, Saad Mahmood,
Ashish Shrivastava, Ondrej Dusek, and Jinho D. Choi
19:00 - 20:00 GEM System Session
Structure-to-Text Generation with Self-Training, Acceptability Classifiers and Context-Conditioning
for the GEM Shared Task
Shreyan Bakshi, Soumya Batra, Peyman Heidari, Ankit Arun, Shashank Jain, and Michael White
NUIG-DSI’s submission to The GEM Benchmark 2021
Nivranshu Pasricha, Mihael Arcan, and Paul Buitelaar
System Description for the CommonGen task with the POINTER model
Anna Shvets
SimpleNER Sentence Simplification System for GEM 2021
K V Aditya Srivatsa, Monil Gokani, and Manish Shrivastava
20:00 - 21:00 Poster Session
GO FIGURE: A Meta Evaluation of Factuality in Summarization
Saadia Gabriel, Asli Celikyilmaz, Rahul Jha, Yejin Choi, and Jianfeng Gao
TellMeWhy: A Dataset for Answering Why-Questions in Narratives
Yash Kumar Lal, Nathanael Chambers, Raymond Mooney and Niranjan Balasubramanian
Is Human Scoring the Best Criteria for Summary Evaluation?
Oleg Vasilyev and John Bohannon
Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level
Ruiqi Zhong, Dhruba Ghosh, Dan Klein, and Jacob Steinhardt
Elaborative Simplification: Content Addition and Explanation Generation in Text Simplification
Neha Srikanth and Junyi Jessy Li
Decoding Methods for Neural Narrative Generation
Alexandra DeLucia, Aaron Mueller, Xiang Lisa Li, and João Sedoc

Important Dates


February 2 First Call for Shared Task Submissions and Papers, Release of the Training Data

May 3 Workshop Paper Due Date (excl. shared tasks) UPDATED

May 28 Notification of Acceptance (excl. shared tasks)

June 7 Camera-ready papers due (excl. shared tasks)

Shared Task Dates


February 2 Release of the training Data

March 29 Release of the test sets

May 14 Modeling submissions due

June 11 System Descriptions and Analyses due

June 25 Notification of Acceptance (shared task)

July 9 Camera-ready papers and task descriptions due

August 5-6 Workshop Dates


The workshop is organized by

The shared task and the GEM environment is organized by a larger team which is listed on this page.