GEM Workshop at ACL 2021

The workshop will be held as part of ACL-IJCNLP 2021, August 1-6, 2021. It will take place on August 6. It is endorsed by the ACL Special Interest Group on Natural Language Generation (SIGGEN).

Note: Our system output submission form is perpetually open, please continue contributing to our benchmark. If you want to help improve GEM in the future, join our team.

Workshop Overview

Natural language generation is one of the most active research fields in NLP, with generation, summarization, and dialog among the most submitted-to tracks. As such, the number of available datasets, metrics, models, and evaluation strategies are increasing rapidly. This is leading to the situation where new models are often evaluated on different anglo-centric tasks with incompatible evaluation setups. With GEM, we are aiming to solve this problem by standardizing and improving the corpora on which to evaluate NLG models, and by supporting the development of better evaluation approaches. Submitted papers analyze the state of NLG evaluation and propose better alternatives. Moreover, we are organizing the living GEM benchmark which incorporates new advances in data and human and automatic evaluation to make it easier to evaluate models on challenging tasks with the correct tools. In our shared task, models were applied to up to 11 tasks in 18 languages, 80 challenge sets, and their outputs characterized using a combination of human evaluation and over 50 automatic metrics. Through the presented papers and the shared task, we aim to uncover shortcomings and opportunities for progress.


All times in UTC, please use a converter like this one to convert to your local time.

We do not distinguish between workshop papers and Findings of the ACL papers that are being presented - they are all great!

If you want to suggest questions to the panels, please submit and vote here.

Time (UTC) Session
11:30 - 12:00 Welcome and Explanation of Logistics (Recording)
12:00 - 13:00 Poster Session
Evaluating the Efficacy of Summarization Evaluation across Languages Fajri Koto, Jey Han Lau, and Timothy Baldwin
Automatic Text Simplification for Social Good: Progress and Challenges Sanja Stajner
Flesch-Kincaid is Not a Text Simplification Evaluation Metric Teerapaun Tanprasert and David Kauchak
Human Perception in Natural Language Generation Lorenzo De Mattei, Huiyuan Lai, Felice Dell'Orletta, and Malvina Nissim
Semantic Similarity Based Evaluation for Abstractive News Summarization Figen Beken Fikri, Kemal Oflazer, and Berrin Yanikoglu
Shades of BLEU, Flavours of Success: The Case of MultiWOZ Tomáš Nekvinda and Ondřej Dušek
13:00 - 13:45 Panel Discussion with Hady Elsahar, Seraphina Goldfarb-Tarrant, He He, and Ehud Reiter Suggest questions here. (Recording)
13:45 - 14:00 Break
14:00 - 15:00 Talk Session (Recording)
Personalized Response Generation with Tensor Factorization Zhenghui Wang, Lingxiao Luo, and Diyi Yang
A Review of Human Evaluation for Style Transfer Eleftheria Briakou, Sweta Agrawal, Ke Zhang, Joel Tetreault, and Marine Carpuat
GOT: Testing for Originality in Natural Language Generation Jennifer Brooks and Abdou Youssef
Evaluating Text Generation from Discourse Representation Structures Chunliu Wang, Rik van Noord, Arianna Bisazza, and Johan Bos
15:00 - 16:00 Poster Session
Detecting Hallucinated Content in Conditional Neural Sequence Generation Chunting Zhou, Graham Neubig, Jiatao Gu, Mona Diab, Paco Guzman, Luke Zettlemoyer, and Marjan Ghazvininejad
Synthesizing Adversarial Negative Responses for Robust Response Ranking and Evaluation Prakhar Gupta, Yulia Tsvetkov, and Jeffrey Bigham
Perceptual Models of Machine-Edited Text Elizabeth Merkhofer, Monica-Ann Mendoza, Rebecca Marvin, and John Henderson
Improving Automated Evaluation of Open Domain Dialog via Diverse Reference Augmentation Varun Gangal, Harsh Jhamtani, Eduard Hovy, and Taylor Berg-Kirkpatrick
XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages Tahmid Hasan, Abhik Bhattacharjee, Md. Saiful Islam, Kazi Mubasshir, Yuan-Fang Li, Yong-Bin Kang, M. Sohel Rahman, and Rifat Shahriyar
Human Evaluation of Creative NLG Systems: An Interdisciplinary Survey on Recent Papers Mika Hämäläinen and Khalid Alnajjar
16:00 - 17:00 Keynote by Asli Celikyilmaz (Recording) Are language models enough for narrative coherence? Abstract Automatic text generation enables computers to summarize online meetings, write stories or articles about an event, have conversations in customer-service, chit-chat with individuals, describe pictures to visually impaired, and similar tasks. In this talk, I will discuss challenges and shortcomings of building such systems with the current neural text generation models focusing on issues relating to modeling discourse structure and narrative flow. I will present our recent approaches that imbue transformer based neural generators with structural representations by way of implicit memory architectures and latent structural embeddings. I will conclude the talk pointing to avenues for future research.
17:00 - 17:45 Panel Discussion with Anya Belz, Asli Celikyilmaz, Mike Lewis, Lisa Li, and Wang Lu Suggest questions here. (Recording)
17:45 - 18:00 Break
18:00 - 19:00 GEM Overview Session
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics Everyone listed on the GEM team page
Reusable Templates and Guides For Documenting Datasets and Models for Natural Language Processing and Generation: A Case Study of the HuggingFace and GEM Data and Model Cards Angelina McMillan-Major, Salomey Osei, Juan Diego Rodriguez, Pawan Sasanka Ammanamanchi, Sebastian Gehrmann, and Yacine Jernite
Preliminary Results of the GEM Shared Task GEM Organizers
NL-Augmenter: A Collaborative Effort to Transform and Filter Text Datasets Kaustubh Dhole, Sebastian Gehrmann, Jascha Sohl-Dickstein, Varun Prashant Gangal, Tongshuang Wu, Simon Mille, Zhenhao Li, Aadesh Gupta, Samson Tan, Saad Mahmood, Ashish Shrivastava, Ondrej Dusek, and Jinho D. Choi
19:00 - 20:00 GEM System Session
Structure-to-Text Generation with Self-Training, Acceptability Classifiers and Context-Conditioning for the GEM Shared Task Shreyan Bakshi, Soumya Batra, Peyman Heidari, Ankit Arun, Shashank Jain, and Michael White
NUIG-DSI’s submission to The GEM Benchmark 2021 Nivranshu Pasricha, Mihael Arcan, and Paul Buitelaar
System Description for the CommonGen task with the POINTER model Anna Shvets
SimpleNER Sentence Simplification System for GEM 2021 K V Aditya Srivatsa, Monil Gokani, and Manish Shrivastava
20:00 - 21:00 Poster Session
GO FIGURE: A Meta Evaluation of Factuality in Summarization Saadia Gabriel, Asli Celikyilmaz, Rahul Jha, Yejin Choi, and Jianfeng Gao
TellMeWhy: A Dataset for Answering Why-Questions in Narratives Yash Kumar Lal, Nathanael Chambers, Raymond Mooney and Niranjan Balasubramanian
Is Human Scoring the Best Criteria for Summary Evaluation? Oleg Vasilyev and John Bohannon
Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level Ruiqi Zhong, Dhruba Ghosh, Dan Klein, and Jacob Steinhardt
Elaborative Simplification: Content Addition and Explanation Generation in Text Simplification Neha Srikanth and Junyi Jessy Li
Decoding Methods for Neural Narrative Generation Alexandra DeLucia, Aaron Mueller, Xiang Lisa Li, and João Sedoc

Important Dates


February 2 First Call for Shared Task Submissions and Papers, Release of the Training Data

May 3 Workshop Paper Due Date (excl. shared tasks) UPDATED

May 28 Notification of Acceptance (excl. shared tasks)

June 7 Camera-ready papers due (excl. shared tasks)

Shared Task Dates


February 2 Release of the training Data

March 29 Release of the test sets

May 14 Modeling submissions due

June 11 System Descriptions and Analyses due

June 25 Notification of Acceptance (shared task)

July 9 Camera-ready papers and task descriptions due

August 5-6 Workshop Dates


The workshop is organized by

The shared task and the GEM environment is organized by a larger team which is listed on this page.