The workshop will be held as part of ACL-IJCNLP 2021, August 1-6, 2021. It will take place on August 6. It is endorsed by the ACL Special Interest Group on Natural Language Generation (SIGGEN).
Note: Our system output submission form is perpetually open, please continue contributing to our benchmark. If you want to help improve GEM in the future, join our team.
Workshop Overview
Natural language generation is one of the most active research fields in NLP, with generation, summarization, and dialog among the most submitted-to tracks. As such, the number of available datasets, metrics, models, and evaluation strategies are increasing rapidly. This is leading to the situation where new models are often evaluated on different anglo-centric tasks with incompatible evaluation setups. With GEM, we are aiming to solve this problem by standardizing and improving the corpora on which to evaluate NLG models, and by supporting the development of better evaluation approaches. Submitted papers analyze the state of NLG evaluation and propose better alternatives. Moreover, we are organizing the living GEM benchmark which incorporates new advances in data and human and automatic evaluation to make it easier to evaluate models on challenging tasks with the correct tools. In our shared task, models were applied to up to 11 tasks in 18 languages, 80 challenge sets, and their outputs characterized using a combination of human evaluation and over 50 automatic metrics. Through the presented papers and the shared task, we aim to uncover shortcomings and opportunities for progress.
Schedule
All times in UTC, please use a converter like this one to convert to your local time.
We do not distinguish between workshop papers and Findings of the ACL papers that are being presented - they are all great!
If you want to suggest questions to the panels, please submit and vote here.
Time (UTC) | Session |
---|---|
11:30 - 12:00 | Welcome and Explanation of Logistics (Recording) |
12:00 - 13:00 | Poster Session |
Evaluating the Efficacy of Summarization Evaluation across Languages Fajri Koto, Jey Han Lau, and Timothy Baldwin | |
Automatic Text Simplification for Social Good: Progress and Challenges Sanja Stajner | |
Flesch-Kincaid is Not a Text Simplification Evaluation Metric Teerapaun Tanprasert and David Kauchak | |
Human Perception in Natural Language Generation Lorenzo De Mattei, Huiyuan Lai, Felice Dell'Orletta, and Malvina Nissim | |
Semantic Similarity Based Evaluation for Abstractive News Summarization Figen Beken Fikri, Kemal Oflazer, and Berrin Yanikoglu | |
Shades of BLEU, Flavours of Success: The Case of MultiWOZ Tomáš Nekvinda and Ondřej Dušek | |
13:00 - 13:45 | Panel Discussion with Hady Elsahar, Seraphina Goldfarb-Tarrant, He He, and Ehud Reiter Suggest questions here. (Recording) |
13:45 - 14:00 | Break |
14:00 - 15:00 | Talk Session (Recording) |
Personalized Response Generation with Tensor Factorization Zhenghui Wang, Lingxiao Luo, and Diyi Yang | |
A Review of Human Evaluation for Style Transfer Eleftheria Briakou, Sweta Agrawal, Ke Zhang, Joel Tetreault, and Marine Carpuat | |
GOT: Testing for Originality in Natural Language Generation Jennifer Brooks and Abdou Youssef | |
Evaluating Text Generation from Discourse Representation Structures Chunliu Wang, Rik van Noord, Arianna Bisazza, and Johan Bos | |
15:00 - 16:00 | Poster Session |
Detecting Hallucinated Content in Conditional Neural Sequence Generation Chunting Zhou, Graham Neubig, Jiatao Gu, Mona Diab, Paco Guzman, Luke Zettlemoyer, and Marjan Ghazvininejad | |
Synthesizing Adversarial Negative Responses for Robust Response Ranking and Evaluation Prakhar Gupta, Yulia Tsvetkov, and Jeffrey Bigham | |
Perceptual Models of Machine-Edited Text Elizabeth Merkhofer, Monica-Ann Mendoza, Rebecca Marvin, and John Henderson | |
Improving Automated Evaluation of Open Domain Dialog via Diverse Reference Augmentation Varun Gangal, Harsh Jhamtani, Eduard Hovy, and Taylor Berg-Kirkpatrick | |
XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages Tahmid Hasan, Abhik Bhattacharjee, Md. Saiful Islam, Kazi Mubasshir, Yuan-Fang Li, Yong-Bin Kang, M. Sohel Rahman, and Rifat Shahriyar | |
Human Evaluation of Creative NLG Systems: An Interdisciplinary Survey on Recent Papers Mika Hämäläinen and Khalid Alnajjar | |
16:00 - 17:00 | Keynote by Asli Celikyilmaz (Recording) Are language models enough for narrative coherence? Abstract Automatic text generation enables computers to summarize online meetings, write stories or articles about an event, have conversations in customer-service, chit-chat with individuals, describe pictures to visually impaired, and similar tasks. In this talk, I will discuss challenges and shortcomings of building such systems with the current neural text generation models focusing on issues relating to modeling discourse structure and narrative flow. I will present our recent approaches that imbue transformer based neural generators with structural representations by way of implicit memory architectures and latent structural embeddings. I will conclude the talk pointing to avenues for future research. |
17:00 - 17:45 | Panel Discussion with Anya Belz, Asli Celikyilmaz, Mike Lewis, Lisa Li, and Wang Lu Suggest questions here. (Recording) |
17:45 - 18:00 | Break |
18:00 - 19:00 | GEM Overview Session |
The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics Everyone listed on the GEM team page | |
Reusable Templates and Guides For Documenting Datasets and Models for Natural Language Processing and Generation: A Case Study of the HuggingFace and GEM Data and Model Cards Angelina McMillan-Major, Salomey Osei, Juan Diego Rodriguez, Pawan Sasanka Ammanamanchi, Sebastian Gehrmann, and Yacine Jernite | |
Preliminary Results of the GEM Shared Task GEM Organizers | |
NL-Augmenter: A Collaborative Effort to Transform and Filter Text Datasets Kaustubh Dhole, Sebastian Gehrmann, Jascha Sohl-Dickstein, Varun Prashant Gangal, Tongshuang Wu, Simon Mille, Zhenhao Li, Aadesh Gupta, Samson Tan, Saad Mahmood, Ashish Shrivastava, Ondrej Dusek, and Jinho D. Choi | |
19:00 - 20:00 | GEM System Session |
Structure-to-Text Generation with Self-Training, Acceptability Classifiers and Context-Conditioning for the GEM Shared Task Shreyan Bakshi, Soumya Batra, Peyman Heidari, Ankit Arun, Shashank Jain, and Michael White | |
NUIG-DSI’s submission to The GEM Benchmark 2021 Nivranshu Pasricha, Mihael Arcan, and Paul Buitelaar | |
System Description for the CommonGen task with the POINTER model Anna Shvets | |
SimpleNER Sentence Simplification System for GEM 2021 K V Aditya Srivatsa, Monil Gokani, and Manish Shrivastava | |
20:00 - 21:00 | Poster Session |
GO FIGURE: A Meta Evaluation of Factuality in Summarization Saadia Gabriel, Asli Celikyilmaz, Rahul Jha, Yejin Choi, and Jianfeng Gao | |
TellMeWhy: A Dataset for Answering Why-Questions in Narratives Yash Kumar Lal, Nathanael Chambers, Raymond Mooney and Niranjan Balasubramanian | |
Is Human Scoring the Best Criteria for Summary Evaluation? Oleg Vasilyev and John Bohannon | |
Are Larger Pretrained Language Models Uniformly Better? Comparing Performance at the Instance Level Ruiqi Zhong, Dhruba Ghosh, Dan Klein, and Jacob Steinhardt | |
Elaborative Simplification: Content Addition and Explanation Generation in Text Simplification Neha Srikanth and Junyi Jessy Li | |
Decoding Methods for Neural Narrative Generation Alexandra DeLucia, Aaron Mueller, Xiang Lisa Li, and João Sedoc |
Important Dates
Workshop
February 2
First Call for Shared Task Submissions and Papers, Release of the Training Data
May 3
Workshop Paper Due Date (excl. shared tasks) UPDATED
May 28
Notification of Acceptance (excl. shared tasks)
June 7
Camera-ready papers due (excl. shared tasks)
Shared Task Dates
Modeling
February 2
Release of the training Data
March 29
Release of the test sets
May 14
Modeling submissions due
June 11
System Descriptions and Analyses due
June 25
Notification of Acceptance (shared task)
July 9
Camera-ready papers and task descriptions due
August 5-6
Workshop Dates
Organization
The workshop is organized by
- Antoine Bosselut (Stanford University)
- Esin Durmus (Cornell University)
- Varun Prashant Gangal (Carnegie Mellon University)
- Sebastian Gehrmann (Google Research)
- Yacine Jernite (Hugging Face)
- Laura Perez-Beltrachini (University of Edinburgh)
- Samira Shaikh (UNC Charlotte)
- Wei Xu (Georgia Tech)
The shared task and the GEM environment is organized by a larger team which is listed on this page.