GEM 💎 Workshop at EMNLP 2022

The Second Version of Generation, Evaluation & Metrics (GEM) Workshop 2022 workshop will be held as part of EMNLP, December 7, 2022. It is endorsed by the ACL Special Interest Group on Natural Language Generation (SIGGEN).

The workshop will be held in hybrid mode with sessions in-person and via the conference portal.


All times in Gulf Standard Time, please use a converter like this one to convert to your local time. To accomodate attendees from as many time zones as possible, we will have a virtual-only part in the evening.

In-Person and Virtual Start End
9:00 10:30 Opening Remarks and Keynote 1 (Sean Welleck)
10:30 11:00 Coffee Break
11:00 12:30 Talk Session 1
12:30 14:00 Lunch
14:00 15:30 Poster Session
15:30 16:00 Coffee Break
16:00 17:00 Keynote 2 (Timo Schick)
17:00 18:30 Talk Session 2
Virtual-Only Part
20:00 21:00 Keynote 3 (Emily Dinan)
21:00 22:30 Poster Session


Keynote 1 - Sean Welleck

Reflections on Trusting Untrustworthy Language Generators


In his 1984 Turing Award Lecture “Reflections on Trusting Trust”, Ken Thompson famously said “You can’t trust code that you did not totally create yourself”. These words are especially relevant today, as powerful and flexible language models generate natural language and code that is increasingly human-like. However, these same systems challenge our trust, exhibiting odd degeneracies, amplifying biases, and producing flawed reasoning. In this talk, I will introduce two directions for harnessing the potential of these language models while mitigating the risks. First, I will discuss unlearning: removing undesirable behaviors by integrating feedback and learning. Second, I will discuss how integrating language models with trustworthy symbolic systems can open the door to tackling challenging mathematical reasoning tasks. Join me as we explore the path towards trusting untrustworthy language generators.


Sean Welleck is a Postdoctoral Scholar at the University of Washington and the Allen Institute for Artificial Intelligence, working with Yejin Choi. His research focuses on algorithms for natural language generation and machine reasoning, with the aim of minimizing the effort needed to trust the output of AI systems. He has developed unlearning, decoding, and evaluation algorithms for controllable neural language generation, and methods for integrating language models with symbolic systems, with a particular focus on mathematical reasoning. He received his Ph.D. from New York University, where he was advised by Kyunghyun Cho. Outside of his research activities, he hosts the Thesis Review Podcast and enjoys running long distances.

Keynote 2 - Timo Schick

Instructable and Collaborative Language Models


Textual content is often the output of a collaborative writing process — which includes writing text, making comments and changes, finding references, and asking others for help —, but today’s NLP models are only trained to generate the final output of this process. In this talk, we will discuss an alternative approach where models are trained to imitate the entire writing process. We will look at examples of how this enables models to plan and explain their actions, to correct their own mistakes, and to better collaborate with humans. We will also discuss how to make such models better at following human-written instructions.


Timo Schick is a research scientist at FAIR working on few-shot learning in NLP. Previously, he did his PhD at the Center for Information and Language Processing (CIS) in Munich and worked in industry as a data scientist for several years. Timo's current research focuses on instruction-based learning and teaching language models to collaborate with other entities.

Keynote 3 - Emily Dinan

Challenges in evaluating safety for LLMs


While research on large language models (LLMs) continues to accelerate, much recent work has called attention to anticipated risks and harms from their use in society. We will discuss challenges in evaluating the relative safety of these models as well as current approaches for doing so. Finally, we will highlight avenues for future research into evaluating and mitigating these harms.


Emily Dinan is a Research Engineer at FAIR (Meta AI) in New York. Her research interests include conversational AI, natural language processing, and safety and responsibility in these fields. Recently she has focused on methods for preventing conversational agents from reproducing biased, toxic, or otherwise harmful language. Prior to joining FAIR, she received her master's degree in Mathematics from the University of Washington.

Sessions and Papers

Talk Session 1

Title Authors Mode
DEMETR: Diagnosing Evaluation Metrics for Translation Marzena Karpinska, Nishant Raj, Katherine Thai, Yixiao Song, Ankita Gupta and Mohit Iyyer In Person
BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization Wojciech Kryscinski, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong and Dragomir Radev In Person
A Survey of Recent Error Annotation Schemes for Automatically Generated Text Rudali Huidrom and Anya Belz In Person
Truncation Sampling as Language Model Desmoothing John Hewitt, Christopher Manning and Percy Liang In Person
Error Analysis of ToTTo Table-to-Text Neural NLG Models Barkavi Sundararajan, Somayajulu Sripada and Ehud Reiter Virtual
Towards Credible Human Evaluation of Open-Domain Dialog Systems Using Interactive Setup Sijia Liu, Patrick Lange, Behnam Hedayatnia, Alexandros Papangelis, Di Jin, Andrew Wirth, Yang Liu and Dilek Hakkani-Tur Virtual

Talk Session 2

Title Authors Mode
Improving abstractive summarization with energy-based re-ranking Diogo Pernes, Afonso Mendes and André F. T. Martins In Person
Revisiting text decomposition methods for NLI-based factuality scoring of summaries John Glover, Federico Fancellu, Vasudevan Jagannathan, Matthew R. Gormley and Thomas Schaaf Virtual
A Corpus and Evaluation for Predicting Semi-Structured Human Annotations Andreas Marfurt, Ashley Thornton, David Sylvan, Lonneke van der Plas and James Henderson In Person
Answerability: A custom metric for evaluating chatbot performance Pranav Gupta, Anand A. Rajasekar, Amisha Patel, Mandar Kulkarni, Alexander Sunell, Kyung Kim and Anusua Trivedi Virtual
Control Prefixes for Parameter-Efficient Text Generation Jordan Clive, Kris Cao and Marek Rei Virtual
Assessing Inter-metric Correlation for Multi-document Summarization Evaluation Michael Ridenour, Ameeta Agrawal and Olubusayo Olabisi Virtual

Poster Session - In-Person

Title Authors
Task-driven augmented data evaluation Olga Golovneva, Pan Wei, Khadige Abboud, Charith Peris, Lizhen Tan and Haiyang Yu
Weakly Supervised Context-based Interview Question Generation Samiran Pal, Kaamraan Khan, Avinash Kumar Singh, Subhasish Ghosh, Tapas Nayak, Girish Palshikar and Indrajit Bhattacharya
Analyzing Multi-Task Learning for Abstractive Text Summarization Frederic Thomas Kirstein, Jan Philip Wahle, Terry Ruas and Bela Gipp
CLSE: Corpus of Linguistically Significant Entities Aleksandr Chuklin, Justin Zhao and Mihir Kale
Towards In-Context Non-Expert Evaluation of Reflection Generation for Counselling Conversations Zixiu Wu, Simone Balloccu, Rim Helaoui, Diego Reforgiato Recupero and Daniele Riboni
Evaluation of Response Generation Models: Shouldn't It Be Shareable and Replicable? Seyed Mahed Mousavi, Gabriel Roccabruna, Michela Lorandi, Simone Caldarella and Giuseppe Riccardi
Enhancing and Evaluating the Grammatical Framework Approach to Logic-to-Text Generation Eduardo Calò, Elze van der Werf, Albert Gatt and Kees van Deemter
Transfer learning for multilingual vacancy text generation Anna Lorincz, David Graus, Dor Lavi and Joao Lebre Magalhaes Pereira
EdiT5: Semi-Autoregressive Text Editing with T5 Warm-Start Jonathan Mallinson, Jakub Adamek, Eric Malmi and Aliaksei Severyn
Unsupervised Token-level Hallucination Detection from Summary Generation By-products Andreas Marfurt and James Henderson
T5QL: Taming language models for SQL generation Samuel David Arcadinho, David Aparicio, Hugo Veiga and Antonio Alegria
Human perceiving behavior modeling in evaluation of code generation models Sergey V. Kovalchuk, Vadim Lomshakov and Artem Aliev
GiCCS: A German in-Context Conversational Similarity Benchmark Shima Asaadi, Zahra Kolagar, Alina Liebel and Alessandra Zarcone
Measuring the Measuring Tools: An Automatic Evaluation of Semantic Metrics for Text Corpora George Kour, Samuel Ackerman, Eitan Daniel Farchi, Orna Raz, Boaz Carmeli and Ateret Anaby Tavor
LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization Kalpesh Krishna, Erin Bransom, Bailey Kuehl, Mohit Iyyer, Arman Cohan, Pradeep Dasigi and Kyle Lo
Exploring Document-Level Literary Machine Translation with Parallel Paragraphs from World Literature Katherine Thai, Marzena Karpinska, Kalpesh Krishna, Moira Inghilleri, John Wieting and Mohit Iyyer
Do Decoding Algorithms Capture Discourse Structure in Multi-Modal Tasks? A Case Study of Image Paragraph Generation Nikolai Ilinykh and Simon Dobnik
20Q: Overlap-Free World Knowledge Benchmark for Language Models Maxime De Bruyn, Ehsan Lotfi, Jeska Buhmann and Walter Daelemans
Controllable Factuality in Document-Grounded Dialog Systems Using a Noisy Channel Model Nico Daheim, David Thulke, Christian Dugast and Hermann Ney
Learning to Model Editing Processes Machel Reid and Graham Neubig
On the Effectiveness of Automated Metrics for Text Generation Systems Pius von Däniken, Jan Deriu, Don Tuggener and Mark Cieliebak
Residual Learning of Neural Text Generation with n-gram Language Model Huayang Li, Deng Cai, Jin Xu and Taro Watanabe
He Said, She Said: Style Transfer for Shifting the Perspective of Dialogues Amanda Bertsch, Graham Neubig and Matthew R. Gormley
EtriCA: Event-Triggered Context-Aware Story Generation Augmented by Cross Attention Chen Tang, Chenghua Lin, Henglin Huang, Frank Guerin and Zhihao Zhang
Knowledge Graph Generation From Text Igor Melnyk, Pierre Dognin and Payel Das
Learning When and What to Quote: A Quotation Recommender System with Mutual Promotion of Recommendation and Generation Lingzhi Wang, Xingshan Zeng and Kam-Fai Wong
Discord Questions: A Computational Approach To Diversity Analysis in News Coverage Philippe Laban, Chien-Sheng Wu, Lidiya Murakhovs'ka, Xiang Chen and Caiming Xiong
CONSISTENT: Open-Ended Question Generation From News Articles Tuhin Chakrabarty, Justin Lewis and Smaranda Muresan
Table-To-Text generation and pre-training with TabT5 Ewa Andrejczuk, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene and Yasemin Altun

Poster Session - Virtual

Presenters can choose which of the sessions they want to attend for their posters.

Title Authors
Generating Coherent Narratives with Subtopic Planning to Answer How-to Questions Pengshan Cai, Mo Yu, Fei Liu and hong yu
Semantic Similarity as a Window into Vector- and Graph-Based Metrics Wai Ching Leung, Shira Wein and Nathan Schneider
WikiOmnia: filtration and evaluation of the generated QA corpus on the whole Russian Wikipedia Dina Pisarevskaya and Tatiana Shavrina
Model Criticism for Long-Form Text Generation (Non-Archival) Yuntian Deng, Volodymyr Kuleshov and Alexander Rush
Controllable Text Generation for All Ages: Evaluating a Plug-and-Play Approach to Age-Adapted Dialogue Lennert Jansen, Štěpán Lars Laichter, Arabella Sinclair, Margot van der Goot, Raquel Fernandez and Sandro Pezzelle
Template-based Contact Email Generation for Job Recommendation Qiuchi Li and Christina Lioma
Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation Mingkai Deng, Bowen Tan, Zhengzhong Liu, Eric Xing and Zhiting Hu
Are Abstractive Summarization Models truly `Abstractive'? An Empirical Study to Compare the two Forms of Summarization Vinayshekhar Bannihatti Kumar and Rashmi Gangadharaiah
Towards Attribute-Entangled Controllable Text Generation: A Pilot Study of Blessing Generation Shulin Huang, Shirong Ma, Yinghui Li, Li Yangning, Shiyang Lin, Haitao Zheng and Ying Shen
Nearest Neighbor Language Models for Stylistic Controllable Generation Severino Trotta, Lucie Flek and Charles Welch
On reporting scores and agreement for error annotation tasks Maja Popović and Anya Belz
Improved Evaluation of Automatic Source Code Summarisation Jesse Phillips, David Bowes, Mahmoud El-Haj and Tracy Hall
Most NLG is Low-Resource: here's what we can do about it David M. Howcroft and Dimitra Gkatzia
What's in a (dataset's) name? The case of BigPatent Silvia Casola, Alberto Lavelli and Horacio Saggion
Multilingual Social Media Text Generation and Evaluation with Few-Shot Prompting Mack Blackburn
Factual Error Correction for Abstractive Summaries Using Entity Retrieval Hwanhee Lee, Cheoneum Park, Seunghyun Yoon, Trung Bui, Franck Dernoncourt, Juae Kim and Kyomin Jung
Coherent Long Text Generation by Contrastive Soft Prompt Guandan Chen, Jiashu Pu, Yadong Xi and Rongsheng Zhang
Improving Dialogue Act Recognition with Augmented Data Khyati Mahajan, Soham Parikh, Quaizar Vohra, Mitul Tiwari and Samira Shaikh
What Was Your Name Again? Interrogating Generative Conversational Models For Factual Consistency Evaluation Ehsan Lotfi, Maxime De Bruyn, Jeska Buhmann and Walter Daelemans
Narrative Why-Question Answering: A Review of Challenges and Datasets Emil Kalbaliyev and Kairit Sirts
Exploring a POS-based Two-stage Approach for Improving Low-Resource AMR-to-Text Generation Marco Antonio Sobrevilla Cabezudo and Thiago Pardo
What Makes Data-to-Text Generation Hard for Pretrained Language Models? Moniba Keymanesh, Adrian Benton, Mark Dredze
Don't Say What You Don't Know: Improving the Consistency of Abstractive Summarization by Constraining Beam Search Daniel King, Zejiang Shen, Nishant Subramani, Daniel S Weld, Iz Beltagy, Doug Downey
Representation Learning for Resource-Constrained Keyphrase Generation Di Wu, Wasi U. Ahmad, Sunipa Dev and Kai-Wei Chang
Efficient (Soft) Q-Learning for Text Generation with Limited Good Data Han Guo, Bowen Tan, Zhengzhong Liu, Eric Xing and Zhiting Hu
Wish I Can Feel What You Feel: A Neural Approach for Empathetic Response Generation Yangbin Chen and Chunfeng Liang
Text Editing as Imitation Game Ning Shi, Bin Tang, Bo Yuan, Longtao Huang, Yewen Pu, Jie Fu and Zhouhan Lin
Audience-Centric Natural Language Generation via Style Infusion Samraj Moorjani, Adit Krishnan, Hari Sundaram, Ewa Maslowska and Aravind Sankar
Grounded Keys-to-Text Generation: Towards Factual Open-Ended Generation Faeze Brahman, Baolin Peng, Michel Galley, Sudha Rao, Bill Dolan, Snigdha Chaturvedi and Jianfeng Gao
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible Knowledge Selection Lanrui Wang, Jiangnan Li, Zheng Lin, Fandong Meng, Chenxu Yang, Weiping Wang and Jie Zhou
HeLo: Learning-Free Lookahead Decoding for Conversation Infilling Ivan Lee and Taylor Berg-Kirkpatrick
Data-Efficient Concept Extraction from Pre-trained Language Models for Commonsense Explanation Generation Yanbo Fang and Yongfeng Zhang
MCPG: A Flexible Multi-Level Controllable Framework for Unsupervised Paraphrase Generation Yi Chen, Haiyun Jiang, Lemao Liu, Rui Wang, Shuming Shi and Ruifeng Xu
ParaMac: A General Unsupervised Paraphrase Generation Framework Leveraging Semantic Constraints and Diversifying Mechanisms Jinxin Liu, Jiaxin Shi, Ji Qi, Lei Hou, Juanzi Li and Qi Tian
Recurrence Boosts Diversity! Revisiting Recurrent Latent Variable in Transformer-Based Variational AutoEncoder for Diverse Text Generation Jinyi Hu, Xiaoyuan Yi, Wenhao Li, Maosong Sun and Xing Xie
Consecutive Question Generation via Dynamic Multitask Learning Yunji Li, Sujian Li and Xing Shi
Sequentially Controlled Text Generation Alexander Spangher, Yao Ming, Xinyu Hua and Nanyun Peng
Inferring the Reader: Guiding Automated Story Generation with Commonsense Reasoning Xiangyu Peng, Siyan Li, Sarah Wiegreffe and Mark Riedl
Guiding Neural Story Generation with Reader Models Xiangyu Peng, Kaige Xie, Amal Alabdulkarim, Harshith Kayam, Samihan Dani and Mark Riedl
Temporal Prompts for Conditional Text Generation Shuyang Cao and Lu Wang
A Framework for Automatic Generation of Spoken Question-Answering Data Merve Ünlü Menevşe, Yusufcan Manav, Ebru Arisoy and Arzucan Özgür
Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis Wenda Xu, Yi-Lin Tuan, Yujie Lu, Michael S. Saxon, Lei Li and William Yang Wang
WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation Alisa Liu, Swabha Swayamdipta, Noah A. Smith and Yejin Choi
Plug-and-Play Recipe Generation with Content Planning Yinhong Liu, Yixuan n/a Su, Ehsan Shareghi and Nigel Collier

Important Dates

December 7 Workshop Date


  • Antoine Bosselut (EPFL)
  • Khyathi Chandu (Carnegie Mellon University)
  • Kaustubh Dhole (Emory University)
  • Varun Gangal (Carnegie Mellon University)
  • Sebastian Gehrmann (Google Research)
  • Yacine Jernite (Hugging Face)
  • Jekaterina Novikova (NoOverfitting Lab)
  • Laura Perez-Beltrachini (University of Edinburgh)

Steering Committee

  • Wei Xu (Georgia Tech)
  • Esin Durmus (Stanford University)
  • Samira Shaikh (UNC Charlotte)