GEM Workshop 2022

GEM 💎 Workshop at EMNLP 2022

The Second Version of Generation, Evaluation & Metrics (GEM) Workshop 2022 workshop will be held as part of EMNLP, December 7, 2022. It is endorsed by the ACL Special Interest Group on Natural Language Generation (SIGGEN).

The workshop will be held in hybrid mode with sessions in-person and via the conference portal.

Schedule

All times in Gulf Standard Time, please use a converter like this one to convert to your local time. To accomodate attendees from as many time zones as possible, we will have a virtual-only part in the evening.

In-Person and Virtual	Start	End
	9:00	10:30	Opening Remarks and Keynote 1 (Sean Welleck)
	10:30	11:00	Coffee Break
	11:00	12:30	Talk Session 1
	12:30	14:00	Lunch
	14:00	15:30	Poster Session
	15:30	16:00	Coffee Break
	16:00	17:00	Keynote 2 (Timo Schick)
	17:00	18:30	Talk Session 2
Virtual-Only Part
	20:00	21:00	Keynote 3 (Emily Dinan)
	21:00	22:30	Poster Session

Keynotes

Keynote 1 - Sean Welleck

Reflections on Trusting Untrustworthy Language Generators

ABSTRACT

In his 1984 Turing Award Lecture “Reflections on Trusting Trust”, Ken Thompson famously said “You can’t trust code that you did not totally create yourself”. These words are especially relevant today, as powerful and flexible language models generate natural language and code that is increasingly human-like. However, these same systems challenge our trust, exhibiting odd degeneracies, amplifying biases, and producing flawed reasoning. In this talk, I will introduce two directions for harnessing the potential of these language models while mitigating the risks. First, I will discuss unlearning: removing undesirable behaviors by integrating feedback and learning. Second, I will discuss how integrating language models with trustworthy symbolic systems can open the door to tackling challenging mathematical reasoning tasks. Join me as we explore the path towards trusting untrustworthy language generators.

BIO

Sean Welleck is a Postdoctoral Scholar at the University of Washington and the Allen Institute for Artificial Intelligence, working with Yejin Choi. His research focuses on algorithms for natural language generation and machine reasoning, with the aim of minimizing the effort needed to trust the output of AI systems. He has developed unlearning, decoding, and evaluation algorithms for controllable neural language generation, and methods for integrating language models with symbolic systems, with a particular focus on mathematical reasoning. He received his Ph.D. from New York University, where he was advised by Kyunghyun Cho. Outside of his research activities, he hosts the Thesis Review Podcast and enjoys running long distances.

Keynote 2 - Timo Schick

Instructable and Collaborative Language Models

ABSTRACT

Textual content is often the output of a collaborative writing process — which includes writing text, making comments and changes, finding references, and asking others for help —, but today’s NLP models are only trained to generate the final output of this process. In this talk, we will discuss an alternative approach where models are trained to imitate the entire writing process. We will look at examples of how this enables models to plan and explain their actions, to correct their own mistakes, and to better collaborate with humans. We will also discuss how to make such models better at following human-written instructions.

BIO

Timo Schick is a research scientist at FAIR working on few-shot learning in NLP. Previously, he did his PhD at the Center for Information and Language Processing (CIS) in Munich and worked in industry as a data scientist for several years. Timo's current research focuses on instruction-based learning and teaching language models to collaborate with other entities.

Keynote 3 - Emily Dinan

Challenges in evaluating safety for LLMs

ABSTRACT

While research on large language models (LLMs) continues to accelerate, much recent work has called attention to anticipated risks and harms from their use in society. We will discuss challenges in evaluating the relative safety of these models as well as current approaches for doing so. Finally, we will highlight avenues for future research into evaluating and mitigating these harms.

BIO

Emily Dinan is a Research Engineer at FAIR (Meta AI) in New York. Her research interests include conversational AI, natural language processing, and safety and responsibility in these fields. Recently she has focused on methods for preventing conversational agents from reproducing biased, toxic, or otherwise harmful language. Prior to joining FAIR, she received her master's degree in Mathematics from the University of Washington.

Sessions and Papers

Talk Session 1

Title	Authors	Mode
DEMETR: Diagnosing Evaluation Metrics for Translation	Marzena Karpinska, Nishant Raj, Katherine Thai, Yixiao Song, Ankita Gupta and Mohit Iyyer	In Person
BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization	Wojciech Kryscinski, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong and Dragomir Radev	In Person
A Survey of Recent Error Annotation Schemes for Automatically Generated Text	Rudali Huidrom and Anya Belz	In Person
Truncation Sampling as Language Model Desmoothing	John Hewitt, Christopher Manning and Percy Liang	In Person
Error Analysis of ToTTo Table-to-Text Neural NLG Models	Barkavi Sundararajan, Somayajulu Sripada and Ehud Reiter	Virtual
Towards Credible Human Evaluation of Open-Domain Dialog Systems Using Interactive Setup	Sijia Liu, Patrick Lange, Behnam Hedayatnia, Alexandros Papangelis, Di Jin, Andrew Wirth, Yang Liu and Dilek Hakkani-Tur	Virtual

Talk Session 2

Title	Authors	Mode
Improving abstractive summarization with energy-based re-ranking	Diogo Pernes, Afonso Mendes and André F. T. Martins	In Person
Revisiting text decomposition methods for NLI-based factuality scoring of summaries	John Glover, Federico Fancellu, Vasudevan Jagannathan, Matthew R. Gormley and Thomas Schaaf	Virtual
A Corpus and Evaluation for Predicting Semi-Structured Human Annotations	Andreas Marfurt, Ashley Thornton, David Sylvan, Lonneke van der Plas and James Henderson	In Person
Answerability: A custom metric for evaluating chatbot performance	Pranav Gupta, Anand A. Rajasekar, Amisha Patel, Mandar Kulkarni, Alexander Sunell, Kyung Kim and Anusua Trivedi	Virtual
Control Prefixes for Parameter-Efficient Text Generation	Jordan Clive, Kris Cao and Marek Rei	Virtual
Assessing Inter-metric Correlation for Multi-document Summarization Evaluation	Michael Ridenour, Ameeta Agrawal and Olubusayo Olabisi	Virtual

Poster Session - In-Person

Title	Authors
Task-driven augmented data evaluation	Olga Golovneva, Pan Wei, Khadige Abboud, Charith Peris, Lizhen Tan and Haiyang Yu
Weakly Supervised Context-based Interview Question Generation	Samiran Pal, Kaamraan Khan, Avinash Kumar Singh, Subhasish Ghosh, Tapas Nayak, Girish Palshikar and Indrajit Bhattacharya
Analyzing Multi-Task Learning for Abstractive Text Summarization	Frederic Thomas Kirstein, Jan Philip Wahle, Terry Ruas and Bela Gipp
CLSE: Corpus of Linguistically Significant Entities	Aleksandr Chuklin, Justin Zhao and Mihir Kale
Towards In-Context Non-Expert Evaluation of Reflection Generation for Counselling Conversations	Zixiu Wu, Simone Balloccu, Rim Helaoui, Diego Reforgiato Recupero and Daniele Riboni
Evaluation of Response Generation Models: Shouldn't It Be Shareable and Replicable?	Seyed Mahed Mousavi, Gabriel Roccabruna, Michela Lorandi, Simone Caldarella and Giuseppe Riccardi
Enhancing and Evaluating the Grammatical Framework Approach to Logic-to-Text Generation	Eduardo Calò, Elze van der Werf, Albert Gatt and Kees van Deemter
Transfer learning for multilingual vacancy text generation	Anna Lorincz, David Graus, Dor Lavi and Joao Lebre Magalhaes Pereira
EdiT5: Semi-Autoregressive Text Editing with T5 Warm-Start	Jonathan Mallinson, Jakub Adamek, Eric Malmi and Aliaksei Severyn
Unsupervised Token-level Hallucination Detection from Summary Generation By-products	Andreas Marfurt and James Henderson
T5QL: Taming language models for SQL generation	Samuel David Arcadinho, David Aparicio, Hugo Veiga and Antonio Alegria
Human perceiving behavior modeling in evaluation of code generation models	Sergey V. Kovalchuk, Vadim Lomshakov and Artem Aliev
GiCCS: A German in-Context Conversational Similarity Benchmark	Shima Asaadi, Zahra Kolagar, Alina Liebel and Alessandra Zarcone
Measuring the Measuring Tools: An Automatic Evaluation of Semantic Metrics for Text Corpora	George Kour, Samuel Ackerman, Eitan Daniel Farchi, Orna Raz, Boaz Carmeli and Ateret Anaby Tavor
LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization	Kalpesh Krishna, Erin Bransom, Bailey Kuehl, Mohit Iyyer, Arman Cohan, Pradeep Dasigi and Kyle Lo
Exploring Document-Level Literary Machine Translation with Parallel Paragraphs from World Literature	Katherine Thai, Marzena Karpinska, Kalpesh Krishna, Moira Inghilleri, John Wieting and Mohit Iyyer
Do Decoding Algorithms Capture Discourse Structure in Multi-Modal Tasks? A Case Study of Image Paragraph Generation	Nikolai Ilinykh and Simon Dobnik
20Q: Overlap-Free World Knowledge Benchmark for Language Models	Maxime De Bruyn, Ehsan Lotfi, Jeska Buhmann and Walter Daelemans
Controllable Factuality in Document-Grounded Dialog Systems Using a Noisy Channel Model	Nico Daheim, David Thulke, Christian Dugast and Hermann Ney
Learning to Model Editing Processes	Machel Reid and Graham Neubig
On the Effectiveness of Automated Metrics for Text Generation Systems	Pius von Däniken, Jan Deriu, Don Tuggener and Mark Cieliebak
Residual Learning of Neural Text Generation with n-gram Language Model	Huayang Li, Deng Cai, Jin Xu and Taro Watanabe
He Said, She Said: Style Transfer for Shifting the Perspective of Dialogues	Amanda Bertsch, Graham Neubig and Matthew R. Gormley
EtriCA: Event-Triggered Context-Aware Story Generation Augmented by Cross Attention	Chen Tang, Chenghua Lin, Henglin Huang, Frank Guerin and Zhihao Zhang

Knowledge Graph Generation From Text	Igor Melnyk, Pierre Dognin and Payel Das
Learning When and What to Quote: A Quotation Recommender System with Mutual Promotion of Recommendation and Generation	Lingzhi Wang, Xingshan Zeng and Kam-Fai Wong
Discord Questions: A Computational Approach To Diversity Analysis in News Coverage	Philippe Laban, Chien-Sheng Wu, Lidiya Murakhovs'ka, Xiang Chen and Caiming Xiong
CONSISTENT: Open-Ended Question Generation From News Articles	Tuhin Chakrabarty, Justin Lewis and Smaranda Muresan
Table-To-Text generation and pre-training with TabT5	Ewa Andrejczuk, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene and Yasemin Altun

Poster Session - Virtual

Presenters can choose which of the sessions they want to attend for their posters.

Title	Authors
Generating Coherent Narratives with Subtopic Planning to Answer How-to Questions	Pengshan Cai, Mo Yu, Fei Liu and hong yu
Semantic Similarity as a Window into Vector- and Graph-Based Metrics	Wai Ching Leung, Shira Wein and Nathan Schneider
WikiOmnia: filtration and evaluation of the generated QA corpus on the whole Russian Wikipedia	Dina Pisarevskaya and Tatiana Shavrina
Model Criticism for Long-Form Text Generation (Non-Archival)	Yuntian Deng, Volodymyr Kuleshov and Alexander Rush
Controllable Text Generation for All Ages: Evaluating a Plug-and-Play Approach to Age-Adapted Dialogue	Lennert Jansen, Štěpán Lars Laichter, Arabella Sinclair, Margot van der Goot, Raquel Fernandez and Sandro Pezzelle
Template-based Contact Email Generation for Job Recommendation	Qiuchi Li and Christina Lioma
Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation	Mingkai Deng, Bowen Tan, Zhengzhong Liu, Eric Xing and Zhiting Hu
Are Abstractive Summarization Models truly `Abstractive'? An Empirical Study to Compare the two Forms of Summarization	Vinayshekhar Bannihatti Kumar and Rashmi Gangadharaiah
Towards Attribute-Entangled Controllable Text Generation: A Pilot Study of Blessing Generation	Shulin Huang, Shirong Ma, Yinghui Li, Li Yangning, Shiyang Lin, Haitao Zheng and Ying Shen
Nearest Neighbor Language Models for Stylistic Controllable Generation	Severino Trotta, Lucie Flek and Charles Welch
On reporting scores and agreement for error annotation tasks	Maja Popović and Anya Belz
Improved Evaluation of Automatic Source Code Summarisation	Jesse Phillips, David Bowes, Mahmoud El-Haj and Tracy Hall
Most NLG is Low-Resource: here's what we can do about it	David M. Howcroft and Dimitra Gkatzia
What's in a (dataset's) name? The case of BigPatent	Silvia Casola, Alberto Lavelli and Horacio Saggion
Multilingual Social Media Text Generation and Evaluation with Few-Shot Prompting	Mack Blackburn
Factual Error Correction for Abstractive Summaries Using Entity Retrieval	Hwanhee Lee, Cheoneum Park, Seunghyun Yoon, Trung Bui, Franck Dernoncourt, Juae Kim and Kyomin Jung
Coherent Long Text Generation by Contrastive Soft Prompt	Guandan Chen, Jiashu Pu, Yadong Xi and Rongsheng Zhang
Improving Dialogue Act Recognition with Augmented Data	Khyati Mahajan, Soham Parikh, Quaizar Vohra, Mitul Tiwari and Samira Shaikh
What Was Your Name Again? Interrogating Generative Conversational Models For Factual Consistency Evaluation	Ehsan Lotfi, Maxime De Bruyn, Jeska Buhmann and Walter Daelemans
Narrative Why-Question Answering: A Review of Challenges and Datasets	Emil Kalbaliyev and Kairit Sirts
Exploring a POS-based Two-stage Approach for Improving Low-Resource AMR-to-Text Generation	Marco Antonio Sobrevilla Cabezudo and Thiago Pardo
What Makes Data-to-Text Generation Hard for Pretrained Language Models?	Moniba Keymanesh, Adrian Benton, Mark Dredze
Don't Say What You Don't Know: Improving the Consistency of Abstractive Summarization by Constraining Beam Search	Daniel King, Zejiang Shen, Nishant Subramani, Daniel S Weld, Iz Beltagy, Doug Downey
Representation Learning for Resource-Constrained Keyphrase Generation	Di Wu, Wasi U. Ahmad, Sunipa Dev and Kai-Wei Chang
Efficient (Soft) Q-Learning for Text Generation with Limited Good Data	Han Guo, Bowen Tan, Zhengzhong Liu, Eric Xing and Zhiting Hu
Wish I Can Feel What You Feel: A Neural Approach for Empathetic Response Generation	Yangbin Chen and Chunfeng Liang
Text Editing as Imitation Game	Ning Shi, Bin Tang, Bo Yuan, Longtao Huang, Yewen Pu, Jie Fu and Zhouhan Lin
Audience-Centric Natural Language Generation via Style Infusion	Samraj Moorjani, Adit Krishnan, Hari Sundaram, Ewa Maslowska and Aravind Sankar
Grounded Keys-to-Text Generation: Towards Factual Open-Ended Generation	Faeze Brahman, Baolin Peng, Michel Galley, Sudha Rao, Bill Dolan, Snigdha Chaturvedi and Jianfeng Gao
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible Knowledge Selection	Lanrui Wang, Jiangnan Li, Zheng Lin, Fandong Meng, Chenxu Yang, Weiping Wang and Jie Zhou
HeLo: Learning-Free Lookahead Decoding for Conversation Infilling	Ivan Lee and Taylor Berg-Kirkpatrick
Data-Efficient Concept Extraction from Pre-trained Language Models for Commonsense Explanation Generation	Yanbo Fang and Yongfeng Zhang
MCPG: A Flexible Multi-Level Controllable Framework for Unsupervised Paraphrase Generation	Yi Chen, Haiyun Jiang, Lemao Liu, Rui Wang, Shuming Shi and Ruifeng Xu
ParaMac: A General Unsupervised Paraphrase Generation Framework Leveraging Semantic Constraints and Diversifying Mechanisms	Jinxin Liu, Jiaxin Shi, Ji Qi, Lei Hou, Juanzi Li and Qi Tian
Recurrence Boosts Diversity! Revisiting Recurrent Latent Variable in Transformer-Based Variational AutoEncoder for Diverse Text Generation	Jinyi Hu, Xiaoyuan Yi, Wenhao Li, Maosong Sun and Xing Xie
Consecutive Question Generation via Dynamic Multitask Learning	Yunji Li, Sujian Li and Xing Shi
Sequentially Controlled Text Generation	Alexander Spangher, Yao Ming, Xinyu Hua and Nanyun Peng
Inferring the Reader: Guiding Automated Story Generation with Commonsense Reasoning	Xiangyu Peng, Siyan Li, Sarah Wiegreffe and Mark Riedl
Guiding Neural Story Generation with Reader Models	Xiangyu Peng, Kaige Xie, Amal Alabdulkarim, Harshith Kayam, Samihan Dani and Mark Riedl
Temporal Prompts for Conditional Text Generation	Shuyang Cao and Lu Wang
A Framework for Automatic Generation of Spoken Question-Answering Data	Merve Ünlü Menevşe, Yusufcan Manav, Ebru Arisoy and Arzucan Özgür
Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis	Wenda Xu, Yi-Lin Tuan, Yujie Lu, Michael S. Saxon, Lei Li and William Yang Wang
WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation	Alisa Liu, Swabha Swayamdipta, Noah A. Smith and Yejin Choi
Plug-and-Play Recipe Generation with Content Planning	Yinhong Liu, Yixuan n/a Su, Ehsan Shareghi and Nigel Collier

Important Dates

December 7 Workshop Date

Organization

Antoine Bosselut (EPFL)
Khyathi Chandu (Carnegie Mellon University)
Kaustubh Dhole (Emory University)
Varun Gangal (Carnegie Mellon University)
Sebastian Gehrmann (Google Research)
Yacine Jernite (Hugging Face)
Jekaterina Novikova (NoOverfitting Lab)
Laura Perez-Beltrachini (University of Edinburgh)

Steering Committee

Wei Xu (Georgia Tech)
Esin Durmus (Stanford University)
Samira Shaikh (UNC Charlotte)