The Second Version of Generation, Evaluation & Metrics (GEM) Workshop 2022 workshop will be held as part of EMNLP, December 7, 2022. It is endorsed by the ACL Special Interest Group on Natural Language Generation (SIGGEN).
The workshop will be held in hybrid mode with sessions in-person and via the conference portal.
Schedule
All times in Gulf Standard Time, please use a converter like this one to convert to your local time. To accomodate attendees from as many time zones as possible, we will have a virtual-only part in the evening.
In-Person and Virtual | Start | End | |
---|---|---|---|
9:00 | 10:30 | Opening Remarks and Keynote 1 (Sean Welleck) | |
10:30 | 11:00 | Coffee Break | |
11:00 | 12:30 | Talk Session 1 | |
12:30 | 14:00 | Lunch | |
14:00 | 15:30 | Poster Session | |
15:30 | 16:00 | Coffee Break | |
16:00 | 17:00 | Keynote 2 (Timo Schick) | |
17:00 | 18:30 | Talk Session 2 | |
Virtual-Only Part | |||
20:00 | 21:00 | Keynote 3 (Emily Dinan) | |
21:00 | 22:30 | Poster Session |
Keynotes
Keynote 1 - Sean Welleck
Reflections on Trusting Untrustworthy Language Generators
ABSTRACT
In his 1984 Turing Award Lecture âReflections on Trusting Trustâ, Ken Thompson famously said âYou canât trust code that you did not totally create yourselfâ. These words are especially relevant today, as powerful and flexible language models generate natural language and code that is increasingly human-like. However, these same systems challenge our trust, exhibiting odd degeneracies, amplifying biases, and producing flawed reasoning. In this talk, I will introduce two directions for harnessing the potential of these language models while mitigating the risks. First, I will discuss unlearning: removing undesirable behaviors by integrating feedback and learning. Second, I will discuss how integrating language models with trustworthy symbolic systems can open the door to tackling challenging mathematical reasoning tasks. Join me as we explore the path towards trusting untrustworthy language generators.
BIO
Sean Welleck is a Postdoctoral Scholar at the University of Washington and the Allen Institute for Artificial Intelligence, working with Yejin Choi. His research focuses on algorithms for natural language generation and machine reasoning, with the aim of minimizing the effort needed to trust the output of AI systems. He has developed unlearning, decoding, and evaluation algorithms for controllable neural language generation, and methods for integrating language models with symbolic systems, with a particular focus on mathematical reasoning. He received his Ph.D. from New York University, where he was advised by Kyunghyun Cho. Outside of his research activities, he hosts the Thesis Review Podcast and enjoys running long distances.
Keynote 2 - Timo Schick
Instructable and Collaborative Language Models
ABSTRACT
Textual content is often the output of a collaborative writing process â which includes writing text, making comments and changes, finding references, and asking others for help â, but todayâs NLP models are only trained to generate the final output of this process. In this talk, we will discuss an alternative approach where models are trained to imitate the entire writing process. We will look at examples of how this enables models to plan and explain their actions, to correct their own mistakes, and to better collaborate with humans. We will also discuss how to make such models better at following human-written instructions.
BIO
Timo Schick is a research scientist at FAIR working on few-shot learning in NLP. Previously, he did his PhD at the Center for Information and Language Processing (CIS) in Munich and worked in industry as a data scientist for several years. Timo's current research focuses on instruction-based learning and teaching language models to collaborate with other entities.
Keynote 3 - Emily Dinan
Challenges in evaluating safety for LLMs
ABSTRACT
While research on large language models (LLMs) continues to accelerate, much recent work has called attention to anticipated risks and harms from their use in society. We will discuss challenges in evaluating the relative safety of these models as well as current approaches for doing so. Finally, we will highlight avenues for future research into evaluating and mitigating these harms.
BIO
Emily Dinan is a Research Engineer at FAIR (Meta AI) in New York. Her research interests include conversational AI, natural language processing, and safety and responsibility in these fields. Recently she has focused on methods for preventing conversational agents from reproducing biased, toxic, or otherwise harmful language. Prior to joining FAIR, she received her master's degree in Mathematics from the University of Washington.
Sessions and Papers
Talk Session 1
Title | Authors | Mode |
---|---|---|
DEMETR: Diagnosing Evaluation Metrics for Translation | Marzena Karpinska, Nishant Raj, Katherine Thai, Yixiao Song, Ankita Gupta and Mohit Iyyer | In Person |
BOOKSUM: A Collection of Datasets for Long-form Narrative Summarization | Wojciech Kryscinski, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong and Dragomir Radev | In Person |
A Survey of Recent Error Annotation Schemes for Automatically Generated Text | Rudali Huidrom and Anya Belz | In Person |
Truncation Sampling as Language Model Desmoothing | John Hewitt, Christopher Manning and Percy Liang | In Person |
Error Analysis of ToTTo Table-to-Text Neural NLG Models | Barkavi Sundararajan, Somayajulu Sripada and Ehud Reiter | Virtual |
Towards Credible Human Evaluation of Open-Domain Dialog Systems Using Interactive Setup | Sijia Liu, Patrick Lange, Behnam Hedayatnia, Alexandros Papangelis, Di Jin, Andrew Wirth, Yang Liu and Dilek Hakkani-Tur | Virtual |
Talk Session 2
Title | Authors | Mode |
---|---|---|
Improving abstractive summarization with energy-based re-ranking | Diogo Pernes, Afonso Mendes and AndrĂŠ F. T. Martins | In Person |
Revisiting text decomposition methods for NLI-based factuality scoring of summaries | John Glover, Federico Fancellu, Vasudevan Jagannathan, Matthew R. Gormley and Thomas Schaaf | Virtual |
A Corpus and Evaluation for Predicting Semi-Structured Human Annotations | Andreas Marfurt, Ashley Thornton, David Sylvan, Lonneke van der Plas and James Henderson | In Person |
Answerability: A custom metric for evaluating chatbot performance | Pranav Gupta, Anand A. Rajasekar, Amisha Patel, Mandar Kulkarni, Alexander Sunell, Kyung Kim and Anusua Trivedi | Virtual |
Control Prefixes for Parameter-Efficient Text Generation | Jordan Clive, Kris Cao and Marek Rei | Virtual |
Assessing Inter-metric Correlation for Multi-document Summarization Evaluation | Michael Ridenour, Ameeta Agrawal and Olubusayo Olabisi | Virtual |
Poster Session - In-Person
Title | Authors |
---|---|
Task-driven augmented data evaluation | Olga Golovneva, Pan Wei, Khadige Abboud, Charith Peris, Lizhen Tan and Haiyang Yu |
Weakly Supervised Context-based Interview Question Generation | Samiran Pal, Kaamraan Khan, Avinash Kumar Singh, Subhasish Ghosh, Tapas Nayak, Girish Palshikar and Indrajit Bhattacharya |
Analyzing Multi-Task Learning for Abstractive Text Summarization | Frederic Thomas Kirstein, Jan Philip Wahle, Terry Ruas and Bela Gipp |
CLSE: Corpus of Linguistically Significant Entities | Aleksandr Chuklin, Justin Zhao and Mihir Kale |
Towards In-Context Non-Expert Evaluation of Reflection Generation for Counselling Conversations | Zixiu Wu, Simone Balloccu, Rim Helaoui, Diego Reforgiato Recupero and Daniele Riboni |
Evaluation of Response Generation Models: Shouldn't It Be Shareable and Replicable? | Seyed Mahed Mousavi, Gabriel Roccabruna, Michela Lorandi, Simone Caldarella and Giuseppe Riccardi |
Enhancing and Evaluating the Grammatical Framework Approach to Logic-to-Text Generation | Eduardo Calò, Elze van der Werf, Albert Gatt and Kees van Deemter |
Transfer learning for multilingual vacancy text generation | Anna Lorincz, David Graus, Dor Lavi and Joao Lebre Magalhaes Pereira |
EdiT5: Semi-Autoregressive Text Editing with T5 Warm-Start | Jonathan Mallinson, Jakub Adamek, Eric Malmi and Aliaksei Severyn |
Unsupervised Token-level Hallucination Detection from Summary Generation By-products | Andreas Marfurt and James Henderson |
T5QL: Taming language models for SQL generation | Samuel David Arcadinho, David Aparicio, Hugo Veiga and Antonio Alegria |
Human perceiving behavior modeling in evaluation of code generation models | Sergey V. Kovalchuk, Vadim Lomshakov and Artem Aliev |
GiCCS: A German in-Context Conversational Similarity Benchmark | Shima Asaadi, Zahra Kolagar, Alina Liebel and Alessandra Zarcone |
Measuring the Measuring Tools: An Automatic Evaluation of Semantic Metrics for Text Corpora | George Kour, Samuel Ackerman, Eitan Daniel Farchi, Orna Raz, Boaz Carmeli and Ateret Anaby Tavor |
LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization | Kalpesh Krishna, Erin Bransom, Bailey Kuehl, Mohit Iyyer, Arman Cohan, Pradeep Dasigi and Kyle Lo |
Exploring Document-Level Literary Machine Translation with Parallel Paragraphs from World Literature | Katherine Thai, Marzena Karpinska, Kalpesh Krishna, Moira Inghilleri, John Wieting and Mohit Iyyer |
Do Decoding Algorithms Capture Discourse Structure in Multi-Modal Tasks? A Case Study of Image Paragraph Generation | Nikolai Ilinykh and Simon Dobnik |
20Q: Overlap-Free World Knowledge Benchmark for Language Models | Maxime De Bruyn, Ehsan Lotfi, Jeska Buhmann and Walter Daelemans |
Controllable Factuality in Document-Grounded Dialog Systems Using a Noisy Channel Model | Nico Daheim, David Thulke, Christian Dugast and Hermann Ney |
Learning to Model Editing Processes | Machel Reid and Graham Neubig |
On the Effectiveness of Automated Metrics for Text Generation Systems | Pius von Däniken, Jan Deriu, Don Tuggener and Mark Cieliebak |
Residual Learning of Neural Text Generation with n-gram Language Model | Huayang Li, Deng Cai, Jin Xu and Taro Watanabe |
He Said, She Said: Style Transfer for Shifting the Perspective of Dialogues | Amanda Bertsch, Graham Neubig and Matthew R. Gormley |
EtriCA: Event-Triggered Context-Aware Story Generation Augmented by Cross Attention | Chen Tang, Chenghua Lin, Henglin Huang, Frank Guerin and Zhihao Zhang |
Knowledge Graph Generation From Text | Igor Melnyk, Pierre Dognin and Payel Das |
Learning When and What to Quote: A Quotation Recommender System with Mutual Promotion of Recommendation and Generation | Lingzhi Wang, Xingshan Zeng and Kam-Fai Wong |
Discord Questions: A Computational Approach To Diversity Analysis in News Coverage | Philippe Laban, Chien-Sheng Wu, Lidiya Murakhovs'ka, Xiang Chen and Caiming Xiong |
CONSISTENT: Open-Ended Question Generation From News Articles | Tuhin Chakrabarty, Justin Lewis and Smaranda Muresan |
Table-To-Text generation and pre-training with TabT5 | Ewa Andrejczuk, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene and Yasemin Altun |
Poster Session - Virtual
Presenters can choose which of the sessions they want to attend for their posters.
Title | Authors |
---|---|
Generating Coherent Narratives with Subtopic Planning to Answer How-to Questions | Pengshan Cai, Mo Yu, Fei Liu and hong yu |
Semantic Similarity as a Window into Vector- and Graph-Based Metrics | Wai Ching Leung, Shira Wein and Nathan Schneider |
WikiOmnia: filtration and evaluation of the generated QA corpus on the whole Russian Wikipedia | Dina Pisarevskaya and Tatiana Shavrina |
Model Criticism for Long-Form Text Generation (Non-Archival) | Yuntian Deng, Volodymyr Kuleshov and Alexander Rush |
Controllable Text Generation for All Ages: Evaluating a Plug-and-Play Approach to Age-Adapted Dialogue | Lennert Jansen, Ĺ tÄpĂĄn Lars Laichter, Arabella Sinclair, Margot van der Goot, Raquel Fernandez and Sandro Pezzelle |
Template-based Contact Email Generation for Job Recommendation | Qiuchi Li and Christina Lioma |
Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation | Mingkai Deng, Bowen Tan, Zhengzhong Liu, Eric Xing and Zhiting Hu |
Are Abstractive Summarization Models truly `Abstractive'? An Empirical Study to Compare the two Forms of Summarization | Vinayshekhar Bannihatti Kumar and Rashmi Gangadharaiah |
Towards Attribute-Entangled Controllable Text Generation: A Pilot Study of Blessing Generation | Shulin Huang, Shirong Ma, Yinghui Li, Li Yangning, Shiyang Lin, Haitao Zheng and Ying Shen |
Nearest Neighbor Language Models for Stylistic Controllable Generation | Severino Trotta, Lucie Flek and Charles Welch |
On reporting scores and agreement for error annotation tasks | Maja PopoviÄ and Anya Belz |
Improved Evaluation of Automatic Source Code Summarisation | Jesse Phillips, David Bowes, Mahmoud El-Haj and Tracy Hall |
Most NLG is Low-Resource: here's what we can do about it | David M. Howcroft and Dimitra Gkatzia |
What's in a (dataset's) name? The case of BigPatent | Silvia Casola, Alberto Lavelli and Horacio Saggion |
Multilingual Social Media Text Generation and Evaluation with Few-Shot Prompting | Mack Blackburn |
Factual Error Correction for Abstractive Summaries Using Entity Retrieval | Hwanhee Lee, Cheoneum Park, Seunghyun Yoon, Trung Bui, Franck Dernoncourt, Juae Kim and Kyomin Jung |
Coherent Long Text Generation by Contrastive Soft Prompt | Guandan Chen, Jiashu Pu, Yadong Xi and Rongsheng Zhang |
Improving Dialogue Act Recognition with Augmented Data | Khyati Mahajan, Soham Parikh, Quaizar Vohra, Mitul Tiwari and Samira Shaikh |
What Was Your Name Again? Interrogating Generative Conversational Models For Factual Consistency Evaluation | Ehsan Lotfi, Maxime De Bruyn, Jeska Buhmann and Walter Daelemans |
Narrative Why-Question Answering: A Review of Challenges and Datasets | Emil Kalbaliyev and Kairit Sirts |
Exploring a POS-based Two-stage Approach for Improving Low-Resource AMR-to-Text Generation | Marco Antonio Sobrevilla Cabezudo and Thiago Pardo |
What Makes Data-to-Text Generation Hard for Pretrained Language Models? | Moniba Keymanesh, Adrian Benton, Mark Dredze |
Don't Say What You Don't Know: Improving the Consistency of Abstractive Summarization by Constraining Beam Search | Daniel King, Zejiang Shen, Nishant Subramani, Daniel S Weld, Iz Beltagy, Doug Downey |
Representation Learning for Resource-Constrained Keyphrase Generation | Di Wu, Wasi U. Ahmad, Sunipa Dev and Kai-Wei Chang |
Efficient (Soft) Q-Learning for Text Generation with Limited Good Data | Han Guo, Bowen Tan, Zhengzhong Liu, Eric Xing and Zhiting Hu |
Wish I Can Feel What You Feel: A Neural Approach for Empathetic Response Generation | Yangbin Chen and Chunfeng Liang |
Text Editing as Imitation Game | Ning Shi, Bin Tang, Bo Yuan, Longtao Huang, Yewen Pu, Jie Fu and Zhouhan Lin |
Audience-Centric Natural Language Generation via Style Infusion | Samraj Moorjani, Adit Krishnan, Hari Sundaram, Ewa Maslowska and Aravind Sankar |
Grounded Keys-to-Text Generation: Towards Factual Open-Ended Generation | Faeze Brahman, Baolin Peng, Michel Galley, Sudha Rao, Bill Dolan, Snigdha Chaturvedi and Jianfeng Gao |
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible Knowledge Selection | Lanrui Wang, Jiangnan Li, Zheng Lin, Fandong Meng, Chenxu Yang, Weiping Wang and Jie Zhou |
HeLo: Learning-Free Lookahead Decoding for Conversation Infilling | Ivan Lee and Taylor Berg-Kirkpatrick |
Data-Efficient Concept Extraction from Pre-trained Language Models for Commonsense Explanation Generation | Yanbo Fang and Yongfeng Zhang |
MCPG: A Flexible Multi-Level Controllable Framework for Unsupervised Paraphrase Generation | Yi Chen, Haiyun Jiang, Lemao Liu, Rui Wang, Shuming Shi and Ruifeng Xu |
ParaMac: A General Unsupervised Paraphrase Generation Framework Leveraging Semantic Constraints and Diversifying Mechanisms | Jinxin Liu, Jiaxin Shi, Ji Qi, Lei Hou, Juanzi Li and Qi Tian |
Recurrence Boosts Diversity! Revisiting Recurrent Latent Variable in Transformer-Based Variational AutoEncoder for Diverse Text Generation | Jinyi Hu, Xiaoyuan Yi, Wenhao Li, Maosong Sun and Xing Xie |
Consecutive Question Generation via Dynamic Multitask Learning | Yunji Li, Sujian Li and Xing Shi |
Sequentially Controlled Text Generation | Alexander Spangher, Yao Ming, Xinyu Hua and Nanyun Peng |
Inferring the Reader: Guiding Automated Story Generation with Commonsense Reasoning | Xiangyu Peng, Siyan Li, Sarah Wiegreffe and Mark Riedl |
Guiding Neural Story Generation with Reader Models | Xiangyu Peng, Kaige Xie, Amal Alabdulkarim, Harshith Kayam, Samihan Dani and Mark Riedl |
Temporal Prompts for Conditional Text Generation | Shuyang Cao and Lu Wang |
A Framework for Automatic Generation of Spoken Question-Answering Data | Merve ĂnlĂź MenevĹe, Yusufcan Manav, Ebru Arisoy and Arzucan ĂzgĂźr |
Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis | Wenda Xu, Yi-Lin Tuan, Yujie Lu, Michael S. Saxon, Lei Li and William Yang Wang |
WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation | Alisa Liu, Swabha Swayamdipta, Noah A. Smith and Yejin Choi |
Plug-and-Play Recipe Generation with Content Planning | Yinhong Liu, Yixuan n/a Su, Ehsan Shareghi and Nigel Collier |
Important Dates
December 7
Workshop Date
Organization
- Antoine Bosselut (EPFL)
- Khyathi Chandu (Carnegie Mellon University)
- Kaustubh Dhole (Emory University)
- Varun Gangal (Carnegie Mellon University)
- Sebastian Gehrmann (Google Research)
- Yacine Jernite (Hugging Face)
- Jekaterina Novikova (NoOverfitting Lab)
- Laura Perez-Beltrachini (University of Edinburgh)
Steering Committee
- Wei Xu (Georgia Tech)
- Esin Durmus (Stanford University)
- Samira Shaikh (UNC Charlotte)