GEM 💎 Workshop at EMNLP 2023

The Third Version of the Generation, Evaluation & Metrics (GEM) Workshop will be held as part of EMNLP, 📅 December 6, 2023.

Overview

Many new NLP applications are cast through the lens of natural language generation. With the advent of these new approaches, many opportunities arise: generation in previously less studied languages, new evaluation paradigms, methods for corpus creation, more efficient architectures, strategies for safe deployments, among many others. At the same time, we can learn from the rich history of NLG research to further improve generation methods. These developments require robust and sound NLG evaluation processes. To that end, the GEM workshop aims to encourage the development of model auditing and human evaluation strategies, and to popularize model evaluations in languages beyond English.

If you are interested, you can check out last year's workshop websites from ACL 2021 and EMNLP 2022. Our call for this workshop can be found here.

Schedule

All times in local Singapore Time, please use a converter like this one to if you are in a different time zone.

Start End
9:00 10:30 Opening Remarks + 6 x 10+2 minutes talk
10:30 11:00 Coffee Break
11:00 12:30 Poster Session I
12:30 14:00 Lunch Break
14:00 15:30 7 x 10+2 minutes talk
15:30 16:00 Coffee Break
16:00 17:30 Poster Session II

Sessions and Papers

Talk Session I (9:00-10:30am)

ID Type Title Authors
271 Findings Vector-Quantized Prompt Learning for Paraphrase Generation Haotian Luo, Yixin Liu, Peidong Liu, Xianggen Liu
1154 Findings Can Large Language Models Fix Data Annotation Errors? An Empirical Study Using Debatepedia for Query-Focused Text Summarization Md Tahmid Rahman Laskar, Mizanur Rahman, Israt Jahan, Enamul Hoque, Jimmy Huang
1562 Findings Geographical Erasure in Language Generation Pola Schwöbel, Jacek Golebiowski, Michele Donini, Cedric Archambeau, Danish Pruthi
1834 Findings A Comprehensive Evaluation of Tool-Assisted Generation Strategies Alon Jacovi, Avi Caciularu, Jonathan Herzig, Roee Aharoni, Bernd Bohnet, Mor Geva
5166 Findings “Kelly is a Warm Person, Joseph is a Role Model”: Gender Biases in LLM-Generated Reference Letters Yixin Wan, George Pu, Jiao Sun, Aparna Garimella, Kai-Wei Chang, Nanyun Peng
11 Main Track FACTSCORE: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer and Hannaneh Hajishirzi

Poster Session I (11:00-12:30pm)

ID Type Title Authors
3 Main Track Contextualizing the Limits of Model & Evaluation Dataset Curation on Semantic Similarity Classification Tasks Daniel Theron
4 Main Track Dialogue Quality and Emotion Annotations for Customer Support Conversations John Mendonca, Patrícia Pereira, Miguel Menezes, Vera Cabarrão, Ana C Farinha, Helena Moniz, Alon Lavie and Isabel Trancoso
7 Main Track Formalizing content creation and evaluation methods for AI-generated social media content Christian Jensen and Axel Højmark
9 Main Track Automatic Evaluation of Generative Models with Instruction Tuning Shuhaib Mehri and Vered Shwartz
12 Main Track Effective Proxy for Human Labeling: Ensemble Disagreement Scores in Large Language Models for Industrial NLP Wei Du, Laksh Advani, Yashmeet Gambhir, Daniel Perry, Prashant Shiralkar, Zhengzheng Xing and Aaron Colak
14 Main Track Automatic Reflection Generation for Peer-to-Peer Counseling Emma O'Neil, João Sedoc, Diyi Yang, Haiyi Zhu and Lyle Ungar
16 Main Track One-Shot and Few-Shot Exemplification Modeling John Harvill, Hee Suk Yoon, Eunseop Yoon, Mark Hasegawa-Johnson and Chang Yoo
21 Main Track QAMPARI: A Benchmark for Open-domain Questions with Many Answers Samuel Amouyal, Tomer Wolfson, Ohad Rubin, Ori Yoran, Jonathan Herzig and Jonathan Berant
23 Main Track Unveiling Safety Vulnerabilities of Large Language Models George Kour, Marcel Zalmanovici, Naama Zwerdling, Esther Goldbraich, Ora Nova Fandina, Ateret Anaby Tavor, Orna Raz and Eitan Farchi
24 Main Track Adapting Pre-trained Generative Models for Extractive Question Answering Prabir Mallick, Tapas Nayak and Indrajit Bhattacharya
25 Main Track Predicting Question-Answering Performance of Large Language Models through Semantic Consistency Ella Rabinovich, Samuel Ackerman, Orna Raz, Eitan Farchi and Ateret Anaby Tavor
28 Main Track Towards Effective Long-Form QA with Evidence Augmentation Mengxia Yu, Sara Rosenthal, Mihaela Bornea and Avi Sil
30 Main Track Harnessing the Plug-and-Play Controller by Prompting Hao Wang and Lei Sha
32 Main Track Context and Literacy Aware Learnable Metric for Text Simplification Jeongwon Kwak, Hyeryun Park, Kyungmo Kim and Jinwook Choi
33 Main Track Synthetic Dialogue Dataset Generation using LLM Agents Yelaman Abdullin, Diego Molla, Bahadorreza Ofoghi, John Yearwood and Qingyang Li
34 Main Track An Empirical Bayes Framework for Open-Domain Dialogue Generation Jing Yang Lee, Kong Aik Lee and Woon Seng Gan
38 Main Track ChatGPT as a Java Decompiler Bradley McDanel and Zhanhao Liu
43 Main Track Targeted Image Data Augmentation Increases Basic Skills Captioning Robustness Valentin Barriere, Felipe del Rio, Andres Carvallo, Carlos Aspillaga, Eugenio Herrera-Berg and Cristian Buc
45 Main Track Separating form and meaning: Using self-consistency to quantify task understanding across multiple senses Xenia Ohmer, Elia Bruni and Dieuwke Hupkes
46 Main Track Text Encoders Lack Knowledge: Leveraging Generative LLMs for Domain-Specific Semantic Textual Similarity Joseph Gatto, Omar Sharif, Parker Seegmiller, Philip Bohlman and Sarah Masud Preum
51 Main Track To Burst or Not to Burst: Generating and Quantifying Improbable Text Kuleen Sasse, Efsun Sarioglu Kayi, Samuel Barham and Edward Staley
52 Main Track Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs Xue-Yong Fu, Md Tahmid Rahman Laskar, Cheng Chen and Shashi Bhushan TN
54 Main Track RankAug: Augmented data ranking for text classification Tiasa Singha Roy and Priyam Basu
67 Main Track Post Turing: Mapping the landscape of LLM Evaluation Alexey Tikhonov and Ivan Yamshchikov
62 Main Track PersonalityChat: Conversation Distillation for Personalized Dialog Modeling with Facts and Traits Ehsan Lotfi, Maxime De Bruyn, Jeska Buhmann and Walter Daelemans
63 Main Track How well ChatGPT understand Malaysian English? An Evaluation on Named Entity Recognition and Relation Extraction MohanRaj Chanthran, Lay-Ki Soon, Ong Huey Fang and Bhawani Selvaretnam
57 Extended Abstract Robust Tooling and New Resources for Large Language Model Evaluation via Catwalk Kyle Richardson, Ian Magnusson, Oyvind Tafjord,Akshita Bhagia, Iz Beltagy, Arman Cohan, Pradeep Dasigi,Jesse Dodge, Dirk Groeneveld, Yuling Gu, Ananya Harsh Jha, Tushar Khot and Nishant Subramani
58 Extended Abstract GUMSum: Multi-Genre Data and Evaluation for English Abstractive Summarization Yang Janet Liu and Amir Zeldes
60 Extended Abstract NewsMet: A ‘Do It All' dataset of contemporary Metaphors in News headlines Rohan Joseph, Timothy Liu, Aik Beng Ng, Simon See and Sunny Rai
20 Extended Abstract On the State of German (Abstractive) Text Summarization Dennis Aumiller, Jing Fan and Michael Gertz
31 Extended Abstract Measuring misogyny in natural language generation: preliminary results from a case study on two Reddit communities Aaron Snoswell, Lucinda Nelson, Hao Xue, Flora Salim, Nicolas Suzor and Jean Burgess
35 Extended Abstract On the Learnability of Watermarks for Language Models Chenchen Gu, Xiang Lisa Li, Percy Liang and Tatsunori Hashimoto
47 Extended Abstract Does Writing with Language Models Reduce Content Diversity? Vishakh Padmakumar and He He

Talk Session II (14:00-15:30pm)

ID Type Title Authors
36 Main Track Flesch or Fumble? Evaluating Readability Standard Alignment of Instruction-Tuned Language Models Joseph Marvin Imperial and Harish Tayyar Madabushi
41 Main Track Multi-domain Summarization from Leaderboards to Practice: Re-examining Automatic and Human Evaluation David Demeter, Oshin Agarwal, Simon Ben Igeri, Marko Sterbentz, Neil Molino, John Conroy and Ani Nenkova
56 Main Track Elo Uncovered: Robustness and Best Practices in Language Model Evaluation Meriem Boubdir, Edward Kim, Beyza Ermis, Sara Hooker and Marzieh Fadaee
39 Extended Abstract Generative language models exhibit social identity biases Tiancheng Hu, Yara Kyrychenko, Jon Roozenbeek and Nigel Collier
70 Industry Track A Simple yet Efficient Ensemble Approach for AI-generated Text Detection Harika Abburi, Kalyani Roy, Michael Suesserman, Nirmala Pudota, Balaji Veeramani, Edward Bowen and Sanmitra Bhattacharya
17 Industry Track Leveraging Large Language Models for Enhanced Product Descriptions in eCommerce Jianghong Zhou, Bo Liu, Jhalak Nilesh Acharya, Yao Hong, Kuang-chih Lee and Musen Wen
55 Industry Track Separating the Wheat from the Chaff with BREAD: An open-source benchmark and metrics to detect redundancy in text Isaac Caswell, Lisa Wang and Isabel Papadimitriou

Poster Session II (16:00-17:30pm)

ID Type Title Authors
223 Findings MacLaSa: Multi-Aspect Controllable Text Generation via Efficient Sampling from Compact Latent Space Hanxing Ding, Liang Pang, Zihao Wei, Huawei Shen, Xueqi Cheng, Tat-Seng Chua
300 Findings DeltaScore: Story Evaluation with Perturbations Zhuohan Xie, Miao Li, Trevor Cohn, Jey Han Lau
469 Findings Show, Write, and Retrieve: Entity-aware Article Generation and Retrieval Zhongping Zhang, Yiwen Gu, Bryan A. Plummer
575 Findings Adversarial Text Generation by Search and Learning Guoyi Li, Bingkang Shi, Zongzhen Liu, Dehan Kong, Yulei Wu, Xiaodan Zhang, Longtao Huang, Honglei Lyu
651 Findings On Uncertainty Calibration and Selective Generation in Probabilistic Neural Summarization: A Benchmark Study Polina Zablotskaia, Du Phan, Joshua Maynez, Shashi Narayan, Jie Ren, Jeremiah Zhe Liu
731 Findings GROVE: A Retrieval-augmented Complex Story Generation Framework with A Forest of Evidence Zhihua Wen, Zhiliang Tian, Wei Wu, Yuxin Yang, Yanqi Shi, Zhen Huang, Dongsheng Li
963 Findings A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing Carlos Gómez-Rodríguez, Paul Williams
1470 Findings Uniform Complexity for Text Generation Joseph Marvin Imperial, Harish Tayyar Madabushi
1548 Findings Unraveling Downstream Gender Bias from Large Language Models: A Study on AI Educational Writing Assistance Thiemo Wambsganss, Xiaotian Su, Vinitra Swamy, Seyed Parsa Neshaei, Roman Rietsche, Tanja Käser
1807 Findings Miracle: Towards Personalized Dialogue Generation with Latent-Space Multiple Personal Attribute Control Zhenyi Lu, Wei Wei, Xiaoye Qu, Xian-Ling Mao, Dangyang Chen, Jixiong Chen
1897 Findings Stylized Dialogue Generation with Feature-Guided Knowledge Augmentation Jinpeng Li, Zekai Zhang, Xiuying Chen, Dongyan Zhao, Rui Yan
1992 Findings Harnessing the power of LLMs: Evaluating human-AI text co-creation through the lens of news headline generation Zijian Ding, Alison Smith-Renner, Wenjuan Zhang, Joel R. Tetreault, Alejandro Jaimes
1993 Findings InfoDiffusion: Information Entropy Aware Diffusion Process for Non-Autoregressive Text Generation Renzhi Wang, Jing Li, Piji Li
2053 Findings The Iron(ic) Melting Pot: Reviewing Human Evaluation in Humour, Irony and Sarcasm Generation Tyler Loakman, Aaron Maladry, Chenghua Lin
2490 Findings Ask To The Point: Open-Domain Entity-Centric Question Generation Yuxiang Liu, Jie Huang, Kevin Chang
2493 Findings Frugal Prompting for Dialog Models Bishal Santra, Sakya Basak, Abhinandan De, Manish Gupta, Pawan Goyal
2716 Findings Towards Informative Open-ended Text Generation with Dynamic Knowledge Triples Zixuan Ren, Yang Zhao, Chengqing Zong
2876 Findings Harnessing the Power of Large Language Models for Empathetic Response Generation: Empirical Investigations and Improvements Yushan Qian, Weinan Zhang, Ting Liu
3010 Findings T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics Yiwei Qin, Weizhe Yuan, Graham Neubig, Pengfei Liu
3019 Findings NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark Oscar Sainz, Jon Ander Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, Eneko Agirre
3386 Findings Narrative Order Aware Story Generation via Bidirectional Pretraining Model with Optimal Transport Reward Zhicong Lu, Li Jin, Guangluan Xu, Linmei Hu, Nayu Liu, Xiaoyu Li, Xian Sun, Zequn Zhang, kaiwen wei
3613 Findings Goodtriever: Adaptive Toxicity Mitigation with Retrieval-augmented Models Luiza Amador Pozzobon, Beyza Ermis, Patrick Lewis, Sara Hooker
3726 Findings Don’t Add, don’t Miss: Effective Content Preserving Generation from Pre-Selected Text Spans Aviv Slobodkin, Avi Caciularu, Eran Hirsch, Ido Dagan
3802 Findings Ensemble-Instruct: Instruction Tuning Data Generation with a Heterogeneous Mixture of LMs Young-Suk Lee, Md Arafat Sultan, Yousef El-Kurdi, Tahira Naseem, Asim Munawar, Radu Florian, Salim Roukos, Ramón Fernandez Astudillo
4841 Findings A Closer Look into Using Large Language Models for Automatic Evaluation Cheng-Han Chiang, Hung-yi Lee
4954 Findings Pseudointelligence: A Unifying Lens on Language Model Evaluation Shikhar Murty, Orr Paradise, Pratyusha Sharma
5156 Findings Improving Pacing in Long-Form Story Planning Yichen Wang, Kevin Yang, Xiaoming Liu, Dan Klein
5563 Findings Bridging Discrete and Continuous Text Spaces for Accelerated Seq2Seq Diffusion Models Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, Lingpeng Kong
5603 Findings Exploring Context-Aware Evaluation Metrics for Machine Translation Xinyu Hu, Xunjian Yin, Xiaojun Wan

Organization

Contact: gem-benchmark-chairs@googlegroups.com

General Chairs

Khyathi Raghavi Chandu (AI2)

Elizabeth Clark (Google Deepmind)

Kaustubh Dhole (Emory University)

Sebastian Gehrmann (Bloomberg)

João Sedoc (NYU)

Alex Wang (Cohere)

Industry Track Chairs

Enrico Santus (Bloomberg)

Hooman Sedghamiz (Bayer)