The Third Version of the Generation, Evaluation & Metrics (GEM) Workshop will be held as part of EMNLP, 📅 December 6, 2023.
Overview
Many new NLP applications are cast through the lens of natural language generation. With the advent of these new approaches, many opportunities arise: generation in previously less studied languages, new evaluation paradigms, methods for corpus creation, more efficient architectures, strategies for safe deployments, among many others. At the same time, we can learn from the rich history of NLG research to further improve generation methods. These developments require robust and sound NLG evaluation processes. To that end, the GEM workshop aims to encourage the development of model auditing and human evaluation strategies, and to popularize model evaluations in languages beyond English.
If you are interested, you can check out last year's workshop websites from ACL 2021 and EMNLP 2022. Our call for this workshop can be found here.
Schedule
All times in local Singapore Time, please use a converter like this one to if you are in a different time zone.
Start | End | |
---|---|---|
9:00 | 10:30 | Opening Remarks + 6 x 10+2 minutes talk |
10:30 | 11:00 | Coffee Break |
11:00 | 12:30 | Poster Session I |
12:30 | 14:00 | Lunch Break |
14:00 | 15:30 | 7 x 10+2 minutes talk |
15:30 | 16:00 | Coffee Break |
16:00 | 17:30 | Poster Session II |
Sessions and Papers
Talk Session I (9:00-10:30am)
ID | Type | Title | Authors |
---|---|---|---|
271 | Findings | Vector-Quantized Prompt Learning for Paraphrase Generation | Haotian Luo, Yixin Liu, Peidong Liu, Xianggen Liu |
1154 | Findings | Can Large Language Models Fix Data Annotation Errors? An Empirical Study Using Debatepedia for Query-Focused Text Summarization | Md Tahmid Rahman Laskar, Mizanur Rahman, Israt Jahan, Enamul Hoque, Jimmy Huang |
1562 | Findings | Geographical Erasure in Language Generation | Pola Schwöbel, Jacek Golebiowski, Michele Donini, Cedric Archambeau, Danish Pruthi |
1834 | Findings | A Comprehensive Evaluation of Tool-Assisted Generation Strategies | Alon Jacovi, Avi Caciularu, Jonathan Herzig, Roee Aharoni, Bernd Bohnet, Mor Geva |
5166 | Findings | “Kelly is a Warm Person, Joseph is a Role Model”: Gender Biases in LLM-Generated Reference Letters | Yixin Wan, George Pu, Jiao Sun, Aparna Garimella, Kai-Wei Chang, Nanyun Peng |
11 | Main Track | FACTSCORE: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation | Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer and Hannaneh Hajishirzi |
Poster Session I (11:00-12:30pm)
ID | Type | Title | Authors |
---|---|---|---|
3 | Main Track | Contextualizing the Limits of Model & Evaluation Dataset Curation on Semantic Similarity Classification Tasks | Daniel Theron |
4 | Main Track | Dialogue Quality and Emotion Annotations for Customer Support Conversations | John Mendonca, Patrícia Pereira, Miguel Menezes, Vera Cabarrão, Ana C Farinha, Helena Moniz, Alon Lavie and Isabel Trancoso |
7 | Main Track | Formalizing content creation and evaluation methods for AI-generated social media content | Christian Jensen and Axel Højmark |
9 | Main Track | Automatic Evaluation of Generative Models with Instruction Tuning | Shuhaib Mehri and Vered Shwartz |
12 | Main Track | Effective Proxy for Human Labeling: Ensemble Disagreement Scores in Large Language Models for Industrial NLP | Wei Du, Laksh Advani, Yashmeet Gambhir, Daniel Perry, Prashant Shiralkar, Zhengzheng Xing and Aaron Colak |
14 | Main Track | Automatic Reflection Generation for Peer-to-Peer Counseling | Emma O'Neil, João Sedoc, Diyi Yang, Haiyi Zhu and Lyle Ungar |
16 | Main Track | One-Shot and Few-Shot Exemplification Modeling | John Harvill, Hee Suk Yoon, Eunseop Yoon, Mark Hasegawa-Johnson and Chang Yoo |
21 | Main Track | QAMPARI: A Benchmark for Open-domain Questions with Many Answers | Samuel Amouyal, Tomer Wolfson, Ohad Rubin, Ori Yoran, Jonathan Herzig and Jonathan Berant |
23 | Main Track | Unveiling Safety Vulnerabilities of Large Language Models | George Kour, Marcel Zalmanovici, Naama Zwerdling, Esther Goldbraich, Ora Nova Fandina, Ateret Anaby Tavor, Orna Raz and Eitan Farchi |
24 | Main Track | Adapting Pre-trained Generative Models for Extractive Question Answering | Prabir Mallick, Tapas Nayak and Indrajit Bhattacharya |
25 | Main Track | Predicting Question-Answering Performance of Large Language Models through Semantic Consistency | Ella Rabinovich, Samuel Ackerman, Orna Raz, Eitan Farchi and Ateret Anaby Tavor |
28 | Main Track | Towards Effective Long-Form QA with Evidence Augmentation | Mengxia Yu, Sara Rosenthal, Mihaela Bornea and Avi Sil |
30 | Main Track | Harnessing the Plug-and-Play Controller by Prompting | Hao Wang and Lei Sha |
32 | Main Track | Context and Literacy Aware Learnable Metric for Text Simplification | Jeongwon Kwak, Hyeryun Park, Kyungmo Kim and Jinwook Choi |
33 | Main Track | Synthetic Dialogue Dataset Generation using LLM Agents | Yelaman Abdullin, Diego Molla, Bahadorreza Ofoghi, John Yearwood and Qingyang Li |
34 | Main Track | An Empirical Bayes Framework for Open-Domain Dialogue Generation | Jing Yang Lee, Kong Aik Lee and Woon Seng Gan |
38 | Main Track | ChatGPT as a Java Decompiler | Bradley McDanel and Zhanhao Liu |
43 | Main Track | Targeted Image Data Augmentation Increases Basic Skills Captioning Robustness | Valentin Barriere, Felipe del Rio, Andres Carvallo, Carlos Aspillaga, Eugenio Herrera-Berg and Cristian Buc |
45 | Main Track | Separating form and meaning: Using self-consistency to quantify task understanding across multiple senses | Xenia Ohmer, Elia Bruni and Dieuwke Hupkes |
46 | Main Track | Text Encoders Lack Knowledge: Leveraging Generative LLMs for Domain-Specific Semantic Textual Similarity | Joseph Gatto, Omar Sharif, Parker Seegmiller, Philip Bohlman and Sarah Masud Preum |
51 | Main Track | To Burst or Not to Burst: Generating and Quantifying Improbable Text | Kuleen Sasse, Efsun Sarioglu Kayi, Samuel Barham and Edward Staley |
52 | Main Track | Are Large Language Models Reliable Judges? A Study on the Factuality Evaluation Capabilities of LLMs | Xue-Yong Fu, Md Tahmid Rahman Laskar, Cheng Chen and Shashi Bhushan TN |
54 | Main Track | RankAug: Augmented data ranking for text classification | Tiasa Singha Roy and Priyam Basu |
67 | Main Track | Post Turing: Mapping the landscape of LLM Evaluation | Alexey Tikhonov and Ivan Yamshchikov |
62 | Main Track | PersonalityChat: Conversation Distillation for Personalized Dialog Modeling with Facts and Traits | Ehsan Lotfi, Maxime De Bruyn, Jeska Buhmann and Walter Daelemans |
63 | Main Track | How well ChatGPT understand Malaysian English? An Evaluation on Named Entity Recognition and Relation Extraction | MohanRaj Chanthran, Lay-Ki Soon, Ong Huey Fang and Bhawani Selvaretnam |
57 | Extended Abstract | Robust Tooling and New Resources for Large Language Model Evaluation via Catwalk | Kyle Richardson, Ian Magnusson, Oyvind Tafjord,Akshita Bhagia, Iz Beltagy, Arman Cohan, Pradeep Dasigi,Jesse Dodge, Dirk Groeneveld, Yuling Gu, Ananya Harsh Jha, Tushar Khot and Nishant Subramani |
58 | Extended Abstract | GUMSum: Multi-Genre Data and Evaluation for English Abstractive Summarization | Yang Janet Liu and Amir Zeldes |
60 | Extended Abstract | NewsMet: A ‘Do It All' dataset of contemporary Metaphors in News headlines | Rohan Joseph, Timothy Liu, Aik Beng Ng, Simon See and Sunny Rai |
20 | Extended Abstract | On the State of German (Abstractive) Text Summarization | Dennis Aumiller, Jing Fan and Michael Gertz |
31 | Extended Abstract | Measuring misogyny in natural language generation: preliminary results from a case study on two Reddit communities | Aaron Snoswell, Lucinda Nelson, Hao Xue, Flora Salim, Nicolas Suzor and Jean Burgess |
35 | Extended Abstract | On the Learnability of Watermarks for Language Models | Chenchen Gu, Xiang Lisa Li, Percy Liang and Tatsunori Hashimoto |
47 | Extended Abstract | Does Writing with Language Models Reduce Content Diversity? | Vishakh Padmakumar and He He |
Talk Session II (14:00-15:30pm)
ID | Type | Title | Authors |
---|---|---|---|
36 | Main Track | Flesch or Fumble? Evaluating Readability Standard Alignment of Instruction-Tuned Language Models | Joseph Marvin Imperial and Harish Tayyar Madabushi |
41 | Main Track | Multi-domain Summarization from Leaderboards to Practice: Re-examining Automatic and Human Evaluation | David Demeter, Oshin Agarwal, Simon Ben Igeri, Marko Sterbentz, Neil Molino, John Conroy and Ani Nenkova |
56 | Main Track | Elo Uncovered: Robustness and Best Practices in Language Model Evaluation | Meriem Boubdir, Edward Kim, Beyza Ermis, Sara Hooker and Marzieh Fadaee |
39 | Extended Abstract | Generative language models exhibit social identity biases | Tiancheng Hu, Yara Kyrychenko, Jon Roozenbeek and Nigel Collier |
70 | Industry Track | A Simple yet Efficient Ensemble Approach for AI-generated Text Detection | Harika Abburi, Kalyani Roy, Michael Suesserman, Nirmala Pudota, Balaji Veeramani, Edward Bowen and Sanmitra Bhattacharya |
17 | Industry Track | Leveraging Large Language Models for Enhanced Product Descriptions in eCommerce | Jianghong Zhou, Bo Liu, Jhalak Nilesh Acharya, Yao Hong, Kuang-chih Lee and Musen Wen |
55 | Industry Track | Separating the Wheat from the Chaff with BREAD: An open-source benchmark and metrics to detect redundancy in text | Isaac Caswell, Lisa Wang and Isabel Papadimitriou |
Poster Session II (16:00-17:30pm)
ID | Type | Title | Authors |
---|---|---|---|
223 | Findings | MacLaSa: Multi-Aspect Controllable Text Generation via Efficient Sampling from Compact Latent Space | Hanxing Ding, Liang Pang, Zihao Wei, Huawei Shen, Xueqi Cheng, Tat-Seng Chua |
300 | Findings | DeltaScore: Story Evaluation with Perturbations | Zhuohan Xie, Miao Li, Trevor Cohn, Jey Han Lau |
469 | Findings | Show, Write, and Retrieve: Entity-aware Article Generation and Retrieval | Zhongping Zhang, Yiwen Gu, Bryan A. Plummer |
575 | Findings | Adversarial Text Generation by Search and Learning | Guoyi Li, Bingkang Shi, Zongzhen Liu, Dehan Kong, Yulei Wu, Xiaodan Zhang, Longtao Huang, Honglei Lyu |
651 | Findings | On Uncertainty Calibration and Selective Generation in Probabilistic Neural Summarization: A Benchmark Study | Polina Zablotskaia, Du Phan, Joshua Maynez, Shashi Narayan, Jie Ren, Jeremiah Zhe Liu |
731 | Findings | GROVE: A Retrieval-augmented Complex Story Generation Framework with A Forest of Evidence | Zhihua Wen, Zhiliang Tian, Wei Wu, Yuxin Yang, Yanqi Shi, Zhen Huang, Dongsheng Li |
963 | Findings | A Confederacy of Models: a Comprehensive Evaluation of LLMs on Creative Writing | Carlos Gómez-Rodríguez, Paul Williams |
1470 | Findings | Uniform Complexity for Text Generation | Joseph Marvin Imperial, Harish Tayyar Madabushi |
1548 | Findings | Unraveling Downstream Gender Bias from Large Language Models: A Study on AI Educational Writing Assistance | Thiemo Wambsganss, Xiaotian Su, Vinitra Swamy, Seyed Parsa Neshaei, Roman Rietsche, Tanja Käser |
1807 | Findings | Miracle: Towards Personalized Dialogue Generation with Latent-Space Multiple Personal Attribute Control | Zhenyi Lu, Wei Wei, Xiaoye Qu, Xian-Ling Mao, Dangyang Chen, Jixiong Chen |
1897 | Findings | Stylized Dialogue Generation with Feature-Guided Knowledge Augmentation | Jinpeng Li, Zekai Zhang, Xiuying Chen, Dongyan Zhao, Rui Yan |
1992 | Findings | Harnessing the power of LLMs: Evaluating human-AI text co-creation through the lens of news headline generation | Zijian Ding, Alison Smith-Renner, Wenjuan Zhang, Joel R. Tetreault, Alejandro Jaimes |
1993 | Findings | InfoDiffusion: Information Entropy Aware Diffusion Process for Non-Autoregressive Text Generation | Renzhi Wang, Jing Li, Piji Li |
2053 | Findings | The Iron(ic) Melting Pot: Reviewing Human Evaluation in Humour, Irony and Sarcasm Generation | Tyler Loakman, Aaron Maladry, Chenghua Lin |
2490 | Findings | Ask To The Point: Open-Domain Entity-Centric Question Generation | Yuxiang Liu, Jie Huang, Kevin Chang |
2493 | Findings | Frugal Prompting for Dialog Models | Bishal Santra, Sakya Basak, Abhinandan De, Manish Gupta, Pawan Goyal |
2716 | Findings | Towards Informative Open-ended Text Generation with Dynamic Knowledge Triples | Zixuan Ren, Yang Zhao, Chengqing Zong |
2876 | Findings | Harnessing the Power of Large Language Models for Empathetic Response Generation: Empirical Investigations and Improvements | Yushan Qian, Weinan Zhang, Ting Liu |
3010 | Findings | T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics | Yiwei Qin, Weizhe Yuan, Graham Neubig, Pengfei Liu |
3019 | Findings | NLP Evaluation in trouble: On the Need to Measure LLM Data Contamination for each Benchmark | Oscar Sainz, Jon Ander Campos, Iker García-Ferrero, Julen Etxaniz, Oier Lopez de Lacalle, Eneko Agirre |
3386 | Findings | Narrative Order Aware Story Generation via Bidirectional Pretraining Model with Optimal Transport Reward | Zhicong Lu, Li Jin, Guangluan Xu, Linmei Hu, Nayu Liu, Xiaoyu Li, Xian Sun, Zequn Zhang, kaiwen wei |
3613 | Findings | Goodtriever: Adaptive Toxicity Mitigation with Retrieval-augmented Models | Luiza Amador Pozzobon, Beyza Ermis, Patrick Lewis, Sara Hooker |
3726 | Findings | Don’t Add, don’t Miss: Effective Content Preserving Generation from Pre-Selected Text Spans | Aviv Slobodkin, Avi Caciularu, Eran Hirsch, Ido Dagan |
3802 | Findings | Ensemble-Instruct: Instruction Tuning Data Generation with a Heterogeneous Mixture of LMs | Young-Suk Lee, Md Arafat Sultan, Yousef El-Kurdi, Tahira Naseem, Asim Munawar, Radu Florian, Salim Roukos, Ramón Fernandez Astudillo |
4841 | Findings | A Closer Look into Using Large Language Models for Automatic Evaluation | Cheng-Han Chiang, Hung-yi Lee |
4954 | Findings | Pseudointelligence: A Unifying Lens on Language Model Evaluation | Shikhar Murty, Orr Paradise, Pratyusha Sharma |
5156 | Findings | Improving Pacing in Long-Form Story Planning | Yichen Wang, Kevin Yang, Xiaoming Liu, Dan Klein |
5563 | Findings | Bridging Discrete and Continuous Text Spaces for Accelerated Seq2Seq Diffusion Models | Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, Lingpeng Kong |
5603 | Findings | Exploring Context-Aware Evaluation Metrics for Machine Translation | Xinyu Hu, Xunjian Yin, Xiaojun Wan |
Organization
Contact: gem-benchmark-chairs@googlegroups.com
General Chairs
Khyathi Raghavi Chandu (AI2)
Elizabeth Clark (Google Deepmind)
Kaustubh Dhole (Emory University)
Sebastian Gehrmann (Bloomberg)
João Sedoc (NYU)
Alex Wang (Cohere)
Industry Track Chairs
Enrico Santus (Bloomberg)
Hooman Sedghamiz (Bayer)