Talks and presentations

Semantic Masking in a Needle-in-a-haystack Test for Evaluating Large Language Model Long-Text Capabilities

January 20, 2025

Talk, COLING 2025 Workshop W22: Writing Aids at the Crossroads of AI, Cognitive Science and NLP, Abu Dhabi National Exhibition Centre (ADNEC), Abu Dhabi, UAE

In this paper, we introduce the concept of Semantic Masking, where semantically coherent surrounding text (the haystack) interferes with the retrieval and comprehension of specific information (the needle) embedded within it. We propose the Needle-in-a-Haystack-QA Test, an evaluation pipeline that assesses LLMs’ long-text capabilities through question answering, explicitly accounting for the Semantic Masking effect. We conduct experiments to demonstrate that Semantic Masking significantly impacts LLM performance more than text length does. By accounting for Semantic Masking, we provide a more accurate assessment of LLMs’ true proficiency in utilizing extended contexts, paving the way for future research to develop models that are not only capable of handling longer inputs but are also adept at navigating complex semantic landscapes.

Voice assistants in vehicles: A case study in mixing traditional linguistic knowledge representations with neural language models

December 02, 2022

Talk, University of Toronto, Department of Computer Science, Odette Hall, Toronto, ON, CA

Voice Assistants in Vehicles has been a popular application of dialogue systems, and there have been many different approaches for this task. This talk will briefly present an evaluation of three models: a domain-specific one based upon typed feature structures, a neural language model, and a mixture of the two, on an unseen but in-domain corpus of user queries in the context of a dialogue classification task. The finding opens the door to a potentially new application of neural language models. The study has changed our perspective on the potential role of structured representations in the future of dialogue systems, and suggests that formal research in this area may have a new role to play in validating and coordinating ad hoc dialogue systems development.