Siva Reddy

gaurav.kamath@mila.quebec

Gaurav Kamath

Doctorat - McGill

Zdeněk Kasner

Doctorat - None

zdenek.kasner@mila.quebec

Amirhossein Kazemnejad

Maîtrise recherche - McGill

amirhossein.kazemnejad@mila.quebec

Github

Benno Krojer

Doctorat - McGill

benno.krojer@mila.quebec

Sarath Chandar Anbil Parthipan

Zichao Li

Doctorat - McGill

Co-superviseur⋅e :

Jackie Cheung

zichao.li@mila.quebec

Xing Han Lu

Doctorat - McGill

Doctorat - Polytechnique

Superviseur⋅e principal⋅e :

andreas.madsen@mila.quebec

Doctorat - McGill

nicholas.meade@mila.quebec

Maîtrise recherche - McGill

Superviseur⋅e principal⋅e :

aristides.milios@mila.quebec

marius.mosbach@mila.quebec

Marius Mosbach

Postdoctorat - McGill

Arkil Patel

Doctorat - McGill

Superviseur⋅e principal⋅e :

arkil.patel@mila.quebec

Github

karolina.stanczak@mila.quebec

Karolina Ewa Stańczak

Postdoctorat - McGill

Github

Ada Tur Tur

Stagiaire de recherche - McGill

Ivan Vulić

Visiteur de recherche indépendant - Cambridge University

ivan.vulic@mila.quebec

Publications

Faithfulness Measurable Masked Language Models

Andreas Madsen

Sarath Chandar Anbil Parthipan

2024-05-01

ICML.cc/2024/Conference (spotlight)

Scope Ambiguities in Large Language Models

Gaurav Kamath

Sebastian Schuster

Sowmya Vajjala

2024-04-05

ArXiv (prépublication)

arxiv.org

WebLINX: Real-World Website Navigation with Multi-Turn Dialogue

Xing Han Lu

Zdeněk Kasner

We propose the problem of conversational web navigation, where a digital agent controls a web browser and follows user instructions to solve… (voir plus) real-world tasks in a multi-turn dialogue fashion. To support this problem, we introduce WEBLINX - a large-scale benchmark of 100K interactions across 2300 expert demonstrations of conversational web navigation. Our benchmark covers a broad range of patterns on over 150 real-world websites and can be used to train and evaluate agents in diverse scenarios. Due to the magnitude of information present, Large Language Models (LLMs) cannot process entire web pages in real-time. To solve this bottleneck, we design a retrieval-inspired model that efficiently prunes HTML pages by ranking relevant elements. We use the selected elements, along with screenshots and action history, to assess a variety of models for their ability to replicate human behavior when navigating the web. Our experiments span from small text-only to proprietary multimodal LLMs. We find that smaller finetuned decoders surpass the best zero-shot LLMs (including GPT-4V), but also larger finetuned multimodal models which were explicitly pretrained on screenshots. However, all finetuned models struggle to generalize to unseen websites. Our findings highlight the need for large multimodal models that can generalize to novel settings. Our code, data and models are available for research: https://mcgill-nlp.github.io/weblinx

2024-03-11

ICLR.cc/2024/Workshop/LLMAgents (poster)

WebLINX: Real-World Website Navigation with Multi-Turn Dialogue

Xing Han Lu

Zdeněk Kasner

We propose the problem of conversational web navigation, where a digital agent controls a web browser and follows user instructions to solve… (voir plus) real-world tasks in a multi-turn dialogue fashion. To support this problem, we introduce WebLINX - a large-scale benchmark of 100K interactions across 2300 expert demonstrations of conversational web navigation. Our benchmark covers a broad range of patterns on over 150 real-world websites and can be used to train and evaluate agents in diverse scenarios. Due to the magnitude of information present, Large Language Models (LLMs) cannot process entire web pages in real-time. To solve this bottleneck, we design a retrieval-inspired model that efficiently prunes HTML pages by ranking relevant elements. We use the selected elements, along with screenshots and action history, to assess a variety of models for their ability to replicate human behavior when navigating the web. Our experiments span from small text-only to proprietary multimodal LLMs. We find that smaller finetuned decoders surpass the best zero-shot LLMs (including GPT-4V), but also larger finetuned multimodal models which were explicitly pretrained on screenshots. However, all finetuned models struggle to generalize to unseen websites. Our findings highlight the need for large multimodal models that can generalize to novel settings. Our code, data and models are available for research: https://mcgill-nlp.github.io/weblinx.

2024-03-11

ICLR.cc/2024/Workshop/LLMAgents (poster)

When does word order matter and when doesn't it?

Xuanda Chen

Timothy John O'donnell

Language models (LMs) may appear insensitive to word order changes in natural language understanding (NLU) tasks. In this paper, we propose … (voir plus)that linguistic redundancy can explain this phenomenon, whereby word order and other linguistic cues such as case markers provide overlapping and thus redundant information. Our hypothesis is that models exhibit insensitivity to word order when the order provides redundant information, and the degree of insensitivity varies across tasks. We quantify how informative word order is using mutual information (MI) between unscrambled and scrambled sentences. Our results show the effect that the less informative word order is, the more consistent the model's predictions are between unscrambled and scrambled sentences. We also find that the effect varies across tasks: for some tasks, like SST-2, LMs' prediction is almost always consistent with the original one even if the Pointwise-MI (PMI) changes, while for others, like RTE, the consistency is near random when the PMI gets lower, i.e., word order is really important.

2024-02-29

ArXiv (prépublication)

arxiv.org

Data science opportunities of large language models for neuroscience and biomedicine

Danilo Bzdok

Andrew Thieme

Oleksiy Levkovskyy

Paul Wren

Thomas Ray

2024-02-01

Neuron (publié)

StarCoder: may the source be with you!

Raymond Li

Loubna Ben allal

Yangtian Zi

Niklas Muennighoff

Denis Kocetkov

Chenghao Mou

Marc Marone

Christopher Akiki

Jia LI

Jenny Chim

Qian Liu

Evgenii Zheltonozhskii

Terry Yue Zhuo

Thomas Wang

Olivier Dehaene

Mishig Davaadorj

Joel Lamy-Poirier

Joao Monteiro

Oleh Shliazhko

Nicolas Gontier … (voir 49 de plus)

Nicholas Meade

Armel Zebaze

Ming-Ho Yee

Logesh Kumar Umapathi

Jian Zhu

Ben Lipkin

Muhtasham Oblokulov

Zhiruo Wang

Rudra Murthy

Jason T Stillerman

Siva Sankalp Patel

Dmitry Abulkhanov

Marco Zocca

Manan Dey

Zhihan Zhang

N. Fahmy

Urvashi Bhattacharyya

Wenhao Yu

Swayam Singh

Sasha Luccioni

Paulo Villegas

Jan Ebert

M. Kunakov

Fedor Zhdanov

Manuel Romero

Tony Lee

Nadav Timor

Jennifer Ding

Claire S Schlesinger

Hailey Schoelkopf

Jana Ebert

Tri Dao

Mayank Mishra

Alex Gu

Jennifer Robinson

Sean Hughes

Carolyn Jane Anderson

Brendan Dolan-Gavitt

Danish Contractor

Daniel Fried

Yacine Jernite

Carlos Muñoz Ferrandis

Sean M. Hughes

Thomas Wolf

Arjun Guha

Leandro Von Werra

Harm de Vries

The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs)… (voir plus), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python and still retains its performance on other programming languages. We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license.

2023-12-17

TMLR (accepté)

Can Retriever-Augmented Language Models Reason? The Blame Game Between the Retriever and the Language Model

Parishad BehnamGhader

Santiago Miret

Augmenting pretrained language models with retrievers to select the supporting documents has shown promise in effectively solving common NLP… (voir plus) problems, including language modeling and question answering, in an interpretable way. In this paper, we first study the strengths and weaknesses of different retriever-augmented language models (REALM,

2023-12-01

Findings of the Association for Computational Linguistics: EMNLP 2023 (publié)

Evaluating In-Context Learning of Libraries for Code Generation

Arkil Patel

Pradeep Dasigi

2023-11-16

ArXiv (prépublication)

arxiv.org

Using In-Context Learning to Improve Dialogue Safety

Nicholas Meade

Spandana Gella

Devamanyu Hazarika

Prakhar Gupta

Di Jin

Yang Liu

Dilek Hakkani-Tur

2023-10-07

EMNLP/2023/Conference (publié)

Are Diffusion Models Vision-And-Language Reasoners?

Benno Krojer

Elinor Poole-Dayan

Vikram Voleti

Chris Pal

Text-conditioned image generation models have recently shown immense qualitative success using denoising diffusion processes. However, unlik… (voir plus)e discriminative vision-and-language models, it is a non-trivial task to subject these diffusion-based generative models to automatic fine-grained quantitative evaluation of high-level phenomena such as compositionality. Towards this goal, we perform two innovations. First, we transform diffusion-based models (in our case, Stable Diffusion) for any image-text matching (ITM) task using a novel method called DiffusionITM. Second, we introduce the Generative-Discriminative Evaluation Benchmark (GDBench) benchmark with 7 complex vision-and-language tasks, bias evaluation and detailed analysis. We find that Stable Diffusion + DiffusionITM is competitive on many tasks and outperforms CLIP on compositional tasks like like CLEVR and Winoground. We further boost its compositional performance with a transfer setup by fine-tuning on MS-COCO while retaining generative capabilities. We also measure the stereotypical bias in diffusion models, and find that Stable Diffusion 2.1 is, for the most part, less biased than Stable Diffusion 1.5. Overall, our results point in an exciting direction bringing discriminative and generative model evaluation closer. We will release code and benchmark setup soon.

The Impact of Positional Encoding on Length Generalization in Transformers

Amirhossein Kazemnejad

Inkit Padhi

Karthikeyan Natesan

K. Ramamurthy

Payel Das

Length generalization, the ability to generalize from small training context sizes to larger ones, is a critical challenge in the developmen… (voir plus)t of Transformer-based language models. Positional encoding (PE) has been identified as a major factor influencing length generalization, but the exact impact of different PE schemes on extrapolation in downstream tasks remains unclear. In this paper, we conduct a systematic empirical study comparing the length generalization performance of decoder-only Transformers with five different position encoding approaches including Absolute Position Embedding (APE), T5's Relative PE, ALiBi, and Rotary, in addition to Transformers without positional encoding (NoPE). Our evaluation encompasses a battery of reasoning and mathematical tasks. Our findings reveal that the most commonly used positional encoding methods, such as ALiBi, Rotary, and APE, are not well suited for length generalization in downstream tasks. More importantly, NoPE outperforms other explicit positional encoding methods while requiring no additional computation. We theoretically demonstrate that NoPE can represent both absolute and relative PEs, but when trained with SGD, it mostly resembles T5's relative PE attention patterns. Finally, we find that scratchpad is not always helpful to solve length generalization and its format highly impacts the model's performance. Overall, our work suggests that explicit position embeddings are not essential for decoder-only Transformers to generalize well to longer sequences.