Valeriia Cherepanova’s Post

View profile for Valeriia Cherepanova, graphic

Postdoctoral Scientist at Amazon AWS Responsible AI

How do language models comprehend gibberish inputs? Our recent work with James Zou focuses on understanding the mechanisms by which LLMs can be manipulated into responding with coherent target text to seemingly gibberish inputs. Paper: https://lnkd.in/gA9Mjqc4 A few takeaways: In this work we show the prevalence of nonsensical prompts that induce LLMs to generate specific and coherent responses, which we call LM Babel. We examine the structure of Babel prompts and find that despite their high perplexity, these prompts often contain nontrivial trigger tokens, maintain lower entropy compared to random token strings, and cluster together in the model representation space. We find that the efficiency of these prompts largely depends on the prompt length as well as target text’s length and perplexity. We show that reproducing harmful texts with aligned models is not only feasible but, in some cases, even easier compared to benign texts, while fine-tuning language models to forget specific information complicates directing them towards unlearned content.

  • No alternative text description for this image

To view or add a comment, sign in

Explore topics