The field of natural language processing has recently unlocked a wide range of new capabilities through the use of large language models, such as GPT-4. The growing application of these models motivates developing a more thorough understanding of how and why they work, as well as further improvements in both quality and efficiency.
In this talk, I will present my work on analyzing and improving the Transformer architecture underlying today’s language models through the study of how information is routed between multiple words in an input. I will show that such models can predict the syntactic structure of text in a variety of languages, and discuss how syntax can inform our understanding of how the networks operate. I will also present my work on structuring information flow to build radically more efficient models, including models that can process text of up to one million words, which enables new possibilities for NLP with book-length text.