An interview with Alexander (Sasha) Rush, Associate Professor at Cornell University

Sayak Paul
8 min readJun 26, 2020

For today’s interview, we have Alexander M. Rush with us. Alexander is currently an Associate Professor at Cornell University where his research group studies several areas of NLP such as text generation, document-level understanding, and so on. They also work on open-source developments such as OpenNMT.

Alexander’s research work has been groundbreaking particularly in the area of text generation and one of them is an absolute favorite of mine — A neural attention model for abstractive sentence summarization. Alexander is also with Hugging Face 🤗 helping the company to develop SoTA stuff in NLP. They recently published an amazing paper on Movement Pruning that discusses several challenges of using pruning for transfer learning tasks and also proposes Movement Pruning as an alternative to Magnitude-based Pruning, especially for transfer learning regimes.

To know more about Alexander and his works, you can follow him on Twitter — @srush_nlp.

An interview with Alexander (Sasha) Rush, Associate Professor at Cornell University

Sayak: Hi Alexander! Thank you for doing this interview. It’s a pleasure to have you here today.

Alexander: Hey Sayak! You can actually call me Sasha in person.

Sayak: Sure, Sasha. Maybe you could start by introducing yourself — what are your current research interests, the kind of open-source stuff you are focusing on, etc.?

Sasha: Sure. So I am a professor at Cornell University, and I work at the Cornell Tech campus in New York (the campus is on Roosevelt Island across from the UN Building in the East River). I have been a professor for about 5 years, originally at Harvard before moving to Cornell. As you mentioned, I also work part-time at Hugging Face 🤗, after being a fan for several years. There we study how to make more efficient, more robust, and less data-hungry models.

Research-wise my group’s interest is “generative models and text generation” which are two different but related topics. In terms of generative models, we are interested in variational inference for complex discrete latent variables. At a high-level, this means tasks like unsupervised parsing or template extraction where we are given raw data and tasked with discovering hidden structure. We believe these approaches are an important missing piece for making deep learning models controllable in applications beyond prediction. One important application for us is conditional text generation e.g. translation, summary, and data-to-text generation. The major goal is to build systems that generate text in a structurally transparent way.

I also enjoy building open-source software. These days I’m working on a couple of rather different projects. Most recently I worked on ICLR 2020, and we built a toolkit Mini-Conf ( that let us run the conference remotely. We have been helping other conferences like AKBC and ACL utilize the same software. More related to my research is Torch-Struct ( a library for building efficient structured models for NLP. One way to think about this is porting a lot of the key ideas of statistical NLP to run fast in PyTorch. Finally, I have some random interests, one is building tools to make deep learning easier, for instance, Parallax ( a torch frontend for Jax, and Named Tensors (

Sayak: Thank you for being so detailed about it, Sasha. Among the open-source tools you mentioned Mini-Conf is probably my favorite. I have been intrigued by the idea of Named Tensors since the day it was presented at PyTorch Dev Con 2019. What motivated you to step into the world of machine learning and specifically natural language processing?

Sasha: My interest was always first in the language aspects, particularly syntax and translation, which are both so difficult. I fell in love with the challenges of these areas as an undergraduate, specifically concepts of formal grammars and automata. While I studied machine learning, even in graduate school, the learning aspect was really just one component of a toolbox of interesting approaches for modeling language and building systems. My dissertation is about complex methods for inference using combinatorial optimization. It wasn’t really until around 2015 that it became clear to me that model fitting was going to be the dominant aspect in the NLP toolbox.

Sayak: Interesting! Just for a side note, Automata was one of my favorite subjects too back in my undergraduate days. When you were starting what kind of challenges did you face? How did you overcome them?

Sasha: So I have certainly faced some challenges — piles of rejections, difficult courses, desires to leave grad school — but as a white American guy who speaks fluent English I’m privileged in the current system. I’ve heard piles of stories about others dealing with sexist colleagues, absent advisors, career-threatening visa issues, or just systemic inequities in the whole way that graduate school is set up. The fact that I just haven’t had to deal with these things when others have been pushed out by them makes me realize I have not really overcome that much, and that there is a lot that needs to be done. It puts things like a few rejections in perspective.

Sayak: I would take this in the spirit of honest acceptance, Sasha. Let’s switch gears to your research now. What’s the typical process that you follow for conducting research? Specifically, is there a systematic way that you use to come up with novel ideas?

Sasha: Idea generation is something that I struggle with. I am not someone with a million novel different things to try each day, and honestly, when I do have an idea, it has a pretty low probability of panning out. My system is to try to foster a group of interesting independent researchers with a culture of deep reading and direct discussion. My role in the process is to act as an editor to try to trim down different suggestions and question why specific choices were made. The goal of the process is to arrive at one specific new idea that can be tested and quantified.

Sayak: That’s quite challenging of a role, Sasha. I think having the culture you mentioned is also important for a research group to shine well both on collective and individual levels. My follow-up question would be how do you typically design the experiments for your research? I understand that it can be extremely specific to what the research is all about. But I still wanted to know if there’s anything generic that you follow.

Sasha: Experiment design, like dataset design, has become an incredibly interesting and important area of modern NLP. It used to be that many experiments were just “did I increase the bleu score on the dataset”, but now in sub-areas like Model Interpretability, there are these major questions about what an experiment really shows, e.g. does this experiment prove that a model knows X?

One thing I am a fan of is smaller, synthetic experiments to supplement full dataset experiments. For instance, if you are studying whether an architecture learns to model a specific phenomenon, to make mostly noiseless datasets to express this aspect directly. However, this technique can also fail or be misleading. For instance, pretraining and, in general, large-scale transformer models have shown that certain critical language properties really do only start to emerge at large-scales, and seemingly cannot be easily simulated or built into models directly.

Sayak: Performing smaller experiments before going with a full-blown dataset is a part of my day-to-day job and I absolutely love doing it. Could you tell us a bit about which areas of NLP do you think are going to be focused more in the future?

Sasha: In the medium term, I am really interested in model efficiency and energy efficiency. There also is starting to be some early work on adversarial attacks in NLP. Finally, questions of full-document NLP and discourse-level phenomena are also critical. In the long-term, we are waiting a bit for the dust to settle on pretraining, and the full extent to which it works. We are starting to see some diminishing returns on scale, and it seems like we need better datasets and tasks to quantify exactly how far these methods get us.

Sayak: Model pre-training is such an important element to consider not only in NLP but also in Computer Vision as well. Being a practitioner, one thing that I often find myself struggling with is learning a new concept. Would you like to share how do you approach that process?

Sasha: One piece of advice I give students is that reading a paper doesn’t mean “reading” the paper. If you are trying to read a paper on a new subject, it should take a whole week. ML/NLP papers are so condensed to fit in 8 pages. Something looks simple in 1 sentence, but that is hiding the fact that there are 5 critical preliminary papers the author needed to read to get there. You really need to print it out, get a coffee, highlight each important line, read those papers, get more coffee, start implementing the method, realize you missed something, highlight that line, get more coffee, get it implemented, get terrible first results, go to sleep. I don’t know if there is a way to rush that process? I mean I still do it that way. I wrote a blog post a couple of years ago, The Annotated Transformer ( that was basically me doing that live.

Sayak: I am in agreement here. Even when the teeth-to-teeth implementation of a paper is out of scope, I try to at least minimally implement it and that pursuit has always helped me to understand the subject in general. Could we expect a course on NLP from your group anytime soon?

Sasha: I think that there are tons of good NLP courses online these days :). The next course I am teaching is called Machine Learning Engineering, it’s a Masters’ course about the systems-level questions of training, tuning, debugging, visualizing, and deploying ML systems (and everything that can go wrong). I do hope to do a specialized tutorial on NLP in the future as well.

Sayak: The course sounds great, Sasha. Any advice for the beginners?

Sasha: Here’s one. Every paper has an incentive to convince you that each thing they did is somehow new, whereas in practice there just are not that many major ideas in the field. A good exercise is to be able to both absorb the low-level specifics that made one particular method work well, while also contextualizing it within other approaches at a high-level. A good author will be honest about this context, but many times the reader needs to be actively skeptical to build this up.

Sayak: Thank you so much, Sasha, for doing this interview and for sharing your valuable insights. I hope they will be immensely helpful for the community.

Sasha: This was fun! Thanks for having me.

I hope you enjoyed reading this interview. Watch out this space for the next one and I hope to see you soon. This is where you can find all the interviews done so far.

If you want to know more about me, check out my website.