Soumith Chintala

Decisions and Pivots

on PyTorch

a tweet-thread at the 5-year mark

January 19, 2022
 

It’s been 5 years since we launched PyTorch, and it’s gotten much bigger than we expected, in usage, contributors and funding. We’re blessed with success, though we haven’t been perfect. We talk about our impact and positive stuff all the time. Thread about some of the interesting/hard decisions and pivots we’ve had to make 👇

  • In the first year, we started small, focused on researchers — and we did that focused and well. We had a good product, strong support structure that we carefully curated and a twitter account that we set on fire. It worked a little too well :)
  • We were extremely blessed to add in folks like @ptrblck, @ezyang, @t-vi early on, and countless others that joined in on the PyTorch party and have completely changed the way we build things!
  • I remember at that time, we still had hope that we could get our open Github issues to 0. We were trying to fix and close Github issues as rapidly as we could. When we had the 150th open Github issue, I remember sitting with Richard Zou and Simon Wang about how if we worked faster and smarter, we could go back to Inbox Zero. Oh, how naive we were!
  • After the 0.3 release, we knew that at the rate at which the hardware was getting faster, we absolutely needed a compiler to be able to drive hardware optimally. Fun fact, @colesbury (who built the Python nogil work) opposed that view :-)
  • Building an ML compiler was and is a research problem, for two reasons:
    • We didn’t know how to codegen efficient code for dynamic shapes
    • We didn’t know how to slice Python in the right way to make it small, strongly typed, and yet be good for all the flexibility that users expect from Python / PyTorch
  • So, we bet on TorchScript (more specifically jit.script). This has been a rough ride, because Python is large and people like using most or all of Python. It wasn’t obvious then, though it is somewhat obvious in retrospect. We’re unbundling it.
  • We could’ve made TorchScript more appealing if we focused it on performance, shown 10x better perf than in eager mode – taken a Numba-like approach – limited but powerful subset. We tried that minimally with optimizing small RNNs, but the world had moved on by the time we got there. That would’ve given people strong incentives to port to TorchScript. But we didn’t focus too much on performance, and focused instead on exporting PyTorch programs to C++ for productionisation. Here’s why:
  • We were bold, competent but very underfunded. For example, the first version of PyTorch-Distributed was built by Adam Paszke and three of his friends in undergrad as a class project. Additionally, the biggest criticism (and demand) at that time was that PyTorch was a playtoy not ready for production.
  • So, around the time we were building our compiler, we got the opportunity to merge together with the Caffe2 project, for a significantly larger team and a much more sustained and growing funding. We took it, and I made the call. This is also where we baked in the “commits have to be landed by FB engineers” which was a huge trade-off. I made the call knowing that the downside was increased friction in open-source for a few years until we could streamline this aspect. Life is not perfect.
  • For about two years, we pivoted our compiler stack to handle production, while silently seeing XLA and TVM breeze past. We also had to integrate two teams (PyTorch and Caffe2) – which was more of a social engineering problem than a technical problem. It took time – between 2018 and 2020.
  • We also cleaned up our internals which were bubble-gum wrapped house of cards, enabling us to build PyTorch Mobile, enable hardware vendors to build out-of-tree backends, helped start our strong collaboration with Google on TPU support, and enabled many other ongoing projects (fx, functorch, dynamo, dispatch, lazy).
  • Also, our CPU perf was horrendous (but the researchers didn’t notice), and prod workloads cared about CPU a lot – we fixed it.
  • All of this had a massive impact: libtorch enabled us to enter many new markets as a result – self-driving, mobile apps, recommendation. We are running in prod in many of the top companies across the world.
  • Our super-strong emphasis on backward-compatibility is universally appreciated till today, and I’m proud of that.
  • Anyways, none of this big production push helped researchers in a significant way, so many jokingly say that not much has changed since PyTorch 0.4, and in a simplistic, squinty view, they are somewhat right.
  • While we have seen massive success in research, it is heavily lifted by our core initial product + strong backward-compatibility guarantees.
  • With the sustained/additional funding due to the production pivot, we made great improvements in distributed, performance, added various new layers and functions (some of them with bad design, sorry!), complex support, and this has been a massive amount of time and effort, and that does make people’s day-to-day usage easier and better.
  • But what researchers mean when they say “not much has changed” is that there hasn’t been a step-change in their day-to-day user-experience.
  • Product exploration and product design are generally handled the same way research is – we explore, and exploit what we find. With the various threads tying up in PyTorch, I am pretty bullish that as of today, we have the right infrastructure, the right set of leaders and maintainers, the right attitude and priorities to have PyTorch lead significant product disruption again (unless all our bets are uniformly bad by luck). We are blessed to have the trust of the research community even as I made certain calls that were not ideal in their perspective (at that time).
  • I am really excited to see prototypes like dynamo, lazy, fx, functorch, nvfuser etc. and I’m pretty confident that they will consolidate into a disruptive next set of experiences. I’m proud that we’ve built prototypes like named-tensor, even though they didn’t work out.

Other posts

Previous: Growing open-source