Note: I didn’t cover some of this talk outline at JuliaCon and improvised on other parts, so the video doesn’t match or cover everything written here.
As I’ve been told, the Julia community invites keynotes from people from adjacent open-source communities to learn from them. So, in my keynote, I tried to give a small thread of my journey through Torch and PyTorch.
Most open-source projects don’t just get started with “we need to have 10k users”. That just doesn’t make sense, and the journey through open-source is a lot more organic.
What we start with – especially in open-source – is to do something that’s of personal interest. Often, we try to intersect it with something that is of broader interest or need. The market-making here a bit of a self-fulfilling prophecy. Ideas and project mostly grow organically only if they are of interest to many people who want to commit time together. There are rare exceptions, like a mega-corp launching a top-down product with a huge marketing plan, but that’s not really what I want to talk about here.
Most small open-source projects, after enough effort and involvement, think about growth. At that point, they’ve nailed down their core interests and philosophies which are a foundation for their technical and cultural stack. Next, they’re wondering if they are doing the best they can to sell, market and grow this project.
In this note, I talk about four aspects that are useful to delibrately outline as you grow from this stage of a project. I talk through these aspects via stories from my journey from Torch to PyTorch:
- Philosophy / Principles
- Scope & Risk
- Scaling of the project
Philosophy / Principles
When we think about a project, it could be a technology-centric project, like Tensor Comprehensions which was trying to propagate the ideas of polyhederal optimization, or a user-centric project like Torch-7, which propagates the ideas of ease-of-use without caring about what technology or ideas get you the ease of use.
I started working with Torch in 2010/2011. Over time, I made friends in the Torch community, I understood the implicit principles that they stood for as a whole. Open-source, like politics, can be fairly ill-defined in relationships and principles – not everyone stands for the same thing.
So, over the years, I absorbed and appreciated that Torch was a user-centric product, which stood for immediate-mode, easy-to-debug, stay-out-of-the-way explicitness. It was targeted at people somewhat familiar with programming matters, and who could reason about things like performance, and if needed, write a C function and bind it in quickly.
When we were writing PyTorch, I came to the realization that in an organic open-source community, not everyone stands for the same principles. We had a couple of really important members in the Torch community who were against Python, even though our user-centric view allowed us to go in that direction. Then, we had to make a decision on bringing them along or leaving them behind. These are difficult decisions because there is no right answer, only subjective calls that you have to make rapidly as a leader.
In such scenarios, it is often worth thinking about when to stay strong, and when to compromise. My view is that you have to be absolutely stubborn on the principles / philosophy that you are driving, but everything else is changeable.
This view was really useful in bringing people along. Over time, PyTorch brought along and integrated the Caffe2 community and the Chainer community and stayed friendly with Jax and Swift4TF since their inception. Bringing others along has huge advantages. The community gets bigger, you arguably get wider perspectives that make the project better and broader over time. And if you are stubborn on your core principles, you aren’t really compromising on your original vision, only strictly making it better.
Scope & Risk
Apart from the challenge of bringing the Torch community along, the second thing we were facing was that our most formidable competitor at that time – TensorFlow rumoured to have 10x to 30x more developers. The really good thing for us though, was that TensorFlow was trying to be everything for everyone. From our visibility, it was a project planned top-down with a huge amount of resources and breadth.
So, we naturally tried to take the exact opposite approach, mostly to survive and compete on realistic terms. We decided that we would concentrate on no one except ML researchers, and make their life really good. This way, we can stay focused and deliver with lesser resources. We intentionally scoped down and hence took more vertical risk, but less horizontal risk. We just wanted to nail our addressable market.
However, once we were successful in that market with PyTorch, we got incrementally amibitious. We slowly and incrementally expanded our scope and ambition, as we grew and matured. This approached scaled well.
Here, I also want to talk about the deliberate nature of risks you take and how it could shape your destiny.
We were making a deliberate bet on the ML researcher market, and that:
- the modeling they do over the next few years would need more flexibilty and debuggability
- the ML researcher market would continue innovating on crazier model architectures and that it would become the mainstream future
So, with this bet, we needed a very wide API combined with a user-experience that helped use and extend that API really easily. This bet we made could have not worked out for a million reasons based on how the ML community shaped it’s future. For example we could’ve hypothetically stopped innovating at ResNets for maybe a decade, and that would’ve made the need for a wide and large API obsolete – and that future would’ve needed a library more vertically focused only ResNets.
You can listen to more of my views of this topic, and how I think about the future ML frameworks, in this talk that I gave.
Measurement / Metrics
Apart from our core principles and scope, we also wanted a feedback loop with our customers, which is a standard operating need in product development. Then, we asked ourself about how we want to track various dimensions of PyTorch:
- Are they measurable
- Are they measuarable well?
- Should you measure them?
- How do you deal with unmeasurable areas? – at scale?
In our Torch days, we learned a lot about how people love to measure things, and how other love to read comparisons of measurement as gospel. For example, micro-benchmarks, github stars growth, tables of feature comparison, etc. etc.
After people published a few such measurements and comparisons in the community, we felt wronged by some of these measurements, and we were annoyed by them. But the bigger realization we took out of our experiences from Torch was to ask ourselves how prematurely measuring something shapes the product in a bad way. Even though we didn’t write the blogposts measuring Torch to a competitor, we were constantly expending significant energy trying to optimize for those measurements and reacting to them, instead of focusing on other more user-important priorities.
So, when we wrote PyTorch, we were clear on two things. One, that our core competency is not something measurable like speed or some other stat, but we needed to march towards a buttery-smooth user-experience that put flexibility, API design and debuggability as a top priority. There is no good way to measure that, and we had to get really comfortable with that ambiguousness, and to be able to subjectively constantly reassess our internal signals on whether we are able to do a good job or not. Second, we believed that if we don’t react to external measurments of PyTorch, we can stay focused on what we care about – even if that creates short-term churn.
So, in PyTorch’s history, we never responded to speed benchmarks or irrelevant measures such as github stars. We haven’t submitted to industry benchmarks such as MLPerf ourself as the PyTorch authors. It was very deliberate, and we are really really comfortable and happy with our approach. When I would give PyTorch talks, it would be common for someone to ask: “how fast are you compared to X?”. Even if I knew we were as fast or faster on a given use-case, I would simply side-step the question – “we are more flexible, and we are probably within 10% of the performance elsewhere, try us out”. This gave us an incredible superpower – to focus on our core competency without the pressure of getting dragged into what we saw as a reductive view of our product – a view where our superiority is not valued at all.
The metrics that we softly tried to rely on were whether people were using PyTorch and it’s relative use to our competition. Not metrics that measured bookmarking (like github stars), or performance on microbenchmarks – but actually writing code in it. So, we used metrics like Github’s global code search (for
import torch and stuff) and arxiv citations, which more accurately portrayed whether someone actually used us, with no ambiguity.
However, the problem was that these are lagging metrics. We couldn’t rely on them to understand immediate needs of our community at all, because they had a huge lead-time, 6 months or so.
We also didn’t use metrics to try to approximate how users felt about their overall experience, and aspects such as debuggability and API ease. But we did subjectively measure them…
At a smaller scale, what we did was for me to basically read the entire volume of information that our community produced – github issues, forum posts, slack messages, twitter posts, reddit and hackernews comments. It was incredibly useful signal – yes there was a lot of noise but there was a lot of signal as well. It helped us prioritize really well, and I think it was a great way to shape our product through subjective land. Apart from me, almost all the core devs spent a lot of time interacting with our users, and so we had a rich amount of shared understanding through very ambiguous and subjective aspects. However, this approach did not scale beyond a point.
As we scaled, I think within 2 years of PyTorch, I was getting to physical / human limits of doing this every day. I was going through ~500 github notifications, ~50+ forum posts, tons of slack activity and many engagements on twitter/Reddit/HN. I think I was working 15 hour days and was just exhausted all the time without actually doing much else. My immediate thought was to obviously pawn this off on someone else, i.e. they work harder and better, and I can nurse my burnout.
This was obviously not going to work. My colleague Edward Yang who has superpowers that I don’t have, he took over the process with the intent of first observing it, and then building a better process to scale it. He wrote a fantastic blog post summarizing what he did here. The nice realization that I came to from watching him do this is that once you’re at a certain scale, you can’t aim to do everything and you have to heavily heavily prioritize and there’s no other way around it and that’s okay. It doesn’t make you cruel for not being able to close every github issue.
The other thing to think about at scale is whether you integrate vertically or horizontally. In 2009, AMD spun off their fab division into a separate company. At that time, I found this incredibly hard to comprehend. Several years later, I read an article that theorized that AMD did this because the fab (backend) wasn’t working well with the designers (frontend), and that the vertical integration there was hurting more than helping. In contrast, Apple’s M1 processor and it’s magically fast practical speeds are attributed to unbelievably good vertical scaling, where the software team is able to measure the bottlenecks in the Apple software ecosystem, and find critical low-level ops that need to be sped up, and that signal translating all the way until hardware design. I don’t know if either of these theories are true, but I do believe that vertical integration done poorly is a huge overhead and done well is a huge force multiplier – so choose wisely which way you go. On PyTorch, we vertically integrated packages like
quantization, which need deeper vertical integration because they intersect heavily with frontend design. We branched off packages such as
torchserve into their own github repos, because they didn’t need as much end-to-end thought. Here, I think the decisions around vertical vs horizontal integration also heavily rely on the effective bandwidth between the people building these things – whether they are within the same company, whether they are in the same timezone or physical space, or if they are primarily talking via long-form async communication – all defining whether the vertical integration can be effectively executed.
Another topic I wanted to talk about on scaling is on growing not just yourself but your ecosystem. The right kind of incentives matter – a lot. Since the beginning of PyTorch, we wanted to grow the community based on whether people were interested in using / contributing to PyTorch because they liked it as a product. We worked really hard to remove other kinds of incentives. Hence, for a very long time, we didn’t offer any prizes or bounties or other economic incentives to get people to use PyTorch. Our view was that once you introduce economic incentives, it shapes your community’s culture in an irreversible way. Even now, outside of a hackathon or two a year, we strive to not push this button much, even though we have a bigger budget as a project and we can. Another incentive we care about a lot is to make sure we give others the space to grow, and not scope-creep everything for ourselves. We care about helping the community grow and fill voids first, and only if no one is filling the needs do we deliberately go in and make those investments top-down.
My goal was to tell you a few anecdotes and stories from the journey of the PyTorch team, the project and from my personal journey. I hoped to tie them to four useful dimensions that you think about when building and growing an open-source project, whether in Julia land or elsewhere. I hope this was useful, and please email me if you have any feedback.