All My Good Habits Are Bad
September 03, 2020
During my non-coding time at Recurse Center, I’ve been digging into videos about high-performance C++ (or generally just high performance software) in preparation for preparing good habits before I start working at Facebook. In the process I’ve been reflecting on the code I’d write by default, and try to reflect on what I can learn from this. So we’ll pair resources to good habits.
Make life easy for the hardware
Resources:
- Mike Acton’s talk about Data Oriented Design
- Scott Meyer’s talk about CPU Caches
- Carl Cook at CPPCon
CPU access patterns aren’t as simple as “get the address from the program counter, then look at the memory address for data”. It turns out CPUs are a lot smarter than that. A CPU will have various mechanisms to make accessing data as fast as possible, including
- various levels of caches before hitting main memory
- access patterns akin to grabbing blobs of data (called a cache line) and stuffing it into cache
- attempting to make predictions on your access patterns and grab data greedily to improve future access
A cache line is effectively a contiguous blob of memory that the CPU will grab when you access an address. An analogy might be - you’re making apple pie. You want to find the 10 best apples at the grocery store. Rather than pick out one by one, you scoop an arm full into your art. Then you start looking at the apples in your cart. If they’re all good, yay! If not, you throw away the ones you don’t want and grab another armful of apples.
If you treat your cart as L1 cache and going back to the apple bin as accessing main memory, it turns out the analogy means that you take 200x as long (see Peter Norvig’s numbers)
It turns out people who really care about performance - game programmers - figured this out a long time ago. Bob Nystrom’s written a great chapter in Game Programming Patterns about cache locality.
So what can we do?
- Be judicious with OOP patterns. Making everything an instance of a class and having relationships between them is great, but referencing those instances effectively is chasing a pointer which can nearly guarantee a cache miss
- Use monomorphic arrays when it makes sense. If you know you’re going to access all the elements of some list often (eg all the values of a map), just use an array. Make it cache friendly.
Big O isn’t everything - benchmark!
- Eric Lippert’s blog post on a very neat assembly instruction
- The chapter of Crafting Interpreters on writing a hash table
I expect this is particularly emphasized by the interview process for the FAANG companies, but Big O analysis is more or less second nature to me. But one thing that’s worth noting is that sometimes your Big O assumptions miss some truths about realworld numbers.
The idea that a single assembly instruction can do work much faster than the “clever” code I’d write is still kind of wild. Then when thinking about hash tables - of course access is O(1)! But with a caveat.
In the case of a hash collision, your hash table implementation matters a LOT. Are your values a linked list of pointers referencing the next value? If so, you’re now chasing pointers again. Which means cache misses. Which means ~200x as long to execute. Could you have done a simple linear scan on an array of data if it were stored in a cache line and held in L1 faster than you could resolve a hash collision? Well, maybe.
In practice, I’m not sure just HOW useful this is. But a key takeaway for me would be when measuring performance, don’t assume your “optimal” algorithm is as fast as it could be
Higher Order Functions aren’t all that fancy
- Let’s learn about monomorphism from this V8 engineer
- Some thoughts from Ben Lesh
I think this is fallout from the ES6 excitement that led to a ton of “functional-lite” code in the JS ecosystem, but I’m starting to feel a little bit of burnout from it. Consider the following:
const arrayOfNums = [1, 2, 3, 4, 5];
arrayOfNums.map(x => x * 2)
.filter(x => x % 3 == 0)
.reduce((obj, num) => obj[num] = String(num), {})
Comfortably we can call this linear in time and space. But in practice:
- we have a function allocation for each element in each callback
- We create a ton of arrays in the middle that are thrown away
- We end up creating a ton of “shapes” in our reduce unless the function gets inlined away by the compiler.
I fully get the code is arbitrary, but at the same time…does this really convey intent better? Maybe - map tells us there’s a new array created potentially of a new type, filter tells us we’ll get an array of the same type but fewer elements, and reduce tells us…well, not much. But maybe, particularly given that we’re running this in a browser where a user might have a dozen or more tabs open (on a phone!) we should be more judicious with our use of higher order functions.
In particular, I’ve poked around the codebase for React and Rome and have noticed both Sebastian Markbåge (and team) and Sebastian McKenzie both seem to keep it simple. While loops, for…of iterators, basic data structures (queues, stacks, heaps) and some well-typed classes.
In @sebmarkbage’s case in particular, despite his OCaml influence, he’s not writing functional-lite JS. Which makes sense! No tail call optimization, different engine under the language runtime, he’s playing to the strength of the language. That’s something I can take away.
I work and live in Brooklyn, NY building software.