Learning from Giants #49
Compression algorithms from first principles and in DuckDB, Collecting stories is a powerful user interviewing tool, and 15 rules for communicating (mostly async) at Github.
👋 Hi, this is Mathias with your weekly drop of the 1% best, most actionable, and timeless resources to grow as an engineering or product leader. Handpicked from the best authors and companies. Guaranteed 100% OpenAI-free content.
Did a friend send this to you? Subscribe to get these weekly drops directly in your inbox. Read the archive for even more great content. Also: I share these articles daily on LinkedIn.
Compression algorithms from first principles
"When working with large amounts of data, compression is critical for reducing storage size and egress costs. [...] Compression algorithms typically reduce data set size by 75-95%, depending on how compressible the data is."
95% difference. The kind of number that makes a lot of complexity worth it.
In a database, that can lead to huge performance improvements because less data has to be read from disk or transit over a network.
How does compression work?
"At its core, compression algorithms try to find patterns in a data set in order to store it more cleverly. Compressibility of a data set is therefore dependent on whether or not such patterns can be found, and whether they exist in the first place."
Most of your interactions with compression are with zip or tar archives. These algorithms are called general-purposed algorithms because they can work on any piece of data. But like most generic things, they're more expensive and costly because they need to work on everything.
"Another option for achieving compression is to use specialized lightweight compression algorithms. [...] However, unlike general purpose compression, they do not attempt to find generic patterns in bitstreams. Instead, they operate by finding specific patterns in data sets."
There are dozens of such lightweight algorithms, each being seemingly super simple.
"Constant encoding is used when every single value in a column segment is the same value. In that case, we store only that single value." So simple it's hardly an "algorithm"!
"Run-length encoding (RLE) is a compression algorithm that takes advantage of repeated values in a dataset. Rather than storing individual values, the data set is decomposed into a pair of (value, count) tuples, where the count represents how often the value is repeated."
"Dictionary encoding works by extracting common values into a separate dictionary, and then replacing the original values with references to said dictionary."
And so many more!
📗 Mark Raasveldt's Lightweight Compression in DuckDB defines compression before explaining how DuckDB, a columnar in-memory database, leverages these lightweight compression algorithms to gain huge performance boosts.
Collect stories, not generalizations
User interviewing is a skill we all grow throughout our careers. The best interviewers can extract invaluable insights quickly because they've learned to avoid pitfalls. Asking generic questions is one of them.
"People feel much more self-aware than they are. If you ask for generalizations, you'll get confabulations.
When asked to describe themselves, people will tell you a vision they want to project or even believe, which can be very remote from the truth. Similarly, you can't just ask what they think of your product or new idea.
"They probably don't even know, and they'll lean heavily toward saying 'Uh sure yeah, I agree that your idea would be cool.'"
But there is a way to remove these biases: actual stories. Not talking about themselves but about that moment, why, and how it came to be.
"[Asking] "Tell me about a recent time you struggled with X" makes things more concrete. It plants seeds for "why" follow-up questions and leads to a deeper understanding."
📗 Allen Pike's Collect Stories, Not Generalizations is a good reminder (or first encounter) of this crucial principle. And it's not limited to Product management. Asking for stories will get you to the truth in hiring, news, or personal interviews.
15 rules for communicating at GitHub
GitHub is a well-known pioneer of the async-first culture. This culture is powered by a thought-out system that relies on GitHub 😉 as an asynchronous communication platform.
For many of us, the entire company on GitHub way pushes it too far. Still, GitHub's culture is a treasure trove of good async practices for modern companies.
At its core is a trade-off of synchronous vs. asynchronous communication heavily biased towards async.
Rule #1 is "Prefer asynchronous communication"
"When knowledge work is interrupted, intentional or not, whether a popup, a meeting, or a "hey, you got a sec?" drive by, there's a significant switching cost to get back to where you were."
"Whenever possible, prefer issues and chat, to just in time communications."
Most companies are sync first, and occasionally async. GitHub reverses that and is async first, with sync exceptions. Hence the second rule:
Rule #2 "Don't underestimate synchronous mediums"
In practice, some activities are more efficient when done synchronously. Ben lists three: brainstorming, feedback, and small talk.
Beyond the trade-off, GitHub has thirteen other rules to make the async culture work. Some interesting ones are:
#5 "Be mindful of noise". Async quickly reaches many people, which can create large echo chambers. So every message, like a sync meeting, must be weighed for its cost.
#8 "Master the gentle bump". It's easy to miss things in async-first cultures because so much is happening—many issues and threads. A gentle bump can unlock and speed up many decisions.
#9 "Keep discussions logically distinct". Async threads are uni-dimensional. Discussions are trees.
📗 Ben Balter's 15 rules for communicating at GitHub introduces the company's async-first culture and how they've made it work. While it relies very much on Github, tooling shouldn't be a blocker nowadays. As long as you can drop emails, there are hundreds of good alternatives for async communication.
"Asynchronous communication [...] eliminate the endemic "you had to be there" aspect of most corporate workflows, and reduces the need for a dedicated management class to capture, collect, and shuttle information back and forth between business units."