Learning from Giants #62

The how and why of Engineering Strategies, Scaling distributed data problems at Mux, and a primer on Naming things in software.

Dec 02, 2023

👋 Hi, this is Mathias with your weekly drop of the 1% best, most actionable, and timeless resources to grow as an engineering or product leader. Handpicked from the best authors and companies. Guaranteed 100% GPT-free content.

Did a friend send this to you? Subscribe to get these weekly drops directly in your inbox. Read the archive for even more great content. Also: I share these articles daily on LinkedIn.

Solving the Engineering Strategy Crisis

Your Engineering team deserves a Strategy, too.

While absent from most roadmaps, engineering work is core to a tech company's team and product health. And if it's not explicit, your company probably has an unspoken engineering strategy.

What's Engineering Strategy? Taking inspiration from Richard Rumelt's Good Strategy, Bad Strategy, Will Larson says this:

"I believe that Engineering strategy comes down to two core components:
Honest diagnosis that engages with the reality your organization's current needs and challenges"
Practical approach to move forward while addressing the circumstances raised in the diagnosis".

Far from the fancy missions and visions, strategy must be rooted in the current situation and truth. The more honest it is about the present, the more it can be used to guide the future.

Why is it important?

Explicit strategy aligns people around a common approach, creates consistency within the company, focuses resources on improvements to that common approach that benefit everyone, and facilitates hard decisions.

I know what you're thinking: these aren't specific to engineering. Think about scaling your org, onboarding new people, and decentralizing decision-making. Having a written strategy will make these processes smoother. So write that strategy down!

"You can solve half the engineering strategy crisis by just writing stuff down."

Stripe’s engineering strategy, according to Will Larson.

📗 Will Larson's Solving the Engineering strategy crisis then explains how to define (or find) your strategy. It's often already there, but you need to put the right words and align people with it. If you don't know where to start, Will gives examples of engineering strategies of his previous companies: Stripe, Uber, Calm. You'll see how radically different they can be!

Read the full article on Will's blog

Scaling data processing at Mux with an embedded key-value store

How Mux, a video infrastructure company, moved their view management system to a "Shared Nothing" architecture.

The Mux Data team owns a system that does viewer analytics and stores finished views to allow resuming views. Because the requirements of analytics (large-scale aggregations) and view resumption (row-level queries) are opposite, they split it into two sub-systems:

Analytics are powered by an OLAP system, ClickHouse, ingests all view events from Kafka.
Row-level access was powered by Riak, a distributed key-value store used as a 6-hour cache, also ingesting events from Kafka.

"Riak became increasingly difficult to operate as Mux Data usage grew, compounded by a growing number of live video events. Unlike on-demand video consumption, live events have fundamentally different viewership patterns."

Traffic became much spikier, with requests in the hundred thousand per second. Because Riak is a distributed database, its capacity isn't elastic. Scaling it up and down requires human intervention. So, the team had to provision it for peak usage, which is very inefficient overall.

Even then, Riak would stall when provisioned to its maximum capacity, and data ingestion would fall behind.

"Riak often became a write-throughput bottleneck when viewers left a stream en masse. Degraded Riak write performance could cause stream-processing lag, possibly leading to delayed data in our Mux Data Monitoring dashboard."

At this point, the team knew they couldn't continue scaling Riak. They went back to the drawing board. And they realized something. They didn't need a unified store because Kafka already partitioned their events into independent streams.

"Rather than running a networked, distributed KV store (Riak, Redis, etc.), we could instead have the event processors manage many independent partition-scoped embedded KV stores on local disk."

"We leveraged the keyed, partitioned nature of the data in Kafka to scope each cache to a single Kafka topic partition, leading to a cache architecture following the Shared Nothing pattern [from Martin Kleppmann's Designing Data-Intensive Applications book]."

Mux’s move to a “shared nothing” distributed processing architecture. Source: Mux.

The team removed Riak's network requests and coordination bottlenecks by writing directly into an embedded (understand local) key-value store inside the processing Pod. The trade-off is having to manually manage the persistence of that data during rebalancing and upgrades.

📗 Scott Kidder's Scaling data processing with an embedded key-value store," tells the story of a team that started with a pragmatic choice and successfully re-evaluated their options once it began to show scale issues. Scott also clearly explains that there is no silver bullet; this complex solution works for Mux because of its architecture and spiky, write-heavy workload. And last, it's not simple to set up! Even the world-class Mux team struggled to make the persistence work at all times, so think twice before building your own distributed system!

Read the full article on Mux's blog

Your Codebase deserves Better Names

Poor naming is probably the fastest and dumbest way to accumulate tech debt. But it also is the most common. In particular when English isn't your team's primary language. Luckily, it's 1) easy to improve at, 2) trivial to refactor, and 3) can even improve the code itself!

Beyond these apparent reasons, putting effort into naming has a valuable side-effect: it helps detect spaghetti code:

"When it feels impossible to name something, step back and figure out what you really want your code to do before you write any more of it."

So, what's a great name? It should be:

Descriptive. [...] Names should describe what the code does, in detail."
Specific. [...] making sure your words are as precise and specific as possible is essential for a great name."

Get rid of the doXYZ and getXYZ methods and the foo and bar variables. Using specific and consistent nouns and verbs to describe concepts makes a massive difference to the readability of your code.

Good and bad examples. Source: Emma Ferguson, Samsara.

To find these specific words, first scan the codebase for similar occurrences, and if you can't find any, use a thesaurus (I use Power Thesaurus) to generate ideas. Start with a concept and explore synonyms until you discover the perfect one.

In line with codebase conventions". Consistency is underrated.
Grammatical."

For non-English speaking teams, grammar is usually the biggest hurdle. That's OK. Accept it and use Grammarly or translation tools to ensure your names are correct; it's the only way to learn.

Grammar-wise, there are a few general rules that you can always apply:

The most important noun should be last, and adjectives before it: ConnectedOrderBuilder, NetworkConnectivityInspector.
Use verbs for actions and place them first: renderBlueLogo, openDatabaseConnection.

📗 Emma Ferguson's Your Codebase Deserves Better Names is a rare actionable article on naming. While the famous saying is that "cache invalidation and naming things" are the most complex problems in computer science, there are many more articles on the former. So, thank you, Emma, for clarifying this simple but impactful topic!

Read the full article on Samsara's blog

Standing on the Shoulders of Giants

Discussion about this post