Learning from Giants #20
Google's in-memory time series database, A resilient card transaction system at Robinhood, Sizing engineering teams, Introduction to CSS Variables, and Quotes for Writing well.
👋 Hi, this is Mathias with your weekly drop of the 1% best, most actionable, and timeless resources to grow as an engineering or product leader. Handpicked from the best authors and companies.
Did a friend send this to you? Subscribe to get these weekly drops directly in your inbox. Read the archive for even more great content. Also: I share these articles daily on LinkedIn.
Paper summary: Monarch: Google's Planet-Scale In-Memory Time Series Database
If you've read the Google SRE Book chapter, you've noticed how much it relies on Observability data. Logs, traces, and metrics. Without metrics, engineers would be operating blind in large distributed systems. So the system that stores them must be even more resilient and available.
That's why Google designed Monarch.
"Monarch is a globally-distributed in-memory time series database system in Google. Monarch runs as a multi-tenant service and is used mostly to monitor the availability, correctness, performance, load, and other aspects of billion-user-scale applications and systems at Google"
A beautiful global, multi-zone, highly available distributed system.
"Dividing Monarch into Global and Zone components enables scaling and availability of the system. In the presence of availability issues with global components, zones can still operate independently."
But that's not all! The Monarch team describes many other challenges to reach a very rare scale: metric ingestion, storage, and querying.
"Monarch's internal deployment ingested around 2.2 terabytes of data per second in July 2019."
📗 Micah Lerner's review of the Monarch: Google's Planet-Scale In-Memory Time Series Database paper describes its main takeaways. I've had feedback that papers were long reads, so I'll try to share summaries from now on. Still, if you're curious, Micah's article should make you want to read the full paper to know more.
Building a Resilient Card Transaction System at Robinhood
When Robinhood launched its Debit card, it came with a hard SLA: the webhook that authorizes charges has a two-second deadline. One microsecond beyond that, and the charge is denied.
Yet behind, authorizing a Debit card transaction involves complex orchestration of many different services: auth, balance check, fraud detection, ... So how did the team build such a resilient system?
"Our solution is to build a second, lightweight backup system that can serve as a stand-in service to handle all traffic when the primary system is degraded or down."
They decided that horizontal scaling (replicating the authorization server) was not enough. And so, in a very unusual manner, they chose to de-correlate failure as much as possible by building their system twice, with two different architectures and technical stacks.
"The most significant difference in these two systems is in the architecture. The core service has a “pull” based architecture, where the most up-to-date information is queried on demand each time an authorization request comes in. [...] On the other hand, the backup service is a “push” based architecture, caching the latest state of each cardholder’s account by subscribing to asynchronous updates broadcast over Kafka streams."
📗 Robinhood's Building a Resilient Card Transaction System details this unusual solution to a typical SLA and resiliency problem. This huge backup cache Stephen Chang and their team built is quite intriguing, and so is the layer that routes traffic to the primary or backup.
Sizing engineering teams
Organizational design is an area where companies and individuals learn by failing. You probably already have, and so have I. But there is always time to learn from giants!
"How many teams should we have? Should we create a new team for this initiative or ask an existing team to take it on? What is the boundary between these two teams?"
"I've come to believe the fundamental challenge of organizational design is sizing teams."
📗 Will Larson has been in tech management roles for a decade at companies like Uber and Stripe. After facing the organizational design challenge many times, he has come up with a playbook, a set of principles that apply most of the time. That's what he describes in Sizing engineering teams, a short and definitely actionable article.
The article goes into more detail about why the author believes in these principles, which I recommend reading. But if you only have a few seconds, here's what you should remember:
"Teams should be six to eight during steady state."
"To create a new team, grow an existing team to eight to ten, and then bud into two teams of four or five."
"Never create empty teams."
"Never leave managers supporting more than eight folks.
Introduction to CSS Variables
Browser-level standards are like Apple device features. They're not the first solution to a problem, but they're much more ergonomic and performant. CSS Custom Properties, aka CSS Variables, definitely fall into that category.
Most people have been doing dynamic styles with Javascript, like CSS-in-JS solutions (styled-components, emotion), for a long time and will probably continue doing so. Styling became this giant pile of Javascript code, sometimes a non-negligible part of your total bundle.
"There are two reasons to switch to CSS variables in your React app:
The ergonomics are nice.
It unlocks new possibilities! You can do things with CSS variables that are not possible with JS."
And while you could use CSS Variables to replace CSS-in-JS, you can also use both in tandem. CSS-in-JS can use some variables living in the styling context instead of JS.
📗 Josh Comeau's CSS Variables for React Devs is another well-described and interactive article from Josh on using this relatively new CSS feature. CSS Variables are a real quality-of-life improvement in many situations, and beyond that, they open up new possibilities that are worth exploring.
Writing Well
Writing is power. Writing well is a superpower.
"Writing something that other people will read forces you to think well." Paul Graham
It's a superpower that takes time and training to learn. But as with all learned skills, you can speed it up with advice from mentors. In tech, we always refer to the same writing guru: Paul Graham. But by expanding our horizons, we can learn from many other writers.
📗 Slava Akhmechet's Writing well notes are a collection of quotes and links on the topic of writing, and doing it well. Beyond the Paul Graham quotes, Slava also found compelling quotes from George Orwell and Paul Roberts. You should read, bookmark, and re-read that article every month until you feel these bullet points are your default mode.