Learning from Giants #18
How Facebook optimized their main React app, Focusing on product problems not solutions, Monitoring distributed systems by Google SRE, Questions for your first 1:1, and Amazon's Aurora database.
👋 Hi, this is Mathias with your weekly drop of the 1% best, most actionable, and timeless resources to grow as an engineering or product leader. Handpicked from the best authors and companies.
Did a friend send this to you? Subscribe to get these weekly drops directly in your inbox. Read the archive for even more great content. Also: I share these articles daily on LinkedIn.
How Facebook moved the facebook.com app away from server-rendered PHP
For over 15 years, the Facebook team added many layers to the original PHP code powering the main facebook.com page.
"Over time, we've added layer upon layer of new technology to deliver more interactive features. Each of these new features and technologies incrementally slowed the site down and made it harder to maintain. [...] We needed to take a step back to rethink our architecture."
Re-architecting a web application with billions of monthly active users with different usage, languages, and devices. The team settled early on their core principles:
"1. As little as possible, as early as possible. We should deliver only the resources we need, and we should strive to have them arrive right before we need them."
"2. Engineering experience in service of user experience. As we think about the UX challenges on our site, we can adapt the experience to guide engineers to do the right thing by default."
By choosing the client-side way with React, the team knew the main problem to solve: page load performance. When you have to load a multi-kb (or Mb) application before the first render, you've already lost many milliseconds.
What followed is history, but one with many CSS, React, and tiered lazy loading optimizations!
📗 Facebook's Rebuilding our tech stack for the new Facebook.com is a fascinating peek into the team's work to re-architect the most crucial pages of Facebook. Besides the CSS optimizations, what's interesting is the amount of lazy loading added to their React application. When the goal is to ship less code to the client, there is no other choice but to 1) ship as little as possible and 2) stream code and data in prioritized order so that an initial render can start as soon as possible. But that's not all: Ashley Watkins and Royi Hagigi explain that the team went even further and coupled data fetching with visual components loading to only load the elements needed to display the Feed corresponding to the fetched data.
Great PMs don't spend their time on solutions
PM is one of the most challenging jobs in the world. Why is that? One of the reasons is that the most impactful part of the work is the less visible: it's thinking and digging deep into problems.
"Imagine all the time you use on a project was contained in 100 units. So all the PM work, the designers, researchers, analysts, engineers, etc. How would it break down?"
Most companies will spend 10x more time building than researching and defining problems. But Intercom does it differently.
"[At intercom], 40% of our 100 units spent before we've even started designing anything. We obsess about problem prioritisation and problem definition. I mean obsess."
"So why do we do this? We do it because a solution can only be as good as your understanding of the problem you're addressing."
So obvious, right? Yet most PMs don't follow that path because they never did, and often because that's not what their leadership wants.
📗 Paul Adam is Intercom's CPO–he wrote Great PMs don't spend their time on solutions to raise awareness on this common trap for product teams. If that's you, this is your opportunity to reset and sell the idea of spending more time on problems to your leadership!
Google SRE: Monitoring Distributed Systems
As production systems grow complex, monitoring them can grow even more complex And more often than not, you'll find yourself under-geared in reaction mode or with email alert fatigue. The Google SRE teams' main takeaway is that monitoring systems must be kept simple.
"it's important that monitoring systems—especially the critical path from the onset of a production problem, through a page to a human, through basic triage and deep debugging—be kept simple and comprehensible by everyone on the team."
You don't want a paged on-call engineer to take more time understanding what an alert really means than troubleshooting the system itself. That's easier said than done. Where to start?
"The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four."
Then when creating rules for alerting on top of these numbers, vet them by asking the right questions. Is your alert actionable, user-impacting, and requiring urgent investigation?
📗 Google SRE's Monitoring Distributed Systems chapter of the SRE book is an excellent introduction to the practice. It'll take you through definitions, basic principles all monitoring systems must follow, and questions to ask yourself while setting up alerts. It's the output of multiple years of learning by failing in the world's best teams.
Questions for our first 1:1
The first 1:1 is always special because there is much to do to kickstart a relationship. As a manager, it's easy to let an hour pass by just giving context.
But how much do you know about your managee? How will you support them in a personalized way? Ask them! The best way to kickstart that relationship is by learning how you can adapt your support to what they need, want, or prefer.
Take feedback and recognition. How do they like them delivered?
"Have the answers to these questions WAY before you need them. Few things are harder than trying to give someone feedback; doing it in a way that you think they'll be most able to hear it is invaluable."
📗 Lara Hogan's Questions for our first 1:1 is a quick read on questions managers should ask during their initial 1:1 to get to know reports better and personalize your support. While Lara mentions simple points like giving feedback, they also cover topics that I'm pretty sure you never asked in a 1:1!
Amazon Aurora: Amazon's Cloud Native Relational Database
Traditional relational databases were not designed to take full advantage of the power of the cloud. And when you're the database service provider, this old design becomes a very costly operational burden.
So it's not surprising to see cloud providers re-build "cloud-native" relational databases–evolutions that can leverage cloud primitives and scale to many VMs. That's what AWS did with Aurora.
"We designed Aurora as a high throughput OLTP database that compromises neither availability nor durability in a cloud-scale environment. The big idea was to move away from the monolithic architecture of traditional databases and decouple storage from compute."
Decoupling storage from compute and moving all of that to the cloud. It's probably not the first time you read this idea. There is a "dumb" way of doing that with network-replicated volumes, but it does not scale. It just moves the problem from compute and storage to the network, and it quickly becomes a huge bottleneck. That was Aurora's primary design challenge.
📗 Amazon's Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases paper describes a few key architecture solutions to this challenge. Initially based on open-source MySQL, with an InnoDB storage engine patched to work on cloud primitives, Aurora evolved a lot in recent years to reach the impressive performance and scalability that the paper describes.