Learning from Giants #10
Webhook Security, Sharding Postgres at Notion, A/B testing common mistakes, System handover at Soundcloud, and Server-side SQLite is trending.
👋 Hi, this is Mathias with your weekly drop of the 1% best, most actionable, and timeless resources to grow as an engineering or product leader. Handpicked from the best authors and companies.
Did a friend send this to you? Subscribe to get these weekly drops directly in your inbox. Read the archive for even more great content. Also: I share these articles daily on LinkedIn.
The Complete Guide to Webhook Security
Webhooks are probably the second most popular server-to-server interface after direct HTTP calls.
But there is one significant difference between the two: security.
"Webhooks were not built to be secure out-of-the-box, and the entire security burden falls on the developer."
1. Contrary to APIs, webhook security is the responsibility of the consumer. And since, in most cases, the producer is larger than the consumer, it's the responsibility of the smaller actor.
2. Webhooks still work if you don't implement most of the security requirements. No data validation? Works fine. No replay attack protection? Fine again.
And since there is no standard, you can see some pretty wild attempts at securing webhooks by some companies, layering JWTs, and whatnot.
📗 Hookdeck's Complete Guide to Webhook Security is a good start to learning about webhook vulnerabilities and common mitigation strategies. While you may be tempted to run and fix all these unsecured webhook endpoints, resist that urge and try to grasp each vulnerability first. The level of security webhook endpoints require depends on the sensitivity of the payload and use case. But if one holds even mildly sensitive data, by all means, fix it!
Lessons Learned from Sharding Postgres at Notion
As a start-up, the architecture choices you make early on are focused on optimizing for pivots and simplicity.
Out of these choices, the hardest to move away from will be the database. And if you're lucky enough to see product-market fit and hypergrowth, you will have no time to revert that decision.
"By mid-2020, it was clear that product usage would surpass the abilities of our trusty Postgres monolith, which had served us dutifully through five years and four orders of magnitude of growth."
These four orders of magnitude of growth on a single instance are the reason Postgres and MySQL are still the default options for most projects. But when you start hitting limits, it's time to scale out.
Time to shard the database.
"If you've never sharded a database before, here's the idea: instead of vertically scaling a database with progressively heftier instances, horizontally scale by partitioning data across multiple databases."
📗 Notion's Herding elephants: Lessons learned from sharding Postgres at Notion tells the story of how the company split its main database to enable more growth and increase the overall performance of their app. Garett Fidalgo does a great job telling and documenting the story, talking about trade-offs and counter-arguments. What makes it even more worth reading is the "lessons learned" section that will benefit all engineers working on scaling a product. The combined primary key argument is something everyone should consider when starting a new project.
How Not to Run an A/B Test
Your A/B test dashboard has a critical flaw.
Repeated significance error.
The solution: lock the dashboard, and put the password in a lock that only opens after a fixed time. But why?
"If you run A/B tests on your website and regularly check ongoing experiments for significant results, you might be falling prey to what statisticians call repeated significance testing errors. As a result, even though your dashboard says a result is statistically significant, there's a good chance that it's actually insignificant."
The simple fact that you're checking your dashboard daily can be a big problem.
"For example, if you peek at an ongoing experiment ten times, then what you think is 1% significance is actually just 5% significance."
📗 Evan Miller's How not to run an AB test is a classic 2010 article that every product and data leader should read. It explains the statistics behind the most popular A/B testing mistake. Thankfully the author also lists solutions to this problem, namely 1) No peeking, 2) Sequential experiments, and 3) Bayesian experiments.
How Soundcloud Does System Hand-over
Following my post about Architecture Decision Records, let's zoom out and look at a significant problem they solve: system handovers.
If an individual or team cannot continue owning a system, how can you ensure the new owners can do so with continuity and serenity?
"Having experienced not-so-successful handovers — some of which took place over the course of a one-hour meeting — I was inspired to create a guideline that will help other teams do handovers differently."
So what should a handover focus on?
"The goal is to help the new team understand the what, why, and how of the system, and to empower them to maintain, change, and improve it."
📗 Soundcloud's How to Successfully Hand Over Systems gives us a great peek into how Soundcloud solved this topic internally. Aleksandra Gavrilovska lists the main questions handover processes should answer and document for the new team. Interestingly, she explains that the new team should own the process so they can decide whether they're comfortable with the handover.
PS: Such documentation would be helpful in other situations. I'll throw an idea: like some product teams do "pre-mortems", let's do "pre-handovers" to document production systems!
Ben Johnson is All-In on Server-Side SQLite
The world's most used database is back in the spotlight.
As workloads are moving to the edge to feel snappier to users, databases aren't. Today's distributed databases aren't built to run as small deployments in hundreds of regions. So naturally, a new space of opportunities has opened: edge datastores.
The problem: distributing the database at the edge while keeping the developer experience on par with what developers currently have, e.g. SQL.
That's where SQLite comes in.
"SQLite is an embedded database. It doesn't live in a conventional architectural tier; it's just a library, linked into your application server's process. It's the standard bearer of the "single process application": the server that runs on its own, without relying on nine other sidecar servers to function."
It's the easiest way to replicate the SQL developer experience while replicating the database.
But it's not that simple.
"There are two big reasons everyone doesn't default to SQLite. The first is resilience to storage failures, and the second is concurrency at scale."
📗Ben Johnson's Im All-In on Server-Side SQLite is an excellent introduction to that new trend and one of its most popular projects: Litestream. After a quick database history lesson, Ben defends this new model of embedding SQLite with the application. The idea is to go back to a master and read-replica model using embedded databases. On top, he argues that deploying databases within the application could improve the developer experience. His point is that combining ultra-low latency and current hardware could effectively remove the need for query optimization for most workloads, as even bad queries should resolve extremely fast. It'd thus enable engineers to build fast-by-default applications while spending zero time on optimization.
On a similar note, Cloudflare recently announced D1, their SQLite-based database that will run close to the code (and will have a feature to run code close to the database).
See, SQLite is exciting (again)!