Learning from Giants #64
The software architecture of the backbone of Cloudflare's 1.1.1.1, A framework to beat larger competitors, and the full theory behind graceful behavior when systems reach full capacity.
🎅 Hi, this is Mathias with your weekly drop of the 1% best, most actionable, and timeless resources to grow as an engineering or product leader. Handpicked from the best authors and companies. Guaranteed 100% GPT-free content.
Did a friend send this to you? Subscribe to get these weekly drops directly in your inbox. Read the archive for even more great content. Also: I share these articles daily on LinkedIn.
How Rust and Wasm power Cloudflare's 1.1.1.1
1.1.1.1, Cloudflare's DNS resolver, has grown exponentially since its 2018 release.
Unsurprisingly, Cloudflare started with an off-the-shelf solution called Knot Resolver but quickly outgrew it and needed more and safer customization.
"We forged a platform that we are happy with, and we call it BigPineapple."
BigPinneapple is Cloudflare's DNS resolution system, hosted in hundreds of edge data centers. It was built with the following goals in mind:
High performance and throughput. Total latency should be in the milliseconds.
Correctness. DNS resolution is old and has many edge cases and non-RFC-compliant servers that must be supported.
Extensibility. Cloudflare's value-add beyond a simple DNS resolver is in the customization and new use cases it can enable.
The article describes BigPinneapple's system architecture and design choices. It's built using a Rust async framework called Tokio, and composed of separate components deployed together. Here are the most interesting ones:
The server speaks many protocols because DNS is queried over many of them and turns these heterogeneous requests into common DNS query Cloudflare calls "Frames." It offloads query resolution to a set of workers.
The cache is central to the overall server performance. Because DNS resolution involves making multiple network requests, it can be slow, hence the caching needs. Because of how large the DNS space is, it can't fit in one cache. Cloudflare uses consistent hashing to distribute queries to different nodes in the same data center to split caches between nodes.
Conductor manages outbound connections smartly. Every non-cached DNS query leads to network calls to upstream DNS nameservers with different performance, capacity, and availability. "The conductor is able to make these decisions by tracking the upstream server's metrics, such as RTT, QoS, etc."
The Sandbox is the extensibility component. It runs WebAssembly modules as "callbacks" that are registered at different points of the BigPinneaple request lifecycle.
"BigPineapple's Wasm runtime is currently powered by Wasmer. The runtime allows each module to run in its own instance, with an isolated memory and a signal trap, which naturally solved the module isolation problem we described before."
📗 Ambang Wen's How Rust and Wasm power Cloudflare's 1.1.1.1 describes the architecture of one of the largest DNS resolvers on the planet. It goes into sufficient detail to be insightful, calling out protocols and techniques the team used to build and operate the system at such a scale.
How to Beat Larger Competitors
"It doesn't matter how big your company is, there is always someone bigger."
Start-ups often shrug off larger companies when talking about competition. We build our two-dimensional competitive charts without them and fill them with smaller companies that look just like us. That's wrong in many ways. Even if large companies aren't your main risk or focus, their size can threaten your business.
"Big companies can outlast you, so they don't need to win, they just need to wait for you to fail."
So, you need a strategy to beat and protect yourself against them. You need to take the initiative.
came up with a framework to think about your fight against these larger companies:Deposition
"Strategy 1: Deposition. You can [...] frame their product as "last generation" or "table stakes". This is called depositioning."
Depositioning is a powerful marketing message, but it is even stronger when you build your entire company and product around that message. One example is all the new ERP-like companies whose central message is "Old ERPs are old and unusable, ERPs are dead, we're the future".
Commoditize
"Strategy 2: Commoditize. Once a business is big, it cannot easily adjust to a new business model [...] If you can offer a similar service for 1/10th of the price it's unlikely they can find a way to compete."
Driving price down is a risky strategy for your image and not always the best way to build a healthy business. Still, if it's powered by a fundamental change in how the business is done, it can be the best competitive advantage.
Disrupt
"Strategy 3: Disrupt. If you do something that is radically different then you have a chance to disrupt their business."
Delivering similar value in a better way is also a recipe for explosive growth because, depending on switching costs, the existing market can switch to your solution in a landslide. Think Uber vs. Taxis. While the word is overused, disruption is probably the most common way start-ups win. It's also the most common way they lose to smaller competitors once they're giants. A never-ending cycle!
"Leaders in any given category are rarely leaders for more than a decade or two, as smaller competitors eventually find ways to chip away at their advantages."
📗
' How to Beat Larger Competitors gives an actionable framework to think about big companies in your space. In some markets, beating larger competitors can also give you access to large existing markets without needing to create a new one, which can tremendously accelerate growth.Graceful behavior at Capacity
🚒 "The database is stuck at 100% ; all requests are timing out!"
vs.
😎 "We're currently experiencing issues; clients may see elevated error rates."
This shows the difference between graceful degradation and total outage from the outside. Yet from the inside, the same thing may be happening: high contention for some shared resource, like a database. Without proper protection in place, high contention can spiral into congestion collapse.
"Congestion collapse is when a system is receiving a high request rate but achieving throughput much lower than its capacity."
Once you've identified that bottleneck system (all systems are, at some point, but the first is the limiting one), there are different ways to protect it that all achieve the same goal:
"The basic strategy to resolve contention is to deliberately limit concurrency to levels which don't cause undue contention."
Adding a request queue can only solve traffic spikes, but under heavy load, queues will fill until either latency becomes too much or they're full.
"There's only one solution: somehow have less work to do. We refer to either or both of these strategies as backpressure."
There are two general ways to apply backpressure:
Flow control is a mechanism through which the server asks the client to slow down incoming requests only to receive what it can process. But that requires controlling most clients.
Load shedding is simple: drop requests entering your system to accept what the system can process. Multiple strategies exist to select such requests, from random to request tiering.
📗
's Graceful behavior at capacity explains that because all systems have limits, it's essential to plan for when such limits are reached. The author then describes the main concepts and techniques commonly employed to protect systems under heavy load and ensure they don't degrade beyond their own capacity.