Learning from Giants #43
How NAT traversal works in Tailscale, Slack's global messaging architecture, and a Slow product development investigation framework.
👋 Hi, this is Mathias with your weekly drop of the 1% best, most actionable, and timeless resources to grow as an engineering or product leader. Handpicked from the best authors and companies.
Did a friend send this to you? Subscribe to get these weekly drops directly in your inbox. Read the archive for even more great content. Also: I share these articles daily on LinkedIn.
How NAT traversal works
Software engineering is a world of abstracted complexity. And when you need to go deeper, you can get away with asking ChatGPT.
...Most of the time.
Networking is the best example. How much do you know about UDP, TCP, NAT when all you see is HTTP requests?
Tailscale is a wireguard-based, peer-to-peer VPN. Part of their magic is how they can setup these peer-to-peer connections in very tough network environments, like between two devices both behind firewalls and enterprise network routers. The key? NAT (Network Address Translator) traversal. Figuring the IP to call that peer on even though they're behind a router, and opening firewalls to talk to that IP.
First: getting through a firewall most often closed to outbound connections.
"Stateful firewalls are the simpler of our two problems. [...] Stateful firewalls remember what packets they’ve seen in the past and can use that knowledge when deciding what to do with new packets that show up."
So from these first principles, you understand that if each device makes an outbound connection to the other, then both firewalls will switch to open
state for that destination. When retrying, the devices will be able to talk to each other.
Simple, right?
"Well, not quite. For this to work, our peers need to know in advance what ip:port to use for their counterparts. This is where NATs come into play, and ruin our fun."
📗 David Anderson's How NAT traversal works is a reference article on the concepts and tricks required to do a best-effort NAT traversal to establish a direct connection between two machines. It starts from very clear first principles, and builds on them to discuss even the trickiest details.
Slack’s messaging architecture
"Our servers serve tens of millions of channels per host, tens of millions of connected clients, and our system delivers messages across the world in 500ms."
Only four services power Slack's messaging experience: Channel Servers, Gateway Servers, Admin Servers, and Presence Servers. All written in Java.
"Channel Servers (CS) are stateful and in-memory, holding some amount of history of channels. Every CS is mapped to a subset of channels based on consistent hashing."
"Gateway Servers (GS) are stateful and in-memory. They hold users' information and websocket channel subscriptions. This service is the interface between Slack clients and CSs."
"Admin Servers (AS) are stateless and in-memory. They interface between our Webapp backend and CSs." Webapp is Slack's API backend, Admin handles the business logic behind it.
"Presence Servers (PS) are in-memory and keep track of which users are online."
📗 Sameera Thangudu's Real-time messaging at Slack is a very dense write-up on the company's high-level architecture. It also shows that sharding and keeping most information in memory can take a company quite far. And for ephemeral data like Presence, sometimes you don't even need to persist it. The short article leaves many questions unanswered, like how Channel Servers handle persistence or how search works on top of messages. Sameera, we would love to see a follow-up post!
Slow product development investigation framework
“Why is the development of new features always so slow!?!?”
As your company grows and your product matures, delivery speed will inevitably slow. There are many good reasons, like having customers that rely on your product to run their business.
But there are also many bad ones. Some organizational or people challenges can slow product development to a halt. Jevin Maltais sorts them into two categories: issues that lead to projects never starting, or never ending.
Some symptoms you may have seen:
"Lots of meetings around a particular feature/project over weeks (or months!!)"
"No clear decision maker or commitment to decision"
"Projects that are 90% done but never seem to get out the door."
📗
's Slow Product Development Investigation Framework gives tips to figure out the reasons behind these symptoms and recommendations to solve them. Whatever your problem is and the solutions you're trying to set up, remember that a great impact-driven culture takes time and repetition to build. As Jevin puts it, however you decide to work as a team, "The key is picking one and STICK TO IT then iterate."
Super interesting to learn about Slack architecture !