System Design Basics
You’re gonna need a bigger boat.
Designing a system that supports millions of users is challenging, and it is a journey that requires continuous refinement and endless improvement
A distributed system is a system with many components that are located on different machines. These machines communicate with one another to complete tasks such as adding a new user to the database and returning data from API requests. Despite being a distributed system with many moving parts, the user sees one cohesive process.
System design interview questions are the most difficult to tackle among all the technical interviews. The questions require the interviewees to design an architecture for a software system, which could be a news feed, Google search, chat system, etc. These questions are intimidating, and there is no certain pattern to follow. The questions are usually very big scoped and vague.
Systems design interviews are more relevant for senior positions but you may get asked general questions at the mid-level position as well. This article is to give you a taste of what concepts are involved in learning how to build a distributed system.
The more you research and learn about this huge topic, the more you will feel that you’re gonna need a bigger boat. 🛥️
Purpose of System Design
How do we architect a system that supports the functionality and requirements of a system in the best way possible? The system can be “best” across several different dimensions in system-level design. These dimensions include:
- Scalability: a system is scalable if it is designed so that it can handle additional load and will still operate efficiently.
- Reliability: a system is reliable if it can perform the function as expected, it can tolerate user mistakes, is good enough for the required use case, and it also prevents unauthorized access or abuse.
- Availability: a system is available if it is able to perform its functionality (uptime/total time). Note reliability and availability are related but not the same. Reliability implies availability but availability does not imply reliability.
- Efficiency: a system is efficient if it is able to perform its functionality quickly. Latency, response time and bandwidth are all relevant metrics to measuring system efficiency.
- Maintainability: a system is maintainable if it easy to make operate smoothly, simple for new engineers to understand, and easy to modify for unanticipated use cases.
Single server setup
To start with something simple, everything is running on a single server.
- Users access websites through domain names.
- Internet Protocol (IP) address is returned to the browser or mobile app.
- Once the IP address is obtained, Hypertext Transfer Protocol (HTTP) requests are sent directly to your web server.
- The web server returns HTML pages or JSON response for rendering.
Database
With the growth of the user base, one server is not enough, and we need multiple servers: one for web/mobile traffic, the other for the database. Separating web/mobile traffic (web tier) and database (data tier) servers allows them to be scaled independently.
You can choose between a traditional relational database and a non-relational database.
Relational databases are also called a relational database management system (RDBMS) or SQL database. The most popular ones are MySQL, Oracle database, PostgreSQL, etc. Relational databases represent and store data in tables and rows. You can perform join operations using SQL across different database tables.
Non-Relational databases are also called NoSQL databases. Popular ones are MongoDB, CouchDB, Cassandra, HBase, Amazon DynamoDB, etc. These databases are grouped into four categories: key-value stores, graph stores, column stores, and document stores. Join operations are generally not supported in non-relational databases.
Non-relational databases might be the right choice if:
- Your application requires super-low latency.
- Your data are unstructured, or you do not have any relational data.
- You only need to serialize and deserialize data (JSON, XML, YAML, etc.).
- You need to store a massive amount of data.
Vertical scaling, referred to as “scale up”, means the process of adding more power (CPU, RAM, etc.) to your servers. Horizontal scaling, referred to as “scale-out”, allows you to scale by adding more servers into your pool of resources.
When traffic is low, vertical scaling is a great option, and the simplicity of vertical scaling is its main advantage. Unfortunately, it comes with serious limitations:
- Vertical scaling has a hard limit. It is impossible to add unlimited CPU and memory to a single server.
- Vertical scaling does not have failover and redundancy. If one server goes down, the website/app goes down with it completely.
Horizontal scaling is more desirable for large scale applications due to the limitations of vertical scaling.
In the previous design, users are connected to the web server directly. Users will unable to access the website if the web server is offline.
In another scenario, if many users access the web server simultaneously and it reaches the web server’s load limit, users generally experience slower response or fail to connect to the server.
Load Balancer
A load balancer is the best technique to address these problems. A load balancer evenly distributes incoming traffic among web servers that are defined in a load-balanced set.
Users connect to the public IP of the load balancer directly. With this setup, web servers are unreachable directly by clients anymore.
For better security, private IPs are used for communication between servers. A private IP is an IP address reachable only between servers in the same network; however, it is unreachable over the internet. The load balancer communicates with web servers through private IPs.
After a load balancer and a second web server are added, we successfully solved no failover issue and improved the availability of the web tier.
- Scenario 1: If server 1 goes offline, all the traffic will be routed to server 2. This prevents the website from going offline. We will also add a new healthy web server to the server pool to balance the load.
- Scenario 2: If the website traffic grows rapidly, and two servers are not enough to handle the traffic, the load balancer can handle this problem gracefully. You only need to add more servers to the web server pool, and the load balancer automatically starts to send requests to them.
Now the web tier looks good, what about the data tier? The current design has one database, so it does not support failover and redundancy. Database replication is a common technique to address those problems.
Database Replication
A master database generally only supports write operations. A slave database gets copies of the data from the master database and only supports read operations. All the data-modifying commands like insert, delete, or update must be sent to the master database.
Most applications require a much higher ratio of reads to writes; thus, the number of slave databases in a system is usually larger than the number of master databases.
The following advantages are achieved in database replication:
- Better performance: In the master-slave model, all writes and updates happen in master nodes; whereas, read operations are distributed across slave nodes. This model improves performance because it allows more queries to be processed in parallel.
- Reliability: If one of your database servers is destroyed by a natural disaster, data is still preserved. You do not need to worry about data loss because data is replicated across multiple locations.
- High availability: By replicating data across different locations, your website remains in operation even if a database is offline as you can access data stored in another database server.
The illustration below shows a master database with multiple slave databases:
- Scenario 1: If only one slave database is available and it goes offline, read operations will be directed to the master database temporarily. As soon as the issue is found, a new slave database will replace the old one. In case multiple slave databases are available, read operations are redirected to other healthy slave databases. A new database server will replace the old one.
- Scenario 2: If the master database goes offline, a slave database will be promoted to be the new master. All the database operations will be temporarily executed on the new master database. A new slave database will replace the old one for data replication immediately. In production systems, promoting a new master is more complicated as the data in a slave database might not be up to date. The missing data needs to be updated by running data recovery scripts.
Let us take a look at the design:
- A user gets the IP address of the load balancer from DNS.
- A user connects the load balancer with this IP address.
- The HTTP request is routed to either Server 1 or Server 2.
- A web server reads user data from a slave database.
- A web server routes any data-modifying operations to the master database. This includes write, update, and delete operations.
Now, you have a solid understanding of the web and data tiers, it is time to improve the load/response time. This can be done by adding a cache layer and shifting static content (JavaScript/CSS/image/video files) to the content delivery network (CDN).
Caching
A cache is a temporary storage area that stores the result of expensive responses or frequently accessed data in memory so that subsequent requests are served more quickly.
In our latest illustration, every time a new web page loads, one or more database calls are executed to fetch data. The application performance is greatly affected by calling the database repeatedly. The cache can mitigate this problem.
The cache tier is a temporary data store layer, much faster than the database. The benefits of having a separate cache tier include better system performance, ability to reduce database workloads, and the ability to scale the cache tier independently.
After receiving a request, a web server first checks if the cache has the available response. If it has, it sends data back to the client. If not, it queries the database, stores the response in cache, and sends it back to the client. This caching strategy is called a read-through cache.
Other caching strategies are available depending on the data type, size, and access patterns.
Here are a few considerations for using a cache system:
- Consider using cache when data is read frequently but modified infrequently. Since cached data is stored in volatile memory, a cache server is not ideal for persisting data.
- It is a good practice to implement an expiration policy. Once cached data is expired, it is removed from the cache.
- Consistency: This involves keeping the data store and the cache in sync. Inconsistency can happen because data-modifying operations on the data store and cache are not in a single transaction. When scaling across multiple regions, maintaining consistency between the data store and cache is challenging.
- A single cache server represents a potential single point of failure defined as a part of a system that, if it fails, will stop the entire system from working. As a result, multiple cache servers across different data centers are recommended to avoid SPOF.
CDN
A CDN is a network of geographically dispersed servers used to deliver static content. CDN servers cache static content like images, videos, CSS, JavaScript files, etc. When a user visits a website, a CDN server closest to the user will deliver static content. Intuitively, the further users are from CDN servers, the slower the website loads.
After adding a CDN and caching to our scaling system example:
- Static assets (JS, CSS, images, etc.,) are no longer served by web servers. They are fetched from the CDN for better performance.
- The database load is lightened by caching data.
System Design Interview Questions
Design system questions are essentially brainstorming sessions. Your final design is not as important as the thinking process behind your design choices.
- Design a URL shortening service like TinyURL
- Design Instagram
- Design Twitter
- Design Youtube or Netflix
Most design-system questions involve fundamental knowledge of computer science.
- Networking: HTTP, IPC, TCP/IP, throughput, latency, how internet works and others
- Database basics: SQL vs No-SQL, types of databases, hashing, indexing, sharding, caching, and others
- Real-world performance: relevant performance of RAM, disk, SSD, and your network
- Basics of web architecture: proxy, load balancers, database servers, caching servers, logging and others
Some general steps you can take as you go through the interview process:
- Talk out loud — Communication is the most important part of interviewing including system design questions, so explain every decision you make out loud. Interviewers cannot read your mind (as far as I know), so you need to show how well you communicate.
- Identify requirements — Ask the interviewer questions to clarify everything you need to know before designing the system. Think about the use cases that are expected to occur. Find the exact scope that the interviewer has in mind. Such as how many users it will have, how much storage and server capacity you need, specific questions on the system functionality, etc.
- Capacity estimation — To define the capacity you need to build the system and the capacity you need to scale the system. Think about the read-to-write ratio, the number of concurrent requests, and various data limitations. Often, you will need to define three aspects: traffic estimates, storage estimates, and memory estimates. Depending on the level of detail required for the SDI, this part maybe not needed.
- Design the high-level system — The goal of this abstract design is to define all the important components that your architecture needs. This often involves defining APIs, database schema, data flow, servers, architecture and other basic components of the system. Start with the entry-points and work your way up to the database.
- Identify and try to resolve bottlenecks — Your high-level design is likely to have one or more bottlenecks when you finish it. It’s okay. You are not expected to design a system from scratch that can handle millions of users, in just one hour. Look for possible bottlenecks that can slow down or hinder the functions of the system. Maybe your system can’t scale and it needs a load balancer. Or maybe it has a problem with security with the current database schema.
Resources 🏆
The first video is a presentation given by Brett Beekley @ Google that covers these topics and much more. The other two are common system design interview questions of designing TikTok and Instagram. The last one is a collection of system design videos by Gaurav Sen.
1-2022 Update: Roberto Vitillo’s Distributed Systems book is great too!