Introduction
Firstly I would like to talk about the scale of twitter, which will help us to understand the requirements of a system like twitter.
Twitter handles about 600 tweets per second. Which means that approximately 600 tweets are made each second. On the contrary, approximately 60,000 tweets are read each second. This clearly shows that twitter is a read heavy system.
There are approximately 300 million users of twitter. Some of them are active users (users who frequently visit twitter), while most of them are passive users (i.e. users who’d only visit once in a while, say 1 month).
Features (i.e. Functional requirements)
Home timeline
- every user has a home timeline
- a home timeline contains tweets from the users, you’ve followed (i.e. your followees), or tweets that are recommended by the twitter recommendation engine
For the sake of simplicity, we will not include the tweets from recommendation engine in the home timeline. Therefore, the home timeline shall only include the tweets from your followees.
User timeline
- each user has a user timeline
- user timeline shows the tweets made by you
Search timeline
- suppose you search “world cup” on twitter search, the search results will show you a number of tweets related to “world cup”
- these tweets are called search timeline
Non functional requirements
It is clear from the description that we need to optimize our design for the following features:
- scalability: It is clear that you need a scalale system that needs to handle 300 million users
- availability: If the user is trying to see a tweet. He should be able to see it. You typically do not want to show “Something went wrong” error screens, which will make it a poor user experience
- eventual consistency: If a tweet is delivered to your followers with a few second delay, it should not be a huge problem (typically)
- causal consistency: If I make a new tweet, it should be shown to me immediately, or I might make a duplicate new tweet thinking that the previously created tweet was not successfully posted
Low level design
Considering the non-functional requirements. It should be clear to us, that we would require a No-SQL database, particularly mongo-db or cassandra.
This is because, No-SQL databases are easier to scale, and both mongodb and cassandra are build to favour availability over consistency.
I wanted to talk about database schema but I would like to emphasize that NoSQL databases do not have a schema in the same sense as a relational database (SQL based) does.
Therefore, lets rather talk about API endpoints (request/response shapes), since they are database agnostic.
- User
We should be able to fetch a user by a user id or email. A typical response would look like:
|
|
- Tweet
Each tweet, has a limit of 140 characters. And each tweet can contain images/videos. The basic structure defining a tweet object might look like the following:
|
|
- User timeline
We should be able to retrieve all the posts, made by a particular user. Since, there may be a large number of tweets/posts, we can also implement pagination functionality. A sample REST API response might look like:
Request:
|
|
Response:
|
|
How to generate home page timeline/feed
Suppose we need to generate the feed for alice. Support alice follows bob and charlie. Therefore, in our simplified twitter, alice should see more recent (say 20) posts from bob and charlie.
When we need to generate the home feed for alice, the simplest option is to use the following algorithm:
- Find the friends of
alice
, using the “followees” field. This will provide us user id’s ofbob
andcharlie
- Now get all the tweets from
bob
andcharlie
using their user timelines respectively (this will involve a database call) - Sort the agreegated tweets from bob and charlie and return top 20 tweets.
In fact, I previously solved a similar problem on leetcode. And this was the solution I came up with (which follows the above algorithm):
|
|
The problem with this approach is that:
- this is an expensive operation. For a user with 30 followees, we are making 30 expensive database calls. And then we also need to sort the tweets.
- it generates the home feed on the fly (expensive operation -> will make user wait)
To improve the performance of the above approach, the idea is to keep each user’s timeline (both home timeline and user tiemline) in a redis.
Whenever, bob or charlie post a new tweet. A fanout service picks up the new tweet and populates it in alice’s home feed.
But even with this approach, there is one problem. Consider a famous user who has 100 million followers. Now when he tweets, the fanout service needs to update the timeline of 100 million users. This is an increadibly expensive operation.
The solution is to segregate users into regular users and famous users.
The fanout method should only run with regular users. To collate the
tweets by famous users, we should fetch them on the fly
. Therefore,
the home timeline generation service would typically keep a seperate
list for famous followees for a given user. And the tweets from famous
users is fetched on the fly combined with precomputed home feed from
redis (which only contains tweets from regular users).
This approach, also enables us to distribute the tweets of famous users using a more optimal method, for example through CDN.
Lastly, I would like to note that publisher/subscriber model of communication, is probably a more natural approach to solve these kind of problems.
- each user is a publisher
- and his followers are his subscribers
- when alice follows bob. She is subscribing to bob. Therefore:
- bob -> publisher
- alice -> subscriber
- when bob makes a post. All subscribers that subscriber to bob’s post immediately take some action
- For example, when bob makes a post, alice can store the post in her timeline
Since, we will be handling 300 million users, we potentially have 300 million publisher streams and 300 million subscriber streams. Hence, a huge pub/sub system is required.
We can think of using something like apache kafka. Which is a production grade pub/sub system. But even with kafka, this might be a challenge, given the number of publisher stremas we need to maintain. So as a home work, read more about kafka.
How to generate search timeline/feed
Twitter uses something called inverted full text search. Here’s how it works:
When you make a tweet. Twitter does the following:
identify the terms within your tweet, that can be used to index your tweet. For example, I you tweet:
1
I am supporing manchester united this season.
Then twitter parser may identify “manchester united” as the main search key for this tweet. Note that there may be multiple search keys.
Then it stores the
search_key : tweet
mapping, into a key value store.For each
search_key
there is a list of tweets that contain that search key. Similarly, your tweet will be appended to the list of tweets, under the search key “manchester united”
|
|
Now whenever, a user searches for “manchester united”. Twitter can just fetch the list of tweets from the above key value store. And your tweet will be displayed in the results (based on some raking)
You might have already observed, that such a key-value store needs to be of massive scale considering we are dealing with 300 million users. And that’s why, this key-value store is not located in a single place. This is probably spread across multiple nodes and data centres.
Therefore, twitter uses an approch of scatter and gather when writing and reading from this distributed key-value store.
For a given query, each datacenter is asked to provide a list of relavant tweets from it’s databases. These are merged together and finally shown to the user after being ranked.