I’m thinking about how Twitter or Twitter-like systems are modeled. @seanconnollydev linked me to a project by FaunaDB devs that discussed their take on it using FaunaDB as the back-end (they’re former Twitter devs). I also analyzed the Twitter v2 API docs to get a sense of what they might be doing for data modeling.
Approaching Fwitter first because it was very straight-forwardly laid out in their docs. They’ve broken Accounts apart from Users
because that allows them to handle multiple authentication methods to authenticate to the same user, so the Account model has the login data (email, password, user reference). The User model has the details to build the user’s profile (alias aka handle, icon, name).
The next core piece they have is the fweet itself. The fields on the fweet (alphabetized, not in importance order) are an asset object (id, url, type: [image|video], cloudName), author (user), comments (count), created_at (timestamp), hashtags (presumably an array of ids into the hashtags table), likes (count), message, original (if a refweet this is a reference to refweeted fweet), refweets (count). It doesn’t look like they’re counting quote fweets, or if they are they are probably putting them into the refweet count.
They also have included a Comment model, which is an odd choice to me. The Comment model holds the message, author, and a reference to the commented fweet. If they would have used a fweet itself as a comment and referenced the fweet they’re commenting on they could have supported all the fields (like assets and hashtags) that any fweet does, but they’re intentionally limiting commenters down to just messages here. Not sure why they made this choice.
They have a hashtag table that holds the name of the hashtag and gets indexed into by fweets using hashtags. This seems like somewhere I would have added a join table between Hashtags and Fweets, but maybe Fauna does something special here that makes that obsolete or slow.
They’ve implemented a table to hold fweetstats (user, fweet, like, refweet, comment where the first two are references and the last 3 are booleans). This allows them to decide how to display a fweet when a user is viewing it based on how that user has interacted with the fweet in the past. They also have followerstats that hold postlikes and postrefweets counts, and author and follower references. Interesting to count how much interaction 2 users have had with each other.
It also seems like they might have a rate-limiting table that identifies which user is trying to perform which action and to throttle that user if they’re doing it too much. Neat.
And that’s pretty much it for Fwitter. It’s a subset of the functionality of Twitter but it’s clean and approachable.
I’m guessing a little (actually a lot) on Twitter since their API v2 aggregates a lot of fields into Tweets and Users that I suspect don’t actually live on the Tweet or User models and I didn’t spend much time researching because this is more about intuiting what’s going on under the hood. I’m going to post this publicly so if anyone ever sees it I’m sure they’ll tell me all the places I’m wrong.
Twitter likely has the same sort of account/user model split but I didn’t really focus on it in their API. Know that it’s probably there but I’m not going to talk about it beyond this. The user model itself seems like it probably holds the id, name, username, created_at, description, location, pinned_tweet_id, profile_image_url, protected, url, verified, and withheld_id fields. They put some user metrics directly on the user in the User response call, but in a relational database those would normally be stored in a separate model and joined in. Using a NoSQL solution they might store that data directly on the model and just update it when they need to, which makes reads a lot faster since you’re not doing joins, and this is not an unreasonable approach when running at Twitter scale. You need to decide how you want to approach row-locking/transactions here, and you’ll probably want to be writing to a primary database and reading off 1 or more replicas. These metrics can update often, so they’ll need to be updated asynchronously and then the replicas updated often. The metrics they’re holding on the user are the followers_count, following_count, tweet_count, listed_count, and probably a reference to their withheld table (which has fields for copyright, country_codes, and scope: tweet|user). I don’t understand how the withheld object is being used so I’m not sure whether it belongs in its own table or if the fields should be duplicated on the tweet and user tables.
Under the hood they likely have join tables to tie together followers (follower_id, followed_id), blocks (blocker_id, blocked_id), and mutes (muter_id, muted_id). You could add more fields around any of this if you wanted to capture the context of when the action occurred, tying it to a tweet or a point in time or something.
Moving into the core of the app: let’s talk about Tweets. The tweet model likely has these fields at minimum: id, text, created_at, author_id (user reference), conversation_id (this _might be_ for quote tweets), in_reply_to_user_id, possibly_sensitive, lang, reply_settings, source, and a withheld_id or the withheld fields from that table. There are some other pieces that could be pulled into their own models or they could be part of the tweet model: referenced_tweets (type: retweeted, quoted, replied_to, id), attachments (media_keys, poll_ids), geo (coordinates type, coordinates themselves, place), context_annotations (domain_id, entity_id), and entities (id, name, description). The referenced tweets and attachments and geo are unlikely to be reusable, they provide direct context to the tweet itself, so it would make sense to associate those directly on the tweet model. I don’t know how context annotations differ from entity annotations, but it’s likely those are each pulled out to their own models.
If the context annotations and entity annotations have the same shape then I would create an Annotation table with those fields (id, start, end probability, type, normalized_text) and reference the id of that field from both the tweet and entities models. The context annotations also reference a domain, so there’s likely a Domain model with an id, name, and description.
Since there is a conversation_id field in the tweet there’s probably a Conversation model to hold details about the conversation. It would also make sense to have Places pulled out to their own table with an id and a name. It’s likely these would be referenced from many geo objects on the tweets, seems reusable because there is only one museum with a certain name but many geo points that are still within the museum and if a lot of people are at the museum and tweeting about the event or exhibit happening there you would want to tie all those together to the same place so if someone searches for the place you can pull in all the tweets related to it.
Some of the other data pulled out of tweets like urls (id, start, end, url, expanded_url, display_url, unwould_url), hashtags (id, start, end, tag), mentions (id, start, end, username), cashtags (id, start, end, tag) likely each have their own Models. In the database I would reference these into entities with join tables. You can have a join table for each of them, something like Entities_Annotations (entity_id, annotation_id), Entities_Urls (entity_id, url_id), etc. but I think you can actually simplify the implementation with a polymorphic join, something like Entities_PolymorphicJoins (entity_id, join_type, join_id) where join_type could be one of Annotation, Url, Hashtag, Mention, Cashtag.
Tweets can also have Includes on them which can be other tweets, users, places, media, or polls, so I would use the same sort of polymorphic join there, something like Includes_PolymorphicJoins (tweet_id, join_type, join_id).
Tweets have metrics so I would break those into their own tables for the same reason as I would do it for the user metrics explained above: Public Metrics (tweet_id, retweet_count, reply_count, like_count, quote_count), Non-Public Metrics (tweet_id, impression_count, url_link_clicks, user_profile_clicks), Organic Metrics (tweet_id, impression_count, url_link_clicks, user_profile_clicks, retweet_count, reply_count, like_count), Promoted Metrics (tweet_id, impression_count, url_link_clicks, user_profile_clicks, retweet_count, reply_count, like_count).
And then the last 2 tables are join tables for likes and retweets. Likes (tweet_id, user_id, created_at) and Retweets (tweet_id, user_id, created_at).
I think that covers one approach to how you might model the data behind Twitter. Is it right? Is it the best? Is it the way Twitter actually does it? No idea. But it’s how I might approach the data modeling problem given the same domain. They made some necessary trade-offs for scale, so they probably co-locate as much data as possible for raw speed which would change how the data is stored and modeled in the application. Reach out to me on Twitter (@davemcg3) if you want to share thoughts or let me know how right or wrong my approach here is. Thanks for joining me on this thought experiment!