Old is Gold : Slack believes it

Asutosh Panda
5 min readMar 27, 2022

Slack started as gaming company and today we know how big it it in the corporate area. The persistent group messaging feature slack which allows you catch up the messages even after you open slack after some hours, the interactivity during group chat is what we will see here, how it works, how the architecture is designed for it.

Working Mechanism :

Slack still hasn’t touched the billion parameter and still handles few millions of users on daily basis but it takes care of the largest number of open socket web connections of the world. It’s users on an average stay active constantly for 2–3 hrs per day and 10 hrs in total throughout a whole working day. They have a conservative taste in tech stack where most of the things they use are 10–15 yrs old. The logic behind this is to keep it stable and solve newly popped up issues ASAP. They keep it minimalistic as in if some new feature gets introduced to the system then they prefer to use one of the tool which is currently use and how to leverage them.

simple architecture of slack

From the above diagram : when the slack clients(mobile, laptop) try to access slack it hits the WebApp(large RESTful app), obviously important part of every chatapp i.e Message Server, MySQL as db and Job Queue.

  • WebApp : — When someone tries to access “slack.com” first he will hit the RTM(Real Time Messaging ) API first then comes the WebApp. The WebApp is a monolith app that is made with PHP. It’s a scaled out LAMP stack application in which Memcache is wrapped around a sharded MySQL db. PHP is way too old, but it’s also cool that such a feature-rich app is still using it successfully.
login & receive messages
  • For slack all the business teams, members, chats, all are relational data, held in normalized form in some table. After the hit goes to RTM API the request goes to the shared db based on the team ID of the user. In that particular shard we can find which domain-name is mapped to which team, which team is mapped to which shard etc. Based on the response of RTM stats(blob data), it goes to build the payload accordingly.
  • The shards of MySQL holds these : —
    - source of truth for all customer data : teams, users, channels, messages etc
    - replication across 2 DCs : available in case if one DC
    - sharded by team : on the basis of performance, fault isolation and scalability
    though it’s a logically single db based on many machines and every head will be holding log data of every user, still they will be able to sustain the right traffic. They do this for fault tolerance basically.
    - But how do they handle conflicts : —
    1. Choosing A in CAP terms
    2. Conflicts are possible
    ➞ Most resolved automatically
    ➞ Some manually, by operator action
    3. INSERT ON DUPLICATE KEY UPDATE …
    4. Partitioning by team saves us
    ➞ Team writes cannot overlap
    ➞ Even teams use “left” head, odd teams use “right” head
RTM response
  • RTM Payload : — contains a lot of tiny details but the crucial data among all those are the whether the socket started or not and the socket URL. In short here are the things which RTM.start payload tries to cover
    1. Returns an image of the whole team
    2. Architecture of clients
    ➞ Eventually consistent snapshot of whole team
    ➞ Updates trickle in through the web socket
    3. Guarantees responsive clients
    4. Ensures the connection is established
message delivery architecture
  • Message Delivery : — it is handled by the message server and the reason behind to keep all the messages live or real-time, everyone is connected without any latency. This is completely written in Java. But once the RTM starts they begins some sort of race to keep everything up-to-date. The message server keeps buffer memory of recent events that lasts upto 30sec and before the 30sec period runs out the RTM has to update the messages at client side and at the cloud too. To avoid the glitches, delays they perform a lot of in memory queues to persist the data locally for a while.
deferring work
  • The Job Queue : — they handle this using Redis. They use multiple queues and pool of workers for isolation among all these queues also. At one level once the task is done it sends a memo to the other level to get started. Few of the cool stuffs happen here such as: once you give the URL of someone’s twitter profile it will show the small profile of the twitter handle as a popup.

Thanks for reading upto here😊

to learn more about it checkout these links :-

https://youtu.be/WE9c9AZe-DY

--

--

Asutosh Panda

I am a DevOps Engineer, interested in SRE and DevOps world, apart from tech I am into cinematography, poetry, dance