Break Down : GitHub Outage(5th May 2020)

Asutosh Panda
3 min readMar 24, 2022

Outages have become common in last few months and here I would be breaking down the mighty GitHub’s outage. The outage appeared on 5th May of this year that lasted for around 2 hours and 24 minutes. This outage effected all the internal apps such as GitHub Pages, GitHub Actions, GitHub Dependabot.The thing is over the period of last few months so many outages happened, few of the companies just gave up and said it’s with the cloud, we can’t do anything. Whereas few of the companies found a way around to overcome the situation(eg:Spotify). For GitHub it’s not an issue related to cloud and they somehow handled the incident after some time.

The root Cause

I have seen shared license among different applications causing issues earlier but here in case of GitHub it was a shared database that was the root cause of the outage. When their MySQL database tried to insert a larger integer into the table in the sense when it tried to enter a new column to its existing capacity of 2147483647(2³¹ -1), the table denied to enter the new value. Following this Ruby on Rails raised an ActiveModel:RangeError.

What happens when we try to mimic the error?

In MySQL8 if you’d try to replicate the similar scenario in your local system then you’d get a “duplicate entry error” not an “out of range error”. It’s kind of an exception and value doesn’t even gets inserted into the table. The maximum capacity of MySQL for integers is -2³¹ to 2³¹. The auto-increment inside a database stops when the maximum values is reached.

The cool stuff is in most languages like C once the limit is touched(2³¹ -1) then the counter again begins from the negative end(-2³¹ -1). But in MySQL once the limit is touched, the auto increment works somehow in this way 2147483647 + 1 = 2147483647 and you will get a “duplicate id entry” error.

What GitHub could have done to handle the incident?

They haven’t specifically mentioned how they handled it? But they have said that their monitoring team had set an alert that will trigger when the limit of the integer injection limit of the database touches 70%. The basics of incident handling in SRE concepts says that you better try to mitigate an incident once it appears. Generally how someone should try to mitigate this kind of issue? Here are 2 methods -

  1. Make the id UNSIGNED : -
  • Make it larger to enlarge the range
  • SET the transaction isolation level READ UNCOMMITED
  • SET the FOREIGN_KEY_CHECKS = 0
  • ALTER the TABLE, this will cause an outage but it’s a planned outage, it will take some time to implement fully(copy the table with newly assigned column value, not a good idea to do it in-place)

2. Swap the TABLE :-

  • In day-to-day SRE job the engineers follow this method mostly when some sort of migration goes on
  • As the name suggests here we can create a new table with same structure but with a larger id and then swap the old table’s data into the new one.
  • SET the transaction isolation level READ UNCOMMITED
  • SET the FOREIGN_KEY_CHECKS = 0
  • CREATE an empty TABLE with larger id capacity(2⁶⁴)
  • ALTER the TABLE
  • Swap between the two tables using the logic swapping between variables
how the larger id will give more space that can get assigned later

The following article was inspired by Arpit Bhayani’s dissection of GitHub video from youtube and here it is -https://youtu.be/ZFRAFTn0cQ0

Here are some more resources to know in depth about the methods :-

Thank you for reading upto here.

--

--

Asutosh Panda

I am a DevOps Engineer, interested in SRE and DevOps world, apart from tech I am into cinematography, poetry, dance