Jump to content
  • GitHub reveals reason behind last week’s string of outages


    Karlston

    • 743 views
    • 3 minutes
     Share


    • 743 views
    • 3 minutes

    GitHub's Chief Security Officer and SVP of Engineering shared more details today on a string of outages that hit the code hosting platform last week.

     

    While these incidents had unrelated root causes, they affected most of GitHub's primary services from May 9 to May 11, causing widespread database connection and authentication failures for up to ten hours.

     

    "Last week, GitHub experienced several availability incidents, both long running and shorter duration. We have since mitigated these incidents and all systems are now operating normally," Hanley said.

     

    "The root causes for these incidents were unrelated but in aggregate, they negatively impacted the services that organizations and developers trust GitHub to deliver. This is not acceptable nor the standard we hold ourselves to."

     

    On May 9, eight main services were hit by a major outage caused by a configuration change to GitHub's internal service serving Git data.

     

    The second outage, occurring on May 10, impacted the issuance of authentication tokens for GitHub Apps and resulted from high load and inefficient implementation of an API responsible for managing GitHub App permissions.

     

    "On May 10, the database cluster serving GitHub App auth tokens saw a 7x increase in write latency for GitHub App permissions (status yellow)," Hanley explained.

     

    "The failure rate of these auth token requests was 8-15% for the majority of this incident, but did peak at 76% percent for a short time."

     

    The third GitHub outage experienced by users last week, on May 11, was due to a loss of read replicas after a database cluster serving Git data crashed and triggered an automated failover mechanism.

     

    GitHub_incident_history.png

    Incident history (GitHub)

     

    "We are addressing the Git database crash that has caused more than one incident at this point. This work was already in progress and we will continue to prioritize it," Hanley said.

     

    "We are addressing the database failover issues to ensure that failovers always recover fully without intervention."

     

    GitHub will share more detailed information on these outages and what it's doing to address the issues that caused them in its May Availability Report.

     

    "The May report will include these incidents and any further detail we have on them, along with a general update on progress towards increasing the availability of GitHub," Hanley said.

     

    GitHub was also affected by multiple outages within a week in March 2022, when the company revealed that the incidents were caused by resource contention issues in the platform's primary database cluster.

     

    Another major outage impacted GitHub in February 2022, when the platform was down worldwide, preventing access to the website and blocking commits, cloning, or pull request attempts.

     

     

    GitHub reveals reason behind last week’s string of outages

    • Like 2

    User Feedback

    Recommended Comments

    There are no comments to display.



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Paste as plain text instead

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...