Jump to content
  • AT&T failed to test disastrous update that kicked all devices off network


    Karlston

    • 238 views
    • 9 minutes
     Share


    • 238 views
    • 9 minutes

    AT&T caused outage that blocked 92 million calls, 25,000 attempts to reach 911.

    A government investigation has revealed more detail on the impact and causes of a recent AT&T outage that happened immediately after a botched network update. The nationwide outage on February 22, 2024, blocked over 92 million phone calls, including over 25,000 attempts to reach 911.

     

    As described in more detail later in this article, the FCC criticized AT&T for not following best practices, which dictate "that network changes must be thoroughly tested, reviewed, and approved" before implementation. It took over 12 hours for AT&T to fully restore service.

     

    "All voice and 5G data services for AT&T wireless customers were unavailable, affecting more than 125 million devices, blocking more than 92 million voice calls, and preventing more than 25,000 calls to 911 call centers," the Federal Communications Commission said yesterday. The outage affected all 50 states as well as Washington, DC, Puerto Rico, and the US Virgin Islands.

     

    The outage also cut off service to public safety users on the First Responder Network Authority (FirstNet), the FCC report said. "Voice and 5G data services were also unavailable to users from mobile virtual network operators (MVNOs) and other wireless customers who were roaming on AT&T Mobility's network," the FCC said.

    An incorrect process

    AT&T previously acknowledged that the mobile outage was caused by a botched update related to a network expansion. The "outage was caused by the application and execution of an incorrect process used as we were expanding our network, not a cyber attack," AT&T said.

     

    The FCC report said the nationwide outage began three minutes after "AT&T Mobility implemented a network change with an equipment configuration error." This configuration error caused the AT&T network "to enter 'protect mode' to prevent impact to other services, disconnecting all devices from the network, and prompting a loss of voice and 5G data service for all wireless users."

     

    While the network change was rolled back within two hours, full service restoration "took at least 12 hours because AT&T Mobility's device registration systems were overwhelmed with the high volume of requests for re-registration onto the network," the FCC found.

    Outage reveals deeper problems at AT&T

    Although a configuration error was the immediate cause of the outage, the FCC investigation revealed various problems in AT&T's processes that increased the likelihood of an outage and made recovery more difficult than it should have been. The FCC Public Safety and Homeland Security Bureau analyzed network outage reports and written responses submitted by AT&T and interviewed AT&T employees. The bureau's report said:

     

    The Bureau finds that the extensive scope and duration of this outage was the result of several factors, all attributable to AT&T Mobility, including a configuration error, a lack of adherence to AT&T Mobility's internal procedures, a lack of peer review, a failure to adequately test after installation, inadequate laboratory testing, insufficient safeguards and controls to ensure approval of changes affecting the core network, a lack of controls to mitigate the effects of the outage once it began, and a variety of system issues that prolonged the outage once the configuration error had been remedied.

    At 2:42 am CST on February 22, an AT&T "employee placed a new network element into its production network during a routine night maintenance window in order to expand network functionality and capacity," the FCC said. The configuration "did not conform to AT&T's established network element design and installment procedures, which require peer review."

     

    An adequate peer review should have prevented the network change from being approved and from being loaded onto the network, but this peer review did not take place, the FCC said. The configuration error was made by one employee, and the misconfigured network element was loaded onto the network by a second employee.

     

    "The fact that the network change was loaded onto the AT&T Mobility network indicates that AT&T Mobility had insufficient oversight and controls in place to ensure that approval had occurred prior to loading," the FCC said.

    AT&T faces possible punishment

    AT&T issued a statement saying it has "implemented changes to prevent what happened in February from occurring again. We fell short of the standards that we hold ourselves to, and we regret that we failed to meet the expectations of our customers and the public safety community."

     

    AT&T could eventually face some kind of punishment. The Public Safety and Homeland Security Bureau referred the matter to the FCC Enforcement Bureau for potential violations of FCC rules.

     

    Verizon Wireless last month agreed to pay a $1,050,000 fine and implement a compliance plan because of a December 2022 outage in six states that lasted one hour and 44 minutes. The Verizon outage was similarly caused by a botched update, and the FCC investigation revealed systemic problems that made the company prone to such outages.

    All 911 attempts failed

    Once the AT&T configuration error was introduced, "downstream network elements propagated the error further into the network," the FCC said. "This triggered an automated response that shut down all network connections to prevent the traffic from propagating further into the network. The shutdown isolated all voice and 5G data processing elements from the wireless towers and switching elements, preventing these services from being available."

     

    The AT&T network disconnected all devices from voice and 5G data services "at 2:45 am, just three minutes after the misconfigured network element was placed into production." When voice services were disconnected, no 911 calls from AT&T devices could be routed to Public Safety Answering Points (PSAPs), the FCC said:

     

    All such attempted 911 calls therefore failed preventing more than 25,000 calls to PSAPs or 911 call centers. This includes devices that were in SOS mode while attached to AT&T towers. When a device is in SOS mode, it cannot register to the network, but in most instances the device can still reach 911. However, all voice services on the AT&T Mobility network were unavailable during the outage, including calls to 911 made by phones in SOS mode that were attempted over the AT&T Mobility network. AT&T customers whose devices in SOS mode attached to other carriers could complete 911 calls through those networks.

    AT&T prioritized the restoration of FirstNet service over commercial and residential users, and FirstNet infrastructure was restored by 5 am. "Restoring service to commercial and residential users took several more hours as AT&T Mobility continued to observe congestion as high volumes of AT&T Mobility user devices attempted to register on the AT&T Mobility network. This forced some devices to revert back to SOS mode," the FCC said.

    Other underlying problems

    The lack of peer review mentioned earlier was accompanied by a failure to conduct adequate lab testing. The FCC said AT&T's lab testing "either failed to effectively emulate the live environment or failed to test the impact of this misconfiguration on the wider network. Any such testing should have identified the issue prior to the occurrence of the outage."

     

    AT&T also failed to adequately test after implementation of the network change, the FCC said. "An effective post-installation test may have helped detect the misconfigured network element more quickly, thereby allowing AT&T Mobility to initiate corrective action more expeditiously," the FCC said. "AT&T Mobility either lacked sufficient oversight and controls in place to ensure these test processes were followed, or if they were, then the processes themselves were insufficient."

     

    Additionally, a "downstream network element lacked controls specific to mitigating this error and therefore was unable to mitigate the effects created by the misconfigured network element," the FCC said. "Because the network element was lacking these controls, it passed traffic further into the network."

     

    AT&T was unprepared for the congestion caused by user devices attempting to reconnect to the network en masse. "Despite configuring its network to enter Protection Mode to prevent propagating errors to other parts of the network, AT&T failed to prepare for the registration congestion associated with the network recovering from Protection Mode, or to sufficiently mitigate that congestion after the fact... More robust registration systems with greater capacity would have enabled AT&T Mobility to more quickly and efficiently recover after the network entered into Protection Mode," the FCC said.

    Fixes

    AT&T has been working on changes to prevent future outages. Within two days of the February outage, it implemented new technical controls, the FCC said.

     

    "This included scanning the network for any network elements lacking the controls that would have prevented the outage, and promptly putting those controls in place. AT&T has engaged in ongoing forensic work and implemented additional enhancements to promote network robustness and resilience," the FCC said. AT&T also "implemented additional steps for peer review and adopted procedures to ensure that maintenance work cannot take place without confirmation that required peer reviews have been completed."

     

    The FCC said it will issue a public notice to service providers reminding them of the importance of following best practices. The public notice will be based on analysis of the AT&T outage and other recent outages.

     

    "Sound network management practices of critical infrastructure and AT&T Mobility's own processes demand that only approved network changes that are developed pursuant to internal procedures and industry best practices, should be loaded onto the production network. It should not be possible to load changes that fail to meet those criteria," the FCC said.

     

    Source

     

    Hope you enjoyed this news post.

    Thank you for appreciating my time and effort posting news every single day for many years.

    2023: Over 5,800 news posts | 2024 (till end of June): 2,839 news posts


    User Feedback

    Recommended Comments

    There are no comments to display.



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Paste as plain text instead

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Recently Browsing   0 members

    • No registered users viewing this page.
×
×
  • Create New...