Definitely not Windows 95: What operating systems keep things running in space?

October 2, 2020

Definitely not Windows 95: What operating systems keep things running in space?

The updates don't come every spring and fall, but space operating systems keep evolving.

Enlarge / ESA's Solar Orbiter mission will face the Sun from within the orbit of Mercury at its closest approach.

ESA/ATG medialab

The ESA’s recently launched Solar Orbiter will spend years in one of the most unwelcoming places in the Solar System: the Sun. During its mission, Solar Orbiter will get 10 million kilometers closer to the Sun than Mercury. And, mind you, Mercury is close enough to have sustained temperatures reaching 450°C on its Sun-facing surface.

To withstand such temperatures, Solar Orbiter is going to rely on an intricately designed heat shield. This heat shield, however, will protect the spacecraft only when it is pointed directly at the Sun—there is no sufficient protection on the sides or in the back of the probe. So, accordingly, ESA developed a real-time operating system (RTOS) for Solar Orbiter that can act under very strict requirements. The maximum allowed off-pointing from the Sun is only 6.5 degrees. Any off-pointing exceeding 2.3 degrees is acceptable only for a very brief period of time. When something goes wrong and dangerous off-pointing is detected, Solar Orbiter is going to have only 50 seconds to react.

"We’ve got extremely demanding requirements for this mission," says Maria Hernek, head of flight software systems section at ESA. "Typically, rebooting the platform such as this takes roughly 40 seconds. Here, we’ve had 50 seconds total to find the issue, have it isolated, have the system operational again, and take recovery action.”

To reiterate: this operating system, located far away in space, needs to remotely reboot and recover in 50 seconds. Otherwise, the Solar Orbiter is getting fried.

Billiard ball OS

To deal with such unforgiving deadlines, spacecraft like Solar Orbiter are almost always run by real-time operating systems that work in an entirely different way than the ones you and I know from the average laptop. The criteria by which we judge Windows or macOS are fairly simple. They perform a computation, and if the result of this computation is correct, then a task is considered to be done correctly. Operating systems used in space add at least one more central criterion: a computation needs to be done correctly within a strictly specified deadline. When a deadline is not met, the task is considered failed and terminated. And in spaceflight, a missed deadline quite often means your spacecraft has already turned into a fireball or strayed into an incorrect orbit. There’s no point in processing such tasks any further; things must adhere to a very precise clock.

The time, as measured by the clock, is divided into singular ticks. To simplify it, space operating systems are typically designed in such a way that each task is performed within a set number of allocated ticks. It can take three ticks to upload data from sensors; four further ticks are devoted to fire up engines and so on. Each possible task is assigned a specific priority, so a higher-priority task can take precedence over the lower-priority task. And this way, a software designer knows exactly which task is going to be performed in any given scenario and how much time it is going to take to get it done.

To compare this to operating systems we all know, just watch any given speed comparison between modern smartphones. In this one made by EverythingApplePro, the iPhone XS Max and Samsung S10 Plus go head to head opening some popular apps. Before the test, both phones are restarted, and the cache is cleared in them. Samsung opens all the apps in 2 minutes 30 seconds, and the iPhone clocks in at 2 minutes 54 seconds. In the second round, all the apps are closed and opened again without restarting or clearing the RAM. Because the apps are still in RAM, Samsung finishes the opening in 46 seconds, and the iPhone does it in 42 seconds. That’s a whopping two-minute time difference between the first try and the second. But if the phones had to run the kind of real-time operating systems used for spaceflight, opening those apps would take exactly the same amount of time no matter how many times you tried it—down to a millisecond.

Beyond time, space operating systems have more tricks up their sleeves. Real-time operation is one thing, and determinism is another. If you somehow convinced Craig Federighi to take part in one of those speed comparisons, gave him full access to the iPhone about to be tested, and asked him to predict exactly how much time it would take for this iPhone to complete the test, he would likely have no idea. Sure, he’d probably say something like "fast," or "fast enough," or even "blazingly fast," but nothing more specific than that. Neither iOS nor Android is a deterministic system. The number of factors that could potentially affect speed results is so huge that making such exact predictions is practically impossible. But if the phone was running a space-grade OS, an engineer with access to the system would know exactly what causes what in a given sequence and could calculate the exact time necessary for any given task. Space-grade software has to be fully predictable and perform within super specific deadlines.

Shooting at the Moon (and beyond) with VxWorks

Back in the Apollo days, operating systems were custom-built for each mission. Sure, some of the code got reused—parts of the software made for the Apollo program made their way to Skylab and the Shuttle program, for instance. But for the most part, things had to be done from scratch.

One small reboot

During their famous descent, Buzz Aldrin and Neil Armstrong left the rendezvous radar antenna on and pointed at the Apollo Command Module orbiting the Moon. This was a safety measure for the lander to know where the CM was in case it needed to abort the landing. But it turned out the radar was flooding the computer with data, which caused the AGC to quickly run out of memory. The infamous 1201 and 1202 errors simply meant there were no free magnetic or memory cores and no free vector accumulation areas, respectively. The lack of memory made it impossible for the landing programs to complete on time, and this in turn caused repeatable restarts of the computer. Still, due to safety measures built into the OS, no critical navigation data was lost during those reboots—the landing could proceed as planned.

The OS simply ran its scheduled tasks, picking up exactly where it had left off.

Eventually, NASA’s preferred OS solution came from WindRiver, a company based in Alameda, California. WindRiver released a fully operational commercial off-the-shelf, real-time operating system called VxWorks back in 1987. While VxWorks wasn’t the first system of this kind, it quickly became the most widely deployed of them all, meaning VxWorks soon caught the eye of NASA mission designers.

The first mission to fly VxWorks was the Clementine Moon probe, otherwise known as the Deep Space Program Science Experiment. Back in the early 1990s, Clementine marked NASA’s shift away from behemoth, Apollo-like programs. Everything was supposed to be lean, developed quickly, and on a tight budget. As such, one of the design choices made for the Clementine probe was to use VxWorks, and the system made a good enough impression to get a second date. VxWorks was the choice for the Mars Pathfinder mission.

But not everything was all rosy for this RTOS, though. A bug—the priority inversion problem—caused a lot of trouble for NASA’s ground control team. Shortly after landing, Pathfinder’s system started to reboot for no apparent reason, which delayed transmitting the collected data back to Earth. It took three weeks to find the problem and another 18 hours to fix it; the issue turned out to be buried deep down in the VxWorks mechanics.

Listing image by Lee Hutchinson (original image)

An intro to VxWorks from WindRiver

Anatomy of VxWorks

At the heart of VxWorks lies the wind microkernel. Its job is to manage all the interactions between applications operating in the system and hardware. In VxWorks, the microkernel is responsible for task scheduling with all 256 levels of priority the task can be assigned. Both preemptive and non-preemptive round-robin scheduling is supported along with all communications between tasks.

Tasks in the system can be in one of four states. The "ready" state is the state of a task when it is started. From there, it can either run till it’s done or can be assigned a specific amount of time for running. A task enters a “blocked” state when it gets preempted by another task with a higher priority or when its allotted number of ticks has run out. The third option is a "delayed" state. A task is delayed while it waits for resources necessary for it to do its job (maybe data samples from a sensor). A delay is always measured by a timer running independently of processing, typically a tick counter at all times maintained by the kernel. When such delays exceed some set values, the system assumes something probably went really wrong and starts rebooting. Finally, there is also the fourth, “suspended” state, where the task’s context registers are saved while it is stopped for debugging.

Inter-task communication in VxWorks can be done either through a messaging service that allows tasks to exchange data or through semaphores, a variable that exists to make sure tasks are interlocked or synchronized when needed. There are two types of semaphores in VxWorks. The first are binary semaphores, which can assume two values: "full" or "empty." Full semaphores are available for tasks, and empty ones are unavailable. When a task starts, it takes an available semaphore, making it "empty" or unavailable for other tasks. When the task is finishing its execution, it relinquishes the semaphore, thus rendering it available for other tasks.

Such binary semaphores are used for synchronizing or interlocking different tasks. The name "semaphore" has railroad connotations, so let’s stick with that for an analogy: imagine two trains that need to meet at some point to exchange cargo. In the VxWorks reality, the train that needs to pick the cargo up would create an empty semaphore and hand it over to the train that is carrying this cargo at the moment. Once the cargo-carrying train has unloaded it at the exchange point, this train would release the semaphore, leaving it up for grabs again.The first train (the one that created the semaphore) would then get notified that the semaphore is available, take it, and come in to pick up the cargo

In addition to binary semaphores, VxWorks includes a second type known as mutual exclusion, or "mutex," semaphores. These allow a task to have the exclusive use of a resource. The main difference with this method is how the semaphore is initialized. Binary semaphores are always created empty. Mutex semaphores are always created full. A task simply creates a full semaphore and takes it immediately, thus making it unavailable to all other tasks until it’s through with whatever it is doing. Such semaphores are often used to access communications hardware. A task needs to use such equipment, say, an information bus, until its data transfer is over. Cutting the transmission before it's done would be pointless, hence the need for mutex semaphores.

If this sounds clever, it’s because it is. The semaphore system is proprietary, and it became one of VxWorks’ selling points. But during those first few weeks Mars Pathfinder spent on the Red Planet, the RTOS still went beautifully downhill.

A Martian bug

The "information bus" working onboard the Mars Pathfinder was a shared memory used for passing the data between different components of the lander. Predictably, this area was a resource locked with a mutex semaphore. As it turned out, there were three tasks involved in causing the mysterious reboots. The first was a high priority task whose job was to manage the information bus operations. The second was a low priority task, which once in a while would take the information bus mutex to share meteorological data. The third culprit involved was a medium priority communications task.

Here’s how this system was supposed to work: the meteorological data-gathering task was supposed to infrequently seize the information bus mutex. On rare occasions when the information bus management task was scheduled to run while the meteorological data-gathering task was running, the higher-priority task would try to get ahold of the same mutex—and therefore it ended up blocked until the lower-priority meteorological data was written to the bus. So far, so good, as data transfers should go from start to finish. But the third medium-priority communications task entered the scene and caused trouble.

The trouble was that there was an unlikely sequence of events that could schedule the medium-priority task to run when the low-priority meteorology task was running after it caused the high-priority bus-management task to block on the mutex. There was only a split-second window of opportunity for this to happen, but when it did occur, the medium-priority task preempted the low-priority task. One of the many things the halted meteorological data gathering couldn’t do on such occasions was release the mutex semaphore to the high-priority bus management task. In consequence, the medium-priority task indirectly blocked the higher-priority task from running, hence the priority inversion. Of course, this caused the bus management task to enter the delayed state. And once the independent timer working in the kernel figured out that the important thread was not running as planned, it assumed something went really wrong and initiated a total reboot.

Such reboots happened roughly half a dozen times in two weeks—but ultimately VxWorks and its design was not to blame. The system could deal with such issues with a trick called “priority inheritance,” which caused the low-priority task to temporarily assume the higher priority of another task it has just blocked on mutex. If priority inheritance was working in the Mars Pathfinder, the meteorological data-gathering task would have simply assumed the high priority of the bus management task for the time the bus management task was waiting on the semaphore. This, in turn, would have prevented the medium-priority communications task from preempting it. All that had to be done was to turn on the priority inheritance option before launch.

Therefore, at the end of the day, Pathfinder’s issues stemmed from a human error. VxWorks, thus found not guilty, has gone on to fly on pretty much every rover that has landed on Mars since. Just a few decades after becoming the most widely deployed RTOS on Earth, it managed to become the most popular operating system on the Red Planet, too.

Enlarge / From 2015: An artist's rendering of the BepiColombo mission, a joint ESA/JAXA project, which will take two spacecraft to the harsh environment of Mercury.

ESA

ESA Falls for RTEMS

For the last decade, the space operating systems landscape seemed stable. In the US, NASA was mostly happy with using proprietary VxWorks for its most high-profile missions. But in the EU, the ESA had its own workhorse. The space agency was heavily invested in developing the open source RTEMS—which, according to the ESA’s Maria Hernek, is just as capable but comes without expensive licensing fees.

RTEMS was not initially created to fly European spaceships—its original purpose was flying US missiles, actually. This RTOS history began with a study performed at the Research Development and Engineering Center of the US Army Missile Command back in 1988. Army researchers concluded that using proprietary real-time operating systems caused a number of problems. Most notably, the government did not own the code, so it couldn’t modify it in any way. Moreover, the study claimed the responsibility for software failures looked a bit unclear, and RTOSes of that era were too slow for missile systems. For all those reasons, the Army decided to build its own RTOS called Real-Time Executive for Missile Systems. The goal was to make an RTOS that was fast enough for guiding missiles, government-owned, easy to run on different processor families, and license-free.

As the RTEMS was taking shape, the US Military started to realize that its possible applications reached far beyond firing rockets. Hence the name of the system quickly evolved into the more general Real-Time Executive for Military Systems. And since May 4, 1995, when RTEMS was released as open source and no longer bound to wear a uniform, it became known as the Real-Time Executive for Multiprocessor Systems.

The European Space Agency has fallen in love with it for two main reasons. The first is that RTEMS was designed from the ground up to be effortlessly ported to new processor families. So, making it work on SPARC LEON radiation-hardened chips developed in Europe for ESA’s space missions could be done with relative ease. The second reason was that the system was highly customizable. Based on the same working principles as VxWorks, RTEMS allowed programmers more freedom since virtually everything in the system could be changed. ESA was totally free to fiddle with the code.

Scheduling is one of the customizable areas where RTEMS differs from VxWorks. In VxWorks, a programmer is stuck with a preemptive priority-based scheduler for tasks with differing priorities and a round-robin when multiple tasks have the same priority. It can’t be changed. WindRiver built it this way—take it or leave it. RTEMS offers a completely different approach.

Of course, RTEMS has a priority-based scheduler with 256 levels of priority just as in VxWorks. There is also a round-robin scheduling method available. Both are used as default schedulers for single-processor platforms. But in RTEMS, you can dispense with each option and go for one of the numerous other scheduling mechanisms instead. There is the Simple Priority Scheduler, a leaner version of default schedulers that can work under several memory constraints. The same low-memory scheduler is also available in a variant designed for symmetric multiprocessing systems with multiple processors running in parallel. Or another scheduling option entirely is the Earliest Deadline First Scheduler, which, as its name suggests, prioritizes tasks with earliest deadlines. Plus if you are not happy with any of RTEMS’ options, you are free to throw them all out the window and write your own scheduling algorithm—RTEMS will work with that as well.

Since opting for this RTOS, ESA has invested lots of time and effort into qualifying RTEMS to software criticality Level B, which is the second-highest level of software reliability recognized by the agency. The ESA uses Level B status to denote software whose failure would cause “critical” consequences. To achieve that, ESA testers had to execute every single line and every single decision point in the RTEMS code. The only higher criticality—Level A—is where the consequences of failure are “catastrophic.” (Sadly, ESA documents do not specify what “critical” or “catastrophic” mean exactly, but you can easily imagine the ISS crashing down on Brussels.)

“I recall the last time we used VxWorks was in one of the instruments on Sentinel 1 spacecraft,” says Hernek. All other modern European space missions, including the most recent Solar Orbiter, flew with RTEMS onboard.

RTOS on a mission

At this point, VxWorks and RTEMS have been used for decades and are astonishingly good at what they do. In an email exchange discussing real-time operating systems in 2004, Gregory Menke, NASA’s software engineer, wrote that in terms of performance, RTEMS and VxWorks were so close that it was impossible to even tell the difference between the two. So, as you might expect, ESA used VxWorks at times, and NASA went for RTEMS on more than one occasion. The two major flight operating systems have even run in parallel on the same spacecraft managing different instruments.

But that doesn’t mean the last decade has been all VxWorks and RTEMS in the world of space operating systems. And sometimes, new challengers came from the most unexpected places—like a bitcoin forum post.

Back in 2013, bitcoin core developer Jeff Garzik posted a humble idea to the Bitcoin Talk Forum: what about building some bitcoin resiliency in space?

"I was researching how to sort of make the bitcoin network even more resilient,” Garzik says. “And I had an amateur space background—my father took me to Space Shuttle launches; he worked at the White Sands Missile Range." Garzik saw two potential paths: the first, according to him, was to rent a bandwidth on an existing satellite and use it to broadcast the blockchain data. "But from the blockchain community standpoint, there were too many single points of failure, there was a significant shutdown risk,” he admits.

This second path, however, was to tap into the nanosatellite revolution going on. Garzik envisaged putting a ring of micro satellites on Earth’s orbit, and those satellites could be connected via an inter satellite link that would store and broadcast the blockchain data.

"I called that the SpaceChain," says Garzik. “The SpaceChain was designed as a self-healing mesh network of multiple satellites that could route around hardware and spacecraft failures. We were looking at a sort of cloud computing model, many plus cheap, where you could make up for failures with software."

Garzik’s idea was quickly forged into the SpaceChain foundation, and the SpaceChain foundation quickly got busy developing the SpaceChain OS, which was to run on all of those satellites.

Enlarge / SpaceChain OS's long-term mission got a lift on a Falcon 9 once upon a time.

SpaceX

Crowd-funded spaceships

SpaceChain OS is made with two key components. First, there is a typical RTOS based on an open source Sylix kernel. According to Garzik, this Sylix OS has been extensively used in China in multiple military and space applications. In a way, this makes it similar to RTEMS. Remember, the acronym currently stands for Real Time Executive for Multiprocessor Systems, but the letter "M" previously stood for "Missile." Sylix has the same kind of backstory; the only difference is its country of origin.

"It’s been, no pun intended, really battle-hardened," says Garzik, who adds that "the system is reliable, tested, and able to support the most popular embedded space processors. And it is lean, which makes it easy to maintain and keep out of bugs. The Linux kernel has about five million lines of code. Sylix is five times smaller than that."

The second key component for SpaceChain OS is a blockchain technology included to run the constellation network. This blockchain component is also what makes it possible to launch crowd-funded satellites. The idea is that multiple companies, institutions, or even individuals can chip in to fund a satellite running SpaceChain OS. While the Sylix part is responsible for running the spacecraft in the same way RTEMS or VxWorks runs the hardware of modern spaceships, the blockchain component is there to share the spacecraft resources among multiple stakeholders.

"It works like this: to add a node to the SpaceChain network, first you need to go through a qualification process within the SpaceChain organization and, once qualified, you are added to a white list," explains Garzik. "Then you can either pay SpaceChain to build your satellite or do it on your own based on open source hardware specifications, public standards, and public protocols—just like the Internet.” From there, Garzik says the organization needs to launch the satellite or leave the launching to SpaceChain. And once the spacecraft is successfully placed in orbit, the owner needs to pay a registry fee in order to receive a blockchain smart contract that allows the new node to authenticate with other satellites in the network.

It may sound grandiose or even fantastical, but Garzik’s vision is at least slowly in progress. SpaceChain has placed multiple bitcoin nodes in space: in 2018 with the help of a CZ-4B Y34 rocket and in 2019 on the back of a SpaceX Falcon 9 rocket. And their work has been intriguing enough to garner investments from the likes of the ESA itself.

Enlarge / So, no, space does not work on poorly photoshopped versions of Windows 95.

Lee Hutchinson (original image)

The Old Guard responds

The development of new space operating systems is certainly an exciting concept. But so far, the people who’ve actually flown something to the asteroid belt and beyond seem somewhat skeptical about SpaceChain’s ideas.

"SpaceChain OS is not going to replace RTEMS as long as it doesn’t provide any new functionality that we really need and that RTEMS doesn’t have," says ESA’s Hernek.

According to her, any new software product that ESA wants to implement on its spacecraft needs to go through a pretty long period of testing and certification. That’s why people at ESA, NASA, and other space agencies generally try to reuse what they already have as much as possible, since this shortens the overall development process.

"Look, we don’t play with new space software because we think it’s fun. We always have good reasons to do it,” Hernek said. “It’s always either that the software we have available does not solve our problems, that it causes some problems, or something like that.”

On top of that, slowly but surely, the vintage space operating systems are getting closer to the far-reaching goals Garzik talks about. They just do it through incremental changes. In the latest VxWorks 7 release, Wind River introduced multiple new features aimed mainly at making the development process easier and faster. The entire system became highly modular, and the developers have the capability to try multiple versions of critical components like the file system and see what solves their problem. There is no need to wait for the next update of the entire system as is the case with macOS for example. The company also added support for advanced graphical user interfaces so that VxWorks can now run touchscreens and other user-friendly displays likely to appear in future spaceships. Similar functionalities have already been implemented in RTEMS as well.

Still, Hernek admits the SpaceChain’s idea of satellites available to multiple users at the same time is quite enticing. "They say the multiple user functionality is something we don’t have, and they are correct," Hernek says. "A dual user system is the best we can manage at the moment." According to Hernek, though, the ESA is currently trying to solve this problem through partitioning flight software. "We have been experimenting with parts of IMA, the Integrated Modular Avionics system, that enables multiple aircraft to work in a distributed network supporting different applications. That’s the system used in the 4th-gen jet fighters," she explains.

So as the list of reasons to launch to space continues to expand these days, so, too, do the abilities of the modern space operating system landscape, where even industry stalwarts can be pushed to learn new tricks by upstarts borne out of Internet threads. Hernek and her team at the ESA remain a great example; they’re currently developing various software products that would allow implementing this multi-user kind of architecture in spacecraft.

"Sure, it would be interesting to look at SpaceChain OS and see how they are addressing this problem. For now, though, we’ve got the IMA that allows partitioning. We’re getting there."

Jacek Krywko is a science and technology writer based in Warsaw, Poland. He covers space exploration and artificial intelligence research, and he has previously written for Ars about facial-recognition screening, teaching AI-assistants new languages, and comms. / CPUs / AI in space.

Definitely not Windows 95: What operating systems keep things running in space?

Sign In

Definitely not Windows 95: What operating systems keep things running in space?

Recommended Posts

Karlston

Definitely not Windows 95: What operating systems keep things running in space?

The updates don't come every spring and fall, but space operating systems keep evolving.

Billiard ball OS

Shooting at the Moon (and beyond) with VxWorks

One small reboot

Anatomy of VxWorks

A Martian bug

ESA Falls for RTEMS

RTOS on a mission

Crowd-funded spaceships

The Old Guard responds

Link to comment

Share on other sites

Archived

Recently Browsing 0 members

nsane.down

News

Browse

Activity