📜 ⬆️ ⬇️

Programmer on Mars: Shutdown Dammit Until

- Houston, we have problems.
- No, Mark, this is your problem.

How do sleep the programmers and testers rover for $ 400 million? Especially if the device did not get in touch with Sol.


')
Mark Adler is an American software developer who works in the field of space exploration. He is best known for his work in the field of data compression, being the author of the Adler-32 hash function, as well as the co-author of the zlib and gzip data compression library. He participated in the development of the Info-ZIP and Portable Network Graphics (PNG) image formats. Adler was also responsible for the mission of the rover "Spirit" in the framework of the program " Mars Exploration Rover ".

Together with the Edison company (which specializes in the development of protective relay systems and applications for simulation of experiments ), we will tell you about the software developer of the rover and how events developed when they tried to fix a software error at a distance of 225 million kilometers.

Biography


Adler was born in Miami, Fla., Was the only child of David and Bertha Adler. He received a bachelor of science degree in mathematics and defended his master’s thesis in electrical engineering from the University of Florida in 1981 and 1985, respectively. In 1990, Adler became a Ph.D. in physics at the California Institute of Technology. He lives in La Cañada (California) with Diana Saint-James, and they have two children - Joshua and Zachary. Diana works at the California Institute of Technology, is engaged in the production and participates in theatrical performances.



For collaborating with J. Gailly (by gzip), he received the 2009 USENIX award for his contribution to the development of FLOSS data compression algorithms. [Source - STUG Award ]

Career


After defending his PhD, Adler worked for Hughes Aircraft in the Space and Communications division on various projects, including analyzing the effects of X-ray bursts on satellite channels, developing error-correcting codes , developing an anti-theft vehicle key, and analyzing video and image compression ( wavelets and MPEG-2 ). [Source - Caltech: About Mark Adler ]

Mars exploration


From 1992 to 1995, Adler was the lead engineer of the Cassini-Huygens mission . After that, he was appointed coordinator of the Mars Exploration Program at the Jet Propulsion Laboratory (JPL) from 1996 to 1998. Thus, he has been responsible for planning Mars exploration missions since 2001.

In the period 1999-2000 he worked on the project for the delivery of samples from Mars (Mars Sample Return), under which it was planned to carry out three missions to Mars (2003-2005) and bring samples to Earth in 2005. However, the project was rejected due to the failure of the Mars Polar Lander mission.

Mars exploration rover




Adler initiated the Mars Exploration Rover project for Mars (Spirit and Opportunity rovers) and was actively involved in its implementation.

Currently, Adler is the head of the project Low Density Supersonic Decelerator - the development of a means to descend to Mars cargo weighing from two to three tons.

Reflections of Mark on Mars and on his work (eng) .

Now Mark is the Project Manager for the Low Density Supersonic Low Density Project ( Low Density Supersonic Decelerator or LDSD)



"Spirit". 18th sol. Anomaly

Says Mark Adler. September 22, 2006. Archival materials of the Planetary Community.

In the previous post, I promised to talk about what happened to Spirit in a week after the President announced the national program “A Look at Space Exploration”.

And so, a quick look from the inside on how to manage the priceless national property.

January 21, 2004, 18 Martian days (solos) have elapsed since Spiritual successfully landed on Mars and about a week after its successful departure from the landing platform to the surface of Mars. Everything went so incredibly well that we hardly believed in it. It was actually strange to observe: the rover worked on Mars much better than during testing. "Spirit" was involved in this geological exploration on the surface of another world! We felt the happiest people on Earth.

Well, good luck soon turned away from us.

Jennifer Trosper and I alternately performed the duties of tactical management of the mission of "Spirit". The 18th Sol was on duty Jennifer, and I had a day off. True, I still came to the LRD about noon to give an interview for a documentary. At the exit, I ran into Steve Squares, the scientific director of the MER (Mars Exploration Rovers) program. He was just entering to give his interview. When he saw me, Steve said: “Are you already aware of“ Spirit ”?” The question and the serious tone of Steve instantly pulled me out of my sleepy state. “What do you mean?” I asked, staring at him. Steve said that we did not receive any signal from the "Spirit" in due time, either through a high gain antenna directly to Earth, or through the Mars Odyssey orbiter.

Oh my God.

If only there was no connection in the first case, it could have been easily attributed to bad weather, problems with the Long-Range Space Communication Network, a whole lot. Communication across hundreds of millions of kilometers is not a simple matter and you often have problems. But, judging by our experience at that time, communication through the Odyssey, which was in orbit at a distance of only a few hundred kilometers and worked perfectly, should not have failed.

Everything indicated that the problem was in the rover, and all very seriously.

Space missions are risky. We are used to it. We carefully consider the most dangerous moments. For the MER program, the greatest risks were, in order: delivery, descent and landing, or, as we called them, six minutes of horror, along with a very risky launch from the Earth, as well as with post-landing operations on the platform exit and deployment disconnect cables.

Start, landing and exit. Any of this will make you turn gray. With "Spirit" we broke through it. All risky left behind, we thought. And then, of course, with due attention and care, the major dangers were no longer foreseen. Easy and measured swimming forward.

All this only added anxiety. What the hell happened?

I immediately went to the operating area, where I and many others spent the next three days without flying. Otgulov in a similar situation does not exist. The next three solas Jennifer was responsible for scheduling assignments, and I for their execution. You are engaged in planning during the Martian night, and you perform tactical operations when the rover is awake during the day. Therefore, Jennifer and her team tried to determine what to do, and I did this with my team. Or at least they tried.

On the 19th sol, we simply tried to contact Spirit to get feedback from him. Before we begin, I traditionally sang a song in the mission control center dedicated to the events of the day. For the 19th sola, I chose the SOS group Abba. In general, this is almost all that Spirit was involved in that day. Attempts at communication were unsuccessful, only the signal of the transmitter of the rover was received. When there is no information to transmit, it still turns on and gives an important signal that it is still there. He is not completely lost. Although besides this no other information was received, we considered it a good day. I finished my report on the situation that day on an optimistic note: “In the long term, we plan to restore the state of the apparatus, diagnose and correct what happened, and return to the normal work schedule.” I finished the reports of the 20th and 21st salts with the same words.

On the 20th sol, we made even more zealous attempts to extract information from the rover. After all, without it, we had no idea what to do, what to undertake, how to restore working capacity. After numerous attempts, we managed to get current information from the radio transmitter. Most of it was recurring gibberish, which in itself was a mystery, but still this data package was enough to understand the situation with Spirit. Yes, excellent news, but the situation itself did not cause joy.

We saw that the internal temperature of the rover was much higher than normal, and the battery charge was much lower than expected. These two factors clearly indicated that the rover was not going to go into sleep mode as it should. Usually, his computer works five or six hours a day. This saves precious solar energy stored in the battery and also prevents the device from overheating. Well, "Spirit" has not slept for a long time, and may not fall asleep.

We had one mars rover on our hands. "Spirit" was insomnia, fever, he weakened all the time, mumbled incoherently and for a long time did not obey the commands.

Poorly. We had one rover on Mars dying on Mars, and two days later another, Opportunity, had to go through delivery, descent and landing. During the week we could easily be without any rovers.

On the 20th sola, our main task was to make the rover fall asleep. We hoped that the “disconnect” team would reach the device during the communication session. Thus, we could see the premature ending of the session, which would confirm that the device received the command. And so, we sent a command to disable the device until a certain time - SHUTDWN_DMT_TIL, which takes precedence over any actions that the device is doing at the moment (the names of the teams were modestly decorated with humor).

We were sure it would work. Wow ... We made Spirit fall asleep. To check sent a request to the device. Which was supposed to remain unanswered - the rover is in sleep mode, it can neither receive nor respond to signals.

And then ... we received a response signal.

What the …? "Spirit" was supposed to fall asleep! But no. He decided to stay late at work.

The position of the Earth forced us to postpone the attempt until the next day. At this time, the rover continued to spend the charge, and the electronics overheat. We are rapidly running out of time.

21st sol. We have a plan. The main version of the current malfunction, at least, which left us room for maneuver, at that time was the following: the rover's computer was stuck in a “reload loop”. Software's answer to a problem that it cannot solve is a reboot. The same thing you do when your computer is freezing. But since there was no one there who could press the reset button, the rover did it automatically. However, if the software encounters an error during a reboot, then it is doomed to reboot forever.

Developers prudently programmed the delay between reboots, during which you can talk to the rover. This can explain his intermittent execution of commands - a positive result will be only if the device receives a signal in the interloading interval.

The idea of ​​rebooting is simple: everything that caused the problem in the previous session will disappear in the new one. But in this case, the problem remained. This means that “Spirit” recalled something between reboots, which was the cause of the failure. This indicated flash memory (as in your digital camera), either a small EEPROM memory block, or a hardware failure. Flash memory is used on the rover as a hard disk on your computer - the file system is stored there.

Again, genius developers have built a “back door” for us. There was a way to force the rover to reboot without looking at the flash file system. Radio equipment that receives a signal from the Earth is able to decode several commands on its own, so-called. hardware commands. For their recognition and execution does not require a computer. One of these commands is to tell the computer not to use the file system when booting. Another similar command is to force the computer to reboot.

So that's what we did on the 21st sol. After several attempts, we finally managed to load “Spirit” triumphantly in a more or less healthy mode, in which he responded to commands and did not mumble nonsense. What a relief! We asked for a history of power consumption over the past few days, postponed the next session of communication with the orbital transponder, and, finally, allowed Spiritu to have its well-deserved and urgently needed sleep. This time everything worked out.

Now we had a secret tool to make Spirit work. Rover still had to wake up every morning in boot cycle mode, but we could quickly send the necessary commands to boot without going through the file system. What we did the next few days. We won the race with time and now we could accurately and methodically find out what happened, correct it and continue the mission.

So I went home and instantly fell asleep. My alarm clock rang five hours later. Why? So that I can return to the PJR and not miss the landing of Opportunity that night. A few hours after we regained control of Spirit, Opportunity stormed into the Martian atmosphere at a speed of 12 thousand miles per hour. The landing was successful, and again we felt confident - we have two all-terrain vehicles, on Mars, safe. Wow, well, adventure.

At the end of the 21st sola, a turning point in the recovery operation was completed. True, it took another two weeks to complete the diagnosis, to solve problems (I had to format the hard disk (flash memory) of the device) and to resume the full working capacity of Spirit.

As we recovered the information accumulated before the crash, we pulled out this beautiful color photograph of the US flag on the Rock Abrasion Tool (RAT) manipulator. This flag was on a protective cover, which was made from the remnants of the twin towers of the World Trade Center. RAT was designed and developed in Manhattan, a couple of blocks from where the towers were. We put this picture with stars and stripes on large monitors in our control center, and I lost our national anthem. All stood with a hand to his heart. It was a good moment.

Since then, "Spirit" functioned just fine, not taking into account such signs of age as the increased noise of the engine. While I am writing this, "Spirit" overcame the mark of 967 solos. Nine hundred and sixty seven ?! Wait, there is some kind of mistake. Let me check ... N-no, that's it. Incredible.

You are probably wondering what was the main reason for the failure of the 18th Sol. Ultimately, we realized that it was just a bug in the rover software that we didn’t catch at the testing stage. As we collect information, the amount of occupied memory has increased more and more. On the 18th sol, the memory block was filled, and the reboot process was stopped due to the inability to read the file system. Actually, we thought about possible problems after numerous spent solos. To dispel these suspicions, we conducted a 10-salt test before landing. But we did not perform the 18-salt test. So far, "Spirit" has not held it for us on Mars, yes.

Surely, more than once we will encounter serious software errors on other spacecraft. But I guarantee this error will not take us by surprise anymore.

Thanks for the help with the translation thanks to Sergey Danshin.


Sunset on Mars. Snapshot of the rover "Spirit"

Bug work


For those who need specifics and details, a report will be useful, where the sequence of events is described in detail, the key reason is identified, lessons learned are spelled out, and what changes have been implemented in connection with the detected shortcomings.

The Mars Rover Spirit FLASH Anomaly (Glenn Reeves, Tracy Neilson, Jet Propulsion Laboratory)



Another example of how NASA developers work on bugs. With recommendations - MER Spirit Flash Memory Anomaly (2004)

Interesting Facts



[Source - c2.com/cgi/wiki?MarsSpiritSoftwareProblem ]


As in Edison , the process of testing software for the power grid monitoring system and event visualization or X-ray tomograph occurs in the video:

Source: https://habr.com/ru/post/310312/


All Articles