As many of you have probably heard, last week the Curiosity rover, which was busy analyzing samples drilled by it with a drill, had some problems with the main on-board computer. Let's see what exactly happened, and how JPL experts plan to solve this problem.According to NASA, the cause of memory damage on board Curiosity may be cosmic radiation. Let me remind you that last Thursday, for reasons that experts have associated with damage to the memory area of ​​the rover, engineers had to switch Curiosity to a spare computer.

')
Now the team of the rover checks the telemetry data, and also carries out diagnostic tests in order to understand what went wrong, and how to return the system to a working state.
"We were in a rather strange situation - our software worked, but only partially worked, so we decided to switch to the" clean "version of the on-board software, which also works on" clean "hardware," said Curiosity project manager Richard Cook. “The easiest way to do this is to just start using a spare computer.”
Curiosity is equipped with two computers that have straightforward names A and B, each of which can be used to control the rover. Computer B was used during the flight to Mars, and after landing, the rover switched to computer A and has been using it ever since.
[Onboard computers Curiosity called RAD750, and are radiation-resistant single-board based on the same processor. They are available on 250- or 150-nm technology and can withstand radiation up to 1,000,000 happy, and work in the temperature range from -55 to 125 degrees Celsius, consuming about 5 watts of energy. A system consisting of the processor itself and the motherboard can withstand up to 100,000 rad, and temperatures from -55 to 70 degrees. Computers have 256 kilobytes of EEPROM, 256 megabytes of RAM, and 2 gigabytes of flash memory. Of course, this is not very impressive in 2013, but compared with the hardware of the rovers of the previous generation, the performance gain is very large, com. Per.]
The switch from primary to backup computer occurred at about 5:30 pm EDT (GMT-5) last Thursday. After that, the rover went into the so-called "safe mode". Over the next few days, engineers will continue to connect Computer B to all onboard systems, and restore normal operation of the rover.
Since landing, this problem has become the most significant of those that have fallen on the head of Curiosity.
“Most likely, we will soon return to normal operation,” said Cook, “And yet, this is not the most pleasant experience - you see, the rover is an extremely complex device. Enough things are enough for something to go wrong, and we have to take this into account all the time. ”
The problem first appeared on Wednesday morning. It all started with the fact that employees of the control center noticed the data, which, as it seemed to them, indicated damage to the flash memory of the rover. The on-board software did not record any new data in memory, and refused to transmit the data recorded previously. The only information that could be obtained from the rover was real-time telemetry.
On the same day, during a communication session via the MRO satellite, telemetry showed that memory damage was still not fixed. In addition, as it turned out, the computer didn’t do some pre-programmed actions - it had to go into sleep mode for an hour and then wake up during the next communication window with the Odyssey satellite.
MRO satellites (left) and Odyssey (right)“During the second flight, we received some information, which briefly boiled down to the following:
Hey, guys, the memory is still damaged, and besides, I didn’t go to bed when I had to, I was awake all this time! "- said Cook.
The next communication window was between 10:30 pm and midnight on the same day (JPL control center time zone). The rover computer was still working, and the engineers decided to switch to system B.
At the same time, Cook noted that the memory of the rover was initially made resistant to errors that could be caused by cosmic rays or radiation. However, everything indicated that the most sensitive area of ​​memory had been damaged - the directory that contains information about the location of certain data.
“Without going into details, we have several levels of protection. The memory itself is self-correcting, and the software is designed to be tolerant of data corruption. We believe that we are extremely unlucky - we received errors in precisely those memory areas that were most sensitive to them. ”
[Let me remind you that the rover software itself has several levels of action in an emergency. In the event of particularly serious problems, the rover usually goes into "safe mode", stops all its activities and waits for the next communication window to send information about the problem to the control center, and to receive further instructions.“Thus, we simply lost information about where the data is located. I repeat - in theory, the software of the rover should be tolerant to errors of this kind, but we got into a situation where some of the software worked as expected, and some began to fail while waiting for data to change in memory - the software simply could not understand where it came from . "
Cook noted that the chances of cosmic rays causing such a problem are extremely low, but this has happened before.
“Imagine an address book full of records. Instead of damaging one of these entries, cosmic radiation damages the table of contents. This is extremely rare, but - alas - such things sometimes happen. ”
If this guess is correct, restarting the main computer should solve the problem. However, engineers are not going to rush - they conduct a detailed analysis of the situation in order to be sure of the causes of the problem before taking any action.
“Of course, we can use Computer B, and it is absolutely as effective as the main one. So in the coming week we will be setting up the software of the second computer to make sure that all the systems are working as it should. ”
“In the end, we plan to return to the main computer. If the problem is really memory corruption, then during boot it will disappear by itself, since the onboard software will overwrite the partition table from scratch. ”
NASA experts expect that Curiosity will be able to continue its scientific research in the next few days.
Update
Today (March 4, 2013) NASA
announced that Curiosity is again in “active” mode. According to calculations, he should fully recover and continue scientific research next week.
The exit from the safe mode was implemented on Saturday, and on Sunday the rover again began to use the HGA (high-gain antenna) to communicate with the Earth.
“The recovery process is going well,” said Richard Cook, who was already familiar to us. “It consists of two parts. First, we want to understand exactly what happened to computer A, and second, to conduct a series of operations with computer B, for example, to inform him about the state of the rover - the current position of the arm, the mast, and so on. ”
However, the exact cause of the memory outage is still being investigated.
Please report all errors and typos in PM!
As usual, many thanks to Zelenyikot for the material found.