📜 ⬆️ ⬇️

Phobos-Grunt. Lessons for those left on Earth



I recall the history of the issue. On November 9 last year, after nearly 15 years of development, several project interruptions and launch shifts, the Zenit-2SB launch vehicle with the new Russian spacecraft Phobos-Grunt was launched from Baikonur. The goals were very ambitious: to launch an automatic station to Mars, to reach its satellite - Phobos, to take soil samples from it, which would then be returned to Earth. These would be the first samples of extraterrestrial material physically delivered into the hands of researchers (well, the bookworm would disobey the Japanese Hayabus because of the constant delays of Phobos some microscopic particles of interplanetary dust earlier than our apparatus delivered several years ago) explorations of the moon "in the last century." And taking into account the fact that, according to today's theory, Phobos is an asteroid captured by Mars, that is, a sample of the same source material from which all the planets of the Solar System were formed (the Moon is still a “piece” of the Earth split off, and not real "planet"), this expedition had unprecedented scientific significance. It would also be the first "return" of the apparatus from Mars and its satellite.
Also important was the question of prestige and the open return of Russia to “deep space”, to interplanetary research, which had ceased in Soviet times.

Alas, the whole expedition ended pretty soon. Immediately after launch, it turned out that the device was “stuck” in a near-earth “parking” orbit, was not responding to commands, and was there in a “stuck” state, without executing the program. On November 24, attempts to restore working capacity were officially discontinued, and in February of this year, the device uncontrollably entered the dense layers of the atmosphere, and fell into the ocean, fortunately, without hitting anyone on this descent on Earth.

A brief official report was published in February on the Roscosmos website. That's what it essentially says:
')
The main provisions of the Conclusions of the Interdepartmental Commission for the Analysis of the Causes of an Emergency Situation that arose during the flight tests of the Phobos-Grunt spacecraft, formed in accordance with the order of the head of Roscosmos of December 9, 2011 No. 206

Source http://www.roscosmos.ru/main.php?id=2&nid=18647

[lists the various possible causes discussed and their sources]
An analysis by the commission’s experts of the possible failures of these systems and units showed (taking into account their condition and TMI) that by the time the NSG [emergency situation] occurred they could not be its root cause.
2.2. The reason for the occurrence of NShS is the restart of two half-sets of the CMV22 BVK [On-Board Computing Complex] device (double “restart”), which performed control of the Phobos-Grunt satellite on this leg of the flight, after which, according to the operating logic of the BKU, Phobos-Grunt, ”and he switched to maintaining a constant solar orientation and waiting for commands from the Earth in the X-band of communication, which was provided by design solutions for the migratory trajectory. [...]
2.3. The most likely factor that could be the primary cause of a dual “restart” is the local impact of heavy charged particles (TSCH) of outer space, which led to the failure of RAM of the computing modules of the CMV22 kits during the flight on the second orbit of the Phobos-Grunt satellite.
RAM failure could be caused by the short-term inoperability of the ERE due to the impact of HRCs on the cells of the computing modules of the CMV22, which contain two chips of the same type WS512K32V20G24M (the cells of the computing modules are located in a single package parallel to each other). The impact led to the distortion of the program code and the triggering of the “watchdog” timer, which was the cause of the “restart” of both half sets of CMV22. The model of such an interaction of TZCH with EKB is not regulated by regulatory and technical documents. The Commission considers it necessary to develop and introduce in the organizations of the PSC regulatory and technical documents containing modern models of ionizing radiation of outer space and guidelines for their use.

From the scattered and fragmentary information about how the onboard computers of Russian spacecraft are constructed and what it was, it was possible to understand that Phobos-Grunt decided to use the new on-board computer complex BVK TsVM22, produced by Tekhkom, a division of Argon Design Bureau, it was the transition to TsVM22 that explained the last delay and the transfer of the launch from the previous start window to the current one. For about two years (among other things) Phobos has been reequipped for a new, compact BVK, created using modern microelectronics, and weighing not only 30 as before, but only 1.5 kg. But in space, everyone is not even a kilogram, a gram worth its weight in gold (the approximate cost of outputting a kilogram of cargo to the lowest near-earth orbit is about 3000-4000USD)! But the flight to Mars is not only the conclusion of a near-earth orbit. Each saved kilogram of "iron" allows you to put on the device a kilogram of a smart scientific device.
No wonder that taking advantage of such savings was very tempting.

On board Phobos, there were two independent modules, TsVM22, operating in parallel, independently, and providing hot redundancy, in case of failure of any module in the pair. Such duplication is a common practice in aviation and space technology.

In the wake of the general annoyance caused by regular failures, recently, in the Russian space program, even very annoying and ridiculous rumors have been heard that, allegedly, Phobos used common Chinese electronics, here it is, and let down. In fact, it is not.
Here is what James Hamilton writes about this microcircuit in his blog , in an article about the effect of memory failures on server hardware:

Sampling for "Sampling", Sampling for Sampling, "512K32" for a 512k memory card, , “20” for 20ns memory access time, “G24” is the package type, and “M” indicates a military grade part.

“This SRAM (Static RAM, memory chip, the cell of which, unlike the traditional DRAM for personal computers - Dynamic RAM, retains its state in the absence of circulation and does not require 'regeneration', is widely used in industrial electronics) manufactured by White Electronic Design ( "W"), has the organization StaticRAM ("S"), "512K32" means 512K words of 32 bits. "V" mark of improved characteristics, "20" - 20ns access time to the memory cell, "G24" - type of case, "M "- indicates the" military "class of manufacturing and tolerances."

Source: http://perspectives.mvdirona.com/2012/02/26/ObservationsOnErrorsCorrectionsTrustOfDependentSystems.aspx

However, alas, even the use of real “white” American microelectronics of the “military-grade” class was not enough.

There is a classical problem of insufficient constructive study, and if we take it more broadly, then, apparently, low engineering competence in general. Of course, to design such an arrangement of two BVK boards so that the memory chips in them were located so close that they were stitched with one particle and caused (simultaneous!) Both duplicated computers to fail at once, this is an obvious constructive flaw in the “top level”.

This, apparently, is the classic problem of “who made the costume?” Of their famous monologue Zhvanetsky-Raikin. “Are there any complaints about buttons?” Beautiful, perhaps, a computing complex in itself. Nobody just thought that by placing two chips side by side we would increase the likelihood of a destructive simultaneous radiation effect on its elements. No one looked at such an angle at the assembly. Or, as the official report is dryly expressed: “The model of such interaction of TZCH [heavy charged particles (" cosmic rays ")] with ECB [electronic command unit] is not regulated by regulatory and technical documents." .

But this, alas, not all. Even worse, apparently, is the case with design competence.
Surprisingly, but a fact: back in 2005, in the collection of works “Radiation Effects Data Workshop”, published by IEEE, on the topic of radiation exposure and the effects of heavy charged particles on electronics components, it was directly noted:

Selective testing of 1M and 4M monolithic SRAMs at Brookhaven National Laboratories has been shown to be very sensitive to single-event latchup (SEL). We have observed SEL at the minimum heavy-ion LET available at Brookhaven, 0.375 MeV-cm2 / mg.

“Recent testing of 1M and 4M monolithic SRAM chips, conducted at the Brookhaven National Laboratory, has shown that they are extremely sensitive to the snapping effect. We observed this effect when exposed to heavy ions, at least 0.375 MeV-cm2 / mg, available at the accelerator in Brookhaven. "
Source http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?reload=true&arnumber=1532657

But these are the very microcircuits that we have chosen in the Technical Committee for creating CVM22! And they have been aware of this behavior since at least 2005.

Apparently Phobos was doomed from the very beginning. Sooner or later, in fairly radiation-harsh conditions of “open interplanetary space,” this effect would have been slammed. But if “latching” in principle, if lucky, is treated by “cold rebooting” the complex from the backup system, then the simultaneous failure of both complexes (caused, first of all, as the report indicates, by their design) turned out to be fatal. According to the report, the failure occurred so "amicably" that Phobos did not even send a message about the failure, and with foreign assistance the Control Center managed to get only rather fragmentary telemetry data (apparently with "dull" automation), which spoke only about almost complete inactivity of digital computers on board and the exit of the entire complex from the system.

A few explanatory words about the “latchup effect” or “snapping effect” mentioned above. This is a specific effect that causes a kind of “freezing” of the SRAM memory cell (as shown above, it occurs when a heavy charged particle of a cosmic ray flies), and, as a rule, it requires the SRAM to be turned off completely and sometimes to exit. out of order is irreversible.

In the article “Did the Bad Memory Chips Down the Mars Mars Probe? Moscow blames radiation wreckage on an SRAM chip, but does it add up?”
“Bad memory chip ruined the Russian Martian spacecraft? Moscow blames the impact of cosmic rays on the SRAM chip, but is that the case? ”

Source http://spectrum.ieee.org/aerospace/space-flight/did-bad-memory-chips-down-russias-mars-probe

published in the IEEE Spectrum e-magazine, Steven McClure, NASA specialist from the Jet Propulsion Laboratory (JPL, NASA’s oldest space engineering department), head of the Radiation Effects Group, explicitly states that such equipment is not considered to be SRAM chips in space equipment because of their, well known to specialists, low radiation resistance.

“The WS512K32 chip is well known and widely used in military and aeronautical engineering, but not in space technology,” says McClure, “Neither its manufacturer nor the commercial vendors using this chip carried out radiation testing and did not publish standards and specifications of such an impact on this chip. “It may possibly be used in space technology, for small-scale tasks, in orbiters, and in non-critical positions, but not as a component of the main control computing module of an interplanetary station, which has to work in outer space for several years.” to the author of the McClure article.

Also in the article it was noted that, for some strange reason, the Phobos algorithms did not consider the option of failure, similar to what happened, in a near-earth orbit, where, in fact, the accident happened. In case of failure, similar to what happened, the device goes into the so-called Safe-mode, in which the device uses solar panels on the Sun using “stupid” non-computer means of simple automation, and turns on the command radio link for receiving commands from the Earth (“gives console "), with which you can restore the system to work.

The automatics worked, the device was correctly oriented and switched on radio reception on the emergency channel, however, the algorithm did not provide a failure (and, accordingly, reception of commands through the emergency channel) at the output stage, the possibility of failure and, accordingly, interference from the Earth was provided only with the moment of entering the “departure trajectory”.

In the above article, it is quite rigidly stated: "It was a lawyer."
Source http://spectrum.ieee.org/aerospace/space-flight/did-bad-memory-chips-down-russias-mars-probe

"The release of the official report of February 3 provides only food for further rumors about the presence of fundamental errors in hardware and software, as well as gross violations of safety standards (during development)"

The fact of such a ridiculous, by and large, design errors, commented James Hamilton:

“This mistake is astounding. Reasonable people, it would seem, in no case could not allow this, the error is obvious and lies on the surface. Nevertheless, errors of this kind in large systems are allowed here and there, again and again. The experts, each in their field, do a good job, but the interaction between such “vertical” segments (separately - the construction of a computing complex, separately - its placement in the device, separately - its programming, separately - the development of a “cyclogram”, or a sequence of operations and actions at the start and during the flight. Note) are difficult, and if the general understanding of the product and the “cross-vertical” relationship is not deep enough, these design flaws may remain ID (see above about the problem "who made the costume?". Approx.). Good specialists create good components, but when all the components are combined into a complete system, here and there, we see problems between the components and in their interaction.

Often, good “vertical” specialists do not see the product being created as a whole, knowing well only its component. The two solutions are 1) well-defined and well-documented interfaces (in a broad sense) between the components, whether hardware or software, and 2) dedicated experienced and knowledgeable engineers who deal specifically with the interaction of components and the operation of the system as a whole. Appointment to such a position, as it happens, a technically unqualified manager, is often not effective.

The problems and errors caused by "complexity blindness" (complexity blindness) are often very serious, and, at the same time, depressingly obvious "in hindsight," as in the example discussed above. "

Ps. A few years ago I had a chance to talk with a graduate of the Moscow Aviation Institute, who was in the pre-diploma practice at the Tupolev Design Bureau. He enthusiastically talked about the experts with whom he happened to communicate there. “Grandfathers are real bison, with an exorbitant level of experience, walking reference books and encyclopedias, but they are already all retired there, and they are just stupidly dying out. The average age in KB is under 60 years. All either finalized until retirement, or working pensioners. If someone is younger, there are so rare enthusiasts, yesterday's students, who are two or three years old, after which they dump from those wages and hopelessness either into business or into management. And what kind of students are now leaving the MAI ... There is no one in the “middle”. ”
I think that in the space industry the situation is not much different. As a result, these are the stories.

Pps. I thought for a long time whether such an article is needed on Habré, and where to post it at all. But then something happened, they discussed the history of Phobos, told rumors from the TV and scolded “wretched Rashka”, as usual, and it seemed to me that it would be interesting to someone “what the matter was over” and how it really was.

PPPS. I deliberately wanted to confine myself to facts, and to do without the usual hibre hysteria “sawing”, “throwing down”, “Skolkovo” and “the enemies of Russia are letting our Martian stations into the outlet with their deadly radars”. Only facts and direct speech of experts.

UPD: In the comments gave a link to an open letter to the former leading expert from the NGO them. Lavochkin, our main space engineering center, which, among other things, designed and built Phobos.
open-letter.ru/letter/26645
Everything is quite expected, in accordance with what was said in the article above:
“I want to mention the artificial separation of the structure of the bureau. It is divided into Centers, each with its own director, his deputies, planning departments, etc. And such an organization led to a real disunity of the once-unified design bureau.
[...]
From here, on the one hand, there is duplication of services (for example, several units are engaged in gearboxes, each in its own way), on the other hand, the same drive is designed in three Centers - a control unit in one, an electrician in the other, and a mechanic in the third. And each of these parts does not want to understand the other. "

The letter was written in the name of the Deputy Chairman of the Board Sergei Ivanov in March 2011.

Source: https://habr.com/ru/post/139819/


All Articles