📜 ⬆️ ⬇️

PLC error search sequence

Introduction


Quite often in the literature I came across descriptions of errors and even their classification by type.
Although, I confess, I can’t really remember a single case where I would be helped to know which particular type a particular error belongs to. Is that already after finding out the reasons for explaining them to others.
But how people figured out a place and got to the bottom of the error — it was always interesting to me.

System and Error Information

')
Settings (times, mode flags) and commands to the device are sent from the computer to the PLC.
From the PLC to the computer, signals of the device status and time to the end of the command to this device are issued. Signals are packed in words to minimize the amount of reception and transmission.
Commands are issued from the PLC to the device.
The device gives its status to the PLC.

image

Initially, everything worked, but after some time when giving commands, the statuses on the computer in SCADA began to blink out of business and generally behave extremely unfriendly. And only in one place, on one object.

But “dances with sabers” appeared steadily, with each team, which was very pleased.

Search

The numbers in parentheses indicate the locations of the checks corresponding to the numbers in the diagram below.

The principles and dependencies of blinking cannot be understood at once. Perhaps because of fatigue, perhaps due to the lack thereof.

It was found that the error is present not only in SCADA (1) (where it was actually detected), but also in the OPC server (2) .

Further analysis showed that the error is also present in the PLC, at least in the word formed for the computer (3) .

Checking for an error in the status coming from the device - marks the device as a possible source of the problem.

Manual status changes from the device do not change anything. If you change the status from the device by force (forced, permanent), the error is still saved.
Accordingly, these are not impulses that are not visible during monitoring (4) .

Comparison with the codes of other objects on which this error is not present - does not reveal differences. Complete identity. The probability of an error in the PLC program is reduced (5) .

The computer is disconnected as a possible source of error, recording something in this memory area. The error persists. The probability of error due to the computer tends to zero. Accordingly, no matter what, the problem is still in the PLC (6) .

Remark: It was also possible, by turning off the PLC, with hands to change statuses in OPC. But this option was more difficult technically, but in general, these two tests are almost equivalent.

Transferring a status word transmitted from the PLC to a computer in another area of ​​the PLC memory does not give anything. The error persists.

Transferring a piece of code with an error to another block (conditionally function) also does not affect the error. Accordingly, the likelihood that this is influenced by some other extraneous team, writing something different there, is insignificant. The point is this piece.

The codes are gradually removed, and there remains almost a minimum, at which the error persists (7)
  1. Command submission (without it, it is not clear how to check).
  2. The timer from which the time is taken to reset the command.
  3. Formation of the word on the computer from the status and time to reset the command.


image

Removed timer. The command is not reset, but the error disappears and the statuses stop jumping.

Instead, a new timer is inserted. Thoroughly looks around for absurdities. The timer is the most ordinary, nothing unusual. There are 200 more such programs in the program. But the error appears (8)

The formation of a signal from the PLC, as the most likely candidate for the source of error, is considered. The signal is packed for one word compactness, status bits in the high byte, time in the low byte. Three teams:
  1. Write device status to low byte word.
  2. Replacing the high and low bytes with the SWAP_WORD command (statuses are transferred to the high byte)
  3. By AND time entry in low byte word


It seems nothing unusual, and a completely identical system works with dozens of identical devices around. The brain creaks and refuses to help.

Replaces the sequence of packing commands in the word on the working identically, but consisting of other operators (9):
  1. Record device time in low byte words.
  2. In the intermediate variable, the statuses are multiplied by 256, moving to the high byte of the word.
  3. OR is used to record statuses in the high byte of a word


It all worked.

image

After analysis - the situation becomes finally clear.

Cause of error:

Operators have increased the standard timeout time from 1.5 to 10 minutes.
And if 1.5 minutes is 90 seconds, then 10 minutes is 600 seconds.
600 seconds did not fit into the low byte (maximum 256), and part of the time was written to the high.

The essence of the last check:

When recording the time first, and only then the status, the status clogged the overflow bits that came from the time value. And with the reverse sequence of commands, the time, on the contrary, blocked the status with its bits.

Decision:

Time and status were divided into 2 words. Local engineers were asked to carry out maintenance or replace the device with a timeout greater than 5 times the standard time.

And they worked happily ever after, and broke down in one day.

findings


The described error is not particularly complicated, but in my opinion it is quite nice in terms of exponentialness.

In fact, it is not very important where exactly the error is sought - in electronics, in a PLC, on a computer, or somewhere else. General principles are always about the same:

  1. Maximum collect information about the problem - where and how it manifests itself. Oscilloscopes, sniffers, utilities from Rusinovich, logs, thermometers, in general, everything that can be used in this case. Whether it depends on the time of year, the arrival of a cleaner, or barometric pressure.
  2. Remove from suspicion as much as possible. Cutting tracks on printed circuit boards, disabling tags, removing individual computers from the system. Worse, if there are any feedbacks and other handshakes. Then you can either try to organize a check taking into account the absence of a part of the system, or try to emulate a part, for example, artificially giving a signal to the feedback input. In general, to think.
  3. If possible, bring each test to the end, even if a “thought!” Suddenly appeared. Because “thought!” May not work (and often it doesn’t work before the offensive), and you lose your test results.
  4. In the remaining piece - to change everything that causes suspicion. If this software - try to reinstall or replace with the same. In principle, there is an option to start from this point. I personally saw an engineer repairing a 40-50 chip of the K155 series, who had bitten them all off and soldered new ones. But this, in my opinion, is rather an example of how not to do it. Because even if everything works, you won’t get the reality. Moreover, in the described case, this option did not pass and the fault persisted. In general - I did not say that.


Although of course recipes for some situations may be completely useless in relation to others.
But all errors have a specific reason, and this reason can always be clarified.
For example, somehow the reason was a tractor driver with a harrow that drove over a cable. And although there were no difficulties in its elimination, they were enough in the process of writing recommendations, to avoid repetitions ...

I apologize for the likely difficulty of reading.
The topic is not very artistic, plus I tried to shorten it, skipping not very significant moments, such as the forced refusal of BreakPoint, due to the cyclical nature of the program execution in the PLC and the presence of a timer.

Source: https://habr.com/ru/post/157533/


All Articles