How it all began
So, in our wonderful device Bercut-MMT (based on PXA320 and GNU / Linux) there is a no less remarkable OTDR module (based on STM32 and NutOS), which is a pulsed optical reflectometer. This bundle works as follows: the user clicks on the screen on various elements of the UI, a little bit of magic happens in the device, and the user's desires are transformed into commands of the type “duration 300” that go into the measuring module. Specifically, this command sets the measurement duration to 300 seconds. The module is connected to the device via USB, CDC-ACM is raised to transfer commands over USB.
Briefly - CDC-ACM allows you to emulate a serial port via USB. So for the top level, our measurement module in the system is available as / dev / ttyACM0. CDC-ACM is used to transfer commands to the module or read the current settings / status of the module. For transmission of the trace itself, the USB Bulk interface served, which devoted all its time to only one thing — the transfer of the trace data from the module to the instrument, as a binary data stream. At some point we noticed that the reflectogram does not come to us completely. So we discovered that USB could lose data.
Schematically, it looked like this:

')
b5-cardifaced is a daemon that accepts D-Bus commands and sends them to the card via the CDC-ACM interface. The result of the execution is sent back via D-BUS.
usbgather is a small program that runs on libusb and is engaged in picking up a trace from the module via the USB Bulk and outputting it to stdout.
Crutches and bicycles
We sat down and we thought - we need to understand whether the whole reflectogram came to us to be able to skip incomplete reflectograms. We began to invent various tricky headers, checksums, and so on. Then they realized that we are inventing TCP. And then a volition was made instead of USB Bulk to start TCP / IP over CDC-EEM. Why CDC-EEM? Because CDC-EEM makes it easy to use USB as a transport for transferring network traffic. On the device itself, there is support for CDC-ECM in the core, and we use NutOS as an operating system as modules, and support for CDC-EEM and TCP / IP stack in NutOS was.
Fix life 3 months
It would seem that nothing foreshadowed trouble. Raised CDC-EEM, configured IP addresses. Ping? Have a ping! Hooray. Changed the data transfer mechanism from USB Bulk to data transfer through a TCP socket. Happiness was just about to come, but suddenly when testing, the network fell, screaming into the dmesg about its uneasy life, our crooked hands, and the turn that came up for sending to our network interface. Like that:
[ 118.289339] ------------[ cut here ]------------ [ 118.293978] WARNING: at net/sched/sch_generic.c:258 dev_watchdog+0x184/0x298() [ 118.301163] NETDEV WATCHDOG: usb2 (cdc_eem): transmit queue 0 timed out [ 118.307726] Modules linked in: cdc_eem usbnet cdc_acm wm97xx_ts ucb1400_ts ipv6 cards button pmmct [ 118.318671] [<c002f750>] (unwind_backtrace+0x0/0xec) from [<c003e5a4>] (warn_slowpath_common+0x4c) [ 118.328017] [<c003e5a4>] (warn_slowpath_common+0x4c/0x7c) from [<c003e668>] (warn_slowpath_fmt+0x) [ 118.337536] [<c003e668>] (warn_slowpath_fmt+0x30/0x40) from [<c02ab738>] (dev_watchdog+0x184/0x29) [ 118.346552] [<c02ab738>] (dev_watchdog+0x184/0x298) from [<c0049938>] (run_timer_softirq+0x18c/0x) [ 118.355731] [<c0049938>] (run_timer_softirq+0x18c/0x26c) from [<c0043f78>] (__do_softirq+0x84/0x1) [ 118.364819] [<c0043f78>] (__do_softirq+0x84/0x114) from [<c004404c>] (irq_exit+0x44/0x64) [ 118.372959] [<c004404c>] (irq_exit+0x44/0x64) from [<c0029074>] (asm_do_IRQ+0x74/0x94) [ 118.380843] [<c0029074>] (asm_do_IRQ+0x74/0x94) from [<c0029b04>] (__irq_svc+0x44/0xcc) [ 118.388792] Exception stack(0xc0425f78 to 0xc0425fc0) [ 118.393819] 5f60: 00000001 c606b300 [ 118.401958] 5f80: 00000000 60000013 c0424000 c0428334 c0454bac c0428328 a0022438 69056827 [ 118.410100] 5fa0: a0022368 00000000 c0425f98 c0425fc0 c002b04c c002b058 60000013 ffffffff [ 118.418232] [<c0029b04>] (__irq_svc+0x44/0xcc) from [<c002b058>] (default_idle+0x34/0x40) [ 118.426367] [<c002b058>] (default_idle+0x34/0x40) from [<c002b5cc>] (cpu_idle+0x54/0xb0) [ 118.434425] [<c002b5cc>] (cpu_idle+0x54/0xb0) from [<c00089f0>] (start_kernel+0x28c/0x2f8) [ 118.442653] [<c00089f0>] (start_kernel+0x28c/0x2f8) from [<a0008034>] (0xa0008034) [ 118.450180] ---[ end trace d7e298087ff4c373 ]---
If briefly, literally in two words - each network interface has a timer that records the time from each last data send, and if it exceeded a certain interval - we see this message. Then it all depends on the driver - some then work fine, some - no.
Root of evil
All of the above was compounded by messages in the dmesg about unknown link cmd. We added more debug and found out that we received a response to echo request on the USB host, which we did not send.
When nothing works, it's time to read the documentation. So we got the CDC-EEM dock, but not from anywhere, but directly from usb.org. It turns out the first EEM packet is not only a handful of data, but also an EEM header, which contains the packet type (control or data) and data length. And yes, CDC-EEM has its echo request / echo response.
But our happiness would not be complete if it were not for one more detail - when receiving service packages, the module hung. Tightly. Together with CDC-ACM.
Our USB module was configured so that the transmission went packets of 64 bytes. Accordingly, one Ethernet packet was beaten for N packets of 64 each and transmitted via USB. Like this:

After a very long study of the situation, we came to the conclusion that what is happening is this: we lose part of the EEM package (yes, USB does not guarantee delivery). But we read the length from the title and rely on it. Accordingly, we read N lost bytes from the next packet, and perceive the following data as the beginning of a new EEM packet and interpret the first 2 bytes as a header. And there may be anything. Up to 1 bit coded, which indicates that this is a service packet. In absolutely bad cases, we catch data that, when interpreted as an EEM header, gives us an Echo Response of immense length. Much more than our operational memory. So we realized that our implementation of usbnet in NutOS requires major improvements.
More checks of good and different
In the process of picking usbnet in NutOS, it was found that the current version is not at all ready to receive service packages. From the word at all. We made a new version that was able to correctly process service packages, namely: we looked at the type of package, because we are obliged to respond to echo according to the standard; checked the length - if it is more than MTU - then we obviously caught garbage. We also found an oddity in the function that triggers the transfer of data at the endpoint: we checked if the zero endpoint was not busy now, and if it was busy, we just left it. The caller of this function always believed that the data transfer was started, and often it turned out that it was not. As a result, we lost data, and in both directions.
There were wars with a TCP-socket - sometimes the data was not transmitted and we did not see why. I don’t know what was leading the NutOS developers, but a lot of functions that have the return type int in any confusing situation returned "-1". Some of them recorded the actual error code in the socket information, some did not. So we had to work out dragging return codes from the bottom, like the function of sending data from the network card, to the very top - functions like NutTcpDeviceWrite? (). After that, we were able to see where the plug had happened.
Then there were all sorts of doping and timeout settings in the socket, adding static entries to the ARP tables on the module and on the device itself: there are only 2 devices in our network: the device and the module, there is no point in obsolete entries in the ARP table.
Results
Our map now has a small TCP server that serves to transfer data from the card to the device. Before starting the measurements, the TCP server is started in the card and the card starts to wait for incoming connections. After the client connects to the TCP server, the card starts the measurements and sends the results to the device via TCP server.
Now, schematically, the operation of the device with the module looks like this:

Characters:
b5-cardifaced - the same as before - transmits commands from the D-Bus to the card and sends the result back to the D-Bus;
nc - netcat itself, reads the trace data from the socket and sends it to stdout for further processing.
After all these adventures, we now have a network reflectometer. Networking, however, is not 100% complete - management occurs through CDC-ACM, and data collection from the module via TCP / IP via CDC-EEM. We still have a small data loss, but due to the use of TCP / IP at the output, we always get a complete trace. We learned a lot about USB in general and CDC-EEM in particular, and I began to love USB a little less than before.
The load test showed that our module based on the STM32F103 can pump 220 kilobytes of data per second over TCP / IP over CDC-EEM, while the module is doing useful work at this time and the USB works without using DMA.