n l i t e d
July 18 2016
My goal has been to develop a better method to monitor the performance of the BLIS sensors, particularly in regard to detecting and diagnosing "Video Capture Timeout" (VCT) errors. The VCT errors present a real challenge, as they are extremely rare, random, and sometimes leave the sensor in an uncommunicative state. Some VCT errors have been traced back to faulty hardware, especially related to the connectors holding the ribbon cable between the MT9M032 video sensor and the Gumstix COM board. These hardware faults are generally irrecoverable and result in the sensor being completely non-functional. There are indications of another class of VCT errors that are spurious and sporadic. These errors are the focus of my investigation.
The VCT error is really an umbrella term for any hardware or software fault that disrupts the flow of video frames into the Blister process for more than 5 seconds, after which Blister will report a "Video Capture Timeout" error to the console and log. The VCT error is actually a symptom triggered by any of a wide variety of causes, some of which have nothing at all do with the video hardware. Some of the causes I have identified so far:
The video data flows from the MT9M032 video sensor into the OMAP-ISP video subsystem through a ribbon cable with a 27-pin 0.3mm pitch latch connector on either end. If the cable shifts or becomes dislodged from the connectors, the video signals are no longer received by the OMAP-ISP and video frames are no longer acquired. This is a common problem with sensors that are used for benchtop testing, especially if the connectors are subjected to multiple clamp-and-release cycles, as they have a very limited lifetime (in terms of clamp cycles) before they fail to clamp properly. This should not be a problem with production sensors where the ribbon cable is actuated only once during assembly.
The first generation sensors suffered from thermal stress that would dislodge the video cable over time, eventually resulting in permanent VCT errors. This problem seems to be fixed with the BLIS 1.5 generation.
These VCT errors typically start as rare and sporadic, become more frequent and persistent over time, until becoming permanent and no frames are acquired at all.
VCT errors due to the video cable require hardware maintenance. From a software perspective, the only information available is that no frames are acquired.
The BLIS cpu communicates with the MT9M032 video sensor and the MSP430 FPS generator using a common I2C bus. It was found that these two peripherals would drive the bus simultaneously, blocking the cpu from sending data. This was addressed by using two GPIO pins to enable and disable the peripherals as needed, controlled by the OMAP3-ISP.KO driver. Access to these GPIO pins is controlled by the system's gpio driver, which makes sure each gpio pin is controlled by only one driver at a time.
If the omap3-isp.ko driver is unexpectedly unloaded and reloaded, it may not release the gpio pins properly using the gpio driver. The gpio driver will then deny access to the pins when the omap3-isp driver is reloaded, believing that the pins are still controlled by another driver. The omap3-isp driver will then be unable to set the I2C gates appropriately.
This only affects the start-up configuration of the MT9M032 video sensor. If the MT9M032 has already been configured since the most recent power cycle, this situation can be ignored. If the MT9M032 has not been properly configured previously, the I2C gate will block Blister from configuring the sensor properly and video will never start, resulting in a VCT error.
This is generally an ugly but benign situation that only occurs during workbench experiments; I do not believe this is a problem in the production environment.
I have observed cases where the omap3-isp fails to initiate video capture properly after restarting the Blister process. I believe, but have not positively confirmed, that this happens only when the OMAP-ISP video subsystem is not completely reset by unloading and reloading the driver. I have traced this problem to the point where I believe it is related to configuring the video buffers for DMA transfers.
The cause, severity, and possible corrections for this condition are unknown and would require further investigation into the omap3-isp driver. The standard operating procedure for restarting production sensors includes reloading the omap3-isp driver, so I'm not sure if this problem is occuring in the production environment.
I have long suspected, but have not had the tools to properly prove, that the Blister process can be "held off" for long periods during prolonged operation. No video frames will be processed during these periods, and if the period is longer than 5 seconds it will be reported as a spurious VCT error.
I have had only anecdotal evidence of this and do not have any data to indicate the frequency, duration, or cause. Simple text logs are not very useful, as I need to log the exact timing of a tremendous number of events over a prolonged period to properly understand what is happening.
Diagnosing the VCT errors is made especially challenging by the fact that my benchtop sensor does not exhibit any VCT errors at all. It is difficult to fix a problem that I can't observe. Therefore, my first goal is to develop a logging system that can be used on production sensors, introduce minimal performance overhead, run in a production environment, and collect information that would allow me to diagnose the problem on sensors where this rare and random event actually occurs.
My approach has been to create a logging system that can trace events at an extremely high time resolution, at very high frequencies, with minimal performance overhead, and avoid use of the file system. I also want to be able to log events that occur inside the real-time interrupt handlers, inside device drivers, in the linux kernel, and in normal processes. The traces then need to be presented in a visual format that makes it easy to recognize patterns and anomolies without writing specific code or grepping through gigabytes of text logs.
Fortunately, I had dealt with similar scenarios before and had developed DbgOut, a logging tool for a previous project to handle a very similar situation in a Windows driver project. I had already adapted it to cross-compile for the BLIS project, so most of the hard work of porting the project from x86/Windows to ARM/Linux had already been accomplished. Logging from a real-time context was a new challenge and required new work resulting in DbgOutRT.
My goal was to log every single interrupt from the MT9M032 and every frame processed. This should tell me if the VCT errors were due to the Blister process failing to keep up with the video frames, or the video frames failing to arrive. Logging these events over a prolonged period should also confirm or disprove the existence of the system "Dead Zones" and also tell me whether they affected the incoming video interrupts as well.
The majority of the development work was spent figuring out how to handle real-time logging. The real-time context (including linux kernel "atomic time") places severe restrictions on what actions can be performed: any call to the kernel scheduler is not allowed, only previously locked kernel memory can be accessed, most system services are not accessible. Any violation typically results in an immediate and irrecoverable kernel panic and system hang. This task was further complicated by the lack of (and errors in) documentation for the Xenomai environment, working with a linux kernel that is nearing its scheduled EOL, a complete lack of debugging tools beyond printf, the complete unsuitability of printk for high-frequency/low-latency event logging, and the added challenge of RT context restrictions. Fortunately, BLIS reboots quickly.
Another significant challenge was to develop a logging mechanism that accurately revealed a lack of activity. Every logging system can be overwhelmed, trying to log more data than the cpu can process. There are two approaches to this dilemma:
Write the logging system so that every byte is guaranteed to be transmitted (or displayed) before returning. Implicit in this approach is the assumption that data is never lost or mangled during transmission or display. This is the approach taken by the linux printk logger.
When this system is swamped, it steadfastly refuses to be overwhelmed by denying cpu cycles. This is extremely disruptive to the system as a whole, and is unacceptable for use in a production environment.
This approach designs the logging system to minimize the effect on the overall system, returning from the logging call before the data has been transmitted, then transmitting the data from a normally scheduled task. This makes the logging call deterministic and is far less disruptive, especially for RT and kernel logging. The downside is that it requires writing the events into pre-allocated buffers for transmission later, opening the door to overrunning the buffers and dropping data. The system needs to make it obvious when data is dropped, so that a lack of activity can be confidently attributed to either data loss or an actual lack of activity on the target system. This approach has the added benefit of detecting data loss or corruption during transmission.
DbgOutRT takes the second approach, "Fast and obvious." It is also designed to handle very large data sets, with the goal of tracing everything on the target system for extended periods and offloading the data to servers with practically unlimited storage.
I was eventually able to create a system that allows me to trace (event timestamps) and log (text) from all contexts (RT, NRT kernel, user space) using a common interface, and thread all the data into a single coherent log. These logs are viewed using DbgScope, a Windows program that presents the traces in a natural visual display.
TL;DR: The long "dead zones" (6 seconds) displayed by an early development version of the DbgOutRT logger was a bug created by a race condition when the 32bit cycle counter rolled over. There are smaller dead zones on the order of 500ms that bear further investigation.
Below are views of a trace generated while Blister ran on my bench system for 10 minutes. There are 2.3 million trace events being displayed. The trace data can be dumped to a (very large) CSV text file and imported into Excel for statistical analysis.
There were three data drops (or data corruption) regions detected, marked by the red dots above the traces. Any inactivity immediately after these marks should be ignored as dropped data. The three data drops are 100ms or less. This leaves several areas of no activity that are immediately obvious. This includes one that is nearly 6 seconds long, enough to trigger a VCT error report.
The DbgOutRT logging system is still a work in progress and there is still the possibility that these Dead Zones are being falsely reported. Given this caveat, I believe the traces confirm something I have suspected for a while and that the effect may be more severe and frequent than I thought.
However, I think it is unlikely that the production units are experiencing these dead zones to this extent. A 6-second dead zone every 3 minutes would be hard to overlook. It may be there is a difference in the configuration of my bench system compared to the production systems that might cause different behavior, or tt may be a symptom of the aging of the flash file system on my bench unit.
It is too early to draw any definite conclusions from this data, other than this: There is something interesting happening that needs to be investigated. If possible, I would like to generate traces from production units to see if these same dead zones are present.
This is a typical event sequence showing the accumulated latencies from the RT interrupt handler, the NRT task, the video IOCTL calls, and the start of frame processing.
|RT interrupt||Unknown, can only be measured externally.|
|Frame processing start||81.929us|
|Frame processing complete||7.890ms|
DbgOutRT has become quite capable, I can log events, text, and data from user-mode processes, kernel-mode drivers, and now even real-time context. The data collection module compiles for x86 and ARM, Windows and Linux. DbgOutRT uses the cpu cycle counter as its tick source. The BLIS sensor runs at 720MHz, so each tick represents 1.4ns. On a 4GHz iCore system, each tick would be 250 picoseconds. Data collection is limited by network bandwidth, not storage.
I have concentrated on the data collection side, the DbgOutRT infrastructure. The DbgScope UI is at a usable level, but there are many missing features and UI bugs. DbgScope is at a development level that makes it a great tool for my own use, but I would be reluctant to give it out to the general public in its current state.
There is an intermediate component, DbgOutRelay. An instance of this program runs on the target system to relay data from DbgOutRT to either local storage or across the network to another instance of DbgOutRelay running on a remote computer. This component is also at a usable, but very "beta" level.
Wednesday, July 20 2016
My hypothesis is that there are periods, the so-called dead zones, when the system is running with interrupts disabled. These periods are rare but persist for up to 6 seconds. However, there is doubt about whether this is actually happening or a bug in the logging system. I don't know which since I am not sure my system is behaving properly.
To confirm or refute this hypothesis, I needed to run this same logging test on a known good system. Heather has observed the performance of the sensors on her bench for extended periods, they are behaving normally and can be assumed to be known good benchmarks. I would expect traces generated on her system to show no, or at the most very small and rare, dead zones. If there are no dead zones shown, my confidence that the dead zones on my system are real improves. If the dead zones are present, especially if they are large and frequent, my confidence in the traces drops and I need to focus on looking for bugs in the logging system.
I put together a test pack containing the DbgOutRT binaries to be downloaded to Heather laptop and sensor.
Traces from the BLIS sensor on Heather's bench (0509A at 192.168.100.239). It shows the same "dead zones", each is about 6 seconds. Occurs at 94, 432, 487, 584, 672 seconds. The test ran for 738 seconds (about 12.5 minutes).
Five dead zones of 6 seconds each in the space of 12.5 minutes would be immediately obvious. The logging results do not match reality, so I need to look for a bug in the system.
The first clue is the number 6. Or more precisely, 5.96. Each gap is exactly the same width of 5.96 seconds. At 720MHz, this is 4.3 billion ticks, and 4.3 billion is very close to one of those magic numbers: 4,294,967,296 or 0x1'0000'0000. Conversely, 0x1'0000'0000 ticks at 720MHz is exactly 5.96523 seconds. This isn't a coincidence, it is a roll over bug.
The cpu's cycle counter is only 32 bits long, with a maximum value of 2^32-1 (4,294,967,295 or 0xFFFF'FFFFF). It is incremented 720 million times a second, so it counts from 0 to 0xFFFF'FFFFF every 5.96523 seconds and then rolls over to 0.
A maximum range of less than six seconds is far too short for any useful logging, so I extend the cycle counter from 32 to 64 bits by keeping a second 32 bit roll over counter. This extends the maximum range from six seconds to 818 years, which should be long enough to cover the operational lifetime of any of the BLIS sensors.
Counting the roll overs is simple. Each time I read the cycle counter I check whether the new value is less than the value read previously. Since the counter always increments, the only time this is true is when the counter has rolled over from 0xFFFF'FFFF back to 0.
Accurately counting roll overs depends on reading the cycle counter at least once every 5.95 seconds, otherwise I will miss a roll over count. This would result in an apparent compaction of time in the log, which is the opposite of what I am seeing.
The code to read the cycle counter is arcane due to the inline assembly code required to access the cycle counter register, but the logic is simple. I have omitted the initialization code that sets the initial value of PrevReadCt.
The PrevReadCt and RollCt are global variables that hold the previously read value of the cycle counter and the current roll over count (which is used as the upper 32 bits of the 64 bit timestamp). These global variables are also system-wide critical resources. All the threads in the system, user-mode, kernel-mode, real-time, and interrupts, need to read and update these two variables.
The "dead zones" in the traces were being generated when RollCt was being incremented twice for a single roll over of the cycle counter. This had the effect of expanding the timeline by 5.96 seconds (0x1'0000'0000 ticks) with no activity. There is a very narrow window when a thread is reading the cycle counter during which it could be interrupted by a hardware event, run the interrupt handler code, read the cycle counter, detect the roll over again, increment RollCt, resume the original thread, and increment RollCt a second time for the same roll over event.
For this to happen, two things need to happen at the same time. A thread needs to be accessing the cycle counter to open the window, then an omap-isp hardware interrupt needs to fire before the window is closed.
The window is open for an extremely brief period, on the order of a microsecond or two. However, it is opened frequently -- on the order of 4,000 times a second. The hardware interrupt fires about 40 times each second. 10 minutes is enough for these two events to line up two or three times. It happens more often on my system, which is running on a 720MHz clock, compared to Heather's system running at 500MHz.
The solution is simple, disable interrupts while accessing the cycle counter and updating PrevCt and RollCt. Disabling interrupts does not miss any hardware events, it simple defers them until interrupts are enabled again. In this case, interrupts are disabled for a very brief and deterministic period, on the order of a microsecond or two. The Xenomai RT environment complicates this a bit by requiring a unique API to really disable interrupts (which is something to be avoided in an RT system).
OsRtCriticalEnter() is a RT spin-lock used to guarantee synchronous execution of the code. The lock is held for a very brief (1 or 2 microseconds) and deterministic time and should have negligible effect on system performance. The code can be improved, but it is good enough to prove the fix.
Testing the fix was a bit tedious as I had to experiment with various Xenomai functions to find the proper one. I needed to let the logging run for at least 15 minutes each time to be sure to include these rare events.
Eventually I had a long test in which the six-second gaps disappeared. There are some gaps, but these are marked as data drops and have more to do with my benchtop network than anything on the BLIS sensor and should be ignored. I saw at least one mini dead zone of 500ms, but this would not be enough to trigger any alarms. This trace contains 2.5 million events, and I have not examined it thoroughly except to confirm that the 5.96 seconds gaps are no longer present.
In the end, I consider this to be a very good result. I found and fixed a bug in the logging system and my confidence in the results has improved. I was able to run the logger for extended periods without any problems and with no apparent degradation in the performance of the BLIS sensor itself.
Monday, July 25 2016
DbgOutRT is now able to collect large amounts of real-time data, and I am letting it run for longer periods. The current configuration of Blister and omap3-isp.ko generate something on the order of 4000 records per second and a 15 minute run can create a data log of 25MB with over 2.5M trace events. This is needed to capture events that are both rare and of short duration. With the data collection working, I need to figure out how to visualize these large data sets in a way that lets me quickly identify both the trends and outliers in the data without spending days examining it all.
I am reminded of working on the FLAME project with Bruce. FLAME was a relatively simple data collection module that used the BLIS platform to collect data from a FPGA peripheral that generated samples every 40ms. The Linux driver needed to respond to the interrupt and assign a timestamp with extremely low latency, and later store the data to disk. This was the genesis of the Xenomai RT migration. An external oscilloscope was used to measure the delay between the actual hardware event (FPGA data ready) and the RT interrupt handler, with CHA connected to the interrupt line from the FPGA and CHB connected to a gpio pin that was asserted when the Linux driver's interrupt handler executed. By triggering on CHA and overlaying CHB, the scope would paint a live picture of the operational latency. The picture was repainted every 40ms so that the variations in the CHB trace appeared as a fuzzy line that the human eye automatically translated into a statistical probability graph. Outliers were instantly recognizable, the average was prominently visible, and the deviation was shown as the "fuzz" around the center.
DbgOutRT provides all the information needed to let DbgScope replicate this same sort of "waterfall" display with the advantage of working with static data, not transient voltages. I just need a UI to specify the trigger, which is the condition that begins the display region. Then I redraw the current trace range, resetting the drawing cursor to 0 each time the trigger condition is met, and overpaint the traces. The redraw operation is simple. Specifying the trigger can start simple (rising or falling edge of a channel) and increase in complexity over time (patterns of multiple channels, prerequisites, minimum or maximum times, etc.)
I used Window's DirectDraw API to paint the traces with semi-transparent brushes so that successive overpaints would become brighter. The average delays are easily seen as the bright green regions while the outliers are still visible as faint lines. The screenshot below was created using the "NRT Task" rising edge as the trigger with a 1s hold time. There are over 6000 trigger points being displayed.
I believe this display approach has a lot of potential, but needs to be combined with a way to define sophistated trigger conditions and a more sophisticated analysis function.
The current fallback is to identify interesting regions using the Waterfall display, export the data as text to Excel, crop the data by timestamp, then import the data into dedicated data analysis tools.
Tuesday, July 26 2016
I now have a new set of sophisticated tools that can be used to investigate timing-sensitive events at extremely high resolution and frequency with minimal drag on the target system. The next step is to apply these tools in long-term testing on misbehaving systems in a bench setting (if possible), or on production systems during non-production test sessions. I believe the goal should be to have these tools installed on the production system in a stand-by state, ready to be enabled if the system begins misbehaving.
However, due to budget constraints, the BLIS investigation is being suspended and further work will need to wait.
August 10, 2016
I spent some time enhancing DbgOutRelay to handle multiple incoming datastreams. The goal was to deploy DbgOut/DbgOutRelay on multiple BLIS sensors and relay the datastreams to a single workstation for aggregation and analysis. This turned out to be a significant modification of the existing DbgOutRelay, but a great improvement in usability for this scenario. See Running DbgOut on BLIS