By Paul Trautrim | Jan 16, 2014
Low latency is often a key requirement for high-performance video designs. Nuvation’s engineering team recently completed a low-latency video streaming design that handles HD video from capture to display, using the H.264 codec for compression. We used a unique approach to profile the latency in our system that contributed to the delivery of a successful product.
The H.264 codec is designed with low-latency applications in mind, and it provides the ability to partition video frames into regions called slices. This allows the encoder to start encoding video data before a full frame has been captured which decreases the amount of time it takes to send out an encoded frame. Although we had already measured the total latency in our system, we decided it would be useful to get an in-depth look at slices moving through the video path to see which parts of the system were contributing the most latency. This profiling information would be very useful for further optimizations. We created a tool that would provide timestamps when H.264 slices reached key locations in the video path. This way we could see the amount of time each slice spent at different stages of the path, all the way from capture to display.
Our system used two separate pieces of hardware:
- A circuit board to capture video from a camera, encode it with the H.264 codec, and stream it over a network connection
- A circuit board to receive, decode, and display the video stream on a monitor
Our key locations were spread out across these two boards:
We started by connecting a USB logic analyzer to the GPIO pins on the encoder and decoder boards. When a slice enters one of our key locations, the boards toggle a corresponding GPIO pin. The logic analyzer records these toggles to a file which we could analyze and produce a spreadsheet with a set of timestamps for each slice’s arrival at the key locations.
This would work well sometimes, but there was a problem: if a slice was dropped at some point in the chain, we couldn’t see when it happened. We were assuming that the nth toggle on a pin was caused by the nth slice. So if a slice was dropped, we would end up with an unequal number of toggles for a location, and we wouldn’t know which slice was lost. This threw off the timing data, because if slice n was dropped between locations B and C, it would appear in the data as if slice ntook an unusually long time to get from B to C, when really it never arrived at location C — the next toggle was caused by slice n+1.
In order to be confident in our data, we needed an association between slice numbers and timestamps. Then we could say for sure which slices arrived at which locations and when, with no guesswork required.
Producing Time Stamps
We wanted to generate slice numbers in software on the encoder board and associate them with the video data, but there was no easy way to send these to the logic analyzer and get both slice numbers and timing information together. We needed a way to produce time stamps from within the boards’ software, and save these to memory for later processing. The problem in doing this was that our key locations were spread across different processors on different boards, all operating on different clocks. With a little bit of kernel code, we could access a timer on each board that was shared between processors. So we could get consistent time stamps for all the key locations on the encoder board, and consistent time stamps for all the key locations on the decoder board. But how could we synchronize the two time stamps on separate boards?
Due to the product’s schedule, we opted to use the logic analyzer again for a quick solution. The logic analyzer and GPIO toggling could be used as a bridge between the two boards. Timestamps are taken at the key locations on the encoder board, then the logic analyzer captures the GPIO toggle that occurs when a slice is sent (the moment it leaves the encoder board). A GPIO pin toggles when the slice arrives at the decoder board, and so with the logic analyzer data we were able to calculate the time difference between the send and receive stage, or the network travel time. We could then take the time differences from the timestamps generated on the decoder board, and add these to the network travel time in our post-processing script to get a set of timestamps consistent across the whole system.
After collecting all this data from each processor’s memory and running this post-processing logic (plus a little extra to remove some significant clock drift we ran into), we were able to get detailed timing information for each slice as it travelled through the video path.