When “TOPS” are Misleading

Neural accelerators are often characterized with the performance feature “TOPS” — Trillion operations per second. But that alone is not enough. It is important to know how these accelerators work and what else should be considered when making a comparison.

Jan Werth
Towards Data Science

--

Photo by SpaceX on Unsplash

List of Content

Introduction
What are TOPS?
Universal or Specialized? A Comparison
–Measurements
–Comparisson
When do I Need Many TOPS?
Summary
Addendum

Introduction

Hardware accelerators for artificial intelligence have many names. They are known by names such as Neural Accelerator, AI Accelerator, Deep Learning Accelerator, Neural Processing Unit (NPU), Tensor Processing Unit (TPU), Neural Compute Unit (NCU), etc. All denote the same thing: electronics optimized for matrix operations, which are needed to compute artificial neural networks particularly efficiently. In the following blog, we will stick to the NXP name declaration “NPU”. In previous years, it was primarily Nvidia’s GPUs that dominated the artificial intelligence (AI) field. With the increasing expansion of the AI field to the edge (mobile devices, industrial end hardware, etc.), dedicated hardware components for edge products have been developed for several years now. These focus primarily on low-precision computations (mostly integers), modern data flow architectures, and optimal memory connectivity.

What are TOPS?

To be able to compare the different NPU architectures in a simple way, the metric “Trillion Operations per Second” (TOPS) was created. Among experts, the metric is not considered the optimal metric, but it captures a complex question in an easy-to-understand and comparable number: how many mathematical operations can my chip deliver in one second? This number can be used to quickly compare different chips. The quality of the operations or even which operations are involved in more detail is not taken into consideration. In many cases, the chips also focus on a specific task, where they can then call up their maximum performance. A direct comparison is therefore not always justified.

In most cases, TOPS are measured with the classic “ResNet50” and sometimes with the “MobileNet” architecture. In applications, the ResNet50 is now often replaced by more modern networks. Nevertheless, it provides a good basis for comparison.

Universal or Specialized? A Comparison

In this section, we want to compare two NPUs against each other to showcase what to look out for. One with a broad focus on industrial use and “low TOPS”, and the other on high-speed image analysis with “high TOPS”. First, we show some measurements and then discuss some pro and cons.

Fig. 1 phyBOARD-Pollux with the i.MX 8M Plus [image by author]

— Measurements

The phyBOARD-Pollux single board computer (Fig. 1) is an industrial board based on the NXP i.MX 8M Plus Quad processor. The NPU of the i.MX 8M Plus is specified by NXP with 2.3 TOPS. However, this alone says little about the inference times that can be achieved. An image processing test of this NPU with a ResNet50 resulted in processing 60+ frames per second (fps) with an average inference time of 16 ms per image (224 × 224 pxl);
And 159 fps at an average of 6 ms per image with the MobileNetV1 (Fig. 2) at the same resolution. Tests by NXP itself showed an inference time of around 3 ms for the MobilNet architecture.

Fig. 2 Inference with MobileNetV1 (224x224) on the i.MX 8M Plus NPU [image by author]

There are other neural accelerators marked with nominal higher TOPS. We will look at the Gyrfalcon’s Lightspeeur 2803s (Fig. 3), which can deliver up to 16.8 TOPS peak. This results in a rate of over 100 fps using a MobilNet with image input resolution of 448 x 448 pxl (from their website). If we assume the input size scales linear to the inference time with 200.704 pxl vs 50.176 pxl, we get a direct comparison of around 40 fps for the i.mx8 M Plus vs 100 fps for the gyrfalcon, resulting in a delta of approximately 60 fps. So, a difference of a factor 7.3 in TOPS leads here to a factor of 2.5 for fps.

For the Gyrfalcon, own measurements were not possible at the time of writing (*see addendum)

Following, we will discuss the differences of both chips.

— Comparison

Comparing NXP’s i.MX 8M Plus with Gyrfalcon’s Lightspeeur 2803s, it seems obvious that the Lightspeeur is clearly superior to the 8M Plus in TOPS. However, when looking at the details, it becomes clear that the chips should not be compared based on TOPS alone and that both have their justified field of application.

Fig. 3 Gyrfalcon Lightspeeur 2803s [CC license]

TOPS to fps
First thing that is noticeable is that the inference speed is only 2.5 times faster with 7.3 times the TOPS. Even though, we cannot say for certain, but this seems to come down to the chip integration into the module and/or the measurement application code. A short connection between chip and embedded memory is important to avoid data-transfer bottlenecks.

NPU-Model Integration
Generally, a big plus point of the i.MX 8M Plus is the eIQ Library provided from NXP. eIQ delivers the seamless connection between the model and the NPU. The eIQ library supports TensorFlow Lite, and ONNX models ensuring a smooth, direct implementation of the models on the embedded hardware. For eIQ, a simple conversion and quantization in e.g., TensorFlow Lite is enough to provide the model to the NPU.

Gyrfalcon uses for the model implementation a MDK and SDK (Model and Software Development Kit), which is a bit more challenging (*see addendum). A simple implementation should not be underestimated, as a simple adaptation of the model allows more time for application and model development, enabling faster iterations in agile mode.
A problem that arises with the provision of special SDKs/MDKs for model conversion is that the customer is dependent on the allocation and dedication of the providing software developers. Every update cycle depends on them; every troubleshooting must be handled by them. Unfortunately, SDK corpses appear again and again. So, the closer a software moves to the open source/community solution, the less the dependency on individual developers. This is an advantage of the eIQ library from NXP, which adheres very closely to Google’s NNAPI and uses TF-Lite and ONNX model formats directly.

Standard or Own Model Architecture?
Another difference is that the i.MX 8M Plus supports almost all current model types and model architectures. The Lightspeeur 2803s, on the other hand, focuses on convolutional neural networks (CNNs) used for image processing. The Lightspeeur 2803s also supports only the three most important model architectures: the VGG, ResNet, and MobileNet. These are three commonly used base architectures, but adapted models, modern architectures, as well as other types of networks (e.g., recurrent neural networks) are neglected here. This focus on CNNs enables the Lightspeeur 2803s to achieve the high number of TOPS.

Integrated or PCIe?
Great about the i.mx8 M Plus is also its integration as a system on chip (SOC) creating a “ all in one” solution. Hereby, all hardware components share the memory directely (direct memory access) and no extra data transfer between e.g., CPU and NPU is needed. This is especially interesting as the i.mx8M Plus SOC has several other highly special hardware components integrated, such as hardware-image-preprocessing, image and video -de/compression, raw sensor image processing, … Those can directly be integrated into the data pipeline without any need for additional data transfer.
The Gyrfalcon can of course also be designed as a SOC, however this requires a lot of work and special knowledge to complete. The Gyrfalcon is classically added via PCIe as an external chip. This requires integration during the development stage and additional data transfer between the hardware components during inference.

What is the Target Market?
A decisive factor for many industrial applications is the long-term availability of the i.MX 8M Plus. Edge devices in industrial plants are often expected to run for years with low maintenance and under special conditions. Here, the Lightspeeur 2803s is more suitable for applications in the consumer sector, where device runtimes of two to three years can be assumed under normal commercial use. Availability beyond two years is currently not guaranteed by Gyrfalcon. Long-term availability of more than 10 years is assumed for the new i.MX 8M Plus.

Fig. 4 Comparison between NXP and Gyrfalcon NPU [image by author]

Power Consumption
The power consumption for both chips is comparable with 700mW for the Lightspeeur 2803s and also around 900mW for the i.mx8 M Plus (NPU only). In this case, the power consumption does not favor one over the other, however, power consumption is generally an important feature when comparing chips for embedded devices.

With the points described, you can see that both chips have a different focus and target group. A pure comparison by TOPS would be misleading here. The i.MX 8M Plus is more of a general-purpose chip that can be used in many ways, whereas the Gyrfalcon is suitable for narrowly defined applications with high performance requirements.

When do I Need Many TOPS?

This question arises when is the Lightspeeur 2803s superior to the i.MX 8M Plus? High fps quickly comes to mind, where more rather than less TOPS are preferred. However, one must be aware that most cameras can only deliver 30 to 60 fps. Here, the i.MX 8M Plus, even with a somewhat heavy ResNet50, would still be perfectly adequate.

Batch processing, where an image stream is processed, is also often associated with TOPS. Batch processing, however, is more commonly used during the training of neural networks. In classical inference, one image is processed at a time, i.e., a batch of one.

Where a lot of TOPS are really needed is in “real-time critical” applications. Especially in the field of autonomous driving. Massive computing power in the form of TOPS is required if, for example, you want to detect within a millisecond that a child is running out onto the road, using a 360-degree multistream input, with pixel-precise detection. For this task, a high TOPS chip, like the Lightspeeur 2803s, is the right choice.

It is important to remember that pure inference is not the only task in an AI application. In many cases, at least one or more data preprocessing steps take place before and after computation. The results will be used to steer further processes. Excessive TOPS have no impact if the NPU is not supplied with enough data. A well-programmed and lean application is worth more than a highly tuned NPU for now.
Of course, the two combined will drop any technical performance barriers.

Summary — Different Task, Different Hardware!

Both chips we compared are very good in their specific field. However, TOPS do not give a complete picture of whether an NPU is suitable for the planned application. The support of different models and, above all, the smooth implementation of the models on the chip are at least as important as the pure performance.

It must also be considered that the hardware as well as the surrounding software application can utilize or serve the NPU performance at all. An unoptimized software application cannot be compensated by more TOPS. Basically, the NPU must fit into the overall concept. TOPS is only one of many parameters.

The following question can be asked to get a clearer picture:

  • What is my input data type and size?
    - Sensor data, video stream, mixed data stream, …?
  • How fast do I need the results?
    - Time critical or performance critical?
  • What is the computational difficulty we are looking at?
    - “classic” ML, deep learning, classification, segmentation, …
  • What is your target market?
    - Industrial and consumer markets have different requirements.
  • Is power consumption critical?
    - Handheld device, many devices accumulating, or one plugged in device, …

Do I Need an NPU?
Whether a NPU is needed can be based on the following: if there is a task that can be handled by “classic machine learning” (support vector machine, tree learner, …), the use of a CPU is sufficient in many cases. As soon as deep learning architectures are used, we recommend the use of an NPU.

Table 1 TOPS needed for different input and specifications [table by author]

The question of how many TOPS are required can be roughly based on the following guideline: if we have a large amount of input data (e.g., an image processing application with HD images as input) and we need results under 1 ms, we should choose a chip with 5+ TOPS. If e.g., 10 ms is enough, a chip with 2 TOPS is sufficient. Similarly important is the exact task at hand, e.g., classification vs. segmentation, where different computational effort is needed. Further, considering how complex the network/model is can influence the decision. The list in Table 1 is intended to give a rough overview of how many TOPS are needed under different considerations. Keep in mind the different aspects mentioned in this article which influence the chips usefulness per use case.

As always with data science, every problem needs its tailored approach.

The investigations are based on the PHYTEC AI kit:

and the Gyrfalcon’s Lightspeeur 2803s:

Addendum

It would have been stronger to show our own measurement results from the Gyrfalcon chip. However, even though, we own a Plai Plug 2803 from Gyrfalcon, the commissioning was riddled with difficulties resulting in us not being able to generate own measurements.

The point we make in this article about the dependency of the customer on the developers time, dedication and project focus, was nicely shown for our effort to use the Gyrfalcon chip. After finishing the articles draft, we tried to update the Gyrfalcon results with own data. When we hit an error we could not solve on our own, we contacted the Gyrfalcon developers via their sdk-forum. At the time of writing, it is 12 days later (8 workdays), and we did not receive any answer.
This is not intended to bash on Gyrfalcon, as there are many possible justifications for this delay, however it pictures the described phenomena quite nicely. Imagine you are in a tight project schedule and now have to rely on this answer. 8/12 days with no indication if it will be answered at all is not reassuring. As described, the problem is, that there is only a small to no community we could ask, as the SDK is not widely spread. We also asked on Stackoverflow a couple days later and also did not receive an answer due to its unique nature.
In contrast, we also had issues with the i.mx8M Plus trying to run our TensorFlow lite models, however, the TFlite community is huge, and we received our solution, not from NXP but rather from the community, almost instantly.

Please check for yourselves if our question is answered until now and how long it took to be answered:

A topic with a very similar question was posted in January 2020 and has not been answered at time of writing (26.04.2021):

--

--

Hi, I am a carpenter, electrical engineer and have over 10 years of experience in signal processing, machine- and deep learning. linkedin.com/in/jan-werth