Quantize Your Deep Learning Model to Run on an NPU

Preparing a TensorFlow Model to Run Inference on an i.MX 8M Plus NPU integrated in a phyBOARD-Pollux

Jan Werth

Published in

Towards Data Science

6 min readNov 16, 2020

image by Liam Huang and mikemacmarketing, CC

Table of contend

Introduction
- Why does the NPU utilize int8 when most ANNs are trained in float32?
- Prerequisite
Post training quantization with TensorFlow Version 2.x
- First Method — Quantizing a Trained Model Directly
- Second and Third Method — Quantize a Saved Model from *.h5 or *.pb Files
Converting with TensorFlow Versions below 2.0

Introduction

In this article, we will explain in this article which steps you have to take to transform and quantize your model with different TensorFlow versions. We are only looking into post training quantization.

We are using the phyBOARD-Pollux to run our model. The phyBOARD-Pollux incorporates an i.MX 8M Plus which features a dedicated neural network accelerator IP from VeriSilicon (Vivante VIP8000).

phyBOARD pollux [*Image via phytec.de under license to Jan Werth]*

i.mx8MPlus block diagram from NXP [Image via phytec.de, with permission from NXP]

As the neural processing unit (NPU) from NXP need a fully int8 quantized model we have to look into full int8 quantization of a TensorFlow lite or PyTorch model. Both libraries are supported with the eIQ library from NXP. Here we will only look into the TensorFlow variant.

The general overview on how to do post training quantization can be found on the TensorFlow website.

Why does the NPU utilize int8 when most ANNs are trained in float32?

The operations for floating point are more complex than for integer (arithmetic’s, avoiding overflow). This results in the ability to use only the much simpler and smaller arithmetic units instead of the larger floating-point units.

The physical space needed for float32 operation much larger than for int8. This results in:

Lower power consumption,
Less heat development,
The ability to join more calculation units decreasing inference time.

Prerequisite

To create the code on your PC first, we recommend using Anaconda with a virtual environment running Python 3.6, TensorFlow 2.x, numpy, opencv-python and pandas.

The environment-file to clone the environment can be found here.

Post training quantization with TensorFlow Version 2.x

If you created and trained a model via tf.keras there are three similar ways of quantizing the model.

First Method — Quantizing a Trained Model Directly

The trained TensorFlow model has to be converted into a TFlite model and can be directly quantize as described in the following code block. For the trained model we exemplary use the updated tf.keras_vggface model based on the work of rcmalli. The transformation starts at line 28.

After loading/training your model you first have to create a representative data set. The representative data set is used by the converter to get the max and min values to be able to estimate the scaling factor. This limits the error introduced by the quantization from float32 to intX. The error comes from the different number-space of float and int. Converting from float to int8 limits the number-space to integer values between -128 to 127. Calibrating the model on the dynamic range of the input limits this error.

Here you can just loop through your images or create a generator as in our example. We used the tf.keras.preprocessing.image.ImageDataGenerator() to yield images and do the necessary prepossessing on the images. As a generator you can of course also use the tf.data.Dataset.from_tensors() or … from.tensor_slices(). Just keep in mind to do the same pre-processing on your data here as you did on the data you trained your network with (normalization, resizing, de-noising, …). This can all be packed into the preprocessing_function call of the generator (line 19).

The conversion starts at line 28. A simple TensorFlow lite conversion would look like this:

The quantization part is in-between:

Line 3 : optimizations other than default are deprecated. No other options are available at the moment (Year 2020).
Line 4 : Here we set the representative data set.
Line 5 : Here we make sure that we have a full conversion to int8. Without this option, only the weights and biases would be converted but not the activations. This is used when we only want to reduce the model size. However, our NPU needs full int8 quantization. Having the activations still in floating point would result in overall floating-point and could not run on the NPU.
Line 6: Enables MLIR-based conversion instead of TOCO conversion, which enables RNN support, easier error tracking, and more.
Line 7: Sets the internal constant value to int8. The target_spec corresponds with the TFLITE_BUILTINS from line 5.
Line 9 and 10: Set also the input to int8. This is fully available from TF 2.3

Now if we convert the model using TF2.3 with

experimental_new_converter=True
inference_input_type=tf.int8
inference_output_type=tf.int8

we receive the following model:

However if we do not set the inference_input_type and inference_output_type we receive following model:

So the effect is that you can determine which input data type the model accepts and returns. This can be important if you work with an embedded camera, as included with the pyhBOARD-Pollux. The mipi camera returns 8bit values, so if you want to spare a conversion to float32 int8 input can be handy. But be aware, if you use a model without prediction layers to gain e.g., embeddings, an int8 output will result in very poor performance. Here an output of float32 is recommended. This shows that each problem needs a specific solution.

Second and Third Method — Quantize a Saved Model from .h5 or .pb Files

If you already have your model, you most likely have it saved somewhere either as an Keras h5 file or a TensorFlow protocol buffer pb. We will quickly save our model using TF2.3:

Following the conversion and quantization is very similar as in Method One. The only difference is how we load the model in with the converter. Either load the model and continue as in Method One.

Or load the h5 model directly. When using TensorFlow version 2 and above you have to use a compatible converter:

If you load from a TensorFlow pb file use:

Converting with TensorFlow Versions below 2.0

If you want to convert a model written in TensorFlow version < 1.15.3 using Keras, not all options are available for TFlite conversion and quantization. The best way is to save the model with the TensorFlow version it was created in (e.g., rcmalli keras-vggface was trained in TF 1.13.2). I would suggest not using the “saving and freeze graph” method to create a pb file as the pb files differ between TF1 and TF2. The TFLiteConverter.from_saved_model does not work, creating quite a hassle to achieve quantization. I would suggest using the above mentioned method using Keras:

Then convert and quantize your model with a TensorFlow version from 1.15.3 onward. From this version on a lot of functions where added in preparation for TF2. I suggest using the latest version. This will result in the same models presented earlier.

Good luck and have fun.