How to create a celebrity-look-alike demo and run inference on an NPU

Full detailed explanation on how to create, transform, and infer a celebrity-look-alike demo with TensorFlow on an i.MX 8M Plus NPU integrated in a phyBOARD-Pollux

Jan Werth

Published in

Towards Data Science

14 min readDec 9, 2020

Info:

In this blog will talk in detail about how to create a celebrity-look-alike demo, and how to prepare it for the use on an embedded NPU. If you are only interested how to make a model run on the i.mx8M Plus NPU, please visit this article to save yourselves some time and get more details on post training quantization with different TensorFlow versions.

To watch the associated video, click this youtube link or watch below

Overview
Prerequisites
The Celebrity-Look-Alike Demo
— Preparation
— What are Embeddings
— How are Embeddings Created
— Implementation in Code
— Quantize Your Model to int8
— Short Sidestep on How to Use Your Already Existing Model
— Create a Database
— Prepare the Data Set Further and Create a “Only Faces Fata Set”
— Create Embeddings of Each Image in Your Database
— Load the Embeddings from json
— Live Stream Analysis
— Loading the Model
— Point to the Cascade Classifier
— Set Pre-Processing Function
— Split Your Data
— Setting the Video Pipeline
— Start the Live Stream
— Call the Video Pipeline
— Find Faces in the Live Stream
— After Button Press
— Finding and Cropping Middle Face
— Further Pre-Process Found Face
— Compare the Embeddings
— Plot Results
Port This to Embedded Hardware
Does the NPU Give Any Benefit?
Conclusions

Overview

In this article we want to show you how you can create a celebrity look alike demo based on a pretrained neural Network and integrate this on an embedded device.

To watch the associated video, click this youtube link or watch below

General explanations about the topic of this blog

First, we will explain how to build a celebrity look alike demo with a pre-trained model. Then we will show how to prepare the model for running on an embedded device, and finally optimize the model to be able to run on the embedded neural processing unit.

As hardware, we will use the phyBoard Pollux with the i.mx8M Plus and the VM016 MIPI camera.

Prerequisites

To create the code on your PC first, we recommend using Anaconda with a virtual environment running Python 3.6, TensorFlow 2.x, NumPy, opencv-python, and for preparation pandas.

The Anaconda environment-file to clone the environment can be found here (latest: TF2.3envfile.yml).
You can find the model and installation instructions here.
The demo as described here can be found here.
The demo running optimized in a GUI can be found here.

The Celebrity-Look-Alike Demo

The idea of the demo is to find images of celebrities which look similar to yourself based on the facial features in the face. If you investigate the block diagram below, you see the three different blocks the demo is made of:

Live stream
Preparation
After button press

As the names are telling, they have different functions. The Live stream and After button press blocks are part of the running demo, while the Preparation block is needed beforehand.

Generally, we follow the very good explanations from Jason Brownlee, which I recommend reading. Nevertheless, as we are using a few different approaches as well, we will explain it here in full detail.

First, we will start with the preparation part.

Preparation

As it can be seen in the first part of the block diagram, we are using a pre-trained network. We use a pre-trained network as the task of facial recognition has been very well accomplished already by different research groups. We used the network from Refik Can Malli et al.(rcmalli), which was originally trained by Q. Cao et al. on the FaceVGG2 dataset. As rcmalli’s model was written with TensorFlow 1.14.0 and Keras 2.2.4 we updated it to TensorFlow version 2.2.0. You can find the updated model here. However, we are still using the weights from the original model.

What are Embeddings

To identify a human face, we need to identify the specific facial features. Examples would be the length of our nose, the distance between our eyes, the angle between nose and mouth, …
We humans have perfected recognizing this features and can identify subconsciously millions of them. However, to program that by hand would be impossible. Therefore, we are using rcmallis ResNet50 network to find a good representation of those facial features and its combinations and project them into a lower dimensional space. This lower dimensional space which incorporates the high dimensional space of the facial feature information is called embedding.
If you want to know more about embeddings, I suggest reading this nice article.

How do we actually obtain those embedding now?

Create Embeddings

The pre-trained network we are using here gives us output values for 8631 classes as it was trained on this number of classes.

Output of the used ResNet50 network. Image by author

If we would run this model now on an image of ours, we could get the predictions which of the 8631 classes our image fits most. However, our goal here is to find an image which looks most alike to a specific image of a celebrity. Therefore, we will use a truncated network. The idea is, that the network learned on over 3.3 million faces how a face is composed. Generally, each layer of a network obtains more detailed information. Below you see an the input image we used with our network, and the [1,10,50,101,150,170] layers output. The output is organized in blocks of 3x3 images per layer, showing different filter/neuron outputs per image.

input image to rcmallis ResNet50. Image by author

[1,10,50,101,150,170] layer output as a block of six images, each block showing the 1st to 6th filter/neurons output. Image by author

As you can see, the information gets more detailed the deeper you go into the network. The following layer inputs are composed of combinations of the previous layer outputs. In the last layers (170 of 176) you can see that the information reached almost pixel level details.

When cutting the network, we remove the last part of the network which is responsible for the final prediction. But we do not want a prediction, but low dimensional information about a face. The new output of the model has now 2048 outputs instead of 8631 classes. These outputs are our embeddings, meaning we get for each image of a face we put into the network 2048 values describing the face.

With this truncated network we can now create a library of embeddings of e.g., celebrity faces or faces of our known employees and compare them later to a new face.

Implementation

For all the code, please visit this github.

You can install the rcmalli model as described on their git. It will work the same way as the following method, however you would have to work with a TensorFlow version < 1.15.3 and install Keras v2.2.4. We will continue with our updated version.

Install the updated version via

pip install git+https://github.com/JanderHungrige/tf.keras-vggface

the model and pre-processing libraries can be imported and loaded via

Quantize Your Model to int8

If you will not or are not planning to run your model on an embedded device, you can skip this part and continue with the topic on how to create your database below.

But we want to use the NPU of the i.mx8M Plus of NXP.
The NXP NPU requires the model to be a tflite or pytorch model and it has to be fully quantized to int8. As our model is a TensorFlow model, we have to make it a TensorFlow lite model, while at the same time do a full int8 quantization. The full int8 quantization is neccecasry as, normally, the weights, bias, and activation values of a trained network are in float32 precision. Most NPUs however work with int8 precicion to be able to use simpler logic units. Therfore, we must convert and fit the float32 values in the range of int8, which are exactly 255 values. However, we cannot just cut the floating point values (e.g., 3,1415) to integer values (that would be now 1) using e.g., a numpyarray.astyp(int8) function, as this would distort the whole network. We have to use a int8 converter.

If you load the model including the weights you can directly quantize the model as follows:

In line 11 we set the converter to use a tf.keras model for conversion. Then in line 15–26 we create a generator function to load images from our data set (which we will create in a second), which is needed in line 29 to tell the converter how the image will look like. This is used to reduce the error introduced during quantization by calibrating the model to the (min,max) values. In line 28 to 34 we set the quantization to int8 and convert the model.

Short Sidestep on How to Use Your Already Existing Model

If you want to quantize your own model, you will probably have a h5 or pb file of your model and weights. Let us quickly create one from our model. Below you can see both methods. If you have a TensorFlow version below 2.0 use the h5 methods (For more info please visit this article).

If you use the original model from rcmalli you can only save the model with TensorFlow version 1.13.1 or 1.13.2. This is one reason we updated the keras to a tf.keras model. Then you must use the h5 version.

Now you can quantize your model as before with a small change. Use the different versions to set the converter depending on your model and TensorFlow Version:

Sitestep Over

As said above, we need a representative data set. But generally, we need a data set to create our embeddings for a later comparison. Let us investigate that.

If you want to have more details on the quantization, use different TensorFlow versions, or need also the in- and outputs to be quantized please visit this article.

Create a Database

We will collect around 10k images for the top 1k celebrities from IMDb. Here you could also collect e.g., employee images for facial comparison.

A simple google scraper gets us the database. To first get a csv file with the top names go to this imdb page and export the list via the export option on the site. Then you can run your google or bing crawler.

Be aware that you should only scrape images labeled for reuse or free of any license if you plan for a commercial use. Use here the filter setting: license=’commercial, modify’.
If you want to make sure you have a good, license free dataset, please visit this article.

Now we have 10 images per celebrity summing up to 10k images.

Prepare the Data Set Further and Create a “Only Faces Data Set”

The model expects 224x224 sized images. Also, we are looking for facial embeddings, so we first extract the faces from our data set and resize them to 224x224. You can use any facial detection algorithm. A good one would be the MTCNN. As we later use openCV for facial detection, we also use here the openCV variant with a haarcascade classifier. You can find and download the classifier here.

Now we have a data set with cropped faces in the dimension 224x224. This data set can also be used for the representative_dataset() for the quantization mentioned earlier.

Finally, for the preparation, we can create embeddings for each of the 10k face images. Here we use of course the quantized model we created earlier (with the just created data set) so that the embeddings are calculated with the same model we will use later for the live stream analysis.

Create Embeddings of Each Image in Your Database

The following code block creates embeddings for each image in our data set and saves them with the name and filename in a csv and json file. We only use the csv later as most embedded systems have pandas not included in their BSP, but the csv reader is included in python3. However, if you want to use the embeddings further, working with pandas on a json file is quite a delight.

Now we are done with the preparations.

To run the demo we will copy, the embeddings, the quantized model, the cascade classifier file, the face images data set, and the demo code (which we will create now) on to the embedded device.

If you are not familiar with this, boot your device and connect a screen, mouse, and keyboard to the device. Then, open a console and get your ip address e.g., with the Linux command: ip a. Now you can use ssh to go to your device or copy data to it. Go into the folder you have your files, then use scp command like this to copy the files to the device: scp -r user@ip_you_just_got:./
This will copy the whole folder (due to -r) to the device home folder.

Live Stream Analysis

Finally, after a lot of preparation we can combine everything and merge the look-alike demo.

The steps here are as following:

Read in your live stream
Detect any face in live stream
If trigger is hit, detect if there is a face detected, and find the most middle face if there is more than one face detected
Crop the face and create a 224x224 image
Use our model to create embeddings of detected and cropped face
Run euclidean distance on the embeddings and the previously saved embeddings
Find the minimum distance and return that value, corresponding celebrity image index and celebrity name
Plot everything

We first check if we are on a x86 (PC) setup or on an ARM system (embedded device) to set some variables like data paths or video devices accordingly.

Before we analyze the live stream, we

Load the Embeddings from csv,
Load the model,
Load the cascade classifier,
Define the pre-processing function,
Split your data and
Set the video pipeline.

Load the Embeddings From json

We could use pandas to read the csv, however, pandas is more difficult to implement via Yocto Linux on the embedded system. The json library is already implemented:

Loading the Model:

Point to the Cascade Classifier:

Here we use the local binary pattern (lpb) classifier. If you want to stick with openCV, you can choose also the haar-classifier which is better in performance but demands more resources. For an embedded device we recommend the lpb version. However, with optimization the haar cascader also runs smoothly on our system.

Set Pre-Processing Function

The pre-processing function that was used by rcmalli used a center pixel mean algorithm. On your PC, you can just import this function, however if you are planning to run this on an embedded device, I recommend to just write the function out to have less trouble including the function into your board support package (BSP).

The function simply subtracts the mean of the training data from the new input image.

Split Your Data

We want to speed up the comparison of our embeddings vs the celeb embedddings. Therefore, we split the data in x chunks. Also, the embeddings from those threads must be collected

Setting the Video Pipeline

If you are not on an embedded system you can just read in your webcam stream via openCV

Otherwise, on an ARM system using a mipi camera, as included in the phyBOARD Pollux, you have to read the stream via openCV and a Gstreamer pipeline, which converts the Bayer image to RGB and set the size of the image.

We check the Gstreamer support, set the camera state and create a pipeline:

Start the Live Stream

Call the Video Pipeline

Now, we can call the video pipeline with openCV and have a constant video stream. As the video stream from the mipi camer is in the bayer fromat it must be converted to RGB. You could do that also in the gstreamer pipeline, however, we noticed that it is much faster this way.

Find Faces in the Live Stream

each frame is then analyzed for faces

After Button Press

Finding and Cropping Middle Face

As soon as a button is pressed, we start to find the most middle face and use only this by cropping it to 224x224

Further Pre-Process Found Face

As we have the cropped face, we must do the same preprocessing as done on the training data and can then feed it to the model to gain the embeddings

Compare Embeddings

With resulting embeddings we can now compare them to the existing embeddings. We do this using threading to use all four cores of our device and speed this up. Actually, this is one of the most demanding tasks of this demo.

Plot Results

Finally, we only have to stich our face together with the best match and plot it. Here you can also implement a GUI of course. For simplicity and better understanding we did it more “raw”.

This concludes the demo description. On the platform itself this demo is implemented with a GUI and the use of object oriented programming. however, the basics are the same. To check out the GUI version, visit this Git.

Again, if you want to have all the above described code, please visit this github.

Port This to Embedded Hardware

If you use a phyBOARD-Pollux kit, the needed software from NXP is already included into the BSP. NXP created eIQ, which facilitates the connection between the onboard NPU and the peripheral components. This is done in core with an tuned google NNAPI, which is capable of understanding TensorFlow lite models and Pytorch models. Therefore, after converting your model to TFlite and quantizing it, eIQ takes over. If you create your own application and want to use it with the existing BSP, you only have to copy the model files and otherwise required files onto the board. Here we suggest to use the ssh protocol.
If you want to include your application or add specific libraries into the existing BSP using Yocto Linux, that is of course also possible. Please continue to read here for further information.

Does the NPU Give Any Benefit?

We are using the imx8M Plus NPU but does that really matter? Or better, how much improvement do we see using the NPU over the GPU regarding inference time? We measured the pure inference over a set of 10 input images 1200 and 600 times for NPU and GPU (we know, it should be the same, but the result is quite clear).

We got a mean inference time of the pure image inference of 0.016s over 1200 runs for the NPU.

We compared that to inference on the embedded GPU and got a mean of 1.5s over 600 runs.

Conclusions

We showed you in this article how to create and run a fun little demo. But, what it also demonstrates is, where we need an NPU and when we can rely on the embedded CPU or GPU. If we take another look at the demo block-diagram, we can determine which part in particular uses CPU/GPU and which parts use, but also need, the NPU.

Knowing this allows us to better estimate what hardware we need. Are the 2.3 TOPs of the phyBOARD pollux sufficient? We have seen that we are well equipped with 60+ fps for this application. But further on, we can abstract which deep learning applications we can run on this embedded hardware. In general, we see that deep learning is only a part of the whole application. What about my next project? Do I have more or less deep learning parts? Based on this, will my hardware performance still be sufficient?

I am sure that for most projects we can answer that with a yes.

Take care,
Yours Jan