How to Create and Tune Your Own Data Set for Facial Recognition using Neural Networks

Creating a Dataset for Celebrity Comparison with Creative Commons License Images and Tuning the Dataset to Perform Better

Jan Werth

Published in

Towards Data Science

6 min readNov 30, 2020

Table of Content

Introduction
- Prerequisite
The Crawler
Using Deep Neural Network to Identify Miss-Matched Images
- The Idea
- The Results
- The Code
Further Improvement

Introduction

When you want to create a data set to compare your face to the face of celebrities and run it for example on a phyBoard Pollux neural processing unit, like we did here, or any other aim where you would use images of e.g., celebrities, the good images are mostly not under a creative common license. We used a Bing image crawler to look for celebrity faces and had troubles when using the filter set to: commercial and reuse. However, we found a way to use a deep neural network to separate the good from the bad.

All code can be found here

Prerequisite

To use the code described here you would need a

python 3.6+ environment (I recommend Anaconda using virtual environments),
icrawler,
TensorFlow 2.x,
tflite_runtime,
pandas,
numpy,
matplotlib,
scipy,
opencv-python,
and the tf.keras-vggface model.

To install the tflite_runtime, download this wheel file and install via pip install path_to_file

The conda environment-file to clone the environment can be found here (latest: TF2.3envfile.yml).

The Crawler

When you want to gather e.g., faces of celebrities, the most simple way is to use a python image crawler library, like the icrawler. Here you can use a search term in combination with filters and other settings like size, type of image, … We downloaded a CSV file from imdb to get the names of the top 1k Hollywood celebrities and used that as the crawler input. We decided to use Bing as it is sometimes better for image search.

The crawler tries to get 10 images per name. That would result in around 10k images (The crawler will abort after 10 tries no matter if he was successful or not). You can see, that we set the license in line 10 to ‘commercial, modify’. The Problem is, that the results are just bad. Meaning for less known actors we mostly get one true hit and the rest are just random images. The images might be closely connected context wise, but identifying the correct image requires manual checks.

Crawled Andrea Riseborough images [all images cc licensed]

For better known names, one or two images can be off. In this example of Amber Heard, we get one image that is correct context wise, but does not show Amber Heard but her Husband Jonny Depp

Crawled Amber Heard images [all images cc licensed]

With 10k images, it is impossible (if you want to keep your sanity) to check all images per hand.

Using Deep Neural Network to Identify Miss-Matched Images

The Idea

What we can do to automate the checkup, we can use the same technique used for facial identification. In this blog we described in detail how to set up facial identification to compare your face with celebrity faces and run inference on an embedded NPU.

The idea is that we use a truncated network and receive as a lower dimensional description of the facial features from the output layer. This output is called Embeddings.
With the tf.keras-vggface model, we adapted a ResNet50 architecture from rcmalli which was first described by Qiong Cao et al. The model was previous trained on over 3 million faces, making it excellent for facial identification. We now truncated the model and cut the fully connected layers to receive an output layer with over 2k output filters, meaning 2k+ facial Embeddings per input image.

Using those embeddings we can describe and compare faces to each other. When gathering facial Embeddings, the embeddings per input image are in the form of a nx1 dimensional vector (n=number of Embeddings). With the euclidean distance, we can now compare the embedding vector of different face images and get a value for their similarity.

With the Bing scraper we got for each celebrity one folder, containing all his images. Now, we use the described method to compare the Embeddings of each image to all other embeddings in the same folder. This results in n-euclidean distance values, for which we can calculate the mean, std, or mean standard error.

So what we want to achieve is to find the outliers in each folder or determine if all images are just wildly mixed up.

The Results

Before we look into the code, let us take a look at the results of comparing the mean and mean standard error values. In the two images below, you can see the mean values plotted for each image with the mean standard error values as error bars.

You can see that in the first plot the values are much more over the place compared to the second plot, but also are larger in mean euclidean distance. In the second plot we also can see a clear outlier for image 000004.jpg. Both images nicely summarize our findings. The mean of the euclidean distance for each image compared to all others in the folder is a good indicator for the quality. After examination of several cases, we noticed that a mean euclidean distance of 100 is a good cutoff value. For your own dataset, you will have to find out your specific threshold.

The Code

After importing and setting variables (find full code here [V1]), we create a function that create the Euclidean Distance between two Embeddings and a pandas dataframe to save all the Embeddings with name, path, and values.

Now we load the tflite model you can find here : ftp://ftp.phytec.de/pub/Software/Linux/Applications/demo-celebrity-face-match-data-1.0.tar.gz

In the Embeddings file we stored now the Embeddings of each file, but also the mean error and std against all other images in the folder or the ground trouth. Then, we get each image of each folder (Line 3). We get our preporcessing done in the same way as during the training of the model and create the Embeddings (more on Embeddings and why to use them here) (Line 7–9). AFter we created the Embeddings for all images in that one folder, we create the Euclidean distance (Line 18), unsing the previously created functions, to get the distance between each Embeddings in that folder compared to each other. Therfore, we can create a mean distance (*std, mean error,…) for each Embedding (of each image) towards all other Embeddings (images) (Line 20–22).

In Line 25 we save all Embeddings to json.

As we figured out the value of 100 to be a good separator, we can use this inforamtion to delete all other values with a mean Euclidean distance of greater than 100 as in the following code block:

Further Improvement

find full code here [V2]

To further improve the results, you can use a scraper for regular celebrity images without a free license. Create the Embeddings on those images and then compare your license free image Embeddings to those. If you now compare those embeddings again, the difference between good- and bad fits gets even greater, making it more clear to separate.

To create the embeddings, crawl again for images, but do not use the filter=(commercial, reuse) this time. Then you will get much better images of e.g., celebrities. Now create embeddings using the model we use here (much more info on how to create embeddings here and code here ). If you now have the embeddings, we jsut have to alter the first function a bit, to not compare the image to other images in the folder, but against your gold truth Embeddings you jsu t created.

Load the Embeddings (Line 2) and change the faceembedding fucntion like follwoing in Line 4–9. As the fucntion changed now, calling the function has to be adapted also.

Good Luck