How to use Keras2 flow_from_directory() with Azure blob storage

9 min readJun 17, 2019

Preamble

This tutorial explains every step in detail. I know the struggle of pre-assumed knowledge to well.

For the more knowledgeable ones, here is the fast version:

Install blobfuse on your DSVM with sudo apt-get install blobfuse
Create folder for blob on the DSVM with: mkdir ~/mycontainer
Mount blob into DSVM with:

sudo blobfuse ~/mycontainer --tmp-path=/mnt/resource/blobfusetmp --config-file=./fuse_connection.cfg -o attr_timeout=240 -o entry_timeout=240 -o negative_timeout=120 -o allow_other

Connect your Jupyter Notebook to the DSVM
Use Keras flow_from_directory() with the path pointing to the mounted blob
Save models into the same path via callback function

Introduction

In this tutorial I want to show how to connect your Azure Blob storage to your Azure DSVM to use it with your Jupyter notebook and e.g., Keras2.
I struggled to set up this connection to be able to use my uploaded images with the Keras ImageDataGenerator() function. Therefore, I want to show step by step how I achieved this. There might be better and more elegant ways. If you have one, please let me know.

The here described path is to make the blob available is to mount the blob onto the DSVM via blobfuse.

Prerequisite

Azure account. If you do not have one, follow this steps
An Azure storage account. If you do not have one, follow this steps
Images on your blob. I like the Microsoft Azure Storage Explorer to add and distribute file on your Azure storage accounts. The MS Azure Storage Explorer is a visual File explorer linked to your Azure account. Later we want to use the ImageDataGenerator() function. Therefore, please add the files in the folder structure supported. Check this Medium post. EDIT: Be aware that you will find the Azure storage account (image: futter) in the Microsoft Azure Storage Explorer but still have to create a Blob container (image: haende) where the data will be uploaded into. A Blob container can be created with a simple right click on the Storage account. (In the pictures below the names are different than in the description below as it is a edit)

An Azure Notebook running on Python3. To set this up, follow this guide.
An Azure Data science virtual machine (DSVM). If you do not have one follow this steps. If you created the DSVM, you have to set a Network Security Group (NSG). Follow this guide to create a new one. Then you have to set the NSG in your DSVM, otherwise you will not be able to connect to the DSVM with a Jupyter Notebook. See image below:

Why Blob or Standard V2?

As we are aiming to use ImageDateGenerator() function, we will use images. Blob/V2 are generic cloud storages that can hold your images. You cannot “do” anything with this data such as sorting or altering. But it can store quite some images close to your DSVM. This is very useful when you are working on an image classification problem with more data than you could load into RAM, or continuously changing data. Using the Blob/V2 instead your own HDD or cloud storage will reduce your latency by removing the bottleneck of uploading via your provider as the Blob and the DSVM are in the same Network.

If you are working with Json or CSV files, Blob would not be the recommended storage type in Azure. Then I would rather choose the table storage type, which is also included in the V2 storage.

Connect to your DSVM

IMPORTANT: A RUNNING DSVM CREATES COSTS . Just quitting e.g., your Jupyter Notebook does not stop your DSVM. It has to be stopped in the Azure Portal!

When your DSVM is setup, access the DSVM via the virtual machines tab in your Azure Portal. Choose the DSVM you want to use. In our case it is called Verdun (as it feels like endless fighting). On the top bar you can press the connect button which brings you to the connect settings. The DSVM has to run to be able to connect.

In the connect setting get the marked link into your clipboard (ctrl+c).

Now open your terminal or windows cmd (Windows key + typing cmd + enter) and use the copied line to access the DSVM. You will be asked for your DSVM password. Use the password you set before when creating the DSVM. Following message should be seen:

Install Blobfuse

Now we will follow this guide from Microsoft to install blobfuse.

At this time Blobfuse supports Ubuntu 14.04, 16.04, and 18.04. type following in the terminal/cmd to show you your version of Ubuntu:

lsb_release -a

Then install blobfuse

sudo apt-get install blobfuse

Blob needs temporary space on the VMs disk. Therefore, give access to some space on your VMs HDD/SSD

sudo mkdir /mnt/resource/blobfusetmp -p sudo chown <youruser> /mnt/resource/blobfusetmp

To use faster RAM instead of HDD/SSD look into the guide under Prepare for mounting.

Now create a textfile on your DSVM to store your credentials of the storage with: touch ~/fuse_connection.cfg . To add your credentials open the textfile with linux texteditor vim : vim fuse_connection.cfg . Now you enter a new, empty text file. Press: the insert key or i to enter insert mode where you can actually write into the file. Add your credentials in the following form:

accountName myaccount 
accountKey storageaccesskey 
containerName mycontainer

(Where to find this information is described a few steps further down)

Your command window should look something like this:

Then press the escape key ( ESC) to leave editing mode. Afterwards press : to enter command mode. With wq you will save and exit the file.

Using the command chmod 600 fuse_connection.cfg the file will have restricted access.

Now create a directory to mount the storage drive into with

mkdir ~/mycontainer

Now we can mount blobfuse with

sudo blobfuse ~/mycontainer --tmp-path=/mnt/resource/blobfusetmp --config-file=./fuse_connection.cfg -o attr_timeout=240 -o entry_timeout=240 -o negative_timeout=120 -o allow_other

If you added another path during creation of the folder, you have to change

-config-file=./fuse_connection.cfg by adding the full path to the fuse_connection file.

Now you can check your system with the ls command, you will see mycontainer with the blob included.

Each time you stop/start your DSVM in your Azure portal, you will have to mount the blob again.

Where to find account credentials

The accountname is the accountname you can find in the Azure platform. Go to storage accounts — Then choose your storage you want to mount — Go to “tab” Access key where you find Storage account name and the key.

If you scroll further down, you will find the Blob service “tab”. There you will see all the blobs on the storage. Choose (by remembering the name :-) ) the one where your files are in.

Now use Jupyter Notebook with the mounted blob drive

Open your azure notebook account and configure on which machine the notebook should run. default is free Tier, but the free tier is not configured for e.g., machine/deep learning. A DSVM is more powerfull, can have a GPU and has most machine/deep learning libraries pre-installed..

If you choose direct compute, a window pops up to enter the DSVM credentials:

The name is the DSVM name, The IP is the puplic IP, User name and password are the once you choose when creating the DSVM. It can be found on the Virtual machine overview page (Here the public IP is hide within the ):

If validation is valid, connect via Run to the DSVM. OF course, the DSVM has to be set on Start in the Azure Portal.

If you get an error, most likely the Network Security Group is not active anymore. Go again into the DSVM Networking tab, choose again application security groups, choose the one you created earlier and press save. This error happens often when restarting the DSVM later as the NSG is valid only for couple of hours.

IMPORTANT: A RUNNING DSVM CREATES COSTS . Just quitting e.g., your Jupyter Notebook does not stop your DSVM. It has to be stopped in the Azure Portal!

Now lets use the blob in the Jupyter notebook with Keras2

The following code can be found on Git:

JanderHungrige/Workshop

Workshop on introduction to deep learning with Keras - JanderHungrige/Workshop

github.com

First, classic importing the libraries:

Then define your model. We use the mobileNetV2 as base:

Freeze the base layers (Different when you use a different base model, e.g, ResNet50)

Next, compile the model

Define the ImageDataGenerator. here we scale the RGB mages to 0–1. And added some image augmentation. Change them as you wish.

Now comes the important part. Using flow_from_directory() using the blob storage. As we mounted the blob into the DSVM, we simply can treat it as a path. First lets check that we can find the images:

If we check into mycontainer, we see that it contains the folder structure necessary for the ImageDataGenerator. We used folder input as the first element due to Kaggle having adopted this structure. In many cases, the folder structure is split up into Train and Test in which you would find the folders for each class. As we use the validation_split parameter, this is not necessary and all class-folders are directly in the input folder.

Now we can use the flow_from_directory() function with the parameter subset. But first, we point to the blob storage:

The mycontainer folder was created in home/user/ , we just use: datadirectory=’ home/username/mycontainer/input/‘ . Using: ~/mycontainer , which points to the users home directory in Linux, does not work here. Therefore, use the full path.

Our image were sized to 224x224 (here: Tsize) as one of the standard formats of the MobileNet. Other Networks are pre-trained on different formats. Check before you resize your images for the particular network. If you use a pre-trained network, the input size should be the same. Even though, CNN layers do not care, the fixed weights do. If you retrain the entire network without fixing the weights, you do not have to care (that much) about the input dimensions.

As we use validation_split, we also need to use the subset parameter.

Just to be complete, create a callback. For example saving best model.

First, create a folder on the DSVM to save the model into. Go again into the terminal/cmd:

mkdir ~/models

then create the callback, again with the full path to the folder on the DSVM:

And fit the model on the ImageDataGenerator:

If you check the models folder now, you will find the saved model. This can of course also direct be saved into your mounted blob or later copied to the mounted blob to have access to it.