LSTM for Predictive Maintenance on Pump Sensor Data

In this article, we want to look at some time series analysis to explain the thought process in a predictive maintenance case

Jan Werth
Towards Data Science

--

Photo by Khamkéo Vilaysing on Unsplash

List of Content

Prerequisite
Introduction
The Data

— Overall first glance
— Target data (y)
Cleaning Data
— Removing NaNs
Further Options
— Outlier removal
— Feature engineering
— Feature selection
Preparing Data for LSTM
— Create Timeseries
— Splitting Data into Train Validation Test Sets
— Normalizing/Standardizing Data
— Reshape and One-Hot-Encode
Using a Simple LSTM
— Question of layers and units
— A simple model
Results
Further Options
Summary of Steps

Prerequisite

All code can be found in this Git-repo
To recreate this article, you can find the data-set here

I suggest using anaconda to create a Python 3.6 environment and installing the python packages:

Tensorflow (pip install TensorFlow)
Pandas (pip install pandas)
Numpy pip install NumPy)
Scikit-learn (pip install scikit-learn)
Matplotlib (pip install matplotlib)

Introduction

In this article, we are looking into predictive maintenance for pump sensor data. Our approach is quite generic towards time-series analysis, even though each step might look slightly different in your own project. The idea is to give you an idea about the general thought process and how you would attack such a problem. For a complete overview of the steps, please see the figure at the end of the article (maybe open it in parallel while reading). It should help to organize all steps in a logical order.

If you have a similar project and simply are searching for a walkthrough, you can just check out the code on Github and adapt it to your needs.

To limit the article length, of course, a lot of points are just mentioned and not deep-dived. However, we point to relevant articles on the details. We hope that the given information is helping you to understand the general approach of a time series analysis.

The Data

First overall glance

We first check the general look of the data. What is in front of us? How much data do we have? How is the data organized? What type of data is each part of?

So we read the data with one line in pandas and get some overview with simple print statements.

by author

From the prints we can derive the following info:

The CSV file contains 55 columns, with each over 200K entries.

by author

The data is separated into 52 sensor columns, one machine status (target/results) column, a timestamp column, and one Unnamed column, which is just the original index column.

by author

From the timestamp, we can see that the data is recorded in 1-minute steps. A quick look shows us that the sensor data is in float32 with varying amplitude and the timestamp is in the yyyy-MM-dd HH:mm:ss format.

by author

Target Data (y)

Following, after the first glance, the second most important thing for me is to take a look at the target data (y data / results). This will give some indication for the solution strategy.

Now, after the first glimpse at the data, we have to check what results (called target) are provided which we can later use to identify/define our predictive maintenance goal/path. Some questions I ask myself while typing the code to extract the target data are

  • Do we have target information, or will it be an unsupervised task?
  • Are the target data continuous data or booleans?
  • What data type is the target in? Values, text, …
  • In what intervals are the targets recorded? Does every sensor entry get a single entry, or are they grouped?
  • Does each sensor have its own target (one machine — one sensor), or do we have one target for all sensors (one machine — many sensors)?
  • Is the target descriptive, or do we have to identify what counts as broken and what as normal?
  • Is the target information complete and useful?

To get some answers, we will get the unique classes and see how many values we have for each class. Finding, that the labels are in char format (text), we know, that we will have to convert them at some point into integer values to be useable for our later Ml algorithms. To convert them into corresponding integer values, you can use a mapper function from scikit-learn. If you have the mapped target, we should plot it to get an overview of what happens when.

Get info about target data [from author]

So, we see, that there are three classes in the dataset. As mentioned, they are in text form and give us already a good indication of what it is about. In other cases, you might just receive a status [A, B, C] or similar.

Classes available [from author]

After counting the values in each class, we see that we have the majority class “Normal”, which is to be expected, as the machine should run normally the majority of the time. The classes “Recovering” and “Broken” are the minority classes. We see directly, that we most likely cannot work with the Broken class, as seven values are not enough to learn any pattern.

Amount of values per class [by author]

We also see that by counting the values together, we receive the same amount as existing rows. This means there are no missing values (NaN) in the target data (of course that could also be checked with an isna() function).

The plot of the target shows us that the faulty sections are not grouped up at e.g., the end, but are rather dispersed over the entire data length. This is interesting for a later segmentation into train and test sets. We also see, that the Recovering class always follows the Broken class. This means that it is not a problem to have only 7 entries for the Broken class, as we just have to predict the recovering phase to also get the Broken class.

Target data [by author]

To answer some earlier thoughts: We have good and complete target data. We have one entry for each row, resulting in a one-machine:many-sensors supervised learning task.

Cleaning Data

Now that we understand the data, we will have to look for the quality of the sensor data and most likely manipulate/repair/augment the data to be useful for later training.

Removing NaNs

The first step in data cleaning is to check for NaN values. So episodes, where a sensor did not send any data. Not to confuse with zero values, which actually could mean that the value is zero (lots of zeros in a dataset makes it a sparse dataset). Here the questions are

  • How much of the data is NaN?
  • Is it only a few sensors, or do most of them have NaNs?
  • Do some sensors have much more NaNs than others?
  • Are the NaNs clustered together or spread?
  • Can we fill the NaNs or do we have to delete them?

So we first check, print and plot NaNs as they are. We notice right away, that Sensor_15 is completely empty, so we remove him to scale the data better.

Showing NaNs, with dropping the sensor_15 [by author]

We could now simply remove all NaNs from the data, however, we would lose about 77000 timesteps, which would be roughly 35%. Therefore, we try to remove as many NaNs sensor by sensor.

If we look at the next largest holder of NaNs, we see the sensors 50 and 51. We can see several things. First, Sensor 50 just gives out at some point. So we will remove sensor 50. The other option here would be to remove all data from all sensors from time step ~140000 onwards… So not really an option (in other cases it might actually be the only option).

Comparing sensor_50 and sensor_51 [by author]

Second, both sensors have a very similar amplitude and range of values. More so, the yellow marked parts are actually very similar. We saw in other sensors (see below) that the drop at mark 140000 is occurring in almost all sensors. Sensor 51 seems to also show this drop right after its data gap. Therefore, we decided to use the 50 to repair the 51. Not the cleanest way, but here definitely possible.

After dropping sensor_50 we see that now sensor_00 and the sensors between 06–09 show most NaNs. Now is a good time to check the variance, which shows how much the signal derives from the mean of itself. Meaning, does this signal move in any way? If we want to detect a trend or change in class, it is good if the signal does show variance.

Sensors and variance of sensor data [by author]

As we see in the marked part, the sensors 00, 06–09, do not show high variance. So we allow us to delete those as we think that they will not add enough valuable information vs the amount of overall signal we are losing.

After the drop, the remaining sensors do show some NaNs. Now we try to fill them via the fillna() function with a limit set to 30. Meaning, a maximum of 30 consecutive NaN values are filled. This leads to two sensors left with around 200 NaNs. Finally, we just remove those as it represents only 0.09%.

Sidenote: Here is a nice Stackoverflow discussion on how to fill NaNs most efficiently.

Filling consecutive NaNs [by author]
Removing NaNs [from author]

Further Options

You can always do more in preparation. For a proof of concept, I suggest to go ahead with the NaN free data and come back to further data preprocessing after you established a basline model performance.

Now, the first step for being used by machine learning is done. Rudimentary, the data could be read and used to train a machine learning network. However, there are more options to optimize, the required splitting, and later scaling, of the data for a good running solution.

The most common steps for further optimization are:

  • Outlier removal / Noise reduction
  • Feature engineering
  • Feature selection

After a short dive into the topic, we will nevertheless continue with the signals in the actual state. The first goal should be to receive a running proof of concept (POC). Optimization should come afterward, where continuing signal cleaning, feature creation, etc. is in order.

Outlier removal / Noise reduction

Outlier removal means to identify and remove parts of the signal which do not contribute to the pattern recognition or even disrupt the algorithm. This can be done via different already established algorithms. A good overview can be found in the following article:

One thing to keep in mind is that outlier removal, as well as noise reduction, is that those parts of the signal can actually be valuable information. So blindly disregarding them and automatically removing outliers and noise might not ruin your algorithm, but could degrade its performance. This goes especially for noise reduction. Hastily executed noise reduction can threaten your analysis.

General ways to increase the signal-to-noise ratio are moving averages or Kalman-Filters strategies.

Feature engineering

In our case, we can use the signal sensor signals directly as feature inputs. However, if your problem is more complex with unsteady, noisy, and highly volatile signals, you want to bring in your knowledge of the problem. This knowledge, or generally good engineered features, often leads to better model performance. Differently formulated, good features reduce the necessity for perfect models.

This topic fills another article, therefore we are not diving into it but point you to a great article by Jason Brownlee:

Feature selection

Also, this topic would need an article of itself, we will rather explain a few points briefly regarding our dataset and point to more literature by Jason Brownlee for your deep-dive.

Feature selection is done, as redundant data, bad data, or uninformative data will rather hinder than boost your machine learning performance. We always think, the more data, the better, but more precisely it should say: the more data of good features, the better.
By feature selection, we help the algorithm to sort out beforehand which feature will add to solve the problem.

A rule of thumb is, the more simple algorithms like K-NN the more it benefits from feature selection as it would otherwise have difficulties separating the meaningful from the redundant features (drowning in features). More complex algorithms like random forest and ANNs, are able to find the optimal features by themselves. Here we “just” reduce computational effort by previous feature selection.

Just to have it mentioned, regularization can replace feature selection in some cases. An interesting discussion about L1 regularization and feature selection can be found here. A good description can be read here.

When looking at all our sensors we can see that there are sets of sensors that look very similar to each other (Grouped by color in the image below). This means that reduction of input sensors via correlation method or principal component analysis (PCA) would work well.

Grouped sensor signals [by author]

Training a Model

As the data is base-prepared, we can now choose an algorithm type to solve the prediction problem. Depending on the amount of data, the complexity of the problem, the final hardware to run on, and other considerations, you can choose between different algorithms.

As we did no feature selection, we should use either a random forest classifier/predictor or a type of ANN, as both works as an integrated feature selector.

We choose an LSTM to later test its support on the embedded hardware.

Preparing Data for LSTM

After choosing a prediction method, the data has to be again prepared to fit the specific method. Classic ML and ANNs data preparation is basically similare but different in small details (e.g., input dimension).

Create Timeseries

The data could in its current form be used to train a classifier on the actual class. However, we want to predict the future classes based on the actual values. Therefore, we need to shift the data against the target to create a time gap. Here thanks again to Jason Brownlee for a great article on that topic.

Shifting x [by author]

The code below block by block:

  • n_in: How many timesteps to forecast. n_out: How many targets to forecast with the n_in. n_out >1 would mean to forecast a series of target values. Both can be used to look into the future. We only use n_out=1 here.
  • line 6–7, a shifted input is generated based on the n_in amount. So if e.g., n_in=3, then for each of the remaining 46 sensors 3 shifted signals and corresponding names are created in a DataFrame.
  • line 10–15, shift forward to create a target sequence.
  • line17–18, combine all in a DataFrame
  • line 20–22, drop the nans created by the shift
Creating shifted signals for time series prediction [by author]

This function returns a DataFrame with the original and shifted values (image below).

DataFrame with x(t) and x(t-1) [by author]

We now have to remove the values we do not want. In our case would that be

  • the sensors at x(t)
  • the sensors at x(t-n) where n are all but the desired shifted values,
  • and the sensor_44 (target) at x(t-n # a3).

So we will be left with the desired shifted data and the unshifted target. We now have to remove the values we do not want.

  • Line 1, set how many steps to look into the future
  • Line 2, first call series_to_supervise() to create shifted data.
  • Line 3, get all the names for the unshifted
  • Line 4, get all the names of the shifted values except the timestep distance we want to use for prediction (Future value). This is actually not necessary. Leaving all shifted values might improve performance. Test the difference on your own project.
  • Line 6–8, get target data out and remove the collected names.
Remove unwanted data[by author]

Now we have the shifted X data and the unshifted Y data.

Split Data into Train Validation Test Sets

The next step is to split the data into train validation and test set. Normally, this is done with e.g., the sklearn.model_selection.train_test_split() orsklearn.model_selection.StratifiedKFold() Funktion. However, for our proof of concept, we split by hand to have a bit more control. We choose:

Manually separated data [by author]
Splitting the data [by author]

Now we have our sets ready. Again, For a proper k-fold analysis, this should not be done manually, but rather with sklearn.model_selection.KFold() or sklearn.model_selection.StratifiedKFold() function (stratified means that each selection has an equal amount of all classes in it).

Normalize / Standardize

The next step is to normalize the data. Normalization of the data is necessary if the different input features (here sensors) have a different range in their amplitude. Otherwise, values are misrepresented when using machine learning methods. Machine learning methods use mainly multiplications within the functions. Multiplication of a large value leads to even larger values, which will be misinterpreted as “importance”. Scaling all values between, e.g., [0,1], is base for a neutral comparison.

Scaling data [from author]

Reshape and One-Hot-Encode

In the final two steps, we have to reshape the data as the LSTM (which we will choose) expects the input in the form [samples, timesteps, features].

And, as we are looking for classification we have to one-hot-encode our target before training so that the softmax activation interprets the classes correctly without misinterpreting the class order (0,1,2) as ranking or importance. This is easily done with the sklearn.preprocessing.OneHotEncode() function.

Both combined in one function:

Now we are actually ready to train an algorithm on the different classes. We will choose the LSTM.

Using a Simple LSTM

Question of Layers and Units

So we first build ourselves a simple LSTM model. With deep neural networks, there are mainly two questions: How many hidden units (neurons, filter, …) per layer, and how many layers. On the question of how deep a model should be, the rule of thumb is:

If you build your own model (not pretrained), start with one or two layers and gradually add more while comparing performance.

With LSTMs in general, you could question if more layers mean more long-term memory. Here is an interesting article from Jason Brownlee about the topic. To summarize, it seems that more LSTM layers provide different scaling in time, hence better time resolution/ better performance. This is equivalent to more resolution in a CNN with more layers. Attention: more layers mean more complexity and often lead to overfitting.
The number of hidden units (neurons, filter, …) on the other hand are different within CNNs. In CNN's the rule of thumb is to increase the numbers per layer, as more detailed information has to be analyzed. In LSTMs, the number of memory cells is less important than the number of layers in regard to analyzing time series. The rule of thumb here is fewer hidden units (memory cells) than input features. Start small, e.g. 2, and slowly increase. So just try different values, but put less emphasis on them. Here we choose 42 ;-).

A Simple Model

The model we try for our proof of concept looks like this:

The model [by author]

Two LSTM layers with each 42 hidden units and two output layers. You can use a simple sequential model with one output here. We have two as, we want to show different use cases. The signal_out is a Dense layer with one unit giving us a predicted signal, while the class_out is a Dense layer with 3 units and softmax activation giving us the predicted classes “Normal”, “Recovering” and “Broken”.

The model [by author]

The signal_out seems to not make too much sense here, but we want to show, how you could use this approach to increase prediction (see in the results) and predict a signal instead of a class.

Now let's train the model:

Train model [by author]

Results

The training metric looks promising. Validation (here labeled test) loss and accuracy look good.

Accuracy and loss [by author]

But we see one thing. Even though the predicted signal, you see below, is not perfect, the validation accuracy shows 99%. This is a nice example of why to not use accuracy for an unbalanced dataset. Just by classifying everything as a majority class, we already reach high accuracy. Therefore, it is better to implement measurements like Kappa statistic, ROC, F1, Anova, … for imbalanced datasets. And to be scientifically watertight, use more than one of those measurements and do k-fold cross-validation.

Now run inference.

Run inference [by author]

Finally let's plot the target and predicted target over each other:

Predicted target vs original target [by author]

You can see that the faulty episode is not directly classified (due to only 7 samples to work with), however, the recovering phase is detected quite well and, in this case, 10 min in advance. As an engineering problem, that is enough, as we positively detect the start and end of the faulty event and have time to react (switch off, transfer to another pump, lower speed, …). If you need more time in advance, increase the n_in variable from the series_to_supervised() function. However, the shifted DataFrame gets quite large and my laptop was not able to handle it anymore. Time to move to Azure.

If we plot the predicted class (signal_out output of the model) as a continuous signal instead of the max class, we can see, that this might result in a more stable classification if we set a simple threshold for e.g., the output or we threshold the calculated variance.

Predicted target as continues signal vs original target as binary [by author]

Generally, with this output method, you can predict a signal. Meaning, you could use some of the sensor data to predict another sensor's upcoming data. The only thing you have to change for this approach is to use the signal you want to predict as the target instead of the classes (e.g., the data of Sensor_42 is now the target).

Further Options

If the results are not satisfying in your project, the following options are available (and probably I missed a ton):

  • Play with the hyperparameters
  • Add more layers
  • Use the above-mentioned signal processing/ cleaning steps
  • Use lightweight “classic” machine learning
  • Use novel Transformer models
  • Synthesize more minority states with GAN, SMOTE, ADASYN, …
  • Weight minority class penalty

Summary of Steps

This was quite long and maybe confusing. Therefore, we again summarize all the thoughts we have for each chapter. This will allow you to easily reconstruct the train of thought.
Please reuse this image, but don’t forget to cite it.

Train of thought [by author]

--

--

Hi, I am a carpenter, electrical engineer and have over 10 years of experience in signal processing, machine- and deep learning. linkedin.com/in/jan-werth