lunedì 22 luglio 2019

Spark+AI Summit Europe 2019: I'll be there!

I am glad I have been selected to speak at the next Spark+AI Summit Europe 2019, which will happen in Amsterdam, Netherlands on October 15th-17th 2019. I am going to present some follow-up of one of the core topics of my book: memory management in distributed Deep Learning with DL4J on Apache Spark. More details about my talk will follow in the next weeks.
As usual, the summit will have an impressive line-up of speakers, such as Matei Zaharia, Ali Ghodsi, Holden Karau, Luca Canali, Gael Varoquaux, Christopher Crosbie, Michael Armbrust and many others. I hope you will attend this event.

domenica 23 giugno 2019

Time Series & Deep Learning (Part 3 of N): Finalizing the Data Preparation for Training and Evaluation of a LSTM

In the 3rd part of this series I am going to complete the description started in part 2 of the data preparation process for training and evaluation purposes of a LSTM model in time series forecasting. The data set used is the same as for part 1 and part 2. Same as for all of the post of this series, I am referring to Python 3.
In part 2 we have learned how to transform the time series into a supervised model. That isn't enough yet to feed our LSTM model: other two transformations are needed.
The first thing to do is to transform the time series data in order to make it stationary. The input time series used for these posts presents values that are dependent on time. Looking at its plotting in part 1, we can notice an increasing trend up to January 2006, then a decreasing trend up to February 2010 and finally a second increasing trend from there to date. Transforming this data to stationary makes life easier: we can remove the trends from the observed values before training and finally add them back to the forecasts in order to return the prediction to the original scale. Trends can be removed by differencing the data: we subtract the value at time t-1 from the current value (at time t). The pandas DataFrame provides a function, diff(), for this purpose. We need to implement two functions, one which returns the difference time series:

def difference(timeseries, interval=1):
    diff_list = list()
    for idx in range(interval, len(timeseries)):
        diff_value = timeseries[idx] - timeseries[idx - interval]
    return Series(diff_list)

and another one to invert the process before making forecasts:

def invert_difference(historical, inverted_scale_ts, interval=1):
    return inverted_scale_ts + historical[-interval]

The last thing to do is to transform the input time series observations to have a specific scale. The neural network model we are going to use is a LSTM. The default activation function for LSTMs is the hyperbolic tangent, which output values are in the range between -1 and 1, so the best range for the time series used for this example is in the same range. The min and max scaling coefficients need to be calculated on the training data set and then used to scale the test data set. The scikit-learn package comes with a specific class for this, MinMaxScaler. We need to implement two functions, one to calculate the scaling coefficients:

from sklearn.preprocessing import MinMaxScaler

def scale_data_set(train_data, test_data):
    min_max_scaler = MinMaxScaler(feature_range=(-1, 1))
    min_max_scaler =
    train_data = train_data.values.reshape(train_data.shape[0], train_data.shape[1])
    scaled_train_data = min_max_scaler.transform(train_data)
    test_data = test_data.values.reshape(test_data.shape[0], test_data.shape[1])
    scaled_test_data = min_max_scaler.transform(test_data)
    return min_max_scaler, scaled_train_data, 

and a second one to invert the scaling for the forecasted values:

def invert_scale(min_max_scaler, scaled_array, value):
    row = [elem for elem in scaled_array] + [value]
    array = numpy.array(row)
    array = array.reshape(1, len(array))
    inverted_array = min_max_scaler.inverse_transform(array)
    return inverted_array[0, -1]

Now we can put all together. We first make the input data stationary:

raw_values = series.values
diff_values = difference(raw_values, 1)

then transform the data to make the problem like a supervised learning case:

supervised = tsToSupervised(diff_values, 1)

and finally split the data for training (years from 1992 to 2010) and test (years from 2011 to date):

train, test = supervised[0:-98], supervised[-98:]

and transform the scale of the training data:

scaler, train_scaled, test_scaled = scale(train, test)

At this stage we can now build and train our LSTM. But this will be the core topic of the next post of this series.
The complete example would be released as a Jupyter notebook at the end of the first part of this series.

venerdì 7 giugno 2019

Time Series & Deep Learning (Part 2 of N): Data Preparation for Training and Evaluation of a LSTM

In the second post of this series we are going to learn how to prepare data for training and evaluation of a LSTM neural network for time series forecasting. Same as for any other post of this series I am referring to Python 3. The data set used is the same as for part 1.
LSTM (Long-Short Term Memory) neural networks are a specialization of RNNs (Recurrent Neural Networks) introduced by Sepp Hochreiter and Jurgen Schmiduber in 1997 to solve the problem of the Vanishing Gradient affecting RNNs. LSTMs are used in real-world applications of language translation, text generation, image captioning, music generation and time series forecasting. You can find more info about LSTMs in my book or wait for one of my next posts of this series. This post focuses mostly on one of the best practices for data preparation before using a data set for training and evaluation of a LSTM in a time series forecasting problem with the Keras library.
Let's load the data set first:

from pandas import read_csv
from pandas import datetime

def parser(x):
    return datetime.strptime(x, '%Y-%m-%d')
features = ['date', 'value']
series = read_csv('./advance-retail-sales-building-materials-garden-equipment-and-supplies-dealers.csv', usecols=features, header=0, parse_dates=[1], index_col=1, squeeze=True, date_parser=parser)

The first action to do is to transform the time series in a way that the forecasting can be threat as a supervised learning problem. In supervised learning typically a data set is divided into input (containing the independent variables) and output (containing the target variable). We are going to use the observation from the previous time step (identified as t-1) as input and the observation at the current time step (identified as t) as output. No need to implement this transformation logic from scratch, as we can use the shift function available for pandas DataFrames. The input variables can be built by shifting of one place down all the values of the original time series. The output is the original time series. Finally we concatenate both series in a DataFrame. Because we need to apply this process to the values of the original data set, it would be good practice to implement a function for it:

from pandas import DataFrame
from pandas import concat

def tsToSupervised(series, lag=1):
    seriesDf = DataFrame(series)
    columns = [seriesDf.shift(idx) for idx in range(1, lag+1)]
    seriesDf = concat(columns, axis=1)
    seriesDf.fillna(0, inplace=True)
    return seriesDf

supervisedDf = tsToSupervised(series, 1)

Here is a sample of how the supervised DataFrame looks like:


Is the data set now ready to be used to train and validate the network? Not yet. Other transformations need to be done. But this would be the topic of the next post.
The complete example would be released as a Jupyter notebook at the end of the first part of this series.

mercoledì 5 giugno 2019

Time Series & Deep Learning (Part 1 of N): Basic Stuff

During the latest part of my career I had a chance to work with Data Scientists having strong skills in Python. My tech background, after a start with C/C++, is in JVM programming languages mostly (but I had to touch several others during my career), so it was a great chance for me to learn more about practical Python, at least in the Machine Learning and Deep Learning spaces. I am no more a full time developer since few years, as I moved to other roles, but this didn't stop me following my passion for programming languages and staying hands-on with technology in general, not only because I enjoy it, but also because I believe whatever your role is, it is always important to understand the possibilities and the limits of technologies before making any kind of decision.
In this new long series I am going first to share some of the things I have learned about doing time series forecasting through Deep Learning using Python (with Keras and Tensorflow), and finally I will present a follow-up of my book, showing how to do the same with DL4J and/or Keras (with or without Spark). I am going to start today with basic stuff, then I am going to add any time more complexity.
For all of the posts in this series I am going to refer to Python 3. I am going to focus on the time series matters and assuming that you have a working Python 3 environment and know how to install libraries on it.
In this first part, before moving to a DL model implementation, we are going to get familiar with some basic things. The Python libraries involved are pandasmatplotlib and scikit-learn.
What is a time series? It is a sequence of numerical data points indexed in time order. I have dealt (and still deal) with many time series in the last few years (you couldn't expect otherwise if working in some business such as healthcare, cyber security or manufacturing), so time series analysis and forecasting have been really useful to me. Time series could be univariate (when they present a one-dimension value) or multivariate (when they have multiple observations changing over time). In the first posts of this series I am going to refer to univariate time series only, in order to make the concepts understanding as easier as possible. At a later stage we are going to cover multivariate time series too.
Of course I can't share production data, so I have picked up a public data set available in Kaggle for these first posts. The data set I am going to use is part of the Advanced Retail Sales Time Series Collection, which is provided and maintained by the Unites States Census Bureau. There are different data sets available as part of the Advance Monthly Retail Trade Survey, which provides indication of sales of retail and food service companies. The one I am using for this post contains the retail sales related to building materials and garden equipment suppliers and dealers.
First thing to do is to load the data set into a pandas DataFrame. The data set contains four features, value, date, realtime_end and realtime_start. The date feature comes as as string containing dates in the format YYYY-MM-DD. We need to convert those values to dates. We can then define a function for this purpose, to be applied to the date values at loading time:

from pandas import datetime
def parser(x):
    return datetime.strptime(x, '%Y-%m-%d')

We really need only the first two features (value and date), so when loading from the CSV file, we are going to discard the other two. We need also to specify the column which contains the date values and the parser function to apply:

from pandas import read_csv
features = ['date', 'value']
series = read_csv('./advance-retail-sales-building-materials-garden-equipment-and-supplies-dealers.csv', usecols=features, header=0, parse_dates=[1], index_col=1, squeeze=True, date_parser=parser)

We can now have a look at a sample of the rows in the DataFrame:


The output should be something like this:

1992-01-01    10845
1992-02-01    10904
1992-03-01    10986
1992-04-01    10738
1992-05-01    10777
Name: value, dtype: int64

We can also plot the time series through matplotlib:


The data set contains the monthly retail data from January 1992 to date. We need to split the data set into train and test data. We are going to use 70% of the data (years from 1992 to 2010) for training and the remaining 30% (years from 2011 to date) for validation:

X = series.values
train, test = X[0:-98], X[-98:]

One way to evaluate time series forecasting models using a test data set is the so called walk-forward validation. No model training is required because basically we get predictions by moving through the test data set time step by time step. Which, translated in quick and dirty Python code is:

history = [train_value for train_value in train]
predictions = list()
for idx in range(len(test)):

At the end of the walk-forward validation, if we plot in a single chart both the observed and the predicted values:

import numpy as np
x_axis_values = np.arange(1, 99)
pyplot.plot(x_axis_values, test, label = 'Observed')
pyplot.plot(x_axis_values, predictions, label = 'Predicted')

at a glance the situation seems good. But if we do a performance check:

from sklearn.metrics import mean_squared_error
from math import sqrt
rmse = sqrt(mean_squared_error(test, predictions))
print('RMSE: %.3f' % rmse)

we can see that the resulting RMSE (root-mean-square deviation) isn't so good:

 RMSE: 518.697

In the next posts of this series you will learn how to build more robust and efficient models through LSTMs.

giovedì 30 maggio 2019

Skills Matter Infiniteconf 2019 in July: I hope to see you there!

I am going to talk at the Skills Matter Infiniteconf 2019, which will happen in London (UK) on July 4th and 5th 2019. It is a very interesting conference with lots of unusual and fantastic talks. The agenda is still WIP, but it looks promising. My talk would be on Friday 5th at 1 PM. I am going to share some of my past experiences on applying Machine Learning and Deep Learning to DevOps, in particular in managing Apache Spark applications and clusters.
All the details about this conference in the official website. Early bird tickets are available before June 11th. I hope to meet you there!