Skip to main content

Time Series & Deep Learning (Part 3 of N): Finalizing the Data Preparation for Training and Evaluation of a LSTM

In the 3rd part of this series I am going to complete the description started in part 2 of the data preparation process for training and evaluation purposes of a LSTM model in time series forecasting. The data set used is the same as for part 1 and part 2. Same as for all of the post of this series, I am referring to Python 3.
In part 2 we have learned how to transform the time series into a supervised model. That isn't enough yet to feed our LSTM model: other two transformations are needed.
The first thing to do is to transform the time series data in order to make it stationary. The input time series used for these posts presents values that are dependent on time. Looking at its plotting in part 1, we can notice an increasing trend up to January 2006, then a decreasing trend up to February 2010 and finally a second increasing trend from there to date. Transforming this data to stationary makes life easier: we can remove the trends from the observed values before training and finally add them back to the forecasts in order to return the prediction to the original scale. Trends can be removed by differencing the data: we subtract the value at time t-1 from the current value (at time t). The pandas DataFrame provides a function, diff(), for this purpose. We need to implement two functions, one which returns the difference time series:

def difference(timeseries, interval=1):
    diff_list = list()
    for idx in range(interval, len(timeseries)):
        diff_value = timeseries[idx] - timeseries[idx - interval]
        diff_list.append(diff_value)
    return Series(diff_list)


and another one to invert the process before making forecasts:

def invert_difference(historical, inverted_scale_ts, interval=1):
    return inverted_scale_ts + historical[-interval]


The last thing to do is to transform the input time series observations to have a specific scale. The neural network model we are going to use is a LSTM. The default activation function for LSTMs is the hyperbolic tangent, which output values are in the range between -1 and 1, so the best range for the time series used for this example is in the same range. The min and max scaling coefficients need to be calculated on the training data set and then used to scale the test data set. The scikit-learn package comes with a specific class for this, MinMaxScaler. We need to implement two functions, one to calculate the scaling coefficients:

from sklearn.preprocessing import MinMaxScaler

def scale_data_set(train_data, test_data):
    min_max_scaler = MinMaxScaler(feature_range=(-1, 1))
    min_max_scaler = min_max_scaler.fit(train_data)
    train_data = train_data.values.reshape(train_data.shape[0], train_data.shape[1])
    scaled_train_data = min_max_scaler.transform(train_data)
    test_data = test_data.values.reshape(test_data.shape[0], test_data.shape[1])
    scaled_test_data = min_max_scaler.transform(test_data)
    return min_max_scaler, scaled_train_data, 
scaled_test_data

and a second one to invert the scaling for the forecasted values:

def invert_scale(min_max_scaler, scaled_array, value):
    row = [elem for elem in scaled_array] + [value]
    array = numpy.array(row)
    array = array.reshape(1, len(array))
    inverted_array = min_max_scaler.inverse_transform(array)
    return inverted_array[0, -1]


Now we can put all together. We first make the input data stationary:

raw_values = series.values
diff_values = difference(raw_values, 1)


then transform the data to make the problem like a supervised learning case:

supervised = tsToSupervised(diff_values, 1)

and finally split the data for training (years from 1992 to 2010) and test (years from 2011 to date):

train, test = supervised[0:-98], supervised[-98:]

and transform the scale of the training data:

scaler, train_scaled, test_scaled = scale(train, test)

At this stage we can now build and train our LSTM. But this will be the core topic of the next post of this series.
The complete example would be released as a Jupyter notebook at the end of the first part of this series.

Comments

  1. I was scrolling the internet like every day, there I found this article which is related to my interest. The way you covered the knowledge about the subject and the 4 BHK Duplex in hoshangabad Road was worth to read, it undoubtedly cleared my vision and thoughts towards B 5 BHK Duplex in hoshangabad Road. Your writing skills and the way you portrayed the examples are very impressive. The knowledge about 5 BHK Duplex in hoshangabad Road is well covered. Thank you for putting this highly informative article on the internet which is clearing the vision about top builders in Bhopal and who are making an impact in the real estate sector by building such amazing townships.


    ReplyDelete

Post a Comment

Popular posts from this blog

Streamsets Data Collector log shipping and analysis using ElasticSearch, Kibana and... the Streamsets Data Collector

One common use case scenario for the Streamsets Data Collector (SDC) is the log shipping to some system, like ElasticSearch, for real-time analysis. To build a pipeline for this particular purpose in SDC is really simple and fast and doesn't require coding at all. For this quick tutorial I will use the SDC logs as example. The log data will be shipped to Elasticsearch and then visualized through a Kibana dashboard. Basic knowledge of SDC, Elasticsearch and Kibana is required for a better understanding of this post. These are the releases I am referring to for each system involved in this tutorial: JDK 8 Streamsets Data Collector 1.4.0 ElasticSearch 2.3.3 Kibana 4.5.1 Elasticsearch and Kibana installation You should have your Elasticsearch cluster installed and configured and a Kibana instance pointing to that cluster in order to go on with this tutorial. Please refer to the official documentation for these two products in order to complete their installation (if you do

Exporting InfluxDB data to a CVS file

Sometimes you would need to export a sample of the data from an InfluxDB table to a CSV file (for example to allow a data scientist to do some offline analysis using a tool like Jupyter, Zeppelin or Spark Notebook). It is possible to perform this operation through the influx command line client. This is the general syntax: sudo /usr/bin/influx -database '<database_name>' -host '<hostname>' -username '<username>'  -password '<password>' -execute 'select_statement' -format '<format>' > <file_path>/<file_name>.csv where the format could be csv , json or column . Example: sudo /usr/bin/influx -database 'telegraf' -host 'localhost' -username 'admin'  -password '123456789' -execute 'select * from mem' -format 'csv' > /home/googlielmo/influxdb-export/mem-export.csv

Using Rapids cuDF in a Colab notebook

During last Spark+AI Summit Europe 2019 I had a chance to attend a talk from Miguel Martinez  who was presenting Rapids , the new Open Source framework from NVIDIA for GPU accelerated end-to-end Data Science and Analytics. Fig. 1 - Overview of the Rapids eco-system Rapids is a suite of Open Source libraries: cuDF cuML cuGraph cuXFilter I enjoied the presentation and liked the idea of this initiative, so I wanted to start playing with the Rapids libraries in Python on Colab , starting from cuDF, but the first attempt came with an issue that I eventually solved. So in this post I am going to share how I fixed it, with the hope it would be useful to someone else running into the same blocker. I am assuming here you are already familiar with Google Colab. I am using Python 3.x as Python 2 isn't supported by Rapids. Once you have created a new notebook in Colab, you need to check if the runtime for it is set to use Python 3 and uses a GPU as hardware accelerator. You