Skip to main content

Time Series & Deep Learning (Part 3 of N): Finalizing the Data Preparation for Training and Evaluation of a LSTM

In the 3rd part of this series I am going to complete the description started in part 2 of the data preparation process for training and evaluation purposes of a LSTM model in time series forecasting. The data set used is the same as for part 1 and part 2. Same as for all of the post of this series, I am referring to Python 3.
In part 2 we have learned how to transform the time series into a supervised model. That isn't enough yet to feed our LSTM model: other two transformations are needed.
The first thing to do is to transform the time series data in order to make it stationary. The input time series used for these posts presents values that are dependent on time. Looking at its plotting in part 1, we can notice an increasing trend up to January 2006, then a decreasing trend up to February 2010 and finally a second increasing trend from there to date. Transforming this data to stationary makes life easier: we can remove the trends from the observed values before training and finally add them back to the forecasts in order to return the prediction to the original scale. Trends can be removed by differencing the data: we subtract the value at time t-1 from the current value (at time t). The pandas DataFrame provides a function, diff(), for this purpose. We need to implement two functions, one which returns the difference time series:

def difference(timeseries, interval=1):
    diff_list = list()
    for idx in range(interval, len(timeseries)):
        diff_value = timeseries[idx] - timeseries[idx - interval]
        diff_list.append(diff_value)
    return Series(diff_list)


and another one to invert the process before making forecasts:

def invert_difference(historical, inverted_scale_ts, interval=1):
    return inverted_scale_ts + historical[-interval]


The last thing to do is to transform the input time series observations to have a specific scale. The neural network model we are going to use is a LSTM. The default activation function for LSTMs is the hyperbolic tangent, which output values are in the range between -1 and 1, so the best range for the time series used for this example is in the same range. The min and max scaling coefficients need to be calculated on the training data set and then used to scale the test data set. The scikit-learn package comes with a specific class for this, MinMaxScaler. We need to implement two functions, one to calculate the scaling coefficients:

from sklearn.preprocessing import MinMaxScaler

def scale_data_set(train_data, test_data):
    min_max_scaler = MinMaxScaler(feature_range=(-1, 1))
    min_max_scaler = min_max_scaler.fit(train_data)
    train_data = train_data.values.reshape(train_data.shape[0], train_data.shape[1])
    scaled_train_data = min_max_scaler.transform(train_data)
    test_data = test_data.values.reshape(test_data.shape[0], test_data.shape[1])
    scaled_test_data = min_max_scaler.transform(test_data)
    return min_max_scaler, scaled_train_data, 
scaled_test_data

and a second one to invert the scaling for the forecasted values:

def invert_scale(min_max_scaler, scaled_array, value):
    row = [elem for elem in scaled_array] + [value]
    array = numpy.array(row)
    array = array.reshape(1, len(array))
    inverted_array = min_max_scaler.inverse_transform(array)
    return inverted_array[0, -1]


Now we can put all together. We first make the input data stationary:

raw_values = series.values
diff_values = difference(raw_values, 1)


then transform the data to make the problem like a supervised learning case:

supervised = tsToSupervised(diff_values, 1)

and finally split the data for training (years from 1992 to 2010) and test (years from 2011 to date):

train, test = supervised[0:-98], supervised[-98:]

and transform the scale of the training data:

scaler, train_scaled, test_scaled = scale(train, test)

At this stage we can now build and train our LSTM. But this will be the core topic of the next post of this series.
The complete example would be released as a Jupyter notebook at the end of the first part of this series.

Comments

  1. I was scrolling the internet like every day, there I found this article which is related to my interest. The way you covered the knowledge about the subject and the 4 BHK Duplex in hoshangabad Road was worth to read, it undoubtedly cleared my vision and thoughts towards B 5 BHK Duplex in hoshangabad Road. Your writing skills and the way you portrayed the examples are very impressive. The knowledge about 5 BHK Duplex in hoshangabad Road is well covered. Thank you for putting this highly informative article on the internet which is clearing the vision about top builders in Bhopal and who are making an impact in the real estate sector by building such amazing townships.


    ReplyDelete

Post a Comment

Popular posts from this blog

Exporting InfluxDB data to a CVS file

Sometimes you would need to export a sample of the data from an InfluxDB table to a CSV file (for example to allow a data scientist to do some offline analysis using a tool like Jupyter, Zeppelin or Spark Notebook). It is possible to perform this operation through the influx command line client. This is the general syntax: sudo /usr/bin/influx -database '<database_name>' -host '<hostname>' -username '<username>'  -password '<password>' -execute 'select_statement' -format '<format>' > <file_path>/<file_name>.csv where the format could be csv , json or column . Example: sudo /usr/bin/influx -database 'telegraf' -host 'localhost' -username 'admin'  -password '123456789' -execute 'select * from mem' -format 'csv' > /home/googlielmo/influxdb-export/mem-export.csv

Using Rapids cuDF in a Colab notebook

During last Spark+AI Summit Europe 2019 I had a chance to attend a talk from Miguel Martinez  who was presenting Rapids , the new Open Source framework from NVIDIA for GPU accelerated end-to-end Data Science and Analytics. Fig. 1 - Overview of the Rapids eco-system Rapids is a suite of Open Source libraries: cuDF cuML cuGraph cuXFilter I enjoied the presentation and liked the idea of this initiative, so I wanted to start playing with the Rapids libraries in Python on Colab , starting from cuDF, but the first attempt came with an issue that I eventually solved. So in this post I am going to share how I fixed it, with the hope it would be useful to someone else running into the same blocker. I am assuming here you are already familiar with Google Colab. I am using Python 3.x as Python 2 isn't supported by Rapids. Once you have created a new notebook in Colab, you need to check if the runtime for it is set to use Python 3 and uses a GPU as hardware accelerator. You

Load testing MongoDB using JMeter

Apache JMeter ( http://jmeter.apache.org/ ) added support for MongoDB since its 2.10 release. In this post I am referring to the latest JMeter release (2.13). A preliminary JMeter setup is needed before starting your first test plan for MongoDB. It uses Groovy as scripting reference language, so Groovy needs to be set up for our favorite load testing tool. Follow these steps to complete the set up: Download Groovy from the official website ( http://www.groovy-lang.org/download.html ). In this post I am referring to the Groovy release 2.4.4, but using later versions is fine. Copy the groovy-all-2.4.4.jar to the $JMETER_HOME/lib folder. Restart JMeter if it was running while adding the Groovy JAR file. Now you can start creating a test plan for MongoDB load testing. From the UI select the MongoDB template ( File -> Templates... ). The new test plan has a MongoDB Source Config element. Here you have to setup the connection details for the database to be tested: The Threa