Googlielmo's blog

Posts

Showing posts from 2019

See you at the Big Things Conference 2019 in Madrid on Wednesday or Thursday!

I hope you are going to attend the 8th edition of Big Things 2019 , the data and AI conference which will happen in Madrid (Spain) on November 20th and 21st. This conference started in 2012 and this year changed its name from Big Data Spain to Big Things, as it became focused not only on Big Data, but also on whatever is related to AI. Among the speakers this year there will be big names such as Cassie Kozyrkov, Alberto Cairo, Jesse Anderson, Michael Armbrust, Suneel Marthi, Paco Nathan and many others. My talk will be on Thursday 21st at 1:55 PM local time. I am going to give an update on importing and re-training Keras / TensorFlow models in DL4J and Apache Spark . It is a follow-up of some of the topics covered in my book , considering changes related to new releases of DL4J, Keras and TensorFlow since it has been published in January this year. Please stop by if you are going to attend my talk and the conference. I really appreciated the conversations about Deep Learning ...

How to install Anaconda or Miniconda in Colab

Colab is the Google's platform for working with Python notebooks and practice Deep Learning using different frameworks. It is a powerful platform, there is availability of GPUs or TPUs, it allows to use your Google Drive space for notebooks and data, has a friendly user interface and lots of useful features, but in order to install/update Python packages, it comes by default only with pip and no availability for conda . If you need to import a Python package which is available in Anaconda , but not in PyPi you need to install Anaconda or Miniconda yourself from a notebook. In this post I am explaining the simple steps to do it. Anaconda installation Create your notebook and from a code cell download the Anaconda installer: !wget -c https://repo.continuum.io/archive/Anaconda3-5.1.0-Linux-x86_64.sh This is the version that works fine for me. I have tried also with the latest release 2019.10, but the configuration then would have extra complexity. Now you need to make the do...

Using Rapids cuDF in a Colab notebook

During last Spark+AI Summit Europe 2019 I had a chance to attend a talk from Miguel Martinez who was presenting Rapids , the new Open Source framework from NVIDIA for GPU accelerated end-to-end Data Science and Analytics. Fig. 1 - Overview of the Rapids eco-system Rapids is a suite of Open Source libraries: cuDF cuML cuGraph cuXFilter I enjoied the presentation and liked the idea of this initiative, so I wanted to start playing with the Rapids libraries in Python on Colab , starting from cuDF, but the first attempt came with an issue that I eventually solved. So in this post I am going to share how I fixed it, with the hope it would be useful to someone else running into the same blocker. I am assuming here you are already familiar with Google Colab. I am using Python 3.x as Python 2 isn't supported by Rapids. Once you have created a new notebook in Colab, you need to check if the runtime for it is set to use Python 3 and uses a GPU as hardware accelerator. You...

See you at the Data Science Meetup on November 6th!

Following my recent talk at day 2 of the Big Data Days in Moscow, it seems that the training on Apache Spark of Deep Learning models is a hot topic. If you are in Dublin on Wednesday November 6th and want to hear more about it, please join the next Data Science Ireland Meetup , which would be hosted by Mason, Hayes & Curran . The event will start at 6 PM, it will be moderated by Mark Kelly from Alldus and there will be talks from Brian McElligott, Partner at Mason Hayes & Curran and me. You can find all of the details in the official Meetup page. At the moment this post is written the full capacity (120 places) has been reached and there is one person in the waiting list, but if you're interested in I suggest you to join any way the waiting list and monitor it, as some people could cancel also last minute. I hope to meet you there.

See you tomorrow at the DSF Meetup!

If you are interested on hearing more about some way to do predictions of Apache Spark applications performance, please join tomorrow's Dublin Data Science Festival Meetup which would start at 6 PM local time at the Walmart Labs place . Two talks in agenda, the first one from Mirko Arnold (Walmart Labs) about Computer Vision and the second one from me. It would also be another great opportunity for networking.

Data Unlocked Bonanza at Packt Publishing!

You're still in time to get some interesting Machine Learning and AI eBooks and videos for $10 each at Packt Publishing . This promotion covers also my book " Hands-on Deep Learning with Apache Spark ". What are you waiting for? Go and check out for titles before the offer expires! By the way, if you want to listen to follow-ups on topics covered by my book and get in touch with me in person, I am going to give talks at the following events in October and November: Big Data Days , Moscow, Russian Federation, October 8th-10th Spark+AI Summit Europe , Amsterdam, Netherlands, October 16th-17th Big Things Conference , Madrid, Spain, November 20th-21st

Spark+AI Summit Europe 2019: I'll be there!

I am glad I have been selected to speak at the next Spark+AI Summit Europe 2019 , which will happen in Amsterdam, Netherlands on October 15th-17th 2019. I am going to present some follow-up of one of the core topics of my book: memory management in distributed Deep Learning with DL4J on Apache Spark . More details about my talk will follow in the next weeks. As usual, the summit will have an impressive line-up of speakers, such as Matei Zaharia, Ali Ghodsi, Holden Karau, Luca Canali, Gael Varoquaux, Christopher Crosbie, Michael Armbrust and many others. I hope you will attend this event.

Time Series & Deep Learning (Part 3 of N): Finalizing the Data Preparation for Training and Evaluation of a LSTM

In the 3rd part of this series I am going to complete the description started in part 2 of the data preparation process for training and evaluation purposes of a LSTM model in time series forecasting. The data set used is the same as for part 1 and part 2. Same as for all of the post of this series, I am referring to Python 3. In part 2 we have learned how to transform the time series into a supervised model. That isn't enough yet to feed our LSTM model: other two transformations are needed. The first thing to do is to transform the time series data in order to make it stationary. The input time series used for these posts presents values that are dependent on time. Looking at its plotting in part 1, we can notice an increasing trend up to January 2006, then a decreasing trend up to February 2010 and finally a second increasing trend from there to date. Transforming this data to stationary makes life easier: we can remove the trends from the observed values before training and ...

Time Series & Deep Learning (Part 2 of N): Data Preparation for Training and Evaluation of a LSTM

In the second post of this series we are going to learn how to prepare data for training and evaluation of a LSTM neural network for time series forecasting. Same as for any other post of this series I am referring to Python 3. The data set used is the same as for part 1 . LSTM (Long-Short Term Memory) neural networks are a specialization of RNNs (Recurrent Neural Networks) introduced by Sepp Hochreiter and Jurgen Schmiduber in 1997 to solve the problem of the Vanishing Gradient affecting RNNs. LSTMs are used in real-world applications of language translation, text generation, image captioning, music generation and time series forecasting. You can find more info about LSTMs in my book or wait for one of my next posts of this series. This post focuses mostly on one of the best practices for data preparation before using a data set for training and evaluation of a LSTM in a time series forecasting problem with the Keras library. Let's load the data set first: from pandas im...

Time Series & Deep Learning (Part 1 of N): Basic Stuff

During the latest part of my career I had a chance to work with Data Scientists having strong skills in Python. My tech background, after a start with C/C++, is in JVM programming languages mostly (but I had to touch several others during my career), so it was a great chance for me to learn more about practical Python, at least in the Machine Learning and Deep Learning spaces. I am no more a full time developer since few years, as I moved to other roles, but this didn't stop me following my passion for programming languages and staying hands-on with technology in general, not only because I enjoy it, but also because I believe whatever your role is, it is always important to understand the possibilities and the limits of technologies before making any kind of decision. In this new long series I am going first to share some of the things I have learned about doing time series forecasting through Deep Learning using Python (with Keras and Tensorflow ), and finally I will pres...

Skills Matter Infiniteconf 2019 in July: I hope to see you there!

I am going to talk at the Skills Matter Infiniteconf 2019 , which will happen in London (UK) on July 4th and 5th 2019. It is a very interesting conference with lots of unusual and fantastic talks. The agenda is still WIP, but it looks promising. My talk would be on Friday 5th at 1 PM. I am going to share some of my past experiences on applying Machine Learning and Deep Learning to DevOps, in particular in managing Apache Spark applications and clusters. All the details about this conference in the official website. Early bird tickets are available before June 11th. I hope to meet you there!

Voxxed Days Milan 2019 review

Finally I found few minutes to share my impressions after attending the first Voxxed Days event in Italy, which happened in Milan on April 13th 2019 . I was one of the speakers there: my talk was about Deep Learning on Apache Spark with DeepLearning4J (a follow up of some topics from my book ). There were 3 sessions in parallel. The level of the talks was really high and it was hard for me and any other participant to choose which one to follow at a given time slot. The good news is that all of the sessions have been recorded and yesterday the first videos (those from the main session) have been published on YouTube . Once they will be online, I suggest you to watch all of the videos you can, but here are some suggestions among those I had a chance to attend in person at the event. I put my comments to a minimum to reduce spoiling ;) Opening key note by Mario Fusco : he was the main organizer of the event. In the opening key note he presented the agenda. He recently wrote a ...

The Kubernetes Spark operator in OpenShift Origin (Part 1)

This series is about the Kubernetes Spark operator by Radanalytics.io on OpenShift Origin . It is an Open Source operator to manage Apache Spark clusters and applications. In order to deploy the operator on OpenShift Origin, the first time you need to clone the GitHub repository for it: git clone https://github.com/radanalyticsio/spark-operator.git Then login to the cluster using the OpenShift command-line oc : oc login -u <username>:<password> Assuming, like in the OpenShift Origin environments me and my teams used to work, that developers don't have permissions to create CRDs, you need to use Config Maps, so you have to create the operator using the operator-com.yaml file provided in the cloned repo: oc apply -f manifest/operator-cm.yaml The output of the command above should be like the following: serviceaccount/spark-operator created role.rbac.authorization.k8s.io/edit-resources created rolebinding.rbac.authorization.k8s.io/spark-operator-edit-reso...

Installing Minishift on Windows 10 Home

Minishift is a tool to run OpenShift Origin locally as a single node cluster inside a Virtual Machine. It is a good choice for development or doing PoCs locally before deploying things in a real OpenShift cluster. In this post I am going to explain how to install and run it on a Windows 10 Home machine, where no Hyper-V support is available. The only available alternative to Hyper-V is Oracle VirtualBox . You need to install it before going on with the Minishift installation: follow the instructions in the official website or use the Chocolatey package manager. If you don't have Chocolatey in the destination machine, you can install it by opening an Admin PowerShell and first checking which execution policy is set, by running the Get-ExecutionPolicy command. If it returns Restricted , then install by executing: Set-ExecutionPolicy Bypass -Scope Process -Force; iex ((New-Object System.Net.WebClient).DownloadString('https://chocolatey.org/install.ps1')) The fastest an...

See you tonight at the ODSC Dublin Meetup!

I hope to see you tonight at the April ODSC Dublin Meetup @ Jet.com in 40 Molesworth St . I am going to be the second speaker for the night. I am going to talk about importing pre-trained Keras and TensorFlow models into DL4J and the possibility of re-training them on Apache Spark . The first speaker would be John Kane from Cogito .

The book is finally available on Packt!

My book "Hands-on Deep Learning with Apache Spark" is finally available on Packt. Here's the final cover: This is the book content: 1: THE APACHE SPARK ECOSYSTEM 2: DEEP LEARNING BASICS 3: EXTRACT, TRANSFORM, LOAD 4: STREAMING 5: CONVOLUTIONAL NEURAL NETWORKS 6: RECURRENT NEURAL NETWORKS 7: TRAINING NEURAL NETWORKS WITH SPARK 8: MONITORING AND DEBUGGING NEURAL NETWORK TRAINING 9: INTERPRETING NEURAL NETWORK OUTPUT 10: DEPLOYING ON A DISTRIBUTED SYSTEM 11: NLP BASICS 12: TEXTUAL ANALYSIS AND DEEP LEARNING 13: CONVOLUTION 14: IMAGE CLASSIFICATION 15: WHAT'S NEXT FOR DEEP LEARNING? DeepLearning4J (Scala), but also Keras and TensorFlow (Python) are the reference frameworks. More topics on Deep Learning on the JVM and Spark would be covered in the next months in this blog.

Sparklens: a tool for Spark applications optimization

Sparklens is a profiling tool for Spark with a built-in Spark Scheduler simulator: it makes easier to understand the scalability limits of Spark applications. It helps in understanding how efficiently is a given Spark application using the compute resources provided to it. It has been implemented and is maintained at Qubole . It is Open Source ( Apache License 2.0 ) and has been implemented in Scala. One interesting characteristic of Sparklens is its ability to generate estimates with a single run of a Spark application . It reports info such as estimated completion time and estimated cluster utilization with different number of executors, a Job/Stage timeline which shows how the parallel stages were scheduled within a job and lots of interesting per stage metrics. There are four ways to use Sparklens: Live mode Offline mode Run on event-history file Notebooks In this post I am focusing on live and offline modes only. Live mode Sparklens can run at application execution...

Hands-On Deep Learning with Apache Spark: almost there!

We are almost there: my "Hands-On Deep Learning with Apache Spark" book, Packt Publishing , is going to be available by the end of this month: https://www.packtpub.com/big-data-and-business-intelligence/hands-deep-learning-apache-spark In this book I try to address the sheer complexity of the technical and analytical parts, and the speed at which Deep Learning solutions can be implemented on Apache Spark . The book starts explaining the fundamentals of Apache Spark and Deep Learning. Then it details how to set up Spark for performing DL and the principles of distributed modelling and different types of neural nets. Example of implementation of DL models like CNN, RNN, LSTM on Spark are presented. A reader should get a hands-on experience of what it takes and a general feeling of the complexity he/she would deal with. During the course of the book, some popular DL frameworks such as DL4J , Keras and TensorFlow are used to train distributed models. The main goal of ...