Googlielmo's blog

Posts

Showing posts from January, 2019

Sparklens: a tool for Spark applications optimization

Sparklens is a profiling tool for Spark with a built-in Spark Scheduler simulator: it makes easier to understand the scalability limits of Spark applications. It helps in understanding how efficiently is a given Spark application using the compute resources provided to it. It has been implemented and is maintained at Qubole . It is Open Source ( Apache License 2.0 ) and has been implemented in Scala. One interesting characteristic of Sparklens is its ability to generate estimates with a single run of a Spark application . It reports info such as estimated completion time and estimated cluster utilization with different number of executors, a Job/Stage timeline which shows how the parallel stages were scheduled within a job and lots of interesting per stage metrics. There are four ways to use Sparklens: Live mode Offline mode Run on event-history file Notebooks In this post I am focusing on live and offline modes only. Live mode Sparklens can run at application execution...

Hands-On Deep Learning with Apache Spark: almost there!

We are almost there: my "Hands-On Deep Learning with Apache Spark" book, Packt Publishing , is going to be available by the end of this month: https://www.packtpub.com/big-data-and-business-intelligence/hands-deep-learning-apache-spark In this book I try to address the sheer complexity of the technical and analytical parts, and the speed at which Deep Learning solutions can be implemented on Apache Spark . The book starts explaining the fundamentals of Apache Spark and Deep Learning. Then it details how to set up Spark for performing DL and the principles of distributed modelling and different types of neural nets. Example of implementation of DL models like CNN, RNN, LSTM on Spark are presented. A reader should get a hands-on experience of what it takes and a general feeling of the complexity he/she would deal with. During the course of the book, some popular DL frameworks such as DL4J , Keras and TensorFlow are used to train distributed models. The main goal of ...