Skip to main content

Posts

Sparklens: a tool for Spark applications optimization

Sparklens is a profiling tool for Spark with a built-in Spark Scheduler simulator: it makes easier to understand the scalability limits of Spark applications. It helps in understanding how efficiently is a given Spark application using the compute resources provided to it. It has been implemented and is maintained at Qubole . It is Open Source ( Apache License 2.0 ) and has been implemented in Scala. One interesting characteristic of Sparklens is its ability to generate estimates with a single run of a Spark application . It reports info such as estimated completion time and estimated cluster utilization with different number of executors, a Job/Stage timeline which shows how the parallel stages were scheduled within a job and lots of interesting per stage metrics. There are four ways to use Sparklens: Live mode Offline mode Run on event-history file Notebooks In this post I am focusing on live and offline modes only. Live mode Sparklens can run at application execution...

Hands-On Deep Learning with Apache Spark: almost there!

We are almost there: my "Hands-On Deep Learning with Apache Spark" book, Packt Publishing ,  is going to be available by the end of this month: https://www.packtpub.com/big-data-and-business-intelligence/hands-deep-learning-apache-spark In this book I try to address the sheer complexity of the technical and analytical parts, and the speed at which Deep Learning solutions can be implemented on Apache Spark . The book starts explaining the fundamentals of Apache Spark and Deep Learning. Then it details how to set up Spark for performing DL and the principles of distributed modelling and different types of neural nets. Example of implementation of DL models like CNN, RNN, LSTM on Spark are presented. A reader should get a hands-on experience of what it takes and a general feeling of the complexity he/she would deal with. During the course of the book, some popular DL frameworks such as DL4J , Keras and TensorFlow are used to train distributed models. The main goal of ...

Exploring the Spline Data Tracker and Visualization tool for Apache Spark (Part 2)

In part 1 we have learned how to test data lineage info collection with Spline from a Spark shell. The same can be done in any Scala or Java Spark application. The same dependencies for the Spark shell need to be registered in your build tool of choice (Maven, Gradle or sbt): groupId: za.co.absa.spline artifactId: spline-core version: 0.3.5 groupId: za.co.absa.spline artifactId: spline-persistence-mongo version:0.3.5 groupId: za.co.absa.spline artifactId:spline-core-spark-adapter-2.3 version:0.3.5 With reference to Scala and Spark 2.3.x, a Spark job like this: // Create the Spark session val sparkSession = SparkSession    .builder()    .appName("Spline Tester")    .getOrCreate()   // Init Spline System.setProperty("spline.persistence.factory", "za.co.absa.spline.persistence.mongo.MongoPersistenceFactory") System.setProperty("spline.mongodb.url", args(0)) System.setProperty("spline.mongodb.name", args(1)) imp...

Exploring the Spline Data Tracker and Visualization tool for Apache Spark (Part 1)

One interesting and promising Open Source project that caught my attention lately is Spline , a data lineage tracking and visualization tool for Apache Spark , maintained at  Absa . This project consists of 2 parts: a Scala library that works on the drivers which, by analyzing the Spark execution plans, captures the data lineages and a web application which provides a UI to visualize them. Spline supports MongoDB and HDFS as storage systems for the data lineages in JSON format. In this post I am referring to MongoDB. You can start playing with Spline through the Spark shell. Just add the required dependencies to the shell classpath as follows (with reference to the latest 0.3.5 release of this project): spark-shell --packages "za.co.absa.spline:spline-core:0.3.5,za.co.absa.spline:spline-persistence-mongo:0.3.5,za.co.absa.spline:spline-core-spark-adapter-2.3:0.3.5" Running the Spark shell with the command above on Ubuntu and some other Linux distro, whether some issue on...

Black Friday @Packt Publishing!

This Friday November 23rd 2018 would be Black Friday at Packt Publishing too! Each book or video, including the latest releases, could be purchased for US$ 10 only. It would be also possible to pre-order my upcoming book " Hands-on Deep Learning with Apache Spark " for US$ 10. Please remember that this convenient price is valid on Friday 23rd only. Enjoy it!

Ultralight Data Movement for IoT with SDC Edge @ Predict 2018

On October 2nd 2018 I am going to give a talk at the Predict 2018 conference, RDS, Dublin, Ireland. My talk will start at 2:20 PM. It is part of the Technology IoT & Manufacturing 4.0 section. Please feel free to get in touch during the conference day to discuss about IIoT, Open Source adoption, data streaming, edge analytics, Deep Learning and more.  

AI with the Best 2018 conference

I am proud to share that I will give a talk at the AI With the Best 2018 conference. The title of my talk is "Why Scala for Data Science?" and it is part of the "AI in Action" track. There I am going to cover some topics of my upcoming book . The conference will happen on September Friday 14th 2018. It is an online event. Buying a ticket for this event will give you also access to the recording of all the talks in the next 2 months after the conference end (just in case you should miss some during the live streaming). You will have also a chance to interact with the speakers and book 1:1 time with some of them. Please have a look at the list of speakers and talk topics: it is very impressive. I hope you will attend it!