Skip to main content

Sparklens: a tool for Spark applications optimization

Sparklens is a profiling tool for Spark with a built-in Spark Scheduler simulator: it makes easier to understand the scalability limits of Spark applications. It helps in understanding how efficiently is a given Spark application using the compute resources provided to it. It has been implemented and is maintained at Qubole. It is Open Source (Apache License 2.0) and has been implemented in Scala.
One interesting characteristic of Sparklens is its ability to generate estimates with a single run of a Spark application. It reports info such as estimated completion time and estimated cluster utilization with different number of executors, a Job/Stage timeline which shows how the parallel stages were scheduled within a job and lots of interesting per stage metrics.
There are four ways to use Sparklens:
  • Live mode
  • Offline mode
  • Run on event-history file
  • Notebooks
In this post I am focusing on live and offline modes only.

Live mode

Sparklens can run at application execution time by using the following options (for spark-submit and spark-shell):
--packages qubole:sparklens:0.2.1-s_2.11
--conf spark.extraListeners=com.qubole.sparklens.QuboleJobListener
or programmatically by adding the Sparklens dependency to the Java/Scala project (here is an example for Maven):
<repositories>
   <!-- Mandatory: the Sparklens artifact aren't in Maven Central -->
   <repository>
    <id>qubole-maven-repo</id>
    <name>Qubole Maven Repo</name>
    <url>http://dl.bintray.com/spark-packages/maven/</url>
   </repository>
   
   <repository>
      <id>central</id>
      <name>Maven Repository Switchboard</name>
      <layout>default</layout>
      <url>http://repo1.maven.org/maven2</url>
      <snapshots>
        <enabled>false</enabled>
      </snapshots>
    </repository>
</repositories> 
...
<dependency>
  <groupId>qubole</groupId>
  <artifactId>sparklens</artifactId>
  <version>0.2.1-s_2.11</version>
</dependency> 
and then by configuring its listener as follows (for Spark 1.X)
SparkConf conf = new SparkConf();
conf.setMaster(master);
conf.setAppName("Spark app name");
conf.set("spark.extraListeners", "com.qubole.sparklens.QuboleJobListener");
JavaSparkContext jsc = new JavaSparkContext(conf); 
or as follows (for Spark 2.x)
SparkSession spark = SparkSession
  .builder()
  .appName("Spark app name")
  .master(master)
  .config("spark.extraListeners", "com.qubole.sparklens.QuboleJobListener")
  .getOrCreate(); 

Offline mode

Sparklens can be run later and not necessarily at execution time. This can be set by adding the following property to the Spark app configuration:
conf.set("spark.sparklens.reporting.disabled", "true"); 
At the end of a Spark application execution, just a JSON data file is generated. The default save directory is /tmp/sparklens, but it is possible to change destination through the following configuration property:
conf.set("spark.sparklens.data.dir", "/home/guglielmo/sparklens");
This file can then be used to run Sparklens independently through the spark-submit command as follows:
$SPARK_HOME/bin/spark-submit --packages qubole:sparklens:0.2.1-s_2.11 \
  --class com.qubole.sparklens.app.ReporterApp qubole-dummy-arg <datafile_path>
The command above, starting from the JSON data file, produces a report having the same layout and containing the same results as for that generated in live mode.

The report

These are the info available in a final report:

  • Efficiency statistics (Driver vs Executor time, Critical and ideal application time, Core compute hours wastage by driver and executor).
  • Predicted wall clock time and cluster utilization with different executor counts.
  • Per stage metrics.
  • Executors available and executors required over time.
  • Task based Aggregate Metrics.

Conclusion

Me and my team started to adopt this tool lately and so far we have found it really useful to understand the scalability limits of Spark applications that are developed by other teams, but that need then to be executed in our infrastructure. The final report generated by this tool gives a comprehensive set of info that definitely helps on pointing towards the right direction on spotting potential scalability issues and areas for improvement.
The generated reports come in a text format which contains all of the metrics and info mentioned above. Qubole provides an online service which generates a user-friendly and elegant report with interactive charts and tables starting from an uploaded JSON data file. Whether you're organization shouldn't allow you to share the JSON data generated running Sparklens on Spark application executing in your corporate infrastructure,  you need to stay with the text reports. To address situations like these, I am thinking of implementing and releasing an Open Source Java library to generate user-friendy reports in a on-prem environment starting from Sparklens JSON data files or the text reports. Please register your interest in this library by commenting this post. Thank you.

Comments

  1. Hi Googlielmo - Were you able to create a open-source library to generate user friendly reports. I too am interested in using it. I could provide some help if needed.

    ReplyDelete
  2. Excellent Blog, I like your blog and It is very informative. Thank you
    Pyspark online Training
    Learn Pyspark Online

    ReplyDelete

Post a Comment

Popular posts from this blog

Streamsets Data Collector log shipping and analysis using ElasticSearch, Kibana and... the Streamsets Data Collector

One common use case scenario for the Streamsets Data Collector (SDC) is the log shipping to some system, like ElasticSearch, for real-time analysis. To build a pipeline for this particular purpose in SDC is really simple and fast and doesn't require coding at all. For this quick tutorial I will use the SDC logs as example. The log data will be shipped to Elasticsearch and then visualized through a Kibana dashboard. Basic knowledge of SDC, Elasticsearch and Kibana is required for a better understanding of this post. These are the releases I am referring to for each system involved in this tutorial: JDK 8 Streamsets Data Collector 1.4.0 ElasticSearch 2.3.3 Kibana 4.5.1 Elasticsearch and Kibana installation You should have your Elasticsearch cluster installed and configured and a Kibana instance pointing to that cluster in order to go on with this tutorial. Please refer to the official documentation for these two products in order to complete their installation (if you do

Exporting InfluxDB data to a CVS file

Sometimes you would need to export a sample of the data from an InfluxDB table to a CSV file (for example to allow a data scientist to do some offline analysis using a tool like Jupyter, Zeppelin or Spark Notebook). It is possible to perform this operation through the influx command line client. This is the general syntax: sudo /usr/bin/influx -database '<database_name>' -host '<hostname>' -username '<username>'  -password '<password>' -execute 'select_statement' -format '<format>' > <file_path>/<file_name>.csv where the format could be csv , json or column . Example: sudo /usr/bin/influx -database 'telegraf' -host 'localhost' -username 'admin'  -password '123456789' -execute 'select * from mem' -format 'csv' > /home/googlielmo/influxdb-export/mem-export.csv

Using Rapids cuDF in a Colab notebook

During last Spark+AI Summit Europe 2019 I had a chance to attend a talk from Miguel Martinez  who was presenting Rapids , the new Open Source framework from NVIDIA for GPU accelerated end-to-end Data Science and Analytics. Fig. 1 - Overview of the Rapids eco-system Rapids is a suite of Open Source libraries: cuDF cuML cuGraph cuXFilter I enjoied the presentation and liked the idea of this initiative, so I wanted to start playing with the Rapids libraries in Python on Colab , starting from cuDF, but the first attempt came with an issue that I eventually solved. So in this post I am going to share how I fixed it, with the hope it would be useful to someone else running into the same blocker. I am assuming here you are already familiar with Google Colab. I am using Python 3.x as Python 2 isn't supported by Rapids. Once you have created a new notebook in Colab, you need to check if the runtime for it is set to use Python 3 and uses a GPU as hardware accelerator. You