Sparklens: a tool for Spark applications optimization

Sparklens is a profiling tool for Spark with a built-in Spark Scheduler simulator: it makes easier to understand the scalability limits of Spark applications. It helps in understanding how efficiently is a given Spark application using the compute resources provided to it. It has been implemented and is maintained at Qubole. It is Open Source (Apache License 2.0) and has been implemented in Scala.
One interesting characteristic of Sparklens is its ability to generate estimates with a single run of a Spark application. It reports info such as estimated completion time and estimated cluster utilization with different number of executors, a Job/Stage timeline which shows how the parallel stages were scheduled within a job and lots of interesting per stage metrics.
There are four ways to use Sparklens:

Live mode
Offline mode
Run on event-history file
Notebooks

In this post I am focusing on live and offline modes only.

Live mode

Sparklens can run at application execution time by using the following options (for spark-submit and spark-shell):

--packages qubole:sparklens:0.2.1-s_2.11
--conf spark.extraListeners=com.qubole.sparklens.QuboleJobListener

or programmatically by adding the Sparklens dependency to the Java/Scala project (here is an example for Maven):

<repositories>
   <!-- Mandatory: the Sparklens artifact aren't in Maven Central -->
   <repository>
    <id>qubole-maven-repo</id>
    <name>Qubole Maven Repo</name>
    <url>http://dl.bintray.com/spark-packages/maven/</url>
   </repository>
   
   <repository>
      <id>central</id>
      <name>Maven Repository Switchboard</name>
      <layout>default</layout>
      <url>http://repo1.maven.org/maven2</url>
      <snapshots>
        <enabled>false</enabled>
      </snapshots>
    </repository>
</repositories>

...

<dependency>
  <groupId>qubole</groupId>
  <artifactId>sparklens</artifactId>
  <version>0.2.1-s_2.11</version>
</dependency>

and then by configuring its listener as follows (for Spark 1.X)

SparkConf conf = new SparkConf();
conf.setMaster(master);
conf.setAppName("Spark app name");
conf.set("spark.extraListeners", "com.qubole.sparklens.QuboleJobListener");

JavaSparkContext jsc = new JavaSparkContext(conf);

or as follows (for Spark 2.x)

SparkSession spark = SparkSession
  .builder()
  .appName("Spark app name")
  .master(master)
  .config("spark.extraListeners", "com.qubole.sparklens.QuboleJobListener")
  .getOrCreate();

Offline mode

Sparklens can be run later and not necessarily at execution time. This can be set by adding the following property to the Spark app configuration:

conf.set("spark.sparklens.reporting.disabled", "true");

At the end of a Spark application execution, just a JSON data file is generated. The default save directory is /tmp/sparklens, but it is possible to change destination through the following configuration property:

conf.set("spark.sparklens.data.dir", "/home/guglielmo/sparklens");

This file can then be used to run Sparklens independently through the spark-submit command as follows:

$SPARK_HOME/bin/spark-submit --packages qubole:sparklens:0.2.1-s_2.11 \

  --class com.qubole.sparklens.app.ReporterApp qubole-dummy-arg <datafile_path>

The command above, starting from the JSON data file, produces a report having the same layout and containing the same results as for that generated in live mode.

The report

These are the info available in a final report:

Efficiency statistics (Driver vs Executor time, Critical and ideal application time, Core compute hours wastage by driver and executor).
Predicted wall clock time and cluster utilization with different executor counts.
Per stage metrics.
Executors available and executors required over time.
Task based Aggregate Metrics.

Conclusion

Me and my team started to adopt this tool lately and so far we have found it really useful to understand the scalability limits of Spark applications that are developed by other teams, but that need then to be executed in our infrastructure. The final report generated by this tool gives a comprehensive set of info that definitely helps on pointing towards the right direction on spotting potential scalability issues and areas for improvement.
The generated reports come in a text format which contains all of the metrics and info mentioned above. Qubole provides an online service which generates a user-friendly and elegant report with interactive charts and tables starting from an uploaded JSON data file. Whether you're organization shouldn't allow you to share the JSON data generated running Sparklens on Spark application executing in your corporate infrastructure, you need to stay with the text reports. To address situations like these, I am thinking of implementing and releasing an Open Source Java library to generate user-friendy reports in a on-prem environment starting from Sparklens JSON data files or the text reports. Please register your interest in this library by commenting this post. Thank you.

Comments

BKDecember 7, 2020 at 10:20 AM
Hi Googlielmo - Were you able to create a open-source library to generate user friendly reports. I too am interested in using it. I could provide some help if needed.
ReplyDelete
Replies
hemaJuly 12, 2021 at 11:54 AM
Excellent Blog, I like your blog and It is very informative. Thank you
Pyspark online Training
Learn Pyspark Online

ReplyDelete
Replies

Add comment

Googlielmo's blog

Search This Blog