Skip to main content

Sparklens: a tool for Spark applications optimization

Sparklens is a profiling tool for Spark with a built-in Spark Scheduler simulator: it makes easier to understand the scalability limits of Spark applications. It helps in understanding how efficiently is a given Spark application using the compute resources provided to it. It has been implemented and is maintained at Qubole. It is Open Source (Apache License 2.0) and has been implemented in Scala.
One interesting characteristic of Sparklens is its ability to generate estimates with a single run of a Spark application. It reports info such as estimated completion time and estimated cluster utilization with different number of executors, a Job/Stage timeline which shows how the parallel stages were scheduled within a job and lots of interesting per stage metrics.
There are four ways to use Sparklens:
  • Live mode
  • Offline mode
  • Run on event-history file
  • Notebooks
In this post I am focusing on live and offline modes only.

Live mode

Sparklens can run at application execution time by using the following options (for spark-submit and spark-shell):
--packages qubole:sparklens:0.2.1-s_2.11
--conf spark.extraListeners=com.qubole.sparklens.QuboleJobListener
or programmatically by adding the Sparklens dependency to the Java/Scala project (here is an example for Maven):
<repositories>
   <!-- Mandatory: the Sparklens artifact aren't in Maven Central -->
   <repository>
    <id>qubole-maven-repo</id>
    <name>Qubole Maven Repo</name>
    <url>http://dl.bintray.com/spark-packages/maven/</url>
   </repository>
   
   <repository>
      <id>central</id>
      <name>Maven Repository Switchboard</name>
      <layout>default</layout>
      <url>http://repo1.maven.org/maven2</url>
      <snapshots>
        <enabled>false</enabled>
      </snapshots>
    </repository>
</repositories> 
...
<dependency>
  <groupId>qubole</groupId>
  <artifactId>sparklens</artifactId>
  <version>0.2.1-s_2.11</version>
</dependency> 
and then by configuring its listener as follows (for Spark 1.X)
SparkConf conf = new SparkConf();
conf.setMaster(master);
conf.setAppName("Spark app name");
conf.set("spark.extraListeners", "com.qubole.sparklens.QuboleJobListener");
JavaSparkContext jsc = new JavaSparkContext(conf); 
or as follows (for Spark 2.x)
SparkSession spark = SparkSession
  .builder()
  .appName("Spark app name")
  .master(master)
  .config("spark.extraListeners", "com.qubole.sparklens.QuboleJobListener")
  .getOrCreate(); 

Offline mode

Sparklens can be run later and not necessarily at execution time. This can be set by adding the following property to the Spark app configuration:
conf.set("spark.sparklens.reporting.disabled", "true"); 
At the end of a Spark application execution, just a JSON data file is generated. The default save directory is /tmp/sparklens, but it is possible to change destination through the following configuration property:
conf.set("spark.sparklens.data.dir", "/home/guglielmo/sparklens");
This file can then be used to run Sparklens independently through the spark-submit command as follows:
$SPARK_HOME/bin/spark-submit --packages qubole:sparklens:0.2.1-s_2.11 \
  --class com.qubole.sparklens.app.ReporterApp qubole-dummy-arg <datafile_path>
The command above, starting from the JSON data file, produces a report having the same layout and containing the same results as for that generated in live mode.

The report

These are the info available in a final report:

  • Efficiency statistics (Driver vs Executor time, Critical and ideal application time, Core compute hours wastage by driver and executor).
  • Predicted wall clock time and cluster utilization with different executor counts.
  • Per stage metrics.
  • Executors available and executors required over time.
  • Task based Aggregate Metrics.

Conclusion

Me and my team started to adopt this tool lately and so far we have found it really useful to understand the scalability limits of Spark applications that are developed by other teams, but that need then to be executed in our infrastructure. The final report generated by this tool gives a comprehensive set of info that definitely helps on pointing towards the right direction on spotting potential scalability issues and areas for improvement.
The generated reports come in a text format which contains all of the metrics and info mentioned above. Qubole provides an online service which generates a user-friendly and elegant report with interactive charts and tables starting from an uploaded JSON data file. Whether you're organization shouldn't allow you to share the JSON data generated running Sparklens on Spark application executing in your corporate infrastructure,  you need to stay with the text reports. To address situations like these, I am thinking of implementing and releasing an Open Source Java library to generate user-friendy reports in a on-prem environment starting from Sparklens JSON data files or the text reports. Please register your interest in this library by commenting this post. Thank you.

Comments

  1. Hi Googlielmo - Were you able to create a open-source library to generate user friendly reports. I too am interested in using it. I could provide some help if needed.

    ReplyDelete
  2. Excellent Blog, I like your blog and It is very informative. Thank you
    Pyspark online Training
    Learn Pyspark Online

    ReplyDelete

Post a Comment

Popular posts from this blog

Exporting InfluxDB data to a CVS file

Sometimes you would need to export a sample of the data from an InfluxDB table to a CSV file (for example to allow a data scientist to do some offline analysis using a tool like Jupyter, Zeppelin or Spark Notebook). It is possible to perform this operation through the influx command line client. This is the general syntax: sudo /usr/bin/influx -database '<database_name>' -host '<hostname>' -username '<username>'  -password '<password>' -execute 'select_statement' -format '<format>' > <file_path>/<file_name>.csv where the format could be csv , json or column . Example: sudo /usr/bin/influx -database 'telegraf' -host 'localhost' -username 'admin'  -password '123456789' -execute 'select * from mem' -format 'csv' > /home/googlielmo/influxdb-export/mem-export.csv

jOOQ: code generation in Eclipse

jOOQ allows code generation from a database schema through ANT tasks, Maven and shell command tools. But if you're working with Eclipse it's easier to create a new Run Configuration to perform this operation. First of all you have to write the usual XML configuration file for the code generation starting from the database: <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <configuration xmlns="http://www.jooq.org/xsd/jooq-codegen-2.0.4.xsd">   <jdbc>     <driver>oracle.jdbc.driver.OracleDriver</driver>     <url>jdbc:oracle:thin:@dbhost:1700:DBSID</url>     <user>DB_FTRS</user>     <password>password</password>   </jdbc>   <generator>     <name>org.jooq.util.DefaultGenerator</name>     <database>       <name>org.jooq.util.oracle.OracleDatabase</name>     ...

Turning Python Scripts into Working Web Apps Quickly with Streamlit

 I just realized that I am using Streamlit since almost one year now, posted about in Twitter or LinkedIn several times, but never wrote a blog post about it before. Communication in Data Science and Machine Learning is the key. Being able to showcase work in progress and share results with the business makes the difference. Verbal and non-verbal communication skills are important. Having some tool that could support you in this kind of conversation with a mixed audience that couldn't have a technical background or would like to hear in terms of results and business value would be of great help. I found that Streamlit fits well this scenario. Streamlit is an Open Source (Apache License 2.0) Python framework that turns data or ML scripts into shareable web apps in minutes (no kidding). Python only: no front‑end experience required. To start with Streamlit, just install it through pip (it is available in Anaconda too): pip install streamlit and you are ready to execute the working de...