Sparklens is a profiling tool for Spark with a built-in Spark Scheduler simulator: it makes easier to understand the scalability limits of Spark applications. It helps in understanding how efficiently is a given Spark application using the compute resources provided to it. It has been implemented and is maintained at Qubole . It is Open Source ( Apache License 2.0 ) and has been implemented in Scala. One interesting characteristic of Sparklens is its ability to generate estimates with a single run of a Spark application . It reports info such as estimated completion time and estimated cluster utilization with different number of executors, a Job/Stage timeline which shows how the parallel stages were scheduled within a job and lots of interesting per stage metrics. There are four ways to use Sparklens: Live mode Offline mode Run on event-history file Notebooks In this post I am focusing on live and offline modes only. Live mode Sparklens can run at application execution...
Sharing thoughts and tips on Python, Java, Scala, Open Source, DevOps, Data Science, ML/DL/AI.