One interesting and promising Open Source project that caught my attention lately is Spline , a data lineage tracking and visualization tool for Apache Spark , maintained at Absa . This project consists of 2 parts: a Scala library that works on the drivers which, by analyzing the Spark execution plans, captures the data lineages and a web application which provides a UI to visualize them. Spline supports MongoDB and HDFS as storage systems for the data lineages in JSON format. In this post I am referring to MongoDB. You can start playing with Spline through the Spark shell. Just add the required dependencies to the shell classpath as follows (with reference to the latest 0.3.5 release of this project): spark-shell --packages "za.co.absa.spline:spline-core:0.3.5,za.co.absa.spline:spline-persistence-mongo:0.3.5,za.co.absa.spline:spline-core-spark-adapter-2.3:0.3.5" Running the Spark shell with the command above on Ubuntu and some other Linux distro, whether some issue on...
Sharing thoughts and tips on Python, Java, Scala, Open Source, DevOps, Data Science, ML/DL/AI.