Skip to main content

Posts

Showing posts from 2018

Exploring the Spline Data Tracker and Visualization tool for Apache Spark (Part 2)

In part 1 we have learned how to test data lineage info collection with Spline from a Spark shell. The same can be done in any Scala or Java Spark application. The same dependencies for the Spark shell need to be registered in your build tool of choice (Maven, Gradle or sbt): groupId: za.co.absa.spline artifactId: spline-core version: 0.3.5 groupId: za.co.absa.spline artifactId: spline-persistence-mongo version:0.3.5 groupId: za.co.absa.spline artifactId:spline-core-spark-adapter-2.3 version:0.3.5 With reference to Scala and Spark 2.3.x, a Spark job like this: // Create the Spark session val sparkSession = SparkSession    .builder()    .appName("Spline Tester")    .getOrCreate()   // Init Spline System.setProperty("spline.persistence.factory", "za.co.absa.spline.persistence.mongo.MongoPersistenceFactory") System.setProperty("spline.mongodb.url", args(0)) System.setProperty("spline.mongodb.name", args(1)) imp

Exploring the Spline Data Tracker and Visualization tool for Apache Spark (Part 1)

One interesting and promising Open Source project that caught my attention lately is Spline , a data lineage tracking and visualization tool for Apache Spark , maintained at  Absa . This project consists of 2 parts: a Scala library that works on the drivers which, by analyzing the Spark execution plans, captures the data lineages and a web application which provides a UI to visualize them. Spline supports MongoDB and HDFS as storage systems for the data lineages in JSON format. In this post I am referring to MongoDB. You can start playing with Spline through the Spark shell. Just add the required dependencies to the shell classpath as follows (with reference to the latest 0.3.5 release of this project): spark-shell --packages "za.co.absa.spline:spline-core:0.3.5,za.co.absa.spline:spline-persistence-mongo:0.3.5,za.co.absa.spline:spline-core-spark-adapter-2.3:0.3.5" Running the Spark shell with the command above on Ubuntu and some other Linux distro, whether some issue on

Black Friday @Packt Publishing!

This Friday November 23rd 2018 would be Black Friday at Packt Publishing too! Each book or video, including the latest releases, could be purchased for US$ 10 only. It would be also possible to pre-order my upcoming book " Hands-on Deep Learning with Apache Spark " for US$ 10. Please remember that this convenient price is valid on Friday 23rd only. Enjoy it!

Ultralight Data Movement for IoT with SDC Edge @ Predict 2018

On October 2nd 2018 I am going to give a talk at the Predict 2018 conference, RDS, Dublin, Ireland. My talk will start at 2:20 PM. It is part of the Technology IoT & Manufacturing 4.0 section. Please feel free to get in touch during the conference day to discuss about IIoT, Open Source adoption, data streaming, edge analytics, Deep Learning and more.  

AI with the Best 2018 conference

I am proud to share that I will give a talk at the AI With the Best 2018 conference. The title of my talk is "Why Scala for Data Science?" and it is part of the "AI in Action" track. There I am going to cover some topics of my upcoming book . The conference will happen on September Friday 14th 2018. It is an online event. Buying a ticket for this event will give you also access to the recording of all the talks in the next 2 months after the conference end (just in case you should miss some during the live streaming). You will have also a chance to interact with the speakers and book 1:1 time with some of them. Please have a look at the list of speakers and talk topics: it is very impressive. I hope you will attend it!

Deploying and scaling an Oracle database on a multi-node Kubernetes cluster

In this post I am going to explain how to deploy and scale a Oracle Express database on a multi-node Kubernetes cluster. I am going to use this Docker container by Maxym Bylenko.  I am referring to the container for the Oracle XE 11g because of the following open issue with that for Oracle XE 12c at the time I did the process described below. I am assuming the readers have at least basic or middle level knowledge of the Kubernetes concepts. First thing to do is to create a Pod. We can do this (and other operations described in this post) declaratively through a YAML file: apiVersion: v1 kind: Pod metadata:   name: "oradb" labels:   name: "oradb" spec:     containers:       - image: "sath89/oracle-xe-11g:latest"         name: "oradb"         ports:           - containerPort: 1521         restartPolicy: Always Once the Pod has been successfully created, we need to create a Service for it: apiVersion: v1 kind: Service metadata:   name: &q

Live Webinar on SDC Edge @ Streamsets

I have the pleasure of being invited this week to do a live webinar along with Pat Patterson at Streamsets. The title of the webinar is "Ultralight Data Movement for IoT with SDC Edge". Here's the link: https://go.streamsets.com/webinar-2018-06-27-ultralight-data-movement-for-iot-with-sdc-edge.html?es_p=7008945 I hope you will join us on Wednesday June 27th at 10 AM PT.

Data Driven Innovation Open Summit 2018

Thank you for attending my talk at the Data Driven Innovation Open Summit 2018 in Rome. I will post to YouTube the videos I have prepared about the 2 demos of the SDC and SDC Edge originally planned for this talk as soon as I complete the audio comment and some small editing.

Google I/O Extended 2018 Dublin

Thanks to all the people who attended my talk "The Journey to TensorFlow on the JVM Stack" at the Google I/O Extended 2018 in Dublin. The slide deck is available on my SlideShare space . More code examples would be available on this blog in the next months.

DataWorks Summit 2018, Berlin Edition: come to attend my talk.

AI, Machine Learning and Deep Learning are getting an hype nowadays even if most part of the algorithms and models at their core are around since long time: 1805 Least Squares 1812 Bayes' Theorem 1913 Markov Chains 1950 Turing's Learning Machine 1957 Perceptron 1967 Nearest Neighbor 1970 Automatic Differentiation 1972 TF-IDF 1980 Neocognitron 1981 Explanation Based Learning 1982 Recurrent Neural Network 1970 Back Propagation 1989 Reinforcement Learning 1995 Random Forest Algorithm 1995 Support Vector Machines 1997 LSTM So what are the reasons that speed up and accelerated the implementation and made possible today for the theory to become reality? There are several factors:  - Cheaper computation: in the past hardware was a constraining factor for AI/ML/DL. Late advance in hardware (coupled with improved tools and software frameworks) and new computational models (in particular around GPUs) have accelerated AI/ML/DL adoption.  - Cheaper storage: the increased number of availabl

Java with The Best Conference

Do you believe that Java is dead? Please join the Java with the Best online conference on April 17th-18th and then probably you will change your mind. Of course if you are a Java technology passionate you need to attend as well ;) 2 days, 50 talks, 3 parallel sessions (Core Java, Big Data, Machine Learning and Cloud Development with Java, Java Frameworks, Libraries and Languages).  I hope you will get a chance to attend my talk:

ScalaUA Conference 2018

An interesting conference on Scala is going to happen in Kiev (Ukraine) on April 20th-21st: https://www.scalaua.com/ Some early bird tickets should be still available. Here's a list of the speakers already confirmed for this event. Have a look at this short video to get some insight from the 2017 edition.