Googlielmo's blog

Posts

Showing posts from August, 2016

JDBC Producer destination setting in the Streamsets Data Collector.

One of the most used managed destinations in the SDC is the JDBC Producer. It allows data writing to a relational database table using a JDBC connection. The SDC release I am referring to in this post is the 1.5.1.2 running in the JVM 8. Installing a specific JDBC driver. In order to insert data in a database table SDC requires the specific JDBC driver for the database you need to use. This applies to the JDBC consumer origin as well. The first time you plan to add a JDBC Producer destination to a pipeline you need to create a local directory in the SDC host machine external to the SDC installation directory. Example: /home/sdc-user/sdc-extras Then create the following sub-directory structure for all of the JDBC drivers: /home/sdc-user/sdc-extras/streamsets-datacollector-jdbc-lib/lib/ Finally copy the JDBC driver in that folder. Now it is time to make SDC aware of this directory. First you have to add the STREAMSETS_LIBRARIES_EXTRA_DIR environment variable and make it poin

Java vs Scala in Spark development

Developing things for Spark sometimes you don't have a choice in terms of language to use (it is the case of the GraphX APIs, where Scala is the only choice at the moment), but in some other cases you can choose between two different JVM languages (Java or Scala). Coming from a long background in Java and from my shorter experience in Scala, I can say that for sure some advantages using Scala in Spark programming are a better compactness and readability of the code. Have a look at the following simple Java code taken from one of the examples bundled with the Spark distribution: import java.io.Serializable; import java.util.List; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.Row; import org.apache.spark.sql.SQLContext; public class JavaSparkSQL { public static class Person implem