Googlielmo's blog

Posts

Showing posts from 2016

Integrating Kakfa, Spark Streaming and Cassandra: the basics

Spark Streaming brings Apache Spark's language integrated APIs to write streaming jobs the same way as for writing batch jobs. It allows to build fault tolerant applications and reuse the same code for batch and interactive queries. Kafka is an Open Source message broker written in Scala . It is used for building real-time data pipelines and streaming apps. It is horizontally scalable, fault-tolerant and wicked fast. Cassandra is an Open Source distributed database management system designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure. This post walks through the basics of the implementation of a simple streaming application integrating those three technologies. The code example is written in Scala. The releases I am referring to in this post are the following: Scala 2.11.8 Spark 1.6.2 Kafka Client APIs 0.8.2.11 Cassandra 3.9 Datastax Spark-Cassandra Connector compatible with Spark 1

HUG Ireland October Meetup: Machine Learning and the Serving Layer. Successful Big Data Architecture

Another interesting Hadoop User Group (HUG) Ireland Meetup next week (Monday October 3rd 2016) in Dublin at the Bank of Ireland premises in Grand Canal Square : http://www.meetup.com/it-IT/hadoop-user-group-ireland/events/234240469/?eventId=234240469&chapter_analytics_code=UA-55809013-2 If you are in the Dublin area on Monday and interested in Machine Learning, please attend this event to learn more and start networking with other Big Data professionals. Hope to meet you there!

Adding a Maven behavior to not Maven projects

Not Maven projects built through Jenkins cannot be automatically uploaded to a build artifact repository (like Apache Archiva or JFrog Artifactory) at the end of a successful build. In order to allow this level of automation whenever you won't (or can't) transform a standard Java project into a Maven project (but you need to have a uniform build management process) you need to simply add a pom.xml file to the root of your project and commit it to your source code repository along with all the other project files. You can create a basic template following the official documentation and then add/update the specific sections as described in this post. This way developers wouldn't need to change anything else in the current projects' structure. And no need to have Maven on their local development machine or a Maven plugin in the IDE they use. Here's a description of all the sections of the pom.xml file. The values to overwrite are highlighted in bold . Ma

Streamsets Data Collector 1.6.0.0 has been released!

The release 1.6.0.0 of the Streamsets Data Collector has been released on September 1st. This release comes with an incredible number of new features. Here are some of the most interesting: JDBC Lookup processor: it can perform lookups in a database table through a JDBC connection and then you can use the values to enrich records. JDBC Tee processor: it can write data to a database table through a JDBC connection, and then you can pass generated database column values to fields. Support for reading data from paginated webpages through the HTTP origin. Support for Apache Kafka 0.10 and ElasticSearch 2.3.5. Enterprise security in the MongoDB origin and destination including SSL and login credentials. Whole File Data format: to move entire files from an origin system (Amazon S3 or Directory) to a destination system (Amazon S3, HDFS, Local File System or MapR FS). Using the whole file data format, you can transfer any type of file. And many more. Furthermo

JDBC Producer destination setting in the Streamsets Data Collector.

One of the most used managed destinations in the SDC is the JDBC Producer. It allows data writing to a relational database table using a JDBC connection. The SDC release I am referring to in this post is the 1.5.1.2 running in the JVM 8. Installing a specific JDBC driver. In order to insert data in a database table SDC requires the specific JDBC driver for the database you need to use. This applies to the JDBC consumer origin as well. The first time you plan to add a JDBC Producer destination to a pipeline you need to create a local directory in the SDC host machine external to the SDC installation directory. Example: /home/sdc-user/sdc-extras Then create the following sub-directory structure for all of the JDBC drivers: /home/sdc-user/sdc-extras/streamsets-datacollector-jdbc-lib/lib/ Finally copy the JDBC driver in that folder. Now it is time to make SDC aware of this directory. First you have to add the STREAMSETS_LIBRARIES_EXTRA_DIR environment variable and make it poin

Java vs Scala in Spark development

Developing things for Spark sometimes you don't have a choice in terms of language to use (it is the case of the GraphX APIs, where Scala is the only choice at the moment), but in some other cases you can choose between two different JVM languages (Java or Scala). Coming from a long background in Java and from my shorter experience in Scala, I can say that for sure some advantages using Scala in Spark programming are a better compactness and readability of the code. Have a look at the following simple Java code taken from one of the examples bundled with the Spark distribution: import java.io.Serializable; import java.util.List; import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; import org.apache.spark.api.java.function.Function; import org.apache.spark.sql.DataFrame; import org.apache.spark.sql.Row; import org.apache.spark.sql.SQLContext; public class JavaSparkSQL { public static class Person implem

Starting a new Scala project in the Scala Eclipse IDE

Few weeks ago we had a need to move to Scala in order to do some things in Spark using APIs available for that language only. In order to speed up things while starting to learn it and minimize the impact on the existing components and the ongoing CI process, we found the dirty way discussed in this post to quickly start with. Started to use the Scala IDE for Eclipse and did the following actions to create new Scala projects: - Create a new Scala project ( File -> New -> Scala Project ). - Add the Maven nature to it ( Configure -> Convert to Maven Project ). All of our existing projects in the same area are built through Maven and our Jenkins CI servers use Maven to build after any code change and do a lot of other actions (Unit Tests execution, static analysis of the code, code coverage and many others) through it. That's the main reason we are not using sbt. The m2eclipse-scala and the m2e plugins are bundled with the Scala IDE, so no need to install them. - Rem

Publish to Kafka Jenkins plugin release 0.8 available

A new release of the Jenkins post-build action plugin to publish build jobs execution data to a Kafka topic is now available on my GitHub space ( https://github.com/virtualramblas/publishtokafka-plugin ). All the details in the project README file. Please send me your feedback commenting this post or adding a new issue (todo, feature request or bug) to the GitHub plugin repository whether you have a chance to use it. Thanks a lot.

Streamsets Data Collector authentication through LDAP

StreamSets Data Collector (SDC) allows user authentication based on files or LDAP. By default, Data Collector uses file authentication. This post gives you details on how to switch to use your company's LDAP. To enable LDAP authentication you need to perform the following tasks: - Configure the LDAP properties for the Data Collector configuration editing the $SDC_CONF/sdc.properties file: - set the value of the http.authentication.login.module property to ldap - configure the value of the http.authentication.ldap.role.mapping property to map your LDAP groups to Data Collector roles following this syntax: <LDAP_group>:<SDC_role>,<additional_SDC_role>,<additional_SDC_role> Multiple roles can be mapped to the same group or vice versa. You need to use a semicolon to separate LDAP groups and commas to separate Data Collector roles. Here's an example: http.authentication.ldap.role.mapping=LDAP000:admin;LDAP

Shipping and analysing MongoDB logs using the Streamsets Data Collector, ElasticSearch and Kibana

In order to show that the considerations done in my last post are general for any log shipping purpose, let's see now how the same process applies to a more real use case scenario: the log shipping and analysis of a MongoDB database logs. MongoDB logs pattern Starting from the release 3.0 (I am considering the release 3.2 for this post) the MongoDB logs come with the following pattern: <timestamp> <severity> <component> [<context>] <message> where: timestamp is in iso8601-local format. severity is the level associated to each log message. It is a single character field. Possible values are F (Fatal), E (Error), W (Warning), I (Informational) and D (Debug). component is for a functional categorization of the log message. Please refer to the specific release of MongoDB you're using to know the full list of possible values. context is the specific context for a message. message : don't think you need some

Streamsets Data Collector log shipping and analysis using ElasticSearch, Kibana and... the Streamsets Data Collector

One common use case scenario for the Streamsets Data Collector (SDC) is the log shipping to some system, like ElasticSearch, for real-time analysis. To build a pipeline for this particular purpose in SDC is really simple and fast and doesn't require coding at all. For this quick tutorial I will use the SDC logs as example. The log data will be shipped to Elasticsearch and then visualized through a Kibana dashboard. Basic knowledge of SDC, Elasticsearch and Kibana is required for a better understanding of this post. These are the releases I am referring to for each system involved in this tutorial: JDK 8 Streamsets Data Collector 1.4.0 ElasticSearch 2.3.3 Kibana 4.5.1 Elasticsearch and Kibana installation You should have your Elasticsearch cluster installed and configured and a Kibana instance pointing to that cluster in order to go on with this tutorial. Please refer to the official documentation for these two products in order to complete their installation (if you do

Building Rapid Ingestion Data Pipelines talk @ Hadoop User Group Ireland Meetup on June 13th

If you're involved in Big Data stuff and in the Dublin area on Monday June 13th 2016 I hope you have a chance to attend the monthly meetup event of the Hadoop User Group Ireland: http://www.meetup.com/hadoop-user-group-ireland/events/231165491/ The event will be hosted by Bank of Ireland in their premises in Grand Canal Square ( https://goo.gl/maps/bbh1XghukfJ2 ). I will do the second talk " Building a data pipeline to ingest data into Hadoop in minutes using Streamsets Data Collector ". The event will start at 6 PM Irish Time. I hope to meet you there.

Publish to Kafka Jenkins plugin release 0.7 available

The first release of a Jenkins plugin to publish build jobs execution data to a Kafka topic is now available on my GitHub space ( https://github.com/virtualramblas/publishtokafka-plugin ). All the details in the project README file. Please send me your feedback commenting this post or adding a new issue (todo, feature request or bug) to the GitHub plugin repository whether you have a chance to use it. Thanks a lot.

Discovering Streamsets Data Collector (Part 2)

Before moving on to the other planned posts for this series, an important update about the Data Collector. Two new versions (1.3.0 and 1.3.1) that solve some critical bugs and introduce new features have been released since the first post publishing. This is a short list of the most significant benefits you get moving to the new releases: No issue running the Data Collector as for https://issues.streamsets.com/browse/SDC-2657 The workaround for this issue was to downgrade to the release 1.2.1.0, this way missing an important new feature like the Groovy Evaluator processor. The Hadoop FS and Local FS destination can now write files larger that 2 GB. A MongoDB destination is now available (up to release 1.2.2.0 a MongoDB database could have been set as origin only). Two new processors, Base64 Field Decoder and Base64 Field Encoder, have been implemented to work with Base64 binary data encoding/decoding. Enjoy it!

The Kafka Series (part 5): deleting topics

Before going further a quick post about topic deletion in Kafka (someone asked me about this). In part 2 of this series we created a topic called kafkatesting for testing purposes and to get familiar with the Java APIs to implement producers and consumers. When you're done with testing you will need to delete it. This could be done running the following command from a shell: $KAFKA_HOME/bin/kafka-topics.sh --zookeeper localhost:2181 --delete --topic kafkatesting Then if you check the list of existing topics for that cluster you could still see the topic there having the label "marked for deletion". This happens when you use the default properties file for Kafka or you didn't explicitly set to true the value of the delete.topic.enable property (the default value for it is false ) in your custom copy of that file. In order to make this configuration change effective you have to restart both Kafka and ZooKeeper.

The Kafka Series (part 4): implementing a consumer in Java

In part 3 of this series we have implemented a simple Kafka producer using the provided Java APIs. Now let's learn how to implement a consumer for the produced messages. The Kafka release I am referring to is always the 0.9.0.1. Same as for the producer, I suggest to use Maven: this way you have to add only one direct dependency through the POM file: <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka-clients</artifactId> <version>0.9.0.1</version> </dependency> Then create a new class called, for example, SimpleKafkaConsumer : public class SimpleKafkaConsumer implements Runnable { KafkaConsumer<String, String> consumer; It implements the java.lang.Runnable interface and has a single instance variable, consumer , which type is org.apache.kafka.clients.consumer.KafkaConsumer<K,V> . It is the client that consumes records from a Kafka cluster. It isn't thread-safe. In the initia

Discovering Streamsets Data Collector (Part 1)

StreamSets Data Collector ( https://streamsets.com/product/ ) is an Open Source lightweight and powerful engine that streams data in real time. It allows to configure data flows as pipelines through a web UI in few minutes. Among its many features, it makes possible to view real-time statistics and inspect data as it passes through the pipeline. In the first part of this series I am going to show the installation steps to run the Data Collector manually. I am referring to the release 1.2.1.0. The latest one (1.2.2.0) comes with a bug that prevents it to start (I have opened a ticket in the official Jira for this product ( https://issues.streamsets.com/browse/SDC-2657 ), but it is still unresolved at the time this post is written). The prerequisites for the installation are: OS: RedHat Enterprise Linux 6 or 7 or CentOS 6 or 7 or Ubuntu 14.04 or Mac OS X. Java: Oracle or IBM JDK 7+. And now the installation steps: - Download the full StreamSets Data Collector tarball:

The Kafka Series (part 3): implementing a producer in Java

In part 2 of this series we learned how to set up a Kafka single-node single-broker cluster. Before moving to more complex cluster configurations let's understand how to implement a consumer using the Kafka Java APIs. The Kafka release I am referring to is always the 0.9.0.1. I suggest to use Maven for any producer you would need to implement: this way you have to add only one direct dependency through the POM file: <dependency> <groupId>org.apache.kafka</groupId> <artifactId>kafka_2.11</artifactId> <version>0.9.0.1</version> </dependency> Create a new class called SimpleKakfaProducer : public class SimpleKakfaProducer { private ProducerConfig config; private KafkaProducer<String, String> producer; and add the two instance variables above. org.apache.kafka.clients.producer.ProducerConfig is the configuration class for a Kakfa producer. org.apache.kafka.clients.producer.KafkaProducer<K,

The Kafka Series (part 2): single node-single broker cluster installation

In the second part of this series I will describe the steps to install a Kafka single node-single broker cluster on a Linux machine. Here I am referring to the latest Kafka stable version (at the time of writing this post), 0.9.0.1, Scala 2.11. Prerequisites The only prerequisite needed is a JDK 7+. Installation - Move to the opt folder of your system cd /opt and then download the binaries of the latest release there: wget http://www.us.apache.org/dist/kafka/0.9.0.1/kafka_2.11-0.9.0.1.tgz - Extract the archive content: tar xzf kafka_2.11-0.9.0.1.tgz - Create the KAFKA_HOME variable: echo -e "export KAFKA_HOME=/opt/kafka_2.11-0.9.0.1" >> /root/.bash_profile - Add the Kafka bin folder to the PATH: echo -e "export PATH=$PATH:$KAFKA_HOME/bin" >> /root/.bash_profile - Reload the bash profile for the user: source /root/.bash_profile Starting the server - In order for the Kafka ser

The Kafka series (Part 1): what's Kafka?

I am starting today a new series of posts about Apache Kafka ( http://kafka.apache.org/ ). Kafka is an Open Source message broker written in Scala ( http://www.scala-lang.org/ ). Originally it has been developed by LinkedIn ( https://ie.linkedin.com/ ), but then it has been released as Open Source in 2011 and it is currently maintained by the Apache Software Foundation ( http://www.apache.org/ ). Why one should prefer Kafka to a traditional JMS message broker? Here's a short list of convincing reasons: It's fast: a single Kafka broker running on commodity hardware can handle hundreds of megabytes of reads and writes per second from thousands of clients. Great scalability: it can be easily and transparently expanded without downtime. Durability and Replication: messages are persisted on disk and replicated within the cluster to prevent data loss (setting a proper configuration using the high number of available configuration parameters you could achieve zero data loss). Pe

Load testing MongoDB using JMeter

Apache JMeter ( http://jmeter.apache.org/ ) added support for MongoDB since its 2.10 release. In this post I am referring to the latest JMeter release (2.13). A preliminary JMeter setup is needed before starting your first test plan for MongoDB. It uses Groovy as scripting reference language, so Groovy needs to be set up for our favorite load testing tool. Follow these steps to complete the set up: Download Groovy from the official website ( http://www.groovy-lang.org/download.html ). In this post I am referring to the Groovy release 2.4.4, but using later versions is fine. Copy the groovy-all-2.4.4.jar to the $JMETER_HOME/lib folder. Restart JMeter if it was running while adding the Groovy JAR file. Now you can start creating a test plan for MongoDB load testing. From the UI select the MongoDB template ( File -> Templates... ). The new test plan has a MongoDB Source Config element. Here you have to setup the connection details for the database to be tested: The Threa