Skip to main content

Posts

Showing posts from March, 2016

Discovering Streamsets Data Collector (Part 1)

StreamSets Data Collector ( https://streamsets.com/product/ ) is an Open Source lightweight and powerful engine that streams data in real time. It allows to configure data flows as pipelines through a web UI in few minutes. Among its many features, it makes possible to view real-time statistics and inspect data as it passes through the pipeline. In the first part of this series I am going to show the installation steps to run the Data Collector manually. I am referring to the release 1.2.1.0. The latest one (1.2.2.0) comes with a bug that prevents it to start (I have opened a ticket in the official Jira for this product ( https://issues.streamsets.com/browse/SDC-2657 ), but it is still unresolved at the time this post is written). The prerequisites for the installation are: OS: RedHat Enterprise Linux 6 or 7 or CentOS 6 or 7 or Ubuntu 14.04 or Mac OS X.  Java: Oracle or IBM JDK 7+. And now the installation steps:  - Download the full StreamSets Data Collector tarball:   

The Kafka Series (part 3): implementing a producer in Java

In part 2 of this series we learned how to set up a Kafka single-node single-broker cluster. Before moving to more complex cluster configurations let's understand how to implement a consumer using the Kafka Java APIs. The Kafka release I am referring to is always the 0.9.0.1. I suggest to use Maven for any producer you would need to implement: this way you have to add only one direct dependency through the POM file: <dependency>     <groupId>org.apache.kafka</groupId>     <artifactId>kafka_2.11</artifactId>     <version>0.9.0.1</version> </dependency>  Create a new class called SimpleKakfaProducer : public class SimpleKakfaProducer {     private ProducerConfig config;     private KafkaProducer<String, String> producer; and add the two instance variables above. org.apache.kafka.clients.producer.ProducerConfig is the configuration class for a Kakfa producer. org.apache.kafka.clients.producer.KafkaProducer<K,

The Kafka Series (part 2): single node-single broker cluster installation

In the second part of this series I will describe the steps to install a Kafka single node-single broker cluster on a Linux machine. Here I am referring to the latest Kafka stable version (at the time of writing this post), 0.9.0.1, Scala 2.11. Prerequisites The only prerequisite needed is a JDK 7+. Installation - Move to the opt folder of your system    cd /opt   and then download the binaries of the latest release there:     wget http://www.us.apache.org/dist/kafka/0.9.0.1/kafka_2.11-0.9.0.1.tgz - Extract the archive content:     tar xzf kafka_2.11-0.9.0.1.tgz - Create the KAFKA_HOME variable:     echo -e "export KAFKA_HOME=/opt/kafka_2.11-0.9.0.1" >> /root/.bash_profile - Add the Kafka bin folder to the PATH:     echo -e "export PATH=$PATH:$KAFKA_HOME/bin" >> /root/.bash_profile - Reload the bash profile for the user:     source /root/.bash_profile Starting the server  - In order for the Kafka ser

The Kafka series (Part 1): what's Kafka?

I am starting today a new series of posts about Apache Kafka ( http://kafka.apache.org/ ). Kafka is an Open Source message broker written in Scala ( http://www.scala-lang.org/ ). Originally it has been developed by LinkedIn ( https://ie.linkedin.com/ ), but then it has been released as Open Source in 2011 and it is currently maintained by the Apache Software Foundation ( http://www.apache.org/ ). Why one should prefer Kafka to a traditional JMS message broker? Here's a short list of convincing reasons: It's fast: a single Kafka broker running on commodity hardware can handle hundreds of megabytes of reads and writes per second from thousands of clients. Great scalability: it can be easily and transparently expanded without downtime.  Durability and Replication: messages are persisted on disk and replicated within the cluster to prevent data loss (setting a proper configuration using the high number of available configuration parameters you could achieve zero data loss). Pe