StreamSets Data Collector (https://streamsets.com/product/) is an Open Source lightweight and powerful engine that streams data in real time. It allows to configure data flows as pipelines through a web UI in few minutes. Among its many features, it makes possible to view real-time statistics and inspect data as it passes through the pipeline.
In the first part of this series I am going to show the installation steps to run the Data Collector manually. I am referring to the release 1.2.1.0. The latest one (1.2.2.0) comes with a bug that prevents it to start (I have opened a ticket in the official Jira for this product (https://issues.streamsets.com/browse/SDC-2657), but it is still unresolved at the time this post is written).
The prerequisites for the installation are:
- Download the full StreamSets Data Collector tarball:
wget https://archives.streamsets.com/datacollector/1.2.1.0/tarball/streamsets-datacollector-all-1.2.1.0.tgz
- and then extract its content into any desired location:
tar xvf streamsets-datacollector-all-1.2.1.0.tgz
- Check the maximum number of open file descriptors for the hosting machine:
ulimit -n
If it is set to 1024 you need to increase it to 4096 at least updating the /etc/security/limits.conf file adding the following entry:
<user_running_dc> soft nofile 4096
and then logoff and login again.
- Run the Data Collector:
$DATA_COLLECTOR_HOME/bin/streamsets dc
- Access the UI through a web browser at the following URL:
http://<hostname>:18630/
Now you are ready to start to create your first pipeline.
In order to stop the data collector just type Ctrl + C in the same shell from which you started it.
What's next?
In part 2 we will walk through the process of creating a new pipeline from scratch.
In the first part of this series I am going to show the installation steps to run the Data Collector manually. I am referring to the release 1.2.1.0. The latest one (1.2.2.0) comes with a bug that prevents it to start (I have opened a ticket in the official Jira for this product (https://issues.streamsets.com/browse/SDC-2657), but it is still unresolved at the time this post is written).
The prerequisites for the installation are:
- OS: RedHat Enterprise Linux 6 or 7 or CentOS 6 or 7 or Ubuntu 14.04 or Mac OS X.
- Java: Oracle or IBM JDK 7+.
- Download the full StreamSets Data Collector tarball:
wget https://archives.streamsets.com/datacollector/1.2.1.0/tarball/streamsets-datacollector-all-1.2.1.0.tgz
- and then extract its content into any desired location:
tar xvf streamsets-datacollector-all-1.2.1.0.tgz
- Check the maximum number of open file descriptors for the hosting machine:
ulimit -n
If it is set to 1024 you need to increase it to 4096 at least updating the /etc/security/limits.conf file adding the following entry:
<user_running_dc> soft nofile 4096
and then logoff and login again.
- Run the Data Collector:
$DATA_COLLECTOR_HOME/bin/streamsets dc
- Access the UI through a web browser at the following URL:
http://<hostname>:18630/
Now you are ready to start to create your first pipeline.
In order to stop the data collector just type Ctrl + C in the same shell from which you started it.
What's next?
In part 2 we will walk through the process of creating a new pipeline from scratch.
Comments
Post a Comment