Skip to main content

Discovering Streamsets Data Collector (Part 1)

StreamSets Data Collector (https://streamsets.com/product/) is an Open Source lightweight and powerful engine that streams data in real time. It allows to configure data flows as pipelines through a web UI in few minutes. Among its many features, it makes possible to view real-time statistics and inspect data as it passes through the pipeline.



In the first part of this series I am going to show the installation steps to run the Data Collector manually. I am referring to the release 1.2.1.0. The latest one (1.2.2.0) comes with a bug that prevents it to start (I have opened a ticket in the official Jira for this product (https://issues.streamsets.com/browse/SDC-2657), but it is still unresolved at the time this post is written).

The prerequisites for the installation are:
  • OS: RedHat Enterprise Linux 6 or 7 or CentOS 6 or 7 or Ubuntu 14.04 or Mac OS X.
  •  Java: Oracle or IBM JDK 7+.
And now the installation steps:
 - Download the full StreamSets Data Collector tarball:
    wget https://archives.streamsets.com/datacollector/1.2.1.0/tarball/streamsets-datacollector-all-1.2.1.0.tgz
 - and then extract its content into any desired location:
    tar xvf streamsets-datacollector-all-1.2.1.0.tgz
 - Check the maximum number of open file descriptors for the hosting machine:
    ulimit -n
   If it is set to 1024 you need to increase it to 4096 at least updating the /etc/security/limits.conf file adding the following entry:
    <user_running_dc>    soft    nofile    4096
   and then logoff and login again.
 - Run the Data Collector:
    $DATA_COLLECTOR_HOME/bin/streamsets dc
 - Access the UI through a web browser at the following URL:
    http://<hostname>:18630/
   Now you are ready to start to create your first pipeline.

In order to stop the data collector just type Ctrl + C in the same shell from which you started it.

What's next?
In part 2 we will walk through the process of creating a new pipeline from scratch.

Comments

Popular posts from this blog

Exporting InfluxDB data to a CVS file

Sometimes you would need to export a sample of the data from an InfluxDB table to a CSV file (for example to allow a data scientist to do some offline analysis using a tool like Jupyter, Zeppelin or Spark Notebook). It is possible to perform this operation through the influx command line client. This is the general syntax: sudo /usr/bin/influx -database '<database_name>' -host '<hostname>' -username '<username>'  -password '<password>' -execute 'select_statement' -format '<format>' > <file_path>/<file_name>.csv where the format could be csv , json or column . Example: sudo /usr/bin/influx -database 'telegraf' -host 'localhost' -username 'admin'  -password '123456789' -execute 'select * from mem' -format 'csv' > /home/googlielmo/influxdb-export/mem-export.csv

jOOQ: code generation in Eclipse

jOOQ allows code generation from a database schema through ANT tasks, Maven and shell command tools. But if you're working with Eclipse it's easier to create a new Run Configuration to perform this operation. First of all you have to write the usual XML configuration file for the code generation starting from the database: <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <configuration xmlns="http://www.jooq.org/xsd/jooq-codegen-2.0.4.xsd">   <jdbc>     <driver>oracle.jdbc.driver.OracleDriver</driver>     <url>jdbc:oracle:thin:@dbhost:1700:DBSID</url>     <user>DB_FTRS</user>     <password>password</password>   </jdbc>   <generator>     <name>org.jooq.util.DefaultGenerator</name>     <database>       <name>org.jooq.util.oracle.OracleDatabase</name>     ...

Using Rapids cuDF in a Colab notebook

During last Spark+AI Summit Europe 2019 I had a chance to attend a talk from Miguel Martinez  who was presenting Rapids , the new Open Source framework from NVIDIA for GPU accelerated end-to-end Data Science and Analytics. Fig. 1 - Overview of the Rapids eco-system Rapids is a suite of Open Source libraries: cuDF cuML cuGraph cuXFilter I enjoied the presentation and liked the idea of this initiative, so I wanted to start playing with the Rapids libraries in Python on Colab , starting from cuDF, but the first attempt came with an issue that I eventually solved. So in this post I am going to share how I fixed it, with the hope it would be useful to someone else running into the same blocker. I am assuming here you are already familiar with Google Colab. I am using Python 3.x as Python 2 isn't supported by Rapids. Once you have created a new notebook in Colab, you need to check if the runtime for it is set to use Python 3 and uses a GPU as hardware accelerator. You...