Skip to main content

Setting up a quick dev environment for Kafka, CSR and SDC

Few days ago Pat Patterson published an excellent article on DZone about Evolving Avro Schemas With Apache Kafka and StreamSets Data Collector. I recommend reading this interesting article. I followed this tutorial and today I want to share the details on how I quickly setup the environment for this purpose, just in case you should be interested on doing the same. I did it on a Linux Red Hat Server 7 (but the steps are the same for any other Linux distro) and using only images available in the Docker Hub.
First start a Zookeeper node (which is required by Kafka):

sudo docker run --name some-zookeeper --restart always -d zookeeper

and then a Kafka broker, linking the container to that for Zookeeper:

sudo docker run -d --name kafka --link zookeeper:zookeeper ches/kafka

Then start the Confluent Schema Registry (linking it to Zookeeper and Kafka):

sudo docker run -d --name schema-registry -p 8081:8081 --link zookeeper:zookeeper --link kafka:kafka confluent/schema-registry

and the REST proxy for it:

sudo docker run -d --name rest-proxy -p 8082:8082 --link zookeeper:zookeeper --link kafka:kafka --link schema-registry:schema-registry confluent/rest-proxy

Start an instance of the Streamsets Data Collector:

sudo docker run --restart on-failure -p 18630:18630 -d --name streamsets-dc streamsets/datacollector

Finally you can do an optional step in order to make more user friendly (compared to using the CSR APIs) the registration/update of Avro schema in CSR: start the OS CSR UI provided by Landoop:

sudo docker run -d --name schema-registry-ui -p 8000:8000 -e "SCHEMAREGISTRY_URL=http://<csr_host>:8081" -e "PROXY=true" landoop/schema-registry-ui

connecting it to your CSR instance.
You can create a topic in Kafka executing the following commands:

export ZOOKEEPER_IP=$(sudo docker inspect --format '{{ .NetworkSettings.IPAddress }}' zookeeper) 
sudo docker run --rm ches/kafka kafka-topics.sh --create --zookeeper $ZOOKEEPER_IP:2181 --replication-factor 1 --partitions 1 --topic csrtest

The environment is ready to play with and to be used to follow Pat's tutorial.

Comments

Popular posts from this blog

Exporting InfluxDB data to a CVS file

Sometimes you would need to export a sample of the data from an InfluxDB table to a CSV file (for example to allow a data scientist to do some offline analysis using a tool like Jupyter, Zeppelin or Spark Notebook). It is possible to perform this operation through the influx command line client. This is the general syntax: sudo /usr/bin/influx -database '<database_name>' -host '<hostname>' -username '<username>'  -password '<password>' -execute 'select_statement' -format '<format>' > <file_path>/<file_name>.csv where the format could be csv , json or column . Example: sudo /usr/bin/influx -database 'telegraf' -host 'localhost' -username 'admin'  -password '123456789' -execute 'select * from mem' -format 'csv' > /home/googlielmo/influxdb-export/mem-export.csv

jOOQ: code generation in Eclipse

jOOQ allows code generation from a database schema through ANT tasks, Maven and shell command tools. But if you're working with Eclipse it's easier to create a new Run Configuration to perform this operation. First of all you have to write the usual XML configuration file for the code generation starting from the database: <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <configuration xmlns="http://www.jooq.org/xsd/jooq-codegen-2.0.4.xsd">   <jdbc>     <driver>oracle.jdbc.driver.OracleDriver</driver>     <url>jdbc:oracle:thin:@dbhost:1700:DBSID</url>     <user>DB_FTRS</user>     <password>password</password>   </jdbc>   <generator>     <name>org.jooq.util.DefaultGenerator</name>     <database>       <name>org.jooq.util.oracle.OracleDatabase</name>     ...

Turning Python Scripts into Working Web Apps Quickly with Streamlit

 I just realized that I am using Streamlit since almost one year now, posted about in Twitter or LinkedIn several times, but never wrote a blog post about it before. Communication in Data Science and Machine Learning is the key. Being able to showcase work in progress and share results with the business makes the difference. Verbal and non-verbal communication skills are important. Having some tool that could support you in this kind of conversation with a mixed audience that couldn't have a technical background or would like to hear in terms of results and business value would be of great help. I found that Streamlit fits well this scenario. Streamlit is an Open Source (Apache License 2.0) Python framework that turns data or ML scripts into shareable web apps in minutes (no kidding). Python only: no front‑end experience required. To start with Streamlit, just install it through pip (it is available in Anaconda too): pip install streamlit and you are ready to execute the working de...