Skip to main content

Posts

Showing posts from 2017

Quick start with Apache Livy (part 2): the REST APIs

The second post of this series focuses on how to run a Livy server instance and start playing with its REST APIs. The steps below are meant for a Linux environment (any distribution). Prerequisites The prerequisites to start a Livy server are the following: The JAVA_HOME env variable set to a JDK/JRE 8 installation. A running Spark cluster. Starting the Livy server Download the latest version ( 0.4.0-incubating at the time this post is written) from the official website and extract the archive content (it is a ZIP file). Then setup the SPARK_HOME env variable to the Spark location in the server (for simplicity in this post I am assuming that the cluster is in the same machine as for the Livy server, but in the next post I will go through the customization of the configuration files, including the connection to a remote Spark cluster, wherever it is). By default Livy writes its logs into the $LIVY_HOME/logs location: you need to manually create this directory. Finally

Quick start with Apache Livy (part 1)

I have started doing evaluation of Livy for potential case scenarios where this technology could help and I'd like to share some findings with others who would like to approach this interesting Open Source project. It has been started by Cloudera and Microsoft and it is currently in the process of being incubated by the Apache Software Foundation. The official documentation isn't comprehensive at the moment, so I hope my posts on this topic could help someone else. Apache Livy is a service to interact with Apache Spark through a REST interface. It enables both submissions of Spark jobs or snippets of Spark code. The following features are supported: The jobs can be submitted as pre-compiled jars, snippets of code or via Java/Scala client API. Interactive Scala, Python, and R shells. Support for Spark 2.x and Spark1.x, Scala 2.10 and 2.11. It doesn't require any change to Spark code. It allows long running Spark Contexts that can be used for multiple Spark jobs, by

How to Get Metrics From a Java Application Inside a Docker Container Using Telegraf

My latest article on DZone is online. There you can learn how to configure Telegraf to pull metrics from a Java application running inside a Docker container. The Telegraf Jolokia plugin configuration presented as example in the article is set up to collect metrics about the heap memory usage, thread count and class count, but these aren't the only metrics you can collect this way. When running a container hosting the Java app with the Jolokia agent, you can get the full list of available metrics through the following GET request: curl -X GET http://<jolokia_host>:<jolokia_port>/jolokia/list and pick up their names and attributes to be added to the plugin configuration.

Setting up a quick dev environment for Kafka, CSR and SDC

Few days ago Pat Patterson published an excellent article on DZone about Evolving Avro Schemas With Apache Kafka and StreamSets Data Collector. I recommend reading this interesting article. I followed this tutorial and today I want to share the details on how I quickly setup the environment for this purpose, just in case you should be interested on doing the same. I did it on a Linux Red Hat Server 7 (but the steps are the same for any other Linux distro) and using only images available in the Docker Hub . First start a Zookeeper node (which is required by Kafka ): sudo docker run --name some-zookeeper --restart always -d zookeeper and then a Kafka broker, linking the container to that for Zookeeper: sudo docker run -d --name kafka --link zookeeper:zookeeper ches/kafka Then start the Confluent Schema Registry (linking it to Zookeeper and Kafka): sudo docker run -d --name schema-registry -p 8081:8081 --link zookeeper:zookeeper --link kafka:kafka confluent/schema-registry

Exporting InfluxDB data to a CVS file

Sometimes you would need to export a sample of the data from an InfluxDB table to a CSV file (for example to allow a data scientist to do some offline analysis using a tool like Jupyter, Zeppelin or Spark Notebook). It is possible to perform this operation through the influx command line client. This is the general syntax: sudo /usr/bin/influx -database '<database_name>' -host '<hostname>' -username '<username>'  -password '<password>' -execute 'select_statement' -format '<format>' > <file_path>/<file_name>.csv where the format could be csv , json or column . Example: sudo /usr/bin/influx -database 'telegraf' -host 'localhost' -username 'admin'  -password '123456789' -execute 'select * from mem' -format 'csv' > /home/googlielmo/influxdb-export/mem-export.csv

All Day DevOps 2017 is coming!

All Day DevOps 2017 online conference is coming on October 24th 2017. Last year I had a chance to attend it and I can confirm that the overall quality of the talks was excellent. Looking at this year's agenda it seems more promising than 2016's. I heartily recommend attending it to everyone interested in DevOps matters.

Publish to Kafka Jenkins plugin talk @ Dublin Jenkins Meetup

I am going to present a preview of the next release 1.5 and insights about the future release 2.0 of the publishtokafka Jenkins plugin at the Dublin Jenkins meetup on August 31st. Please come in if you're in the Dublin area to learn more about this plugin of mine and to have a chat about Jenkins stuff. Here's the map for the event: https://goo.gl/maps/BGT7MRig1Tv

Streamsets Data Collector pipeline execution scheduling through the SDC REST APIs

A hot topic in the sdc-user group during the past weeks has been about how to schedule the start and stop of SDC pipelines. Usage of the SDC REST APIs has been suggested in some threads, but because the general impression I have is that the audience doesn't have a clear idea about them, I decided to write an article on DZone to help and clarify once and for all how to do it. Enjoy it!

Unit testing Spark applications in Scala (Part 2): Intro to spark-testing-base

In the first part of this series we became familiar with ScalaTest . When it comes to unit test Scala Spark applications ScalaTest isn't enough: you need to add to the roster spark-testing-base . It is an Open Source framework which provides base classes for the main Spark abstractions like SparkContext, RDD, DataFrame, DataSet and Streaming. Let's start to explore all of the facilities provided by this framework and how it works along with ScalaTest with some simple examples. Let's consider the following Scala word count example found on the web: import org.apache.spark.{SparkConf, SparkContext} object SparkWordCount {    def main(args: Array[String]) {     val inputFile = args(0)     val outputFile = args(1)     val conf = new SparkConf().setAppName(" SparkWordCount ")     // Create a Scala Spark Context.     val sc = new SparkContext(conf)     // Load our input data.     val input = sc.textFile(inputFile)     // Split up into words.     val wo

Hubot & SDC

My first Open Source Hubot script has been released and is available in my GitHub space . It provides support to check the status of pipelines in a Streamsets Data Collector server. It is still in alpha release, but the development is ongoing, so new features and improvements will be constantly implemented. Enjoy it!

Started playing with Hubot

In the past weeks, in order to explore new ways to improve DevOps people daily job introducing chatbots, I had a chance to evaluate and play with Hubot . It is an Open Source chat robot implemented by GitHub Inc. which is easy to program using simple scripts written in CoffeeScript and runs on Node.js . I started almost from scratch, being this my first production experience with Node.js and the first experience at all with CoffeeScript. In this post I am sharing just the basics to start implementing a personal Hubot. Prerequisites to follow this tutorial are Node.js and the npm package manager for JavaScript. Download and install the latest versions for your OS. In this post I am going to refer to Node.js 6.10.3 and npm 4.6.1. First of all you need to install the Hubot generator: npm install -g yo generator-hubot Then create the directory for your first Hubot: mkdir firstbot and generate the bot instance through the yeoman generator: cd firstbot  yo hubot At creation

Handling URL redirection with JGit

JGit is a Java library to programmatically do actions versus a local or remote Git repository. It is quite powerful, but it comes with an issue: when trying to connect to a remote repository it doesn't handle URL redirection. When you try to connect to a remote repo like in the example below (the connection attempt is to one of my repos in GutHub): String uri = "https://github.com/virtualramblas/publishtokafka-plugin.git"; String userName ="XXXXXXX" String password = "YYYYYYY" LsRemoteCommand remoteCommand = Git.lsRemoteRepository(); try {             Collection <Ref> refs =                     remoteCommand.setCredentialsProvider(new UsernamePasswordCredentialsProvider(userName, password))                     .setHeads(true)                     .setRemote(uri)                     .call(); } catch(GitAPIException e) {             e.printStackTrace(); } everything is fine because there is no redirection. But trying to connect to another

Unit testing Spark applications in Scala (Part 1): Intro to ScalaTest

This new series is about exploring useful frameworks and practices to unit testing Spark applications implemented in Scala . This first post is a quick introduction to ScalaTest, the most popular unit testing  framework for Scala applications. ScalaTest could be used to do unit testing with other languages like Scala.js and Java, but this post will focus on Scala only. I am going to refer to the latest table version (3.0.1) at the moment this post is being written. Let's see how ScalaTest works. Suppose we have a very simple Scala application with 2 classes, one called Basket : package ie.googlielmo.scalatestexample import scala.collection.mutable.ArrayBuffer class Basket { private val fruits = new ArrayBuffer[Fruit] def addFruit (fruit: Fruit) { fruits += fruit} def removeFruit (fruit: Fruit) { fruits -= fruit} def getFruits = fruits .toList }   which has a single attribute, an ArrayBuffer of Fruit , a simple case class: package ie.googlielmo.scalatestexample

Publish to Kafka Jenkins plugin release 1.0 available

I am glad to announce that the first major release of the Jenkins post-build action plugin to publish build jobs execution data to a Kafka topic is now available on my GitHub space . You can find all the details in the project README file. Please send me your feedback commenting this post or adding a new issue (enhancement, change request or bug) to the GitHub plugin repository whether you have a chance to use it. Thanks a lot.

Evaluating Pinpoint APM (Part 3)

Having completed all of the steps described in the first two posts of this series you should be able to start and use Pinpoint. To test that everything is working fine you can use the testapp web application which is part of its quickstart bundle. For this purpose you could start the collector and the web UI from the quickstart as well: %PINPOINT_HOME%\quickstart\bin\start-collector.cmd %PINPOINT_HOME%\quickstart\bin\start-web.cmd Then start the testapp application: %PINPOINT_HOME%\quickstart\bin\start-testapp.cmd Check that everything is fine connecting to the web UIs:     Pinpoint Web - http://localhost:28080     TestApp - http://localhost:28081 Start to do some actions in the testapp application and see through the web UI which information are sent to Pinpoint. Now you can profile any Java web and standalone application of yours. You need to download the agent jar in any location in the application hosting machine. Then, for standalone applications, you need to run th

Evaluating Pinpoint APM (Part 2)

This second post of the Pinpoint series covers the configuration of the HBase database where the monitoring data are written by the collector and from which they are read by the web UI. I did the first evaluation of Pinpoint on a MS Windows machine, so here I am going to cover some specific installation details for this OS family. For initial evaluation purposes a standalone HBase server (which runs all daemons within a single JVM) is enough. Database installation Here I am referring to the latest stable release (1.2.4) of HBase available at the time this post is being written. This release supports both Java 7 and Java 8: I am referring to Java 8 here. Cygwin isn't going to be used for this installation purposes. Of course you start downloading the tarball with the HBase binaries and then unpack its content. Rename the hbase-1.2.4 directory to hbase . Set up the JAVA_HOME variable to the JRE to use (if you don't have already done it in this installation machine). E

Evaluating Pinpoint APM (Part 1)

I started a journey evaluating Open Source alternatives to commercial New Relic and AppDynamics tools to check if some is really ready to be used in a production environment. One cross-platform Application Performance Management (APM) tool that particularly caught my attention is Pinpoint . The current release supports mostly Java applications and JEE application servers and provides support also for the most popular OS and commercial relational databases. APIs are available to implement new plugins to support specific systems. Pinpoint has been modeled after Google Dapper and promises to install agents without changing a single line of code and mininal impact (about 3% increase in resource usage) on applications performance. Pinpoint is licensed under the Apache License, Version 2.0 . Architecture Pinpoint has three main components:  - The collector: it receives monitoring data from the profiled applications. It stores those information in HBase .  - The web UI: the front-end