Skip to main content

Unit testing Spark applications in Scala (Part 1): Intro to ScalaTest

This new series is about exploring useful frameworks and practices to unit testing Spark applications implemented in Scala. This first post is a quick introduction to ScalaTest, the most popular unit testing  framework for Scala applications.
ScalaTest could be used to do unit testing with other languages like Scala.js and Java, but this post will focus on Scala only. I am going to refer to the latest table version (3.0.1) at the moment this post is being written.
Let's see how ScalaTest works. Suppose we have a very simple Scala application with 2 classes, one called Basket:

package ie.googlielmo.scalatestexample

import scala.collection.mutable.ArrayBuffer

class Basket  {
  private val fruits = new ArrayBuffer[Fruit]

  def addFruit (fruit: Fruit) { fruits += fruit}
  def removeFruit (fruit: Fruit) { fruits -= fruit}
  def getFruits = fruits.toList
}
 
which has a single attribute, an ArrayBuffer of Fruit, a simple case class:

package ie.googlielmo.scalatestexample

case class Fruit(name:String)

and methods to add to or remove a fruit from the basket and to get the list of current fruits in the basket.
In order to write unit tests for the Basket class we need to add the ScalaTest dependency in the sbt build file for the project as follows:

libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.1" % "test"

Or, if you use Maven,  add the dependency to the project pom.xml file as follows:

<dependency>
  <groupId>org.scalatest</groupId>
  <artifactId>scalatest_2.11</artifactId>
  <version>3.0.1</version>
  <scope>test</scope>
</dependency>

The central unit of composition in ScalaTest is Suite, which represents a suite of tests. ScalaTest provides several style traits that extend Suite. They override its lifecycle methods to support different testing styles. If you come from TDD and/or you are familiar with JUnit probably the first choice to start would be the FunSuite style trait. Basically it is a JUnit-like style but coming with some benefits of BDD. In this case a test suite for the Basket class could be implemented this way:

import org.scalatest.FunSuite

class BasketSuite extends FunSuite with BeforeAndAfter {
  var basket : Basket = _

  before {
    basket = new Basket()
  }

  test("An empty Basket should have 0 fruits") {
    assert(basket.getFruits.size == 0)
  }

  test("Adding one fruit") {
    basket.addFruit(new Fruit("Pear"))
    assert(basket.getFruits.size == 1)
  }

  test("Removing one fruit") {
    basket.addFruit(new Fruit("Banana"))
    basket.removeFruit(new Fruit("Pear"))
    assert(basket.getFruits.size == 1)
  }
}

The BeforeAndAfter trait provides the before and after methods to initialize and destroy objects used across the unit tests for a test suite class. The description used for each test method is the label you will see in the reports generated at the end of unit tests execution.

If you are used to BDD probalby a better choice for you would be the FlatSpec suite trait. The implementation of unit test for the Basket class would be something like this:

package ie.googlielmo.scalatestexample

import org.scalatest.FlatSpec

class BasketSpec extends FlatSpec {
  var basket : Basket = new Basket()

  "An empty Basket" should "have 0 fruits" in {
    assert(Set.empty.size == 0)
  }

  it should "contain 1 fruit when adding one" in {
    basket.addFruit(new Fruit("Pear"))
    assert(basket.getFruits.size == 1)
  }

  it should "contain 1 fruit when removing one" in {
    basket.addFruit(new Fruit("Banana"))
    basket.removeFruit(new Fruit("Pear"))
    assert(basket.getFruits.size == 1)
  }
}

Again, the description chosen for each test method here is the label shown the in the reports generated at the end of unit tests execution.
There are 8 Suite traits available in ScalaTest. I am not going to desribe all of them in this post, you can find all of the details in the official framework documentation. I will probably add some more details in the next posts of this series when describing specific test cases for Spark applications.
It is possible, just in case, to use different testing styles for the same Scala project. The full battery of unit tests will be executed any way, as shown in the snapshot below:

Note for Maven users only: you need to disable SureFire and enable ScalaTest in order to execute ScalaTest unit tests. This can be done registering the following plugin into the project pom.xml file: 
 
<plugin>
  <groupId>org.scalatest</groupId>
  <artifactId>scalatest-maven-plugin</artifactId>
  <version>1.0</version>
  <configuration>
    <reportsDirectory>${project.build.directory}/surefire-reports</reportsDirectory>
    <junitxml>.</junitxml>
    <filereports>WDF TestSuite.txt</filereports>
  </configuration>
  <executions>
    <execution>
      <id>test</id>
      <goals>
        <goal>test</goal>
      </goals>
    </execution>
  </executions>
</plugin>

In the next post we will learn how to start unit testing simple Spark jobs and why ScalaTest alone isn't enough for this purpose. Stay tuned!

Comments

Post a Comment

Popular posts from this blog

Streamsets Data Collector log shipping and analysis using ElasticSearch, Kibana and... the Streamsets Data Collector

One common use case scenario for the Streamsets Data Collector (SDC) is the log shipping to some system, like ElasticSearch, for real-time analysis. To build a pipeline for this particular purpose in SDC is really simple and fast and doesn't require coding at all. For this quick tutorial I will use the SDC logs as example. The log data will be shipped to Elasticsearch and then visualized through a Kibana dashboard. Basic knowledge of SDC, Elasticsearch and Kibana is required for a better understanding of this post. These are the releases I am referring to for each system involved in this tutorial: JDK 8 Streamsets Data Collector 1.4.0 ElasticSearch 2.3.3 Kibana 4.5.1 Elasticsearch and Kibana installation You should have your Elasticsearch cluster installed and configured and a Kibana instance pointing to that cluster in order to go on with this tutorial. Please refer to the official documentation for these two products in order to complete their installation (if you do

Exporting InfluxDB data to a CVS file

Sometimes you would need to export a sample of the data from an InfluxDB table to a CSV file (for example to allow a data scientist to do some offline analysis using a tool like Jupyter, Zeppelin or Spark Notebook). It is possible to perform this operation through the influx command line client. This is the general syntax: sudo /usr/bin/influx -database '<database_name>' -host '<hostname>' -username '<username>'  -password '<password>' -execute 'select_statement' -format '<format>' > <file_path>/<file_name>.csv where the format could be csv , json or column . Example: sudo /usr/bin/influx -database 'telegraf' -host 'localhost' -username 'admin'  -password '123456789' -execute 'select * from mem' -format 'csv' > /home/googlielmo/influxdb-export/mem-export.csv

Using Rapids cuDF in a Colab notebook

During last Spark+AI Summit Europe 2019 I had a chance to attend a talk from Miguel Martinez  who was presenting Rapids , the new Open Source framework from NVIDIA for GPU accelerated end-to-end Data Science and Analytics. Fig. 1 - Overview of the Rapids eco-system Rapids is a suite of Open Source libraries: cuDF cuML cuGraph cuXFilter I enjoied the presentation and liked the idea of this initiative, so I wanted to start playing with the Rapids libraries in Python on Colab , starting from cuDF, but the first attempt came with an issue that I eventually solved. So in this post I am going to share how I fixed it, with the hope it would be useful to someone else running into the same blocker. I am assuming here you are already familiar with Google Colab. I am using Python 3.x as Python 2 isn't supported by Rapids. Once you have created a new notebook in Colab, you need to check if the runtime for it is set to use Python 3 and uses a GPU as hardware accelerator. You