Skip to main content

MRUnit Tutorial

Apache MRUnit (https://mrunit.apache.org/) is an Open Source library that allows unit-testing for Hadoop Mappers, Reducers, and MapReduce programs. It provides a convenient integration between MapReduce and standard testing libraries such as JUnit and Mockito and helps (providing a set of interfaces and test harnesses) bridging the gap between MapReduce programs and those traditional libraries. It doesn't replace JUnit, but works on top of it.
Before reading further, please be aware that knowledge of Hadoop MapReduce and JUnit is required for a better understanding of this post.
The three core classes of MRUnit are the following:
MapDriver: the driver class responsible for calling the Mapper’s map() method.
ReducerDriver: the driver class responsible for calling the Reducer’s reduce() method.
MapReduceDriver: the combined MapReduce driver responsible for calling the Mapper’s map() method first, followed by an in-memory Shuffle phase. At the end of this phase the Reducer’s reduce() method is invoked.
Each of the classes above has methods that allow to provide inputs and expected outputs for the tests. The JUnit API’s setup() method is responsible for creating new instances of the Mapper, Reducer, and the appropriate MRUnit drivers needed for each specific test purposes.

In order to add MRUnit to an Hadoop MapReduce project you need to add it as test dependency in the project POM file (of course the project is a Maven project,  I am sure you're not planning to skip Maven for this kind of project):

<dependency>
        <groupId>org.apache.mrunit</groupId>
        <artifactId>mrunit</artifactId>
        <version>1.1.0</version>
        <classifier>hadoop2</classifier>
        <scope>test</scope>

</dependency>

and of course JUnit should be present as well:

<dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.12</version>
      <scope>test</scope>

</dependency>

I am not referring to any particular IDE in this discussion. The example project can be created and managed through Maven by a shell.You can then choose to import it in your favourite one.

The MapReduce application under test is the popular word counter (the Hello World of MapReduce). It processes text files and counts how often words occur. Browsing the web you can find hundred of links with this example, but just in case here's the code for the Mapper
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String w = value.toString();
        context.write(new Text(w), new IntWritable(1));
    }

}


and the Reducer

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        context.write(key, new IntWritable(sum));
    }
}


Now let's start implementing a MRUnit test case. It has the same structure as for a JUnit test case. First of all declare the instance variable for the MapReduce test purposes. They include the drivers provided by the MRUnit framework:
private Mapper mapper;
private Reducer reducer;
private MapDriver mapDriver;
private ReduceDriver reduceDriver;
private MapReduceDriver mapReduceDriver;


Then create instances of them in the JUnit setUp() method:

@Before
public void setUp() throws Exception {
        mapper = new WordCountMapper();
        reducer = new WordCountReducer();
        mapDriver = new MapDriver(mapper);
        reduceDriver = new ReduceDriver(reducer);
        mapReduceDriver = new MapReduceDriver(mapper, reducer);

}

To test the Mapper you need just to provide the inputs and the expected output to the MapperDriver instance and then execute its runTest() method:

@Test
 public void testWordCountMapper() throws IOException {
        mapDriver.withInput(new LongWritable(1), new Text(firstTestKey))
        .withInput(new LongWritable(2), new Text(secondTestKey))
        .withOutput(new Text(firstTestKey), new IntWritable(1))
        .withOutput(new Text("blogspot"), new IntWritable(1))
        .runTest();
 }


As you can see from the code above, MRUnit supports multiple inputs. firstTestKey and secondTestKey are String variables you can initialize in the setUp() method as well.
It is the same process to test the Reducer

@Test
 public void testWordCountReducer() throws IOException {
        Text firstMapKey = new Text(firstTestKey);
        List<IntWritable> firstMapValues = new ArrayList<IntWritable>();
        firstMapValues.add(new IntWritable(1));
        firstMapValues.add(new IntWritable(1));
       
        Text secondMapKey = new Text(secondTestKey);
        List<IntWritable> secondMapValues = new ArrayList<IntWritable>();
        secondMapValues.add(new IntWritable(1));
        secondMapValues.add(new IntWritable(1));
        secondMapValues.add(new IntWritable(1));
       
        reduceDriver.withInput(firstMapKey, firstMapValues)
        .withInput(secondMapKey, secondMapValues)
        .withOutput(firstMapKey, new IntWritable(2))
        .withOutput(secondMapKey, new IntWritable(3))
        .runTest();
 }


and the overall MapReduce flow

@Test
 public void testWordCountMapReducer() throws IOException {
        mapReduceDriver.withInput(new LongWritable(1), new Text(firstTestKey))
        .withInput(new LongWritable(2), new Text(firstTestKey))
        .withInput(new LongWritable(3), new Text(secondTestKey))
        .withOutput(new Text(firstTestKey), new IntWritable(2))
        .withOutput(new Text(secondTestKey), new IntWritable(1))
        .runTest();
 }


Just the specific driver to use changes. Any MRUnit test case can be executed the same way as for the JUnit test cases. So you can run all of them together for your applications. When you execute the command

mvn test

Maven will run all of the unit tests (JUnit and MRUnit both) available for the given MapReduce application and generate their execution reports.

Comments

Popular posts from this blog

Exporting InfluxDB data to a CVS file

Sometimes you would need to export a sample of the data from an InfluxDB table to a CSV file (for example to allow a data scientist to do some offline analysis using a tool like Jupyter, Zeppelin or Spark Notebook). It is possible to perform this operation through the influx command line client. This is the general syntax: sudo /usr/bin/influx -database '<database_name>' -host '<hostname>' -username '<username>'  -password '<password>' -execute 'select_statement' -format '<format>' > <file_path>/<file_name>.csv where the format could be csv , json or column . Example: sudo /usr/bin/influx -database 'telegraf' -host 'localhost' -username 'admin'  -password '123456789' -execute 'select * from mem' -format 'csv' > /home/googlielmo/influxdb-export/mem-export.csv

jOOQ: code generation in Eclipse

jOOQ allows code generation from a database schema through ANT tasks, Maven and shell command tools. But if you're working with Eclipse it's easier to create a new Run Configuration to perform this operation. First of all you have to write the usual XML configuration file for the code generation starting from the database: <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <configuration xmlns="http://www.jooq.org/xsd/jooq-codegen-2.0.4.xsd">   <jdbc>     <driver>oracle.jdbc.driver.OracleDriver</driver>     <url>jdbc:oracle:thin:@dbhost:1700:DBSID</url>     <user>DB_FTRS</user>     <password>password</password>   </jdbc>   <generator>     <name>org.jooq.util.DefaultGenerator</name>     <database>       <name>org.jooq.util.oracle.OracleDatabase</name>     ...

Turning Python Scripts into Working Web Apps Quickly with Streamlit

 I just realized that I am using Streamlit since almost one year now, posted about in Twitter or LinkedIn several times, but never wrote a blog post about it before. Communication in Data Science and Machine Learning is the key. Being able to showcase work in progress and share results with the business makes the difference. Verbal and non-verbal communication skills are important. Having some tool that could support you in this kind of conversation with a mixed audience that couldn't have a technical background or would like to hear in terms of results and business value would be of great help. I found that Streamlit fits well this scenario. Streamlit is an Open Source (Apache License 2.0) Python framework that turns data or ML scripts into shareable web apps in minutes (no kidding). Python only: no front‑end experience required. To start with Streamlit, just install it through pip (it is available in Anaconda too): pip install streamlit and you are ready to execute the working de...