Skip to main content

MRUnit Tutorial

Apache MRUnit (https://mrunit.apache.org/) is an Open Source library that allows unit-testing for Hadoop Mappers, Reducers, and MapReduce programs. It provides a convenient integration between MapReduce and standard testing libraries such as JUnit and Mockito and helps (providing a set of interfaces and test harnesses) bridging the gap between MapReduce programs and those traditional libraries. It doesn't replace JUnit, but works on top of it.
Before reading further, please be aware that knowledge of Hadoop MapReduce and JUnit is required for a better understanding of this post.
The three core classes of MRUnit are the following:
MapDriver: the driver class responsible for calling the Mapper’s map() method.
ReducerDriver: the driver class responsible for calling the Reducer’s reduce() method.
MapReduceDriver: the combined MapReduce driver responsible for calling the Mapper’s map() method first, followed by an in-memory Shuffle phase. At the end of this phase the Reducer’s reduce() method is invoked.
Each of the classes above has methods that allow to provide inputs and expected outputs for the tests. The JUnit API’s setup() method is responsible for creating new instances of the Mapper, Reducer, and the appropriate MRUnit drivers needed for each specific test purposes.

In order to add MRUnit to an Hadoop MapReduce project you need to add it as test dependency in the project POM file (of course the project is a Maven project,  I am sure you're not planning to skip Maven for this kind of project):

<dependency>
        <groupId>org.apache.mrunit</groupId>
        <artifactId>mrunit</artifactId>
        <version>1.1.0</version>
        <classifier>hadoop2</classifier>
        <scope>test</scope>

</dependency>

and of course JUnit should be present as well:

<dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.12</version>
      <scope>test</scope>

</dependency>

I am not referring to any particular IDE in this discussion. The example project can be created and managed through Maven by a shell.You can then choose to import it in your favourite one.

The MapReduce application under test is the popular word counter (the Hello World of MapReduce). It processes text files and counts how often words occur. Browsing the web you can find hundred of links with this example, but just in case here's the code for the Mapper
public class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String w = value.toString();
        context.write(new Text(w), new IntWritable(1));
    }

}


and the Reducer

public class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        context.write(key, new IntWritable(sum));
    }
}


Now let's start implementing a MRUnit test case. It has the same structure as for a JUnit test case. First of all declare the instance variable for the MapReduce test purposes. They include the drivers provided by the MRUnit framework:
private Mapper mapper;
private Reducer reducer;
private MapDriver mapDriver;
private ReduceDriver reduceDriver;
private MapReduceDriver mapReduceDriver;


Then create instances of them in the JUnit setUp() method:

@Before
public void setUp() throws Exception {
        mapper = new WordCountMapper();
        reducer = new WordCountReducer();
        mapDriver = new MapDriver(mapper);
        reduceDriver = new ReduceDriver(reducer);
        mapReduceDriver = new MapReduceDriver(mapper, reducer);

}

To test the Mapper you need just to provide the inputs and the expected output to the MapperDriver instance and then execute its runTest() method:

@Test
 public void testWordCountMapper() throws IOException {
        mapDriver.withInput(new LongWritable(1), new Text(firstTestKey))
        .withInput(new LongWritable(2), new Text(secondTestKey))
        .withOutput(new Text(firstTestKey), new IntWritable(1))
        .withOutput(new Text("blogspot"), new IntWritable(1))
        .runTest();
 }


As you can see from the code above, MRUnit supports multiple inputs. firstTestKey and secondTestKey are String variables you can initialize in the setUp() method as well.
It is the same process to test the Reducer

@Test
 public void testWordCountReducer() throws IOException {
        Text firstMapKey = new Text(firstTestKey);
        List<IntWritable> firstMapValues = new ArrayList<IntWritable>();
        firstMapValues.add(new IntWritable(1));
        firstMapValues.add(new IntWritable(1));
       
        Text secondMapKey = new Text(secondTestKey);
        List<IntWritable> secondMapValues = new ArrayList<IntWritable>();
        secondMapValues.add(new IntWritable(1));
        secondMapValues.add(new IntWritable(1));
        secondMapValues.add(new IntWritable(1));
       
        reduceDriver.withInput(firstMapKey, firstMapValues)
        .withInput(secondMapKey, secondMapValues)
        .withOutput(firstMapKey, new IntWritable(2))
        .withOutput(secondMapKey, new IntWritable(3))
        .runTest();
 }


and the overall MapReduce flow

@Test
 public void testWordCountMapReducer() throws IOException {
        mapReduceDriver.withInput(new LongWritable(1), new Text(firstTestKey))
        .withInput(new LongWritable(2), new Text(firstTestKey))
        .withInput(new LongWritable(3), new Text(secondTestKey))
        .withOutput(new Text(firstTestKey), new IntWritable(2))
        .withOutput(new Text(secondTestKey), new IntWritable(1))
        .runTest();
 }


Just the specific driver to use changes. Any MRUnit test case can be executed the same way as for the JUnit test cases. So you can run all of them together for your applications. When you execute the command

mvn test

Maven will run all of the unit tests (JUnit and MRUnit both) available for the given MapReduce application and generate their execution reports.

Comments

Popular posts from this blog

Turning Python Scripts into Working Web Apps Quickly with Streamlit

 I just realized that I am using Streamlit since almost one year now, posted about in Twitter or LinkedIn several times, but never wrote a blog post about it before. Communication in Data Science and Machine Learning is the key. Being able to showcase work in progress and share results with the business makes the difference. Verbal and non-verbal communication skills are important. Having some tool that could support you in this kind of conversation with a mixed audience that couldn't have a technical background or would like to hear in terms of results and business value would be of great help. I found that Streamlit fits well this scenario. Streamlit is an Open Source (Apache License 2.0) Python framework that turns data or ML scripts into shareable web apps in minutes (no kidding). Python only: no front‑end experience required. To start with Streamlit, just install it through pip (it is available in Anaconda too): pip install streamlit and you are ready to execute the working de...

Load testing MongoDB using JMeter

Apache JMeter ( http://jmeter.apache.org/ ) added support for MongoDB since its 2.10 release. In this post I am referring to the latest JMeter release (2.13). A preliminary JMeter setup is needed before starting your first test plan for MongoDB. It uses Groovy as scripting reference language, so Groovy needs to be set up for our favorite load testing tool. Follow these steps to complete the set up: Download Groovy from the official website ( http://www.groovy-lang.org/download.html ). In this post I am referring to the Groovy release 2.4.4, but using later versions is fine. Copy the groovy-all-2.4.4.jar to the $JMETER_HOME/lib folder. Restart JMeter if it was running while adding the Groovy JAR file. Now you can start creating a test plan for MongoDB load testing. From the UI select the MongoDB template ( File -> Templates... ). The new test plan has a MongoDB Source Config element. Here you have to setup the connection details for the database to be tested: The Threa...

Evaluating Pinpoint APM (Part 1)

I started a journey evaluating Open Source alternatives to commercial New Relic and AppDynamics tools to check if some is really ready to be used in a production environment. One cross-platform Application Performance Management (APM) tool that particularly caught my attention is Pinpoint . The current release supports mostly Java applications and JEE application servers and provides support also for the most popular OS and commercial relational databases. APIs are available to implement new plugins to support specific systems. Pinpoint has been modeled after Google Dapper and promises to install agents without changing a single line of code and mininal impact (about 3% increase in resource usage) on applications performance. Pinpoint is licensed under the Apache License, Version 2.0 . Architecture Pinpoint has three main components:  - The collector: it receives monitoring data from the profiled applications. It stores those information in HBase .  - The web UI: the f...