MapReduce: Design Patterns

Bhupendra patni
May 17, 2015
3 min read

Thank you to O’Reilly, Donald Miner & Adam Shook for the book MapReduce Design Patterns which I have referred throughout the post.

The intended audience for blog post are the new Hadoop developers with basic MapReduce understanding, to provide high level overview of MapReduce development and its design patterns.

MapReduce Framework

MapReduce is a framework for processing data that resides on hundreds of servers in a distributed cluster. The solution you are developing have to fit into the framework on Map and Reduce, which in some situations might be challenging.

Design Patterns

Design Patterns are tools for developers to solve complex problems in a generic reusable way. It has shown the right level of abstraction: not too specific that there are too many of them to remember and too hard to tailor to a problem as well as not too generic which requires too much work to implement a solution.

MapReduce Design Patterns

In this section, we will provide a high level overview of the various MapReduce design Patterns:

Summarization Patterns

The summarization pattern is primarily focused on numerical aggregation, index summarization etc.

The numerical summarization is a pattern for calculating aggregate statistical value on the given dataset e.g. minimum, maximum, sum, average, median, standard deviation etc.

SQL Resemblance:

SELECT colx, MIN(numericCol1), MAX(numericCol1), COUNT(1) FROM table

GROUP BY colx;

The index summarization is used to enable fast searches on the large datasets.

Filtering Patterns

All filtering patterns do not change the actual data and are primarily used to find the subset of records with the matching criteria. The MapReduce filtering pattern are performed on a data distributed across the cluster.

SQL Resemblance:

SELECT col1, col2, col3 FROM table

WHERE col1 < 100;

There are many filtering scenarios like random sampling, records matching with the given criteria, extract top n which are primarily used for data cleansing, taking a closer look at the subset, tracking, removing low scoring data.

Data Organization Patterns

The data organization patterns is primarily used for reorganizing data for optimal retrieval and processing. In the distributed systems, the data processing performance can be optimized with the appropriate sorting, partitioning and sharding.

The data in this pattern can be transformed with many sub categories:

From structured to hierarchical pattern
Partitioning and Bucketing pattern
Sorting and shuffling pattern

Join Patterns

In SQL, joins are performed using simple commands however the joins with MapReduce are not as straightforward. There are many type of joins like INNER JOIN, OUTER JOIN, CROSS JOIN etc. in SQL.

With MapReduce, the joins are performed on Map side and Reduce side depending on the data locality. The joins are performed on the map side for the partial data sets which available on the same location while reduce side joins are performed on the data sets post shuffle and sort on the reducer.

A replicated join which is a special type of join operation between one large and one small datasets that can be performed on the map side.

Meta Patterns

The term meta patterns is directly translated to “Patterns about Patterns”.

The first method of meta pattern is job chaining, which is combining several patterns to solve complex, multistage problem. The master driver application are used to chain multiple MapReduce jobs in one meta job.

The special ChainMapper and ChainReducer classes are special mapper and reducer classes that allow you to run multiple map phases in mapper and multiple map phases after the reducer effectively expanding several traditional map and reduce paradigm into multiple map phases, followed by reduce phase, followed by several map phases.

The second method is job merging, which is an optimization for performing several analytics in the same MapReduce job. Job Merging is a process that allow two unrelated jobs that are loading the same data to share the MapReduce pipeline.

Input & Output Patterns

In the approach, we will not load or store the data in the standard Hadoop MapReduce does out of the box. In many a cases, we may want to skip a expensive step of storing data in Hadoop and read directly from the original source, or feed it directly to some process that uses it after MapReduce is finished.

In this scenario, we will implement custom RecordReader and RecordWriter which will read from the source and write data to the target.

These are high level summary on multiple design patterns which are very useful while developing your custom MapReduce job.

Thank you for reading. I hope the information in the post is helpful for you. Please provide your feedback as well as share with others.