Chances are that when you started learning MapReduce, the first example that was covered was counting how many times a word appears in a given text or set of texts. This example is sometimes referred to as the “Hello, world!” of MapReduce. The example is straightforward enough and it explains how MapReduce works nicely.
But once you’ve got “Hello, world!” out of the way, what next? How do you become comfortable applying the principles of MapReduce in real world situations? For me, the book MapReduce Design Patterns by Donald Miner and Adam Shook was the next step.
The authors of the book state that “…motivation for us to write this book was to fill a missing gap we saw in a lot of new MapReduce developers. They had learned how to use the system, got comfortable with writing MapReduce, but were lacking the experience to understand how to do things right or well.” They further explain that the intent of this book is to educate readers on how experts have figured out solving common problems with MapReduce. As a result, readers benefit from the expertise and do not have to learn from their own mistakes.
MapReduce is a framework in the sense that you have to fit your solution into the framework of map and reduce, which in some situations may be challenging, especially when you are just starting out. There are problems that you can and cannot solve, limiting the number of options you have at your disposal. At the same time, figuring out how to solve a problem with such constraints as imposed by the MapReduce framework requires cleverness and especially a change in thinking as compared to coding in usual programming languages.
The sets of design patterns that are covered in this book include:
- Summarization patterns, for example counting, finding the minimum or maximum value, summarization and grouping, calculating the average
- Filtering patterns, for example viewing subsets of data, sometimes aided by applying Bloom filters, which are explained in more detail in the appendix
- Data organization patterns, geared towards reorganizing data to make MapReduce analysis easier, for example sorting
- Join patterns used to analyze different datasets together to discover interesting relationships, which is somewhat analogous to SQL joins
- Metapatterns to combine more than one pattern to solve multi-stage problems, or to perform several analytics in the same job
Example of a design pattern: Sorting
The most enlightening part of the book for me was the sorting design pattern. The first advice that I came across was that this is one of the more complicated patterns and the authors warn to use it sparingly. While in the relational world of SQL we sort all the time, when we think about it, how frequently do we need sorting in the world of big data? What does sorting even mean when we have massive amounts of data constantly flowing in? It is not surprising that sorting may not be popular in such scenarios.
The reason that sorting in MapReduce is so complicated is that it is not easily parallelizable and you cannot apply typical sorting algorithms that most often rely on recursion. In MapReduce you first have to determine a set of partitions divided by ranges of values that will produce equal-sized subsets of data. Each reducer will then sort a range of data. The lowest range of data goes to the first reducer, the next range goes to the second reducer, and so on. Finally, all of these ranges are glued back together in their respective order to produce the sorted result.
This pattern has two phases: an analyze phase that determines the ranges, and the order phase that actually sorts the data. The analyze phase is optional in some ways. You need to run it only once if the distribution of your data does not change quickly over time, because the value ranges it produces will continue to perform well. Also, in some cases, you may be able to guess the partitions yourself, especially if the data is evenly distributed. But if you cannot avoid the analyze phase, then you may be able to do it with random sampling of the data. The principle is that partitions that evenly split the random sample should evenly split the larger data set well.
Separating the framework from the code
The examples in the book expect you to be familiar with Hadoop and they only work within the Hadoop environment. They are written in java so for someone who prefers a different programming language, the code is not applicable. While I appreciate that developing MapReduce jobs only makes sense when they are executed in a distributed environment, for learning purposes this is not necessarily required. That is why I didn’t apply the code examples in the book exactly as they are written.
When I learn a new framework such as MapReduce I prefer to separate the implementation from learning the concepts themselves. It is not only more difficult to learn both the implementation and the concepts at the same time, sometimes it may even blur the distinction to the point where you are not sure whether you are learning a piece of new technology or a new concept. I want to learn new concepts using a programming environment that I already understand so that I can just write the algorithms. Figuring out what went wrong is much simpler when using technology that I am already familiar with.
Actually, you can write map and reduce code without the hassle of having to install and configure the complete Hadoop stack. You can easily write procedures in a programming language such as python and test them in your own development environment. Of course this would not be the same as learning how to actually run the code in a distributed environment, but it allows you to learn the concepts before applying them in a real live scenario. This is not dissimilar to how you can learn data science algorithms using just Excel. Of course you would never run the actual algorithms in Excel, but it simplifies the learning process without having to learn a new tool at the same time.
I really appreciated reading this book, because the design patterns are clearly defined and explained. The patterns cover a broad spectrum of possible use cases and deliver a clear understanding of what types of problems can be implemented using MapReduce.