MapReduce word count Program in Java with example

Parmanand
5 min readFeb 18, 2021

As as Big Data developer, we all know that how much MapReduce is important for us. But most of us ignore this concept just because this is complex or not necessary. Let me tell you this is very important for interview purpose. So, please try to understand MapReduce programs before starting with Spark or other frameworks.

In this article, I’ll try to explain Simple MapReduce Count program.

MapReduce Program : It takes key-value as input and produces key-value as output.

Let’s consider a file which contains few lines.

i like mapreduce program
mapreduce is very simple
it is very important as well

Now will create MapReduce program to count words.

Step 1: Create a map1 class and extends Mapper class

class Map1 extends Mapper<LongWritable, Text, Text, IntWritable>{    @Override
public void map(LongWritable lineNumber, Text LineText,Context context) throws IOException, InterruptedException {
String line = LineText.toString();
StringTokenizer s = new StringTokenizer(line," ");
while (s.hasMoreTokens()){
context.write(new Text(s.nextToken()), new IntWritable(1));
}
}
}

Map method takes key-value as input and produces key-value as output.

Mapper<LongWritable, Text, Text, IntWritable>

Here, first two data types are input key and value to the map function, will be of long values and string characters respectively. The second two data types are intermediate output key and value from the map function will be string characters and int numbers respectively.

But how input to map method are Long and String? It shouldn’t be only String because file only contains lines?

Well, for the below line input to map function would be (0,”i like mapreduce program”) this is nothing but Long and String.

How 0 ?

This is nothing but the position of char “i”. which is 0.

i like mapreduce program

Again, for below line it would be (25,”mapreduce is very simple”)

How 25?

Same, 25 is the position(count space as well ) of char “m”.

mapreduce is very simple

Here, instead of long, we write LongWritable and instead of string we used Text.

Below is the list of few data types in Java along with the equivalent Hadoop variant:

  1. Integer –> IntWritable: It is the Hadoop variant of Integer. It is used to pass integer numbers as key or value.
  2. Float –> FloatWritable: Hadoop variant of Float used to pass floating point numbers as key or value.
  3. Long –> LongWritable: Hadoop variant of Long data type to store long values.
  4. String –> Text: Hadoop variant of String to pass string characters as key or value.

5. null –> NullWritable: Hadoop variant of null to pass null as a key or value. Usually NullWritable is used as data type for output key of the reducer, when the output key is not important in the final result.

Now, will consider one line and see how it works. Same thing will happen for other lines as well.

0, i like mapreduce program ---- input to mapper

Now, in the mapper function we will write code to split whole string by space. and will assign 1 for each word.

//Sorting line into line variable
String line = LineText.toString();
// splitting using space
StringTokenizer s = new StringTokenizer(line," ");
// accessing one word at a time and assigning count 1 to each word while (s.hasMoreTokens()){
context.write(new Text(s.nextToken()), new IntWritable(1));
}// we used context to write intermediate results. which will work as input for reduces

So the output of mapper would be

i - 1
like - 1
mapreduce - 1
program - 1

Note: here, we are not using key(0) anywhere. Because we don’t need the keys here, for Reduces word name will become key and count will become value.

Step 2: Create a Reducer1 class and extends Reducer class

class Reduce1 extends  Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
int sum = 0;
for(IntWritable val:values)
{
int t=val.get();
sum=sum+t;
}
context.write(key,new IntWritable(sum));
}
}

Here, reducer method takes key-value as input and produces key-value as output.

Reducer<Text, IntWritable, Text, IntWritable>

Here, the first two data types are input key and value to the reduce function, must match the intermediate key and value from the mapper. The second two data types are output key and value from the reduce function, which will be the final result of the MapReduce program.

Reducer receives output of all mappers

Mappers output
i - 1
like - 1
mapreduce - 1
program - 1
mapreduce - 1
is -1
very-1
simple-1
it-1
is-1
very-1
important-1
as-1
well-1

Shuffle and sort phase: The above output will not be passed to reduces directly. first, output of all mapper will be stored at some place and then it will be sorted and grouped by keys

Below is the output after shuffle and sort phase, for this phase you need not to write a single line of code.

i - [ 1] 
like - [ 1 ]
mapreduce - [ 1,1]
program - [1]
is - [ 1,1 ]
very- [ 1,1 ]
simple- [ 1 ]
it- [ 1 ]
important- [1]
as - [1]
well- [ 1 ]

The above lines will be input for reduces, here one reduce will handle one key’s data at a time.

      
//reduce logic
for(IntWritable val:values)
{
int t=val.get();
sum=sum+t;
}

Above code, adds all values of a key and stores into sum.

For key mapreduce output would be :

mapreduce-2

Same thing will happen with all the keys and output will be written to a file.

Final output from Reduces :

as  1
i 1
important 1
is 2
it 1
like 1
mapreduce 2
program 1
simple 1
very 2
well 1

Complete Java program!

File : Count.Javaimport org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.*;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.util.*;import java.io.IOException;import java.util.Date;import java.util.Iterator;import java.util.StringTokenizer;class Map extends Mapper<LongWritable, Text, Text, IntWritable> {//Map Method@Overridepublic void map(LongWritable Text, Text value, Context context) throws IOException, InterruptedException {String line = value.toString();StringTokenizer s = new StringTokenizer(line," ");System.out.println("Mapper.. ");while (s.hasMoreTokens()){context.write(new Text(s.nextToken()), new IntWritable(1));}}}class Reduce extends  Reducer<Text, IntWritable, Text, IntWritable > {//Reduce function@Overridepublic void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {int sum = 0;System.out.println("Reducer.. ");for(IntWritable val:values){int t=val.get();sum=sum+t;System.out.println(key+" "+t);}context.write(key,new IntWritable(sum));}}public class Count extends Configured implements Tool{@Overridepublic int run(String[] strings) throws Exception {Job job=Job.getInstance(getConf());job.setJobName("Word count");job.setJarByClass(Count.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setReducerClass(Reduce.class);job.setMapperClass(Map.class);FileInputFormat.addInputPath(job, new Path("/Users/Param/Extra/data/input/count.txt"));FileOutputFormat.setOutputPath(job, new Path("/Users/Param/Extra/data/output/"+new Date().getTime()));return job.waitForCompletion(true)?1:0;}public static void main(String[] args) throws Exception {ToolRunner.run(new Count(),args);}}

Thanks for reading!

Please do share the article, if you liked it. Any comments or suggestions are welcome.

--

--