How to perform group by in MapReduce program | Java

Grouping by key in Mapreduce program

3 min readSep 11, 2022

As as Big Data developer, we all know that how much MapReduce is important for us. But most of us ignore this concept just because this is complex or not necessary. But this is very important for interview purpose.

In this article, I’ll try to write a program which will perform group by on a column.

Let’s consider a file which contains few lines.

ID, Name, DOB , Marks
1,parmamnd,10-02-1994,70
1,parmamnd,10-02-1994,80
1,parmamnd,10-02-1994,90
2,Rahul,10-02-1994,70
2,Rahul,10-02-1994,56
2,Rahul,10-02-1994,66
3,Shyam,10-02-1994,55
3,Shyam,10-02-1994,77
3,Shyam,10-02-1994,99

Step 1: Create a map class and extends Mapper class

class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
    //Map Method
    @Override
    public void map(LongWritable Text, Text value, Context context) throws IOException, InterruptedException {

        String line[] = value.toString().split(",");
        context.write(new Text(line[0]), new IntWritable( Integer.parseInt(line[3])));
    }
}

Map method takes key-value as input and produces key-value as output. In the above map function we are selecting ID and marks col and sending this to reducer.

Step 1: Create a Reduce class and extends Reducer class

class Reduce extends  Reducer<Text, IntWritable, Text, IntWritable > {
    //Reduce function

    @Override
    public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {

        int sum = 0;
        for(IntWritable val:values)
        {
            sum=sum+val.get();
        }
        System.out.println("reduce output : "+key+" "+sum);
        context.write(key,new IntWritable(sum));
    }
}

Reducer method takes key-value as input and produces key-value as output. In the above reducer code based on the key(ID) we are adding all values(marks)

Output :

In the above output we can see the sum of marks of all three students.

Complete Java Code :

package com.target.mapreduce.groupby;

import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.util.*;
import java.io.IOException;
import java.util.Date;
import java.util.StringTokenizer;

class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
    //Map Method
    @Override
    public void map(LongWritable Text, Text value, Context context) throws IOException, InterruptedException {

        String line[] = value.toString().split(",");
        context.write(new Text(line[0]), new IntWritable( Integer.parseInt(line[3])));
    }
}


class Reduce extends  Reducer<Text, IntWritable, Text, IntWritable > {
    //Reduce function

    @Override
    public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {

        int sum = 0;
        for(IntWritable val:values)
        {
            sum=sum+val.get();
        }
        System.out.println("reduce output : "+key+" "+sum);
        context.write(key,new IntWritable(sum));
    }
}


public class GroupBy extends Configured implements Tool
{
    @Override
    public int run(String[] strings) throws Exception {
        Job job=Job.getInstance(getConf());
        job.setJobName("Map Reduce");
        job.setJarByClass(com.target.mapreduce.groupby.GroupBy.class);
         job.setOutputKeyClass(Text.class);
      //  job.setNumReduceTasks(2);
         job.setOutputValueClass(IntWritable.class);

       job.setReducerClass(com.target.mapreduce.groupby.Reduce.class);
        job.setMapperClass(com.target.mapreduce.groupby.Map.class);

        FileInputFormat.addInputPath(job, new Path("/Users/pamkin/IdeaProjects/WordCount/src/main/resources/input/groupBy.txt"));
        FileOutputFormat.setOutputPath(job, new Path("/Users/pamkin/IdeaProjects/WordCount/src/main/resources/output"+new Date().getTime()));
        return job.waitForCompletion(true)?1:0;
    }
    public static void main(String[] args) throws Exception {
        ToolRunner.run(new com.target.mapreduce.groupby.GroupBy(),args);
    }
}

Thanks for reading!

Please do share the article, if you liked it. Any comments or suggestions are welcome.

How to perform group by in MapReduce program | Java

Grouping by key in Mapreduce program

Written by Parmanand

No responses yet