Home / BigData / 301.1.6-Basics of MapReduce

301.1.6-Basics of MapReduce

Map Reduce is not new

  • Finally, if we want to make something or achieve something, then you need not to do it in one go, you can divide the whole problem into different pieces i.e., initial data is the raw data and then you get the intermediate output and then we do the reduce.

  • Thus, the idea was there earlier, now with the new network programming and all the network computing, we can achieve MapReduce distributed computing.
  • To handle big data, we need to write MapReduce programs, we can’t write simple programs.

MapReduce programs

  • The conventional program to count the number records in a data file:
count=count+1  
  • The MapReduce program to count the number of records in a bigdata file:
count=count+1 
cum_sum=cum_sum+sum

More than just a MapReduce program

  • The MapReduce program to count the number of records in a bigdata file:
count=count+1 
cum_sum=cum_sum+sum
  • Who will setup the network of machines and store the data locally?
  • Who will divide and send the map program to local machines and call the reduce program on top of map?
  • What if one machine is very slow in the cluster?
  • What if there is a hardware failure in one of the machines?
    • It is not just MapReduce, it is not that simple. It is much more than MapReduce.

Additional scripts for work distribution

  1. We need to first setup a cluster of machines, then divide the whole data set into blocks and store them in local machines.
  2. We also need to assign a master node that takes charge of all meta data, which block of data is on what machine.
  3. We need to write a script that will take care of work scheduling, distribution of tasks and job orchestration.
  4. We also need to assign worker slots to execute map and reduce functions.

Additional scripts for efficiently

  • We need to write scripts for load balancing (What if one machine is very slow in the cluster?).
  • We also need to write scripts for data backup, replication and Fault Tolerance (What if the intermediate data is partially read).
  • Finally write the map reduce code that solves our problem.

Implementation of MapReduce is difficult

  • Analysis on Bigdata can give us awesome insights.
  • But, datasets are huge, complex and difficult to process.
  • The solution is distributed computing or MapReduce.
  • But looks like this data storage & parallel processing, job orchestration and network setup is complicated.
  • What is the solution?
  • Is there a readymade tool? Or platform that can take care of all these tasks.
    • Hadoop

 

About admin

Check Also

Functions

301.4.4-Functions

  Functions In this section we will talk about the functions in the pig. Function …

Leave a Reply

Your email address will not be published. Required fields are marked *