31. Explain the Reducer’s reduce phase?

Ans: In this phase the reduce (MapOutKeyType, Iterable, Context) method is called for each pair in the grouped inputs. The output of the reduce task is typically written to the File System via Context. write (ReduceOutKeyType, ReduceOutValType). Applications can use the Context to report progress, set application-level status messages and update Counters, or just indicate that they are alive. The output of the Reducer is not sorted.

32. How many Reducers should be configured?

Ans: The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * mapreduce.tasktracker.reduce.tasks.maximum).

With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing. Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures.

33. It can be possible that a Job has 0 reducers?

Ans: It is legal to set the number of reduce-tasks to zero if no reduction is desired. 34. What happens if number of reducers are 0?

In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by setOutputPath (Path). The framework does not sort the map-outputs before writing them out to the FileSystem.

35. How many instances of Job Tracker can run on a Hadoop Cluster?

Ans: Only one

36. What is the Job Tracker and what it performs in a Hadoop Cluster?

Ans: Job Tracker is a daemon service which submits and tracks the MapReduce tasks to the Hadoop cluster. It runs its own JVM process. And usually it run on a separate machine and each slave node is configured with job tracker node location. The Job Tracker is single point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted.

Job Tracker in Hadoop performs following actions

 Client applications submit jobs to the Job tracker.

 The Job Tracker talks to the Name Node to determine the location of the data

The Job Tracker locates Task Tracker nodes with available slots at or near the data

 TheJob Tracker submits the work to the chosen Task Tracker nodes.

 The Task Tracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different Task Tracker.

 A Task Tracker will notify the Job Tracker when a task fails. The Job Tracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may even blacklist the Task Tracker as unreliable.

 Whentheworkiscompleted,theJobTrackerupdatesitsstatus.  Client applications can poll the Job Tracker for information.

37. How a task is scheduled by a Job Tracker?

Ans: The Task Trackers send out heartbeat messages to the Job Tracker, usually every few minutes, to reassure the Job Tracker that it is still alive. These messages also inform the Job Tracker of the number of available slots, so the Job Tracker can stay up to date with where in the cluster work can be delegated. When the Job Tracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the Data Node containing the data, and if not, it looks for an empty slot on a machine in the same rack.

38. How many instances of Task tracker run on a Hadoop cluster?

Ans: There is one Daemon Task tracker process for each slave node in the Hadoop cluster.

39. What are the two main parts of the Hadoop framework?

Ans: Hadoop consists of two main parts

• Hadoop distributed file system, a distributed file system with high throughput,

• Hadoop MapReduce, a software framework for processing large data sets.

40. Explain the use of Task Tracker in the Hadoop cluster?

Ans: A Task tracker is a slave node in the cluster which that accepts the tasks from Job Tracker like Map, Reduce or shuffle operation. Task tracker also runs in its own JVM Process.

Every Task Tracker is configured with a set of slots; these indicate the number of tasks that it can accept. The Task Tracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker.

The Task tracker monitors these task instances, capturing the output and exit codes. When the Task instances finish, successfully or not, the task tracker notifies the Job Tracker.

The Task Trackers also send out heartbeat messages to the Job Tracker, usually every few minutes, to reassure the Job Tracker that it is still alive. These messages also inform the Job Tracker of the number of available slots, so the Job Tracker can stay up to date with where in the cluster work can be delegated.