41. What do you mean by Task Instance?

Ans: Task instances are the actual MapReduce jobs which run on each slave node. The Task Tracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the entire task tracker. Each Task Instance runs on its own JVM process. There can be multiple processes of task instance running on a slave node. This is based on the number of slots configured on task tracker. By default a new task instance JVM process is spawned for a task.

42. How many daemon processes run on a Hadoop cluster?

Ans: Hadoop is comprised of five separate daemons. Each of these daemons runs in its own JVM.

Following 3 Daemons run on Master Nodes.NameNode – This daemon stores and maintains the metadata for HDFS.

Secondary Name Node – Performs housekeeping functions for the Name Node. Job Tracker – Manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker. Following 2 Daemons run on each Slave nodes Data Node – Stores actual HDFS data blocks.

Task Tracker – It is Responsible for instantiating and monitoring individual Map and Reduce tasks.

43. How many maximum JVM can run on a slave node?

Ans: One or Multiple instances of Task Instance can run on each slave node. Each task instance is run as a separate JVM process. The number of Task instances can be controlled by configuration. Typically a high end machine is configured to run more task instances.

44. What is NAS?

Ans: It is one kind of file system where data can reside on one centralised machine and all the cluster member will read write data from that shared database, which would not be as efficient as HDFS.

45. How HDFA differs with NFS?

Ans: Following are differences between HDFS and NAS

  1. In HDFS Data Blocks are distributed across local drives of all machines in a cluster. Whereas in NAS data is stored on dedicated hardware.
  2. HDFS is designed to work with MapReduce System, since computation is moved to data. NAS is not suitable for MapReduce since data is stored separately from the computations
  3. HDFS runs on a cluster of machines and provides redundancy using replication protocol. Whereas NAS is provided by a single machine therefore does not provide data redundancy.

46. How does a Name Node handle the failure of the data nodes?

Ans: HDFS has master/slave architecture. An HDFS cluster consists of a single Name Node, a master

server that manages the file system namespace and regulates access to files by clients.

In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on.

The Name Node and Data Node are pieces of software designed to run on commodity machines. Name Node periodically receives a Heartbeat and a Block report from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the Data Node is functioning properly. A Block report

contains a list of all blocks on a Data Node. When Name Node notices that it has not received a heartbeat message from a data node after a certain amount of time, the data node is marked as dead. Since blocks will be under replicated the system begins replicating the blocks that were stored on the dead Data Node. The Name Node orchestrates the replication of data blocks from one Data Node to another. The replication data transfer happens directly between Data Node and the data never passes through the Name Node.

47. Can Reducer talk with each other?

Ans: No, Reducer runs in isolation.

48. Where the Mapper’s Intermediate data will be stored?

Ans: The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the Hadoop administrator. The intermediate data is cleaned up after the Hadoop Job completes.

49. What is the use of Combiners in the Hadoop framework?

Ans: Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers.

You can use your reducer code as a combiner if the operation performed is commutative and associative.

The execution of combiner is not guaranteed; Hadoop may or may not execute a combiner. Also, if required it may execute it more than 1 times. Therefore your MapReduce jobs should not depend on the combiners’ execution.

50. What is the Hadoop MapReduce API contract for a key and value Class?

Ans:

◦ The Key must implement the org.apache.hadoop.io.WritableComparable interface.

◦ The value must implement the org.apache.hadoop.io.Writable interface.