Data Locality Concept in Hadoop




Data Locality Concept in Hadoop
Data Locality in Hadoop:


Data locality is a core concept of Hadoop. Based on several
Assumptions around the use of Map Reduce, In short, keep data
On disks that are close to the RAM and CPU that will  use
To process and store


Introduction:


Hadoop optimizer on Data Locality is moving data to compute
Is more than Moving compute to data. It able to schedule jobs
To nodes that are local for input stream and high performance
Result produced. It is out of the blog.This blog explains the couple
of data locality issues that we fixed And identified.


Why is Data Locality important?


The dataset  stored in HDFS, it  divided into stored
And blocks across the Data Nodes in Hadoop cluster. When a
Map Reduce job executed against the dataset the individual
Mappers will process the blocks. When the data is not available
For the Mapped in the same node, where it is being executed, the
Data needs to  copied over the network from the Data Node which
Has the data to the Data Node which is executing the Mapper task.


Imagine a Map Reduce job with over 70 Mappers and each Mapper
Will try to copy the data from another Data Node in the cluster
t the same time, this would result in  network jammed As all the
Mappers would try to copy the data at the same time and It is not ideal.
So it is always effective and cheap and to move the Computation closer
to the data and vice versa.


How is data proximity defined?


When Application Master receive a request to run a job, it looks at which
Nodes in the cluster has enough resources to execute the Mappers and
Reducers for the job, At this point, serious consideration  made to decide
On which nodes the individual Pampers will  be executed based on where the
Data for the Mapper located.


Data Local:

When the data  located on the same node as the Mapper working on the data,
It referred to as Data Local. In this case, the proximity of the data is
Closer to the computation, Application Master prefers the node which has the
Data that  needed by the Mapper to execute the Mapper


Rack Local:

Even though the Data Local is an ideal choice, it is not possible to execute always.
The Mapper on the same node as the data due to resource constraints on a busy
Cluster. In such instances, it  preferred to run the Mapper on an another node
But on the same rack as the node which has data. In this case, the data will
Be moved between nodes. This data provides from node with the data,  To
Executing the Mapper within the same rack In a busy cluster sometimes Rack Local is
also not possible. In that case, a node On a different track chosen to execute the
data and the Mapper will be copied From the node and it has data to the node
executing the Mapper between racks.

Share this

Related Posts

Previous
Next Post »