Data Locality in Hadoop
Data Locality in Hadoop training in Hyderabad:
Data locality is a core concept of Hadoop. Based on several Assumptions
around the use of Map Reduce, In short, keep data on disks that are close to
the RAM and CPU that will use to process and store
Introduction:
Hadoop optimizer on Data Locality is moving data to compute Is
more than Moving compute to data. It able to schedule jobs to nodes that are
local for input stream and high performance result produced. It is out of the blog.
This blog explains the couple of data locality issues that we fixed and
identified.
Why is Data Locality important? & hadoop traininginstitutes in Hyderabad:
The dataset stored in HDFS, it divided into stored and
blocks across the Data Nodes in Hadoop cluster. When a Map Reduce job executed
against the data set the individual Mappers will process the blocks. When the
data is not available For the Mapped in the same node, where it is being
executed, the Data needs to copy over the network from the Data Node which has
the data to the Data Node which is executing the Mapper task.
Imagine a Map Reduce job with over 70 Mappers and each
Mapper Will try to copy the data from another Data Node in the clustert the
same time, this would result in network jammed As all the Mappers would try to
copy the data at the same time and It is not ideal. So it is always effective
and cheap and to move the Computation closer to the data and vice versa.
How is data proximity defined? & hadoop trining Hyderabad
When Application Master receive a request to run a job, it
looks at which Nodes in the cluster has enough resources to execute the Mappersand Reducers for the job, At this point, serious consideration made to decide On which nodes the individual
Pampers will be executed based on where the Data for the Mapper located.
When the data located on the same node as the Mapper working
on the data, It referred to as Data Local. In this case, the proximity of the
data is Closer to the computation, Application Master prefers the node which
has the Data that needed by the Mapper to execute the Mapper
Rack Local:
Even though the Data Local is an ideal choice, it is not
possible to execute always. The Mapper on the same node as the data due to
resource constraints on a busy Cluster. In such instances, it preferred to run
the Mapper on an another node But on the same rack as the node which has data.
In this case, the data will be moved between nodes. This data provides from
node with the data, to executing the Mapper within the same rack in a busy
cluster sometimes Rack Local is also not possible. In that case, a node on a
different track chosen to execute the data and the Mapper will be copied from
the node and it has data to the node executing the Mapper between racks.

Comments
Post a Comment