Now let us see
how exactly the data is getting stored in the HDFS Cluster? Suppose let us
think that HadoopData.txt is the
file which contains the data to be processed. Once if this file is loaded into
the HDFS Cluster, the Name Node divides the file into Blocks. All these blocks are of equal size, by default the size of
each block is 64MB.
Let me explain
clearly, suppose let us think our HadoopData.txt file is of size 400MB (Note: generally input to HDFS will be in TB/PB, here 400mb is taken for our understanding). This
file is divided into 7 blocks. The following figure shows you clearly...
If we want we
can increase the block size, we can do so, but the size should be in the multiples of 64 (128,256,512 etc..,). We can
configure this in the configuration file (hdfs-site.xml) which is present in /hadoop/conf directory.
All these blocks
are stored in different nodes in the cluster, suppose if the any data node in
the cluster is down, then the blocks in that are lost. To overcome this problem
Hadoop introduced a concept called Replication. With this concept, even if any
data node is down we can continue the processing using replicas (duplicate
copies of the data). By default we maintain 3 replicas. We can even configure this value more than 3 in hdfs-site.xml.