How exactly the data is stored on HDFS Cluster?

Posted at  08:53  |  in  HDFS



      Now let us see how exactly the data is getting stored in the HDFS Cluster? Suppose let us think that HadoopData.txt is the file which contains the data to be processed. Once if this file is loaded into the HDFS Cluster, the Name Node divides the file into Blocks. All these blocks are of equal size, by default the size of each block is 64MB. 

      Let me explain clearly, suppose let us think our HadoopData.txt file is of size 400MB (Note: generally input to HDFS will be in TB/PB, here 400mb is taken for our understanding). This file is divided into 7 blocks. The following figure shows you clearly...

blocks



      If we want we can increase the block size, we can do so, but the size should be in the multiples of 64 (128,256,512 etc..,). We can configure this in the configuration file (hdfs-site.xml) which is present in /hadoop/conf directory.



      All these blocks are stored in different nodes in the cluster, suppose if the any data node in the cluster is down, then the blocks in that are lost. To overcome this problem Hadoop introduced a concept called Replication. With this concept, even if any data node is down we can continue the processing using replicas (duplicate copies of the data). By default we maintain 3 replicas. We can even configure this value more than 3 in hdfs-site.xml.

Share this post

About-Privacy Policy-Contact us
Copyright © 2013 Hadoop Tutor. Blogger Template by Bloggertheme9
Proudly Powered by Blogger.
back to top