List the characteristics of HDFS. And explain HDFS operations.
Features of Hadoop HDFS
Fault Tolerance
Fault tolerance in HDFS refers to the working strength of a system in unfavorable conditions and how that system can handle such situations. HDFS is highly fault-tolerant, in HDFS data is divided into blocks and multiple copies of blocks are created on different machines in the cluster (this replica creation is configurable). So whenever if any machine in the cluster goes down, then a client can easily access their data from the other machine which contains the same copy of data blocks. HDFS also maintains the replication factor by creating a replica of blocks of data on another rack. Hence if suddenly a machine fails, then a user can access data from other slaves present in another rack. To learn more about Fault Tolerance follow this Guide.
High Availability
HDFS is a highly available file system, data gets replicated among the nodes in the HDFS cluster by creating a replica of the blocks on the other slaves present in HDFS cluster. Hence whenever a user wants to access this data, they can access their data from the slaves which contains its blocks and which is available on the nearest node in the cluster. And during unfavorable situations like a failure of a node, a user can easily access their data from the other nodes. Because duplicate copies of blocks that contain user data are created on the other nodes present in the HDFS cluster. To learn more about high availability follow this Guide.
Data Reliability
HDFS is a distributed file system that provides reliable data storage. HDFS can store data in the range of 100s of petabytes. It also stores data reliably on a cluster of nodes. HDFS divides the data into blocks and these blocks are stored on nodes present in HDFS cluster. It stores data reliably by creating a replica of each and every block present on the nodes present in the cluster and hence provides fault tolerance facility. If node containing data goes down, then a user can easily access that data from the other nodes which contain a copy of same data in the HDFS cluster. HDFS by default creates copies of blocks containing data present in the nodes in HDFS cluster. Hence data is quickly available to the users and hence user does not face the problem of data loss. Hence HDFS is highly reliable.
Replication
Data Replication is one of the most important and unique features of Hadoop HDFS. In HDFS replication of data is done to solve the problem of data loss in unfavorable conditions like the crashing of a node, hardware failure, and so on. Since data is replicated across a number of machines in the cluster by creating blocks. The process of replication is maintained at regular intervals of time by HDFS and HDFS keeps creating replicas of user data on different machines present in the cluster. Hence whenever any machine in the cluster gets crashed, the user can access their data from other machines which contain the blocks of that data. Hence there is no possibility of a loss of user data. Follow this guide to learn more about the data read operation.
Scalability
As HDFS stores data on multiple nodes in the cluster, when requirements increase we can scale the cluster. There is two scalability mechanism available: Vertical scalability – add more resources (CPU, Memory, Disk) on the existing nodes of the cluster. Another way is horizontal scalability – Add more machines in the cluster. The horizontal way is preferred since we can scale the cluster from 10s of nodes to 100s of nodes on the fly without any downtime.
Distributed Storage
In HDFS all the features are achieved via distributed storage and replication. HDFS data is stored in distributed manner across the nodes in HDFS cluster. In HDFS data is divided into blocks and is stored on the nodes present in HDFS cluster. And then replicas of each and every block are created and stored on other nodes present in the cluster. So if a single machine in the cluster gets crashed we can easily access our data from the other nodes which contain its replica.
i. Interaction of Client with NameNode
If the client has to create a file inside HDFS then he needs to interact with the namenode (as namenode is the centre-piece of the cluster which contains metadata). Namenode provides the address of all the slaves where the client can write its data. The client also gets a security token from the namenode which they need to present to the slaves for authentication before writing the block. Below are the steps which client needs to perform in order to write data in HDFS:
To create a file client executes create() method on DistributedFileSystem. Now DistributedFileSystem interacts with the namenode by making an RPC call for creating a new file having no blocks associated with it in the filesystem’s namespace. Various checks are executed by the namenode in order to make sure that there is no such file, already present there and the client is authorized to create a new file.
If all this procedure gets the pass, then a record of the new file is created by the namenode; otherwise, file creation fails and an IOException is thrown to the client. An FSDataOutputStream returns by the DistributedFileSystem for the client in order to start writing data to datanode. Communication with datanodes and client is handled by DFSOutputStream which is a part of FSDataOutputStream.
ii. Interaction of client with Datanodes
After the user gets authenticated to create a new file in the filesystem namespace, Namenode will provide the location to write the blocks. Hence the client directly goes to the datanodes and start writing the data blocks there. As in HDFS replicas of blocks are created on different nodes, hence when the client finishes with writing a block inside the slave, the slave then starts making replicas of a block on the other slaves. And in this way, multiple replicas of a block are created in different blocks. Minimum 3 copies of blocks are created in different slaves and after creating required replicas, it sends an acknowledgment to the client. In this manner, while writing data block a pipeline is created and data is replicated to desired value in the cluster.
Let’s understand the procedure in great details. Now when the client writes the data, they are split into the packets by the DFSOutputStream. These packets are written to an internal queue, called the data queue. The data queue is taken up by the DataStreamer. The main responsibility of DataStreamer is to ask the namenode to properly allocate the new blocks on suitable datanodes in order to store the replicas. List of datanodes creates a pipeline, and here let us assume the default replication level is three; hence in the pipeline there are three nodes. The packets are streamed to the first datanode on the pipeline by the DataStreamer, and DataStreamer stores the packet and this packet is then forwarded to the second datanode in the pipeline.
In the same way, the packet is stored into the second datanode and then it is forwarded to the third (and last) datanode in the pipeline.
An internal queue known as ”ack queue” of packets that are waiting to be acknowledged by datanodes is also maintained. A packet is only removed from the ack queue if it gets acknowledged by all the datanodes in the pipeline. A client calls the close() method on the stream when it has finished writing data.
When executing the above method, all the remaining packets get flushed to the datanode pipeline and before contacting the namenode it waits for acknowledgments to signal that the file is complete. This is already known to the namenode that which blocks the file is made up of, and hence before returning successfully it only has to wait for blocks to be minimally replicated.