What is Hadoop Architecture and Storage?

4 years ago

Data Mining and Data Warehousing

What is Hadoop Architecture?

Hadoop, developed in 2005 and now an open source platform managed under the Apache Software Foundation, uses a concept known as MapReduce that is composed of two separate functions.
The Map step inputs data and breaks it down for processing across nodes within a Hadoop instance. These “worker” nodes may in turn break the data down further for processing. In the Reduce step, the processed data is then collected back together and assembled into a format based on the original query being performed.
To cope with truly massive-scale data analysis, Hadoop’s developers implemented a scale-out architecture, based on many low-cost physical servers with distributed processing of data queries during the Map operation.
Their logic was to enable a Hadoop system capable of processing many parts of a query in parallel to reduce execution times as much as possible.
This can be contrasted with legacy-structured database design that looks to scale up within a single server by using faster processors, more memory and fast shared storage.
Looking at the storage layer, the design aim for Hadoop is to execute the distributed processing with the minimum latency possible. This is achieved by executing Map processing on the node that stores the data, a concept known as data locality.
As a result, Hadoop implementations can use SATA drives directly connected to the server, thereby keeping the overall cost of the system as low as possible.
To implement the data storage layer, Hadoop uses a feature known as HDFS or the Hadoop Distributed File System. HDFS is not a file system in the traditional sense and isn’t usually directly mounted for a user to view (although there are some tools available to achieve this), which can sometimes make the concept difficult to understand; it’s perhaps better to think of it simply as a Hadoop data store.
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant.
HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS relaxes a few POSIX requirements to enable streaming access to file system data.

Name Node and Data Nodes

HDFS has a master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by
In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in
Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and It also determines the mapping of blocks to DataNodes.
The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the
The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode.