Google File System (GFS) 

The Google File System (GFS) is designed to manage relatively large files using a very large distributed cluster of commodity servers connected by a high-speed network. It is therefore designed to (a) expect and tolerate hardware failures, even during the reading or writing of an individual file (since files are expected to be very large) and (b) support parallel reads, writes, and appends by multiple client programs. A common use case that is efficiently supported is that of many ‘producers’ appending to the same file in parallel, which is also being simultaneously read by many parallel ‘consumers’.

As a result, they also do not scale as well as data organizations built on GFS-like platforms such as Google Datastore. The Hadoop Distributed File System (HDFS) is an open source implementation of the GFS architecture that is also available on the Amazon EC2 cloud platform; we refer to both GFS and HDFS as ‘cloud file systems.’

The architecture of cloud file systems is illustrated in Figure Large files are broken up into ‘chunks’ (GFS) or ‘blocks’ (HDFS), which are themselves large (64MB being typical). These chunks are stored on commodity (Linux) servers called Chunk Servers (GFS) or Data Nodes (HDFS); further, each chunk is replicated at least three times, both on a different physical rack as well as a different network segment in anticipation of possible failures of these components apart from server failures.

When a client program (‘cloud application’) needs to read/write a file, it sends the full path and offset to the Master (GFS) which sends back meta-data


                                                                    FIGURE 4.8. Cloud file systems

for one (in the case of reading) or all (in the case of writing) of the replicas of the chunk where this data is to be found. The client caches such meta-data so that it does need not to contact the Master each time. Thereafter the client directly reads data from the designated chunk server; this data is not cached since most reads are large and caching would only complicate writes.

In the case of a write, in particular, an append, the client sends only the data to be appended to all the chunk servers; when they all acknowledge receiving this data it informs a designated ‘primary’ chunk server, whose identity it receives (and also caches) from the Master. The primary chunk server appends its copy of data into the chunk at an offset of its choice; note that this may be beyond the EOF to account for multiple writers who may be appended to this file simultaneously. The primary then

forwards the request to all other replicas which in turn write the data at the same offset if possible or return a failure. In case of a failure the primary rewrites the data at possibly another offset and retries the process.

The Master maintains regular contact with each chunk server through heartbeat messages and in case it detects a failure its meta-data is updated to reflect this, and if required assigns a new primary for the chunks being served by a failed chunk server. Since clients cache meta-data, occasionally they will try to connect to failed chunk servers, in which case they update their meta-data from the master and retry.

it is shown that this architecture efficiently supports multiple parallel readers and writers. It also supports writing (appending) and reading the same file by parallel sets of writers and readers while maintaining a consistent view, i.e. each reader always sees the same data regardless of the replica it happens to read from. Finally, note that computational processes (the ‘client’ applications above) run on the same set of servers that files are stored on. As a result, distributed programming systems, such as MapReduce, can often schedule tasks so that their data is found locally as far as possible, as illustrated by the Cluster system.