Google File System (GFS)
The Google File System (GFS) is designed to manage relatively large ﬁles using a very large distributed cluster of commodity servers connected by a high-speed network. It is therefore designed to (a) expect and tolerate hardware failures, even during the reading or writing of an individual ﬁle (since ﬁles are expected to be very large) and (b) support parallel reads, writes, and appends by multiple client programs. A common use case that is efﬁciently supported is that of many ‘producers’ appending to the same ﬁle in parallel, which is also being simultaneously read by many parallel ‘consumers’.
As a result, they also do not scale as well as data organizations built on GFS-like platforms such as Google Datastore. The Hadoop Distributed File System (HDFS) is an open source implementation of the GFS architecture that is also available on the Amazon EC2 cloud platform; we refer to both GFS and HDFS as ‘cloud ﬁle systems.’
The architecture of cloud ﬁle systems is illustrated in Figure Large ﬁles are broken up into ‘chunks’ (GFS) or ‘blocks’ (HDFS), which are themselves large (64MB being typical). These chunks are stored on commodity (Linux) servers called Chunk Servers (GFS) or Data Nodes (HDFS); further, each chunk is replicated at least three times, both on a different physical rack as well as a different network segment in anticipation of possible failures of these components apart from server failures.
When a client program (‘cloud application’) needs to read/write a ﬁle, it sends the full path and offset to the Master (GFS) which sends back meta-data
FIGURE 4.8. Cloud ﬁle systems
for one (in the case of reading) or all (in the case of writing) of the replicas of the chunk where this data is to be found. The client caches such meta-data so that it does need not to contact the Master each time. Thereafter the client directly reads data from the designated chunk server; this data is not cached since most reads are large and caching would only complicate writes.
In the case of a write, in particular, an append, the client sends only the data to be appended to all the chunk servers; when they all acknowledge receiving this data it informs a designated ‘primary’ chunk server, whose identity it receives (and also caches) from the Master. The primary chunk server appends its copy of data into the chunk at an offset of its choice; note that this may be beyond the EOF to account for multiple writers who may be appended to this ﬁle simultaneously. The primary then
forwards the request to all other replicas which in turn write the data at the same offset if possible or return a failure. In case of a failure the primary rewrites the data at possibly another offset and retries the process.
The Master maintains regular contact with each chunk server through heartbeat messages and in case it detects a failure its meta-data is updated to reﬂect this, and if required assigns a new primary for the chunks being served by a failed chunk server. Since clients cache meta-data, occasionally they will try to connect to failed chunk servers, in which case they update their meta-data from the master and retry.
it is shown that this architecture efﬁciently supports multiple parallel readers and writers. It also supports writing (appending) and reading the same ﬁle by parallel sets of writers and readers while maintaining a consistent view, i.e. each reader always sees the same data regardless of the replica it happens to read from. Finally, note that computational processes (the ‘client’ applications above) run on the same set of servers that ﬁles are stored on. As a result, distributed programming systems, such as MapReduce, can often schedule tasks so that their data is found locally as far as possible, as illustrated by the Cluster system.