As discussed in the previous chapter, cloud data stores exhibit non-relational storage models. Furthermore, each of these data stores is built to be multi-tenant from scratch since effectively a single instance of such a large-scale distributed data store caters to multiple applications created by cloud users. For example, each user of the Google App Engine can create a fixed number of applications, and each of these APPEARS to have a separate data store;
Here we focus on a different problem: As a user (application developer) of a cloud platform, how does one create one’s own multi-tenant application? In the case of Amazon EC2, the answer is straightforward; since this is an infrastructure cloud it gives users direct access to (virtual) servers where one can recreate exactly the same multi-tenant architectures discussed earlier using standard application servers and database systems.
However, the situation is different using a PaaS platform such as Google’s App Engine with its Datastore, Amazon’s SimpleDB, or even Azure’s data services. For example, a single App Engine application has one data store namespace, or schema (so, if we create one ‘Customer’ model, then we cannot have another by the same name in the same application). Thus, it appears at first that we are constrained to use the inefficient single schema approach.
However, an interesting feature of the Google Data store is that entities are essentially SCHEMA-LESS. Thus, it is up to the LANGUAGE API provided to define how the data store is used. In particular, the Model
class in the Python API to App Engine is object-oriented as well as dynamic. Further, as Python is a completely interpretive language, fresh classes can be defined at runtime, along with their corresponding data store ‘kinds.’
Figure 4.4 shows one possible implementation of multi-tenancy using multiple schemas with Google App Engine, in Python. Separate classes are instantiated for each schema, at runtime. This approach is similar to simulat- ing multiple schemas in a relational database by having table names that are schema dependent.
A similar strategy can be used with Amazon’s SimpleDB, where DOMAINS, which play the role of tables in relational parlance and are the equivalent of
FIGURE 4.4. Multi-tenancy using Google Datastore
‘kind’ in the Google Datastore, can be created dynamically from any of the provided language APIs.
Data Access Control For Enterprise Applications
So far we have covered the typical strategies used to achieve multi-tenancy from the perspective of enabling a single application code base, running in a single instance, to work with data of multiple customers, thereby bringing down costs of management across a potentially large number of customers.
For the most part, multi-tenancy as discussed above appears to be of use primarily in a software as a service model. There are also certain cases where multi-tenancy can be useful within the enterprise as
well. We have already seen that supporting multiple entities, such as bank branches, is essentially a multi-tenancy requirement. Similar needs can arise if a work group-level application needs to be rolled out to many independent teams, who usually do not need to share data. Customizations of the application schema may also be needed in such scenarios, to support variations in business processes. Similar requirements also arise in supporting multiple LEGAL entities each of which could be operating in different regulatory environments.
As we mentioned earlier, in a multi-entity scenario a subset of users may need to be given access to data from all branches, or a subset of branches, depending on their position in an organizational unit hierarchy. More generally, access to data may need to be controlled based on the VALUES of any field of a table, such as high-value transactions being visible only to some users, or special customer names being invisible without explicit permission. Such requirements are referred to as DATA ACCESS control needs, which while common, are less often handled in a reusable and generic manner. Data access control (or DAC) is a generalization of multi-tenancy in that the latter can often be implemented using DAC. In Figure 4.5 we illustrate how data access control can be implemented in a generic manner within a single schema to support fairly general rules for controlling access to records based on field VALUES.
Each application table, such as Customer, is augmented with an additional
field DAC_ID. The DAC Rules table lists patterns based on value ranges of arbitrary fields using which the values of the DAC_IDin each Customer record are filled through a batch process. Users are assigned privileges to access records satisfying one or more such DAC rules as specified in the User DAC Roles table. This in formation is expanded, via batch process, to data.
FIGURE 4.5. Data access control
In the User table where there are cords for each value of DAC_ID that a user can access.
For example, user 101 has access to three DAC rules, which translate to five records in the User table. This calculation involves computing the complete set of mutually exclusive and unique DAC range combinations based on the DAC Rules and thereafter which subset of these a particular user has access to based on the User DAC Roles information; note that this computation is independent of the actual DAC_ID values in the Customer or other application tables.
It is straightforward to limit access to records of the Customer table to only those a particular user is permitted, as specified in the User table using a join. In the illustration of Figure 4.5, we introduce an additional complication where users are also given access to the DAC permissions of all their direct reports, as specified in the Org table.
In a traditional relational database, SQL queries on the Customer database can be modified to support data access control by introducing a generic join, including a self-join on the Org table to find all direct reports of a user, which is then joined to the User table and the Customer table. However, in a cloud database, such as Google Datastore or Amazon’s SimpleDB, joins ARE not supported. Therefore the same functionality must be implemented in code as shown in the figure: The self-join on Org is done in memory giving a list of reporters, including the user; this is used as a filter to get the permissible DAC_IDs from the User table. Finally, this list is used to filter the application query on the Customer table.
It is important to note that when adding or modifying Customer records the DAC_ID needs to be recomputed based on the DAC Rules; this computation also needs to be optimized, especially if there are a large number of DAC Rules. Adding new DAC Rules or modifying existing ones will also require re-computation and updates to the DAC_ID values in the Customer table. Care also needs to be taken when filling the DAC Rules table to ensure that DAC ranges on the same field are always non-overlapping.
We thought it fit to cover data access control here, as part of our treatment of multi-tenancy, first because these requirements are closely related, but also to bring out the complexities of real enterprise applications even for incorporating a generic requirement such as data access control. Beyond the example itself the lesson to be learned is that migrating applications to a multi-tenancy model, especially using cloud databases, is not a trivial task.
The emergence of cloud platforms has given rise to new paradigms for dealing with distributed data in the cloud, parallel computing using huge computing clusters as well as rapid application development tools for specialized domains: Cloud-based data stores differ significantly from traditional relational databases, with different query and consistency semantics as well as performance behavior. The MapReduce programming paradigm makes large-scale analytics tasks easy to define. MapReduce implementations allow massive computing clusters to be used while tolerating faults that are inevitable at such scales. Similarly, but in a very different context, Dev 2.0 platforms allow simple business applications to be developed by end-users using always-on hosted platforms in the cloud, obviating the need for traditional development and thereby increasing business agility.
Data in the cloud
Since the 80s relational database technology has been the ‘default’ data storage and retrieval mechanism used in the vast majority of enterprise applications. The origins of relational databases, beginning with System R[5] and Ingres [60] in the 70s, focused on introducing this new paradigm as a general purpose replacement for hierarchical and network databases, for the most common business computing tasks at the time, viz. transaction processing.
In the process of creating a planetary scale web search service, Google in particular has developed a massively parallel and fault tolerant distributed file system (GFS) along with a data organization (BigTable) and programming paradigm (MapReduce) that is markedly different from the traditional relational model. Such ‘cloud data strategies’ are particularly well suited for large-volume massively parallel text processing, as well as possibly other tasks, such as enterprise analytics. The public cloud computing offerings from Google (i.e. App Engine), as well as those from different vendors, have made similar data models (Google’s Datastore, Amazon’s SimpleDB) and programming paradigms (Hadoop on Amazon’s EC2) available to users as part of their cloud platforms.
At the same time, there have been new advances in building specialized database organizations optimized for analytical data processing, in particular column-oriented databases such as Vertica. It is instructive to note that the BigTable-based data organization underlying cloud databases exhibits some similarities to column-oriented databases. These concurrent trends along with the ease of access to cloud platforms are witnessing a resurgence of interest in non-relational data organizations and an exploration of how these can best be leveraged for enterprise applications.
In this chapter we examine the structure of Google App Engine’s Datastore and its underlying technologies, Google’s distributed file system (GFS) and BigTable abstraction, as well as the open source project Hadoop’s HBase and HDFS (clones of BigTable and GFS respectively).
RELATIONAL DATABASES in Cloud computing
Before we delve into cloud data structures we first review traditional relational database systems and how they store data. Users (including application programs) interact with an RDBMS via SQL; the database ‘front-end’ or parser transforms queries into memory and disk-level operations to optimize execution time. Data records are stored on pages of contiguous disk blocks, which are managed by the disk- space-management layer.
Pages are fetched from disk into memory buffers as they are requested, in many ways similar to the file and buffer management functions of the operating system, using pre-fetching and page replacement policies. However, database systems usually do not rely on the file system layer of the OS and instead manage disk space themselves. This is primarily so that the database can have full control of when to retain a page in memory and when to release it. The database needs are able to adjust page replacement policy when needed and pre-fetch pages from disk based on expected access patterns that can be very different from file operations. Finally, the operating system files used by databases need to span multiple disks so as to handle the large storage requirements of a database, by efficiently exploiting parallel I/O systems such as RAID disk arrays or multi-processor clusters.
The storage indexing layer of the database system is responsible for locating records and their organization on disk pages. Relational records (tabular rows) are stored on disk pages and accessed through indexes on specified columns, which can be B+-tree indexes, hash indexes, or bitmap indexes [46]. Usually, rows are stored on pages contiguously, also called a ‘row-store’, and indexed using B+-trees. An index can be PRIMARY, in which case rows of the table are physically stored as close as possible to sorted order based on the column
specified by the index. Clearly, only one such primary index is likely per table; the remaining indexes are SECONDARY and maintain pointers to the actual row locations on disk.
While B+-tree indexes on a row-store are optimal for write-oriented workloads, such as the case in transaction processing applications, these are not the best for applications where reads dominate; in the latter case bitmap indexes, cross-table indexes, and materialized views are used for efficient access to records and their attributes. Further, row-oriented storage of records on disk may also not be optimal for
read-dominated workloads, especially analytical applications. Recently column-oriented storage [61] has been proposed as a more efficient mechanism suited for analytical workloads, where aggregation of measures columns (e.g. Sales) needs to be performed based on values of dimension columns (e.g. Month). Figure 4.6 illustrates the difference between row-oriented and column-oriented storage. Notice that in a column store, projections of the table are stored sorted by dimension values, which are themselves compressed (as bitmaps, for example) for ease of comparison as well as reduced storage. Notice also that the column store needs additional ‘join indexes’ that map the sort orders of different projections so as to be able to recover the original row when required. When the cardinality of dimension values is small these join indexes can also be efficiently compressed.
FIGURE: Row vs. column storage
FIGURE: Parallel database architectures
(which we have not shown in the figure). When we cover cloud data stores later in this chapter we will see some of the similarities between their meta-data indexes and B+-tree indexes, as well as between their data organization structures and column-oriented databases.
Over the years database systems have evolved toward exploiting the parallel computing capabilities of
multi-processor servers as well as harnessing the aggregate computing power of clusters of servers connected by a high-speed network. Figure 4.7 illustrates three parallel/distributed database architectures: The shared memory architecture is for machines with many CPUs (and with each having possibly many processing ‘cores’) while the memory address space is shared and managed by a symmetric multi-processing operating system that schedules processes in parallel exploiting all the processors. The shared-nothing architecture assumes a cluster of independent servers each with its own disk, connected by a network. A shared-disk architecture is somewhere in between the cluster of servers sharing storage through high-speed network storage, such as a NAS (network attached storage) or a SAN (storage area network) interconnected via standard Ethernet, or faster Fiber Channel or Infiniband connections. Parallel database systems capable of exploiting any of these parallel architectures have existed since the 80s. These systems parallelize SQL queries to execute efficiently and exploit multiple processors [46]. In the case of shared-nothing architectures, tables are partitioned and distributed across processing nodes with the SQL optimizer handling distributed joins as best possible. Each of the traditional transaction-processing databases, Oracle, DB2, and SQL Server support parallelism in various ways, as do specialized systems designed for data warehousings such as Vertica, Netezza, and Teradata.
Traditional relational databases are designed to support high-volume transaction processing involving many, possibly concurrent, record-level insertions and updates. Supporting concurrent access while ensuring that conflicting actions by simultaneous users do not result in inconsistency is the responsibility of the transaction management layer of the database system that ensures ‘isolation’ between different transactions through a variety of locking strategies. In the case of parallel and distributed architectures, locking strategies are further complicated since they involve communication between processors via the well-known ‘two-phase’ commit protocol[46].
It is important to note that the parallel database systems developed as extensions to traditional relational databases were designed either for specially constructed parallel architectures, such as Netezza or for closely coupled clusters of at most a few dozen processors. At this scale, the chances of system failure due to faults in any of the components could be sufficiently compensated for by transferring control to a ‘hot-standby’ system in the case of transaction processing or by restarting the computations in the case of data warehousing applications. As we shall see below, a different approach is required to exploit parallelism at a significantly larger scale.