Cloud Storage

Using Google App Engine and Google distributed BigTable storage
Amazon S3
S3 scaling, performance, and retries
hadoop
HBase Hadoop Database - BigTable
Cascading
Google BigTable Webservice
Terracotta
Terracotta EhCache example
Nirvanix Storage Delivery Network
Mosso Rackspace

Sherpa - Yahoo Cloud Storage - CAP theorem
Yahoo Developer Network Blog

Cloudies Awards


apache-cassandra

Notes on Distributed Keystores
MySQL
Project Voldemort
LightCloud
CouchDB
Tokyo Cabinet


Google BigTable:

One big difference between BigTable and relational databases is how transactions are handled. Row-level transactions are possible in BigTable, but not any other type of transactions.

The unsung hero of the day that really makes BigTable shine, however, is Chubby. Chubby is a lock-providing service that is highly available and allows many difficult design decisions to be abstracted away from BigTable. When a BigTable server (master or slave) comes up, it reserves a lock through Chubby with its name and path, and renews the lock periodically. This makes it trivial for the master to determine which nodes are up and down, as it can just query Chubby to see who has locks (thus these nodes are up). Since the BigTable master also knows the locations of the slaves, it can trivially determine if Chubby is down (it would be unable to see who has locks in Chubby but still be able to contact them).


Building a database on S3:


All protocols proposed in this work were designed to support a large number of concurrent clients; possibly in the thousands or even millions. It is assumed that the utility provider (i.e., S3) can support such a large number of clients and guarantees (pretty much) constant response times, independent of the number of concurrent clients. Furthermore, the utility provider guarantees a high degree of availability and durability; that is, clients can read and write pages from/to S3 at any moment in time without being blocked and updates are never lost by the utility provider. All these properties must be preserved in the application stack at the client. As a result, the protocols must be designed in such a way that any client can fail at any time without blocking any other client. That is, clients are never allowed to hold any locks that would block the execution at other clients. Furthermore, clients are stateless. They may cache data from S3, but the worst thing that can happen if a client fails is that all the work of that client is lost. Obviously, fulfilling all these requirements comes at a cost: eventual consistency [19]. That is, it might take a while before the updates of one client become visible at other clients. Furthermore, ANSI SQL-style isolation and serialization [3] are impossible to achieve under these requirements. Following [21], we believe that strict consistency and ACID transactions are not needed for most Web-based applications, whereas scalability and availability are a must.


S3 Integrity, Locking, and Transactions


S3 provides reliability and availability by replicating data to many servers that could be geographically dispersed. Although updates on a single key are atomic, only eventual consistency is guaranteed. Fetching a key’s value might return stale data. Additionally, S3 does not support neither locking nor multi-key atomic updates or transactions. These shortcomings could present obstacles for multi-user settings where key-value pairs on S3 might be constantly edited. Furthermore, S3 provides little tangible guarantees that data retrieved on a particular read faithfully corresponds to the last put operation on that key (data integrity).

To exemplify these problems, imagine a scenario where a company uses S3 to store internal files that are shared between multiple employees. Employees may retrieve and edit any number of files, some of which might also be editable by others. In order to maintain a serializable view
of the file system, users must ensure that they are dealing with the latest version of a file when modifying it. Serializability also requires the use of locks to prevent simultaneous conflicting updates to a single file. Furthermore, as for many modern file systems, transactions are necessary to group multiple updates and be able to rollback undesired outcomes. Multi-key locks and transactions are also necessary since a single file operation might span multiple key-value pairs for data and meta-data. Finally, file system users need the integrity of their data to be preserved.

Yahoo Sherpa - CAP Theorem

There is something called CAP theorem that imposes a "restriction" in all distributed systems. Using the simpler explanation offered by Mr. Negrin, the CAP theorem restrictions can be explained as follows.

"unavoidable trade-offs between consistency (all records are the same in all replicas), availability (all replicas can accept updates or inserts), and tolerance of network partitions (the system still functions when distributed replicas cannot talk to each other)."

In order to take care of the limitations imposed by the CAP theorem, Yahoo team has chosen the tradeoff to be between consistency and availability. They offer the users various options to choose between consistency and availability and then offer tools to minimize the impact of their choice on the other parameter.

--------------------------------------

With so-called NoSQL technologies MongoDB, CouchDB, Riak, Cassandra, Neo4J, and others gaining attention as viable approaches for "big datasets," there are now opportunities to try something different. Here are several Manning books that can help you stay current:

* Hadoop in Action--In final production
* Lucene in Action, Second Edition--Print or ebook
* MongoDB in Action--Coming soon!
* Mahout in Action--OSCON Speaker!--Manning Author Robin Anil is giving a talk on "Mammoth Scale Machine Learning," where he covers how Mahout's machine learning algorithms help developers classify, cluster, and recommend in big data systems.

Comments

Popular posts from this blog

Oracle JDBC ReadTimeout QueryTimeout

Sites, Newsletters, and Blogs

Locks held on Oracle for hours after sessions abnormally terminated by node failure