当前位置：文档库 › 谷歌bigtable原文+中文注解

谷歌bigtable原文+中文注解

Bigtable:A Distributed Storage System for Structured Data

Fay Chang,Jeffrey Dean,Sanjay Ghemawat,Wilson C.Hsieh,Deborah A.Wallach

Mike Burrows,Tushar Chandra,Andrew Fikes,Robert E.Gruber

{fay,jeff,sanjay,wilsonh,kerr,m3b,tushar,?kes,gruber }@https://www.wendangku.net/doc/404075027.html,

Google,Inc.

Abstract

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size:petabytes of data across thousands of commodity servers.Many projects at Google store data in Bigtable,including web indexing,Google Earth,and Google Fi-nance.These applications place very different demands on Bigtable,both in terms of data size (from URLs

to web pages to satellite imagery)and latency requirements (from backend bulk processing to real-time data serving).Despite these varied demands,Bigtable has successfully provided a ?exible,high-performance solution for all of these Google products.In this paper we describe the sim-ple data model provided by Bigtable,which gives clients dynamic control over data layout and format,and we de-scribe the design and implementation of Bigtable.

1Introduction

Over the last two and a half years we have designed,implemented,and deployed a distributed storage system for managing structured data at Google called Bigtable.Bigtable is designed to reliably scale to petabytes of data and thousands of machines.Bigtable has achieved several goals:wide applicability,scalability,high per-formance,and high availability.Bigtable is used by more than sixty Google products and projects,includ-ing Google Analytics,Google Finance,Orkut,Person-alized Search,Writely,and Google Earth.These prod-ucts use Bigtable for a variety of demanding workloads,which range from throughput-oriented batch-processing jobs to latency-sensitive serving of data to end users.The Bigtable clusters used by these products span a wide range of con?gurations,from a handful to thousands of servers,and store up to several hundred terabytes of data.In many ways,Bigtable resembles a database:it shares many implementation strategies with databases.Paral-lel databases [14]and main-memory databases [13]have achieved scalability and high performance,but Bigtable provides a different interface than such systems.Bigtable does not support a full relational data model;instead,it provides clients with a simple data model that supports dynamic control over data layout and format,and al-lows clients to reason about the locality properties of the data represented in the underlying storage.Data is in-dexed using row and column names that can be arbitrary strings.Bigtable also treats data as uninterpreted strings,although clients often serialize various forms of struc-tured and semi-structured data into these strings.Clients can control the locality of their data through careful choices in their schemas.Finally,Bigtable schema pa-rameters let clients dynamically control whether to serve data out of memory or from disk.

Section 2describes the data model in more detail,and Section 3provides an overview of the client API.Sec-tion 4brie?y describes the underlying Google infrastruc-ture on which Bigtable depends.Section 5describes the fundamentals of the Bigtable implementation,and Sec-tion 6describes some of the re?nements that we made to improve Bigtable’s performance.Section 7provides measurements of Bigtable’s performance.We describe several examples of how Bigtable is used at Google in Section 8,and discuss some lessons we learned in designing and supporting Bigtable in Section 9.Fi-nally,Section 10describes related work,and Section 11presents our conclusions.

2Data Model

A Bigtable is a sparse,distributed,persistent multi-dimensional sorted map.The map is indexed by a row key,column key,and a timestamp;each value in the map is an uninterpreted array of bytes.

(row:string,column:string,time:int64)→string

"https://www.wendangku.net/doc/404075027.html,n.www"

Figure1:A slice of an example table that stores Web pages.The row name is a reversed URL.The contents column family con-tains the page contents,and the anchor column family contains the text of any anchors that reference the https://www.wendangku.net/doc/404075027.html,N’s home page is referenced by both the Sports Illustrated and the MY-look home pages,so the row contains columns named anchor:https://www.wendangku.net/doc/404075027.html, and anchor:my.look.ca.Each anchor cell has one version;the contents column has three versions,at timestamps t3,t5,and t6.

We settled on this data model after examining a variety of potential uses of a Bigtable-like system.As one con-crete example that drove some of our design decisions, suppose we want to keep a copy of a large collection of web pages and related information that could be used by many different projects;let us call this particular table the Webtable.In Webtable,we would use URLs as row keys,various aspects of web pages as column names,and store the contents of the web pages in the contents:col-umn under the timestamps when they were fetched,as illustrated in Figure1.

Rows

The row keys in a table are arbitrary strings(currently up to64KB in size,although10-100bytes is a typical size for most of our users).Every read or write of data under a single row key is atomic(regardless of the number of different columns being read or written in the row),a design decision that makes it easier for clients to reason about the system’s behavior in the presence of concurrent updates to the same row.

Bigtable maintains data in lexicographic order by row key.The row range for a table is dynamically partitioned. Each row range is called a tablet,which is the unit of dis-tribution and load balancing.As a result,reads of short row ranges are ef?cient and typically require communi-cation with only a small number of machines.Clients can exploit this property by selecting their row keys so that they get good locality for their data accesses.For example,in Webtable,pages in the same domain are grouped together into contiguous rows by reversing the hostname components of the URLs.For example,we store data for https://www.wendangku.net/doc/404075027.html,/index.html under the key com.google.maps/index.html.Storing pages from the same domain near each other makes some host and domain analyses more ef?cient.Column Families

Column keys are grouped into sets called column fami-lies,which form the basic unit of access control.All data stored in a column family is usually of the same type(we compress data in the same column family together).A column family must be created before data can be stored under any column key in that family;after a family has been created,any column key within the family can be used.It is our intent that the number of distinct column families in a table be small(in the hundreds at most),and that families rarely change during operation.In contrast, a table may have an unbounded number of columns.

A column key is named using the following syntax: family:quali?er.Column family names must be print-able,but quali?ers may be arbitrary strings.An exam-ple column family for the Webtable is language,which stores the language in which a web page was written.We use only one column key in the language family,and it stores each web page’s language ID.Another useful col-umn family for this table is anchor;each column key in this family represents a single anchor,as shown in Fig-ure1.The quali?er is the name of the referring site;the cell contents is the link text.

Access control and both disk and memory account-ing are performed at the column-family level.In our Webtable example,these controls allow us to manage several different types of applications:some that add new base data,some that read the base data and create derived column families,and some that are only allowed to view existing data(and possibly not even to view all of the existing families for privacy reasons).

Timestamps

Each cell in a Bigtable can contain multiple versions of the same data;these versions are indexed by timestamp. Bigtable timestamps are64-bit integers.They can be as-signed by Bigtable,in which case they represent“real time”in microseconds,or be explicitly assigned by client

//Open the table

Table*T=OpenOrDie("/bigtable/web/webtable"); //Write a new anchor and delete an old anchor RowMutation r1(T,"https://www.wendangku.net/doc/404075027.html,n.www");

r1.Set("anchor:https://www.wendangku.net/doc/404075027.html,","CNN");

r1.Delete("anchor:https://www.wendangku.net/doc/404075027.html,");

Operation op;

Apply(&op,&r1);

Figure2:Writing to Bigtable. applications.Applications that need to avoid collisions must generate unique timestamps themselves.Different versions of a cell are stored in decreasing timestamp or-der,so that the most recent versions can be read?rst. To make the management of versioned data less oner-ous,we support two per-column-family settings that tell Bigtable to garbage-collect cell versions automatically. The client can specify either that only the last n versions of a cell be kept,or that only new-enough versions be kept(e.g.,only keep values that were written in the last seven days).

In our Webtable example,we set the timestamps of the crawled pages stored in the contents:column to the times at which these page versions were actually crawled.The garbage-collection mechanism described above lets us keep only the most recent three versions of every page.

3API

The Bigtable API provides functions for creating and deleting tables and column families.It also provides functions for changing cluster,table,and column family metadata,such as access control rights.

Client applications can write or delete values in Bigtable,look up values from individual rows,or iter-ate over a subset of the data in a table.Figure2shows C++code that uses a RowMutation abstraction to per-form a series of updates.(Irrelevant details were elided to keep the example short.)The call to Apply performs an atomic mutation to the Webtable:it adds one anchor to https://www.wendangku.net/doc/404075027.html, and deletes a different anchor. Figure3shows C++code that uses a Scanner ab-straction to iterate over all anchors in a particular row. Clients can iterate over multiple column families,and there are several mechanisms for limiting the rows, columns,and timestamps produced by a scan.For ex-ample,we could restrict the scan above to only produce anchors whose columns match the regular expression anchor:*https://www.wendangku.net/doc/404075027.html,,or to only produce anchors whose timestamps fall within ten days of the current time.Scanner scanner(T);

ScanStream*stream;

stream=scanner.FetchColumnFamily("anchor"); stream->SetReturnAllVersions();

scanner.Lookup("https://www.wendangku.net/doc/404075027.html,n.www");

for(;!stream->Done();stream->Next()){ printf("%s%s%lld%s\n",

scanner.RowName(),

stream->ColumnName(),

stream->MicroTimestamp(),

stream->Value());

}

Figure3:Reading from Bigtable.

Bigtable supports several other features that allow the user to manipulate data in more complex ways.First, Bigtable supports single-row transactions,which can be used to perform atomic read-modify-write sequences on data stored under a single row key.Bigtable does not cur-rently support general transactions across row keys,al-though it provides an interface for batching writes across row keys at the clients.Second,Bigtable allows cells to be used as integer counters.Finally,Bigtable sup-ports the execution of client-supplied scripts in the ad-dress spaces of the servers.The scripts are written in a language developed at Google for processing data called Sawzall[28].At the moment,our Sawzall-based API does not allow client scripts to write back into Bigtable, but it does allow various forms of data transformation,?ltering based on arbitrary expressions,and summariza-tion via a variety of operators.

Bigtable can be used with MapReduce[12],a frame-work for running large-scale parallel computations de-veloped at Google.We have written a set of wrappers that allow a Bigtable to be used both as an input source and as an output target for MapReduce jobs.

4Building Blocks

Bigtable is built on several other pieces of Google in-frastructure.Bigtable uses the distributed Google File System(GFS)[17]to store log and data?les.A Bigtable cluster typically operates in a shared pool of machines that run a wide variety of other distributed applications, and Bigtable processes often share the same machines with processes from other applications.Bigtable de-pends on a cluster management system for scheduling jobs,managing resources on shared machines,dealing with machine failures,and monitoring machine status. The Google SSTable?le format is used internally to store Bigtable data.An SSTable provides a persistent, ordered immutable map from keys to values,where both keys and values are arbitrary byte strings.Operations are provided to look up the value associated with a speci?ed

key,and to iterate over all key/value pairs in a speci?ed key range.Internally,each SSTable contains a sequence of blocks (typically each block is 64KB in size,but this is con?gurable).A block index (stored at the end of the SSTable)is used to locate blocks;the index is loaded into memory when the SSTable is opened.A lookup can be performed with a single disk seek:we ?rst ?nd the appropriate block by performing a binary search in the in-memory index,and then reading the appropriate block from disk.Optionally,an SSTable can be com-pletely mapped into memory,which allows us to perform lookups and scans without touching disk.

Bigtable relies on a highly-available and persistent distributed lock service called Chubby [8].A Chubby service consists of ?ve active replicas,one of which is elected to be the master and actively serve requests.The service is live when a majority of the replicas are running and can communicate with each other.Chubby uses the Paxos algorithm [9,23]to keep its replicas consistent in the face of failure.Chubby provides a namespace that consists of directories and small ?les.Each directory or ?le can be used as a lock,and reads and writes to a ?le are atomic.The Chubby client library provides consis-tent caching of Chubby ?les.Each Chubby client main-tains a session with a Chubby service.A client’s session expires if it is unable to renew its session lease within the lease expiration time.When a client’s session expires,it loses any locks and open handles.Chubby clients can also register callbacks on Chubby ?les and directories for noti?cation of changes or session expiration.

Bigtable uses Chubby for a variety of tasks:to ensure that there is at most one active master at any time;to store the bootstrap location of Bigtable data (see Sec-tion 5.1);to discover tablet servers and ?nalize tablet server deaths (see Section 5.2);to store Bigtable schema information (the column family information for each ta-ble);and to store access control lists.If Chubby becomes unavailable for an extended period of time,Bigtable be-comes unavailable.We recently measured this effect in 14Bigtable clusters spanning 11Chubby instances.The average percentage of Bigtable server hours during which some data stored in Bigtable was not available due to Chubby unavailability (caused by either Chubby out-ages or network issues)was 0.0047%.The percentage for the single cluster that was most affected by Chubby unavailability was 0.0326%.

5Implementation

The Bigtable implementation has three major compo-nents:a library that is linked into every client,one mas-ter server,and many tablet servers.Tablet servers can be dynamically added (or removed)from a cluster to acco-modate changes in workloads.

The master is responsible for assigning tablets to tablet servers,detecting the addition and expiration of tablet servers,balancing tablet-server load,and garbage col-lection of ?les in GFS.In addition,it handles schema changes such as table and column family creations.

Each tablet server manages a set of tablets (typically we have somewhere between ten to a thousand tablets per tablet server).The tablet server handles read and write requests to the tablets that it has loaded,and also splits tablets that have grown too large.

As with many single-master distributed storage sys-tems [17,21],client data does not move through the mas-ter:clients communicate directly with tablet servers for reads and writes.Because Bigtable clients do not rely on the master for tablet location information,most clients never communicate with the master.As a result,the mas-ter is lightly loaded in practice.

A Bigtable cluster stores a number of tables.Each ta-ble consists of a set of tablets,and each tablet contains all data associated with a row range.Initially,each table consists of just one tablet.As a table grows,it is auto-matically split into multiple tablets,each approximately 100-200M

B in size by default.

5.1Tablet Location

We use a three-level hierarchy analogous to that of a B +-tree [10]to store tablet location information (Figure 4).

Figure 4:Tablet location hierarchy.

The ?rst level is a ?le stored in Chubby that contains

the location of the root tablet .The root tablet contains the location of all tablets in a special METADATA table.Each METADATA tablet contains the location of a set of user tablets.The root tablet is just the ?rst tablet in the METADATA table,but is treated specially—it is never split—to ensure that the tablet location hierarchy has no more than three levels.

The METADATA table stores the location of a tablet under a row key that is an encoding of the tablet’s table

identi?er and its end row.Each METADATA row stores approximately1KB of data in memory.With a modest limit of128MB METADATA tablets,our three-level lo-cation scheme is suf?cient to address234tablets(or261 bytes in128MB tablets).

The client library caches tablet locations.If the client does not know the location of a tablet,or if it discov-ers that cached location information is incorrect,then it recursively moves up the tablet location hierarchy. If the client’s cache is empty,the location algorithm requires three network round-trips,including one read from Chubby.If the client’s cache is stale,the location algorithm could take up to six round-trips,because stale cache entries are only discovered upon misses(assuming that METADATA tablets do not move very frequently). Although tablet locations are stored in memory,so no GFS accesses are required,we further reduce this cost in the common case by having the client library prefetch tablet locations:it reads the metadata for more than one tablet whenever it reads the METADATA table.

We also store secondary information in the METADATA table,including a log of all events per-taining to each tablet(such as when a server begins serving it).This information is helpful for debugging and performance analysis.

5.2Tablet Assignment

Each tablet is assigned to one tablet server at a time.The master keeps track of the set of live tablet servers,and the current assignment of tablets to tablet servers,in-cluding which tablets are unassigned.When a tablet is unassigned,and a tablet server with suf?cient room for the tablet is available,the master assigns the tablet by sending a tablet load request to the tablet server. Bigtable uses Chubby to keep track of tablet servers. When a tablet server starts,it creates,and acquires an exclusive lock on,a uniquely-named?le in a speci?c Chubby directory.The master monitors this directory (the servers directory)to discover tablet servers.A tablet server stops serving its tablets if it loses its exclusive lock: e.g.,due to a network partition that caused the server to lose its Chubby session.(Chubby provides an ef?cient mechanism that allows a tablet server to check whether it still holds its lock without incurring network traf?c.)A tablet server will attempt to reacquire an ex-clusive lock on its?le as long as the?le still exists.If the ?le no longer exists,then the tablet server will never be able to serve again,so it kills itself.Whenever a tablet server terminates(e.g.,because the cluster management system is removing the tablet server’s machine from the cluster),it attempts to release its lock so that the master will reassign its tablets more quickly.

The master is responsible for detecting when a tablet server is no longer serving its tablets,and for reassign-ing those tablets as soon as possible.To detect when a tablet server is no longer serving its tablets,the master periodically asks each tablet server for the status of its lock.If a tablet server reports that it has lost its lock, or if the master was unable to reach a server during its last several attempts,the master attempts to acquire an exclusive lock on the server’s?le.If the master is able to acquire the lock,then Chubby is live and the tablet server is either dead or having trouble reaching Chubby,so the master ensures that the tablet server can never serve again by deleting its server?le.Once a server’s?le has been deleted,the master can move all the tablets that were pre-viously assigned to that server into the set of unassigned tablets.To ensure that a Bigtable cluster is not vulnera-ble to networking issues between the master and Chubby, the master kills itself if its Chubby session expires.How-ever,as described above,master failures do not change the assignment of tablets to tablet servers.

When a master is started by the cluster management system,it needs to discover the current tablet assign-ments before it can change them.The master executes the following steps at startup.(1)The master grabs a unique master lock in Chubby,which prevents con-current master instantiations.(2)The master scans the servers directory in Chubby to?nd the live servers.

(3)The master communicates with every live tablet server to discover what tablets are already assigned to each server.(4)The master scans the METADATA table to learn the set of tablets.Whenever this scan encounters a tablet that is not already assigned,the master adds the tablet to the set of unassigned tablets,which makes the tablet eligible for tablet assignment.

One complication is that the scan of the METADATA table cannot happen until the METADATA tablets have been assigned.Therefore,before starting this scan(step 4),the master adds the root tablet to the set of unassigned tablets if an assignment for the root tablet was not dis-covered during step3.This addition ensures that the root tablet will be assigned.Because the root tablet contains the names of all METADATA tablets,the master knows about all of them after it has scanned the root tablet. The set of existing tablets only changes when a ta-ble is created or deleted,two existing tablets are merged to form one larger tablet,or an existing tablet is split into two smaller tablets.The master is able to keep track of these changes because it initiates all but the last. Tablet splits are treated specially since they are initi-ated by a tablet server.The tablet server commits the split by recording information for the new tablet in the METADATA table.When the split has committed,it noti-?es the master.In case the split noti?cation is lost(either

because the tablet server or the master died),the master detects the new tablet when it asks a tablet server to load the tablet that has now split.The tablet server will notify the master of the split,because the tablet entry it ?nds in the METADATA table will specify only a portion of the tablet that the master asked it to load.

5.3Tablet Serving

The persistent state of a tablet is stored in GFS,as illus-trated in Figure 5.Updates are committed to a commit log that stores redo records.Of these updates,the re-cently committed ones are stored in memory in a sorted buffer called a memtable ;the older updates are stored in a sequence of SSTables.To recover a tablet,a tablet server

tablet log GFS

Memory Write Op

SSTable Files

memtable

Read Op

Figure 5:Tablet Representation

reads its metadata from the METADATA table.This meta-data contains the list of SSTables that comprise a tablet and a set of a redo points,which are pointers into any commit logs that may contain data for the tablet.The server reads the indices of the SSTables into memory and reconstructs the memtable by applying all of the updates that have committed since the redo points.

When a write operation arrives at a tablet server,the server checks that it is well-formed,and that the sender is authorized to perform the mutation.Authorization is performed by reading the list of permitted writers from a Chubby ?le (which is almost always a hit in the Chubby client cache).A valid mutation is written to the commit log.Group commit is used to improve the throughput of lots of small mutations [13,16].After the write has been committed,its contents are inserted into the memtable.When a read operation arrives at a tablet server,it is similarly checked for well-formedness and proper autho-rization.A valid read operation is executed on a merged view of the sequence of SSTables and the memtable.Since the SSTables and the memtable are lexicograph-ically sorted data structures,the merged view can be formed ef?ciently.

Incoming read and write operations can continue while tablets are split and merged. 5.4Compactions

As write operations execute,the size of the memtable in-creases.When the memtable size reaches a threshold,the memtable is frozen,a new memtable is created,and the frozen memtable is converted to an SSTable and written to GFS.This minor compaction process has two goals:it shrinks the memory usage of the tablet server,and it reduces the amount of data that has to be read from the commit log during recovery if this server dies.Incom-ing read and write operations can continue while com-pactions occur.

Every minor compaction creates a new SSTable.If this behavior continued unchecked,read operations might need to merge updates from an arbitrary number of SSTables.Instead,we bound the number of such ?les by periodically executing a merging compaction in the background.A merging compaction reads the contents of a few SSTables and the memtable,and writes out a new SSTable.The input SSTables and memtable can be discarded as soon as the compaction has ?nished.

A merging compaction that rewrites all SSTables into exactly one SSTable is called a major compaction .SSTables produced by non-major compactions can con-tain special deletion entries that suppress deleted data in older SSTables that are still live.A major compaction,on the other hand,produces an SSTable that contains no deletion information or deleted data.Bigtable cy-cles through all of its tablets and regularly applies major compactions to them.These major compactions allow Bigtable to reclaim resources used by deleted data,and also allow it to ensure that deleted data disappears from the system in a timely fashion,which is important for services that store sensitive data.

6Re?nements

The implementation described in the previous section required a number of re?nements to achieve the high performance,availability,and reliability required by our users.This section describes portions of the implementa-tion in more detail in order to highlight these re?nements.Locality groups

Clients can group multiple column families together into a locality group .A separate SSTable is generated for each locality group in each tablet.Segregating column families that are not typically accessed together into sep-arate locality groups enables more ef?cient reads.For example,page metadata in Webtable (such as language and checksums)can be in one locality group,and the contents of the page can be in a different group:an ap-

plication that wants to read the metadata does not need to read through all of the page contents.

In addition,some useful tuning parameters can be speci?ed on a per-locality group basis.For example,a lo-cality group can be declared to be in-memory.SSTables for in-memory locality groups are loaded lazily into the memory of the tablet server.Once loaded,column fam-ilies that belong to such locality groups can be read without accessing the disk.This feature is useful for small pieces of data that are accessed frequently:we use it internally for the location column family in the METADATA table.

Compression

Clients can control whether or not the SSTables for a locality group are compressed,and if so,which com-pression format is used.The user-speci?ed compres-sion format is applied to each SSTable block(whose size is controllable via a locality group speci?c tuning pa-rameter).Although we lose some space by compress-ing each block separately,we bene?t in that small por-tions of an SSTable can be read without decompress-ing the entire?le.Many clients use a two-pass custom compression scheme.The?rst pass uses Bentley and McIlroy’s scheme[6],which compresses long common strings across a large window.The second pass uses a fast compression algorithm that looks for repetitions in a small16KB window of the data.Both compression passes are very fast—they encode at100–200MB/s,and decode at400–1000MB/s on modern machines.

Even though we emphasized speed instead of space re-duction when choosing our compression algorithms,this two-pass compression scheme does surprisingly well. For example,in Webtable,we use this compression scheme to store Web page contents.In one experiment, we stored a large number of documents in a compressed locality group.For the purposes of the experiment,we limited ourselves to one version of each document in-stead of storing all versions available to us.The scheme achieved a10-to-1reduction in space.This is much better than typical Gzip reductions of3-to-1or4-to-1 on HTML pages because of the way Webtable rows are laid out:all pages from a single host are stored close to each other.This allows the Bentley-McIlroy algo-rithm to identify large amounts of shared boilerplate in pages from the same host.Many applications,not just Webtable,choose their row names so that similar data ends up clustered,and therefore achieve very good com-pression https://www.wendangku.net/doc/404075027.html,pression ratios get even better when we store multiple versions of the same value in Bigtable.Caching for read performance

To improve read performance,tablet servers use two lev-els of caching.The Scan Cache is a higher-level cache that caches the key-value pairs returned by the SSTable interface to the tablet server code.The Block Cache is a lower-level cache that caches SSTables blocks that were read from GFS.The Scan Cache is most useful for appli-cations that tend to read the same data repeatedly.The Block Cache is useful for applications that tend to read data that is close to the data they recently read(e.g.,se-quential reads,or random reads of different columns in the same locality group within a hot row).

Bloom?lters

As described in Section5.3,a read operation has to read from all SSTables that make up the state of a tablet. If these SSTables are not in memory,we may end up doing many disk accesses.We reduce the number of accesses by allowing clients to specify that Bloom?l-ters[7]should be created for SSTables in a particu-lar locality group.A Bloom?lter allows us to ask whether an SSTable might contain any data for a spec-i?ed row/column pair.For certain applications,a small amount of tablet server memory used for storing Bloom ?lters drastically reduces the number of disk seeks re-quired for read operations.Our use of Bloom?lters also implies that most lookups for non-existent rows or columns do not need to touch disk.

Commit-log implementation

If we kept the commit log for each tablet in a separate log?le,a very large number of?les would be written concurrently in GFS.Depending on the underlying?le system implementation on each GFS server,these writes could cause a large number of disk seeks to write to the different physical log?les.In addition,having separate log?les per tablet also reduces the effectiveness of the group commit optimization,since groups would tend to be smaller.To?x these issues,we append mutations to a single commit log per tablet server,co-mingling mutations for different tablets in the same physical log ?le[18,20].

Using one log provides signi?cant performance ben-e?ts during normal operation,but it complicates recov-ery.When a tablet server dies,the tablets that it served will be moved to a large number of other tablet servers: each server typically loads a small number of the orig-inal server’s tablets.To recover the state for a tablet, the new tablet server needs to reapply the mutations for that tablet from the commit log written by the original tablet server.However,the mutations for these tablets

were co-mingled in the same physical log?le.One ap-proach would be for each new tablet server to read this full commit log?le and apply just the entries needed for the tablets it needs to recover.However,under such a scheme,if100machines were each assigned a single tablet from a failed tablet server,then the log?le would be read100times(once by each server).

We avoid duplicating log reads by?rst sort-ing the commit log entries in order of the keys table,row name,log sequence number .In the sorted output,all mutations for a particular tablet are contiguous and can therefore be read ef?ciently with one disk seek followed by a sequential read.To parallelize the sorting,we partition the log?le into64MB seg-ments,and sort each segment in parallel on different tablet servers.This sorting process is coordinated by the master and is initiated when a tablet server indicates that it needs to recover mutations from some commit log?le. Writing commit logs to GFS sometimes causes perfor-mance hiccups for a variety of reasons(e.g.,a GFS server machine involved in the write crashes,or the network paths traversed to reach the particular set of three GFS servers is suffering network congestion,or is heavily loaded).To protect mutations from GFS latency spikes, each tablet server actually has two log writing threads, each writing to its own log?le;only one of these two threads is actively in use at a time.If writes to the ac-tive log?le are performing poorly,the log?le writing is switched to the other thread,and mutations that are in the commit log queue are written by the newly active log writing thread.Log entries contain sequence numbers to allow the recovery process to elide duplicated entries resulting from this log switching process.

Speeding up tablet recovery

If the master moves a tablet from one tablet server to another,the source tablet server?rst does a minor com-paction on that tablet.This compaction reduces recov-ery time by reducing the amount of uncompacted state in the tablet server’s commit log.After?nishing this com-paction,the tablet server stops serving the tablet.Before it actually unloads the tablet,the tablet server does an-other(usually very fast)minor compaction to eliminate any remaining uncompacted state in the tablet server’s log that arrived while the?rst minor compaction was being performed.After this second minor compaction is complete,the tablet can be loaded on another tablet server without requiring any recovery of log entries.

Exploiting immutability

Besides the SSTable caches,various other parts of the Bigtable system have been simpli?ed by the fact that all of the SSTables that we generate are immutable.For ex-ample,we do not need any synchronization of accesses to the?le system when reading from SSTables.As a re-sult,concurrency control over rows can be implemented very ef?ciently.The only mutable data structure that is accessed by both reads and writes is the memtable.To re-duce contention during reads of the memtable,we make each memtable row copy-on-write and allow reads and writes to proceed in parallel.

Since SSTables are immutable,the problem of perma-nently removing deleted data is transformed to garbage collecting obsolete SSTables.Each tablet’s SSTables are registered in the METADATA table.The master removes obsolete SSTables as a mark-and-sweep garbage collec-tion[25]over the set of SSTables,where the METADATA table contains the set of roots.

Finally,the immutability of SSTables enables us to split tablets quickly.Instead of generating a new set of SSTables for each child tablet,we let the child tablets share the SSTables of the parent tablet.

7Performance Evaluation

We set up a Bigtable cluster with N tablet servers to measure the performance and scalability of Bigtable as N is varied.The tablet servers were con?gured to use1 GB of memory and to write to a GFS cell consisting of 1786machines with two400GB IDE hard drives each. N client machines generated the Bigtable load used for these tests.(We used the same number of clients as tablet servers to ensure that clients were never a bottleneck.) Each machine had two dual-core Opteron2GHz chips, enough physical memory to hold the working set of all running processes,and a single gigabit Ethernet link. The machines were arranged in a two-level tree-shaped switched network with approximately100-200Gbps of aggregate bandwidth available at the root.All of the ma-chines were in the same hosting facility and therefore the round-trip time between any pair of machines was less than a millisecond.

The tablet servers and master,test clients,and GFS servers all ran on the same set of machines.Every ma-chine ran a GFS server.Some of the machines also ran either a tablet server,or a client process,or processes from other jobs that were using the pool at the same time as these experiments.

R is the distinct number of Bigtable row keys involved in the test.R was chosen so that each benchmark read or wrote approximately1GB of data per tablet server. The sequential write benchmark used row keys with names0to R?1.This space of row keys was parti-tioned into10N equal-sized ranges.These ranges were assigned to the N clients by a central scheduler that as-

Experiment 50500random reads

593241random reads (mem)85116250random writes 37452000sequential reads 24632469sequential writes 36231905scans 105267843

Number of tablet servers

2M 3M 4M V a l u e s r e a d /w r i t t e n p e r s e c o n d

Figure 6:Number of 1000-byte values read/written per second.The table shows the rate per tablet server;the graph shows the

aggregate rate.

signed the next available range to a client as soon as the client ?nished processing the previous range assigned to it.This dynamic assignment helped mitigate the effects of performance variations caused by other processes run-ning on the client machines.We wrote a single string un-der each row key.Each string was generated randomly and was therefore uncompressible.In addition,strings under different row key were distinct,so no cross-row compression was possible.The random write benchmark was similar except that the row key was hashed modulo R immediately before writing so that the write load was spread roughly uniformly across the entire row space for the entire duration of the benchmark.

The sequential read benchmark generated row keys in exactly the same way as the sequential write benchmark,but instead of writing under the row key,it read the string stored under the row key (which was written by an earlier invocation of the sequential write benchmark).Similarly,the random read benchmark shadowed the operation of the random write benchmark.

The scan benchmark is similar to the sequential read benchmark,but uses support provided by the Bigtable API for scanning over all values in a row https://www.wendangku.net/doc/404075027.html,-ing a scan reduces the number of RPCs executed by the benchmark since a single RPC fetches a large sequence of values from a tablet server.

The random reads (mem)benchmark is similar to the random read benchmark,but the locality group that con-tains the benchmark data is marked as in-memory ,and therefore the reads are satis?ed from the tablet server’s memory instead of requiring a GFS read.For just this benchmark,we reduced the amount of data per tablet server from 1GB to 100MB so that it would ?t com-fortably in the memory available to the tablet server.Figure 6shows two views on the performance of our benchmarks when reading and writing 1000-byte values to Bigtable.The table shows the number of operations per second per tablet server;the graph shows the aggre-gate number of operations per second.Single tablet-server performance

Let us ?rst consider performance with just one tablet server.Random reads are slower than all other operations by an order of magnitude or more.Each random read in-volves the transfer of a 64KB SSTable block over the network from GFS to a tablet server,out of which only a single 1000-byte value is used.The tablet server executes approximately 1200reads per second,which translates into approximately 75MB/s of data read from GFS.This bandwidth is enough to saturate the tablet server CPUs because of overheads in our networking stack,SSTable parsing,and Bigtable code,and is also almost enough to saturate the network links used in our system.Most Bigtable applications with this type of an access pattern reduce the block size to a smaller value,typically 8KB.Random reads from memory are much faster since each 1000-byte read is satis?ed from the tablet server’s local memory without fetching a large 64KB block from GFS.

Random and sequential writes perform better than ran-dom reads since each tablet server appends all incoming writes to a single commit log and uses group commit to stream these writes ef?ciently to GFS.There is no sig-ni?cant difference between the performance of random writes and sequential writes;in both cases,all writes to the tablet server are recorded in the same commit log.Sequential reads perform better than random reads since every 64KB SSTable block that is fetched from GFS is stored into our block cache,where it is used to serve the next 64read requests.

Scans are even faster since the tablet server can return a large number of values in response to a single client RPC,and therefore RPC overhead is amortized over a large number of values.Scaling

Aggregate throughput increases dramatically,by over a factor of a hundred,as we increase the number of tablet servers in the system from 1to 500.For example,the

#of tablet servers

259

20..49

100..499

Table1:Distribution of number of tablet servers in Bigtable clusters.

performance of random reads from memory increases by almost a factor of300as the number of tablet server in-creases by a factor of500.This behavior occurs because the bottleneck on performance for this benchmark is the individual tablet server CPU.

However,performance does not increase linearly.For most benchmarks,there is a signi?cant drop in per-server throughput when going from1to50tablet servers.This drop is caused by imbalance in load in multiple server con?gurations,often due to other processes contending for CPU and network.Our load balancing algorithm at-tempts to deal with this imbalance,but cannot do a per-fect job for two main reasons:rebalancing is throttled to reduce the number of tablet movements(a tablet is un-available for a short time,typically less than one second, when it is moved),and the load generated by our bench-marks shifts around as the benchmark progresses.

The random read benchmark shows the worst scaling (an increase in aggregate throughput by only a factor of 100for a500-fold increase in number of servers).This behavior occurs because(as explained above)we transfer one large64KB block over the network for every1000-byte read.This transfer saturates various shared1Gi-gabit links in our network and as a result,the per-server throughput drops signi?cantly as we increase the number of machines.

8Real Applications

As of August2006,there are388non-test Bigtable clus-ters running in various Google machine clusters,with a combined total of about24,500tablet servers.Table1 shows a rough distribution of tablet servers per cluster. Many of these clusters are used for development pur-poses and therefore are idle for signi?cant periods.One group of14busy clusters with8069total tablet servers saw an aggregate volume of more than1.2million re-quests per second,with incoming RPC traf?c of about 741MB/s and outgoing RPC traf?c of about16GB/s. Table2provides some data about a few of the tables currently in use.Some tables store data that is served to users,whereas others store data for batch processing; the tables range widely in total size,average cell size,percentage of data served from memory,and complexity of the table schema.In the rest of this section,we brie?y describe how three product teams use Bigtable.

8.1Google Analytics

Google Analytics(https://www.wendangku.net/doc/404075027.html,)is a service that helps webmasters analyze traf?c patterns at their web sites.It provides aggregate statistics,such as the number of unique visitors per day and the page views per URL per day,as well as site-tracking reports,such as the percentage of users that made a purchase,given that they earlier viewed a speci?c page.

To enable the service,webmasters embed a small JavaScript program in their web pages.This program is invoked whenever a page is visited.It records various information about the request in Google Analytics,such as a user identi?er and information about the page be-ing fetched.Google Analytics summarizes this data and makes it available to webmasters.

We brie?y describe two of the tables used by Google Analytics.The raw click table(?200TB)maintains a row for each end-user session.The row name is a tuple containing the website’s name and the time at which the session was created.This schema ensures that sessions that visit the same web site are contiguous,and that they are sorted chronologically.This table compresses to14% of its original size.

The summary table(?20TB)contains various prede-?ned summaries for each website.This table is gener-ated from the raw click table by periodically scheduled MapReduce jobs.Each MapReduce job extracts recent session data from the raw click table.The overall sys-tem’s throughput is limited by the throughput of GFS. This table compresses to29%of its original size.

8.2Google Earth

Google operates a collection of services that provide users with access to high-resolution satellite imagery of the world’s surface,both through the web-based Google Maps interface(https://www.wendangku.net/doc/404075027.html,)and through the Google Earth(https://www.wendangku.net/doc/404075027.html,)custom client soft-ware.These products allow users to navigate across the world’s surface:they can pan,view,and annotate satel-lite imagery at many different levels of resolution.This system uses one table to preprocess data,and a different set of tables for serving client data.

The preprocessing pipeline uses one table to store raw imagery.During preprocessing,the imagery is cleaned and consolidated into?nal serving data.This table con-tains approximately70terabytes of data and therefore is served from disk.The images are ef?ciently compressed already,so Bigtable compression is disabled.

Project Compression#Column%in

(TB)(billions)Groups sensitive?

Crawl11%160%

Crawl33%20%

Google Analytics29%10%

Google Analytics14%10%

Google Base31%2915%

Google Earth64%733%

Google Earth–80% Orkut–81% Personalized Search47%935%

some problems by removing assumptions made by one part of the system about another part.For example,we stopped assuming a given Chubby operation could return only one of a?xed set of errors.

Another lesson we learned is that it is important to delay adding new features until it is clear how the new features will be used.For example,we initially planned to support general-purpose transactions in our API.Be-cause we did not have an immediate use for them,how-ever,we did not implement them.Now that we have many real applications running on Bigtable,we have been able to examine their actual needs,and have discov-ered that most applications require only single-row trans-actions.Where people have requested distributed trans-actions,the most important use is for maintaining sec-ondary indices,and we plan to add a specialized mech-anism to satisfy this need.The new mechanism will be less general than distributed transactions,but will be more ef?cient(especially for updates that span hundreds of rows or more)and will also interact better with our scheme for optimistic cross-data-center replication.

A practical lesson that we learned from supporting Bigtable is the importance of proper system-level mon-itoring(i.e.,monitoring both Bigtable itself,as well as the client processes using Bigtable).For example,we ex-tended our RPC system so that for a sample of the RPCs, it keeps a detailed trace of the important actions done on behalf of that RPC.This feature has allowed us to de-tect and?x many problems such as lock contention on tablet data structures,slow writes to GFS while com-mitting Bigtable mutations,and stuck accesses to the METADATA table when METADATA tablets are unavail-able.Another example of useful monitoring is that ev-ery Bigtable cluster is registered in Chubby.This allows us to track down all clusters,discover how big they are, see which versions of our software they are running,how much traf?c they are receiving,and whether or not there are any problems such as unexpectedly large latencies. The most important lesson we learned is the value of simple designs.Given both the size of our system (about100,000lines of non-test code),as well as the fact that code evolves over time in unexpected ways,we have found that code and design clarity are of immense help in code maintenance and debugging.One exam-ple of this is our tablet-server membership protocol.Our ?rst protocol was simple:the master periodically issued leases to tablet servers,and tablet servers killed them-selves if their lease expired.Unfortunately,this proto-col reduced availability signi?cantly in the presence of network problems,and was also sensitive to master re-covery time.We redesigned the protocol several times until we had a protocol that performed well.However, the resulting protocol was too complex and depended on the behavior of Chubby features that were seldom exer-cised by other applications.We discovered that we were spending an inordinate amount of time debugging ob-scure corner cases,not only in Bigtable code,but also in Chubby code.Eventually,we scrapped this protocol and moved to a newer simpler protocol that depends solely on widely-used Chubby features.

10Related Work

The Boxwood project[24]has components that overlap in some ways with Chubby,GFS,and Bigtable,since it provides for distributed agreement,locking,distributed chunk storage,and distributed B-tree storage.In each case where there is overlap,it appears that the Box-wood’s component is targeted at a somewhat lower level than the corresponding Google service.The Boxwood project’s goal is to provide infrastructure for building higher-level services such as?le systems or databases, while the goal of Bigtable is to directly support client applications that wish to store data.

Many recent projects have tackled the problem of pro-viding distributed storage or higher-level services over wide area networks,often at“Internet scale.”This in-cludes work on distributed hash tables that began with projects such as CAN[29],Chord[32],Tapestry[37], and Pastry[30].These systems address concerns that do not arise for Bigtable,such as highly variable bandwidth, untrusted participants,or frequent recon?guration;de-centralized control and Byzantine fault tolerance are not Bigtable goals.

In terms of the distributed data storage model that one might provide to application developers,we believe the key-value pair model provided by distributed B-trees or distributed hash tables is too limiting.Key-value pairs are a useful building block,but they should not be the only building block one provides to developers.The model we chose is richer than simple key-value pairs, and supports sparse semi-structured data.Nonetheless, it is still simple enough that it lends itself to a very ef?-cient?at-?le representation,and it is transparent enough (via locality groups)to allow our users to tune important behaviors of the system.

Several database vendors have developed parallel databases that can store large volumes of data.Oracle’s Real Application Cluster database[27]uses shared disks to store data(Bigtable uses GFS)and a distributed lock manager(Bigtable uses Chubby).IBM’s DB2Parallel Edition[4]is based on a shared-nothing[33]architecture similar to Bigtable.Each DB2server is responsible for a subset of the rows in a table which it stores in a local relational database.Both products provide a complete relational model with transactions.

Bigtable locality groups realize similar compression and disk read performance bene?ts observed for other systems that organize data on disk using column-based rather than row-based storage,including C-Store[1,34] and commercial products such as Sybase IQ[15,36], SenSage[31],KDB+[22],and the ColumnBM storage layer in MonetDB/X100[38].Another system that does vertical and horizontal data partioning into?at?les and achieves good data compression ratios is AT&T’s Day-tona database[19].Locality groups do not support CPU-cache-level optimizations,such as those described by Ailamaki[2].

The manner in which Bigtable uses memtables and SSTables to store updates to tablets is analogous to the way that the Log-Structured Merge Tree[26]stores up-dates to index data.In both systems,sorted data is buffered in memory before being written to disk,and reads must merge data from memory and disk.

C-Store and Bigtable share many characteristics:both systems use a shared-nothing architecture and have two different data structures,one for recent writes,and one for storing long-lived data,with a mechanism for mov-ing data from one form to the other.The systems dif-fer signi?cantly in their API:C-Store behaves like a relational database,whereas Bigtable provides a lower level read and write interface and is designed to support many thousands of such operations per second per server. C-Store is also a“read-optimized relational DBMS”, whereas Bigtable provides good performance on both read-intensive and write-intensive applications. Bigtable’s load balancer has to solve some of the same kinds of load and memory balancing problems faced by shared-nothing databases(e.g.,[11,35]).Our problem is somewhat simpler:(1)we do not consider the possibility of multiple copies of the same data,possibly in alternate forms due to views or indices;(2)we let the user tell us what data belongs in memory and what data should stay on disk,rather than trying to determine this dynamically;

(3)we have no complex queries to execute or optimize. 11Conclusions

We have described Bigtable,a distributed system for storing structured data at Google.Bigtable clusters have been in production use since April2005,and we spent roughly seven person-years on design and implementa-tion before that date.As of August2006,more than sixty projects are using Bigtable.Our users like the perfor-mance and high availability provided by the Bigtable im-plementation,and that they can scale the capacity of their clusters by simply adding more machines to the system as their resource demands change over time.

Given the unusual interface to Bigtable,an interest-ing question is how dif?cult it has been for our users to adapt to using it.New users are sometimes uncertain of how to best use the Bigtable interface,particularly if they are accustomed to using relational databases that support general-purpose transactions.Nevertheless,the fact that many Google products successfully use Bigtable demon-strates that our design works well in practice.

We are in the process of implementing several addi-tional Bigtable features,such as support for secondary indices and infrastructure for building cross-data-center replicated Bigtables with multiple master replicas.We have also begun deploying Bigtable as a service to prod-uct groups,so that individual groups do not need to main-tain their own clusters.As our service clusters scale, we will need to deal with more resource-sharing issues within Bigtable itself[3,5].

Finally,we have found that there are signi?cant ad-vantages to building our own storage solution at Google. We have gotten a substantial amount of?exibility from designing our own data model for Bigtable.In addi-tion,our control over Bigtable’s implementation,and the other Google infrastructure upon which Bigtable de-pends,means that we can remove bottlenecks and inef?-ciencies as they arise.

Acknowledgements

We thank the anonymous reviewers,David Nagle,and our shepherd Brad Calder,for their feedback on this pa-per.The Bigtable system has bene?ted greatly from the feedback of our many users within Google.In addition, we thank the following people for their contributions to Bigtable:Dan Aguayo,Sameer Ajmani,Zhifeng Chen, Bill Coughran,Mike Epstein,Healfdene Goguen,Robert Griesemer,Jeremy Hylton,Josh Hyman,Alex Khesin, Joanna Kulik,Alberto Lerner,Sherry Listgarten,Mike Maloney,Eduardo Pinheiro,Kathy Polizzi,Frank Yellin, and Arthur Zwiegincew.

References

[1]A BADI, D.J.,M ADDEN,S.R.,AND F ERREIRA,

M.C.Integrating compression and execution in column-oriented database systems.Proc.of SIGMOD(2006). [2]A ILAMAKI,A.,D E W ITT,D.J.,H ILL,M.D.,AND S K-

OUNAKIS,M.Weaving relations for cache performance.

In The VLDB Journal(2001),pp.169–180.

[3]B ANGA,G.,D RUSCHEL,P.,AND M OGUL,J.C.Re-

source containers:A new facility for resource manage-ment in server systems.In Proc.of the3rd OSDI(Feb.

1999),pp.45–58.

[4]B ARU, C.K.,F ECTEAU,G.,G OYAL, A.,H SIAO,

H.,J HINGRAN,A.,P ADMANABHAN,S.,C OPELAND,

G.P.,AND W ILSON,W.G.DB2parallel edition.IBM

Systems Journal34,2(1995),292–322.

[5]B AVIER,A.,B OWMAN,M.,C HUN,B.,C ULLER,D.,

K ARLIN,S.,P ETERSON,L.,R OSCOE,T.,S PALINK,T., AND W AWRZONIAK,M.Operating system support for planetary-scale network services.In Proc.of the1st NSDI (Mar.2004),pp.253–266.

[6]B ENTLEY,J.L.,AND M C I LROY,M.D.Data compres-

sion using long common strings.In Data Compression Conference(1999),pp.287–295.

[7]B LOOM,B.H.Space/time trade-offs in hash coding with

allowable errors.CACM13,7(1970),422–426.

[8]B URROWS,M.The Chubby lock service for loosely-

coupled distributed systems.In Proc.of the7th OSDI (Nov.2006).

[9]C HANDRA,T.,G RIESEMER,R.,AND R EDSTONE,J.

Paxos made live—An engineering perspective.In Proc.

of PODC(2007).

[10]C OMER,D.Ubiquitous https://www.wendangku.net/doc/404075027.html,puting Surveys11,2

(June1979),121–137.

[11]C OPELAND,G.P.,A LEXANDER,W.,B OUGHTER,

E.E.,AND K ELLER,T.W.Data placement in Bubba.In

Proc.of SIGMOD(1988),pp.99–108.

[12]D EAN,J.,AND G HEMAWAT,S.MapReduce:Simpli?ed

data processing on large clusters.In Proc.of the6th OSDI (Dec.2004),pp.137–150.

[13]D E W ITT,D.,K ATZ,R.,O LKEN,F.,S HAPIRO,L.,

S TONEBRAKER,M.,AND W OOD,D.Implementation techniques for main memory database systems.In Proc.

of SIGMOD(June1984),pp.1–8.

[14]D E W ITT,D.J.,AND G RAY,J.Parallel database sys-

tems:The future of high performance database systems.

CACM35,6(June1992),85–98.

[15]F RENCH,C.D.One size?ts all database architectures

do not work for DSS.In Proc.of SIGMOD(May1995), pp.449–450.

[16]G AWLICK,D.,AND K INKADE,D.Varieties of concur-

rency control in IMS/VS fast path.Database Engineering Bulletin8,2(1985),3–10.

[17]G HEMAWAT,S.,G OBIOFF,H.,AND L EUNG,S.-T.The

Google?le system.In Proc.of the19th ACM SOSP(Dec.

2003),pp.29–43.

[18]G RAY,J.Notes on database operating systems.In Oper-

ating Systems—An Advanced Course,vol.60of Lecture Notes in Computer Science.Springer-Verlag,1978. [19]G REER,R.Daytona and the fourth-generation language

Cymbal.In Proc.of SIGMOD(1999),pp.525–526. [20]H AGMANN,R.Reimplementing the Cedar?le system

using logging and group commit.In Proc.of the11th SOSP(Dec.1987),pp.155–162.

[21]H ARTMAN,J.H.,AND O USTERHOUT,J.K.The Zebra

striped network?le system.In Proc.of the14th SOSP (Asheville,NC,1993),pp.29–43.[22]https://www.wendangku.net/doc/404075027.html,/products/database.php.Product page.

[23]L AMPORT,L.The part-time parliament.ACM TOCS16,

2(1998),133–169.

[24]M AC C ORMICK,J.,M URPHY,N.,N AJORK,M.,

T HEKKATH,C.A.,AND Z HOU,L.Boxwood:Abstrac-tions as the foundation for storage infrastructure.In Proc.

of the6th OSDI(Dec.2004),pp.105–120.

[25]M C C ARTHY,J.Recursive functions of symbolic expres-

sions and their computation by machine.CACM3,4(Apr.

1960),184–195.

[26]O’N EIL,P.,C HENG,E.,G AWLICK,D.,AND O’N EIL,

E.The log-structured merge-tree(LSM-tree).Acta Inf.

33,4(1996),351–385.

[27]https://www.wendangku.net/doc/404075027.html,/technology/products/-

database/clustering/index.html.Product page.

[28]P IKE,R.,D ORWARD,S.,G RIESEMER,R.,AND Q UIN-

LAN,S.Interpreting the data:Parallel analysis with Sawzall.Scienti?c Programming Journal13,4(2005), 227–298.

[29]R ATNASAMY,S.,F RANCIS,P.,H ANDLEY,M.,K ARP,

R.,AND S HENKER,S.A scalable content-addressable network.In Proc.of SIGCOMM(Aug.2001),pp.161–172.

[30]R OWSTRON,A.,AND D RUSCHEL,P.Pastry:Scal-

able,distributed object location and routing for large-scale peer-to-peer systems.In Proc.of Middleware2001 (Nov.2001),pp.329–350.

[31]https://www.wendangku.net/doc/404075027.html,/products-sensage.htm.

Product page.

[32]S TOICA,I.,M ORRIS,R.,K ARGER,D.,K AASHOEK,

M.F.,AND B ALAKRISHNAN,H.Chord:A scalable peer-to-peer lookup service for Internet applications.In Proc.of SIGCOMM(Aug.2001),pp.149–160. [33]S TONEBRAKER,M.The case for shared nothing.

Database Engineering Bulletin9,1(Mar.1986),4–9. [34]S TONEBRAKER,M.,A BADI,D.J.,B ATKIN,A.,C HEN,

X.,C HERNIACK,M.,F ERREIRA,M.,L AU,E.,L IN,

A.,M ADDEN,S.,O’N EIL,E.,O’N EIL,P.,R ASIN,

A.,T RAN,N.,AND Z DONIK,S.C-Store:A column-

oriented DBMS.In Proc.of VLDB(Aug.2005),pp.553–564.

[35]S TONEBRAKER,M.,A OKI,P.M.,D EVINE,R.,

L ITWIN,W.,AND O LSON,M.A.Mariposa:A new ar-chitecture for distributed data.In Proc.of the Tenth ICDE (1994),IEEE Computer Society,pp.54–65.

[36]https://www.wendangku.net/doc/404075027.html,/products/database-

servers/sybaseiq.Product page.

[37]Z HAO,B.Y.,K UBIATOWICZ,J.,AND J OSEPH,A.D.

Tapestry:An infrastructure for fault-tolerant wide-area location and routing.Tech.Rep.UCB/CSD-01-1141,CS Division,UC Berkeley,Apr.2001.

[38]Z UKOWSKI,M.,B ONCZ,P.A.,N ES,N.,AND H EMAN,

S.MonetDB/X100—A DBMS in the CPU cache.IEEE Data Eng.Bull.28,2(2005),17–22.