Google Platform
From Wikipedia the free encyclopedia, by MultiMedia
Google, being one of the most popular Internet
search engines, requires large
computational resources in order to provide their service. This article
describes Google's technological infrastructure, as presented in the
company's public announcements.
![Google Company Logo](./modules/Google_Guide-MM/images/Google44.jpg)
Google Company Logo
Network topology
Google has several clusters in various locations across the world. When
an attempt to connect to Google is made, Google's DNS servers perform load
balancing to allow the user to access Google's content most rapidly. This is
done by sending the user the IP address of a cluster that is not under heavy
load, and is geographically proximate to them. Each cluster has a few
thousand servers, and upon connection to a cluster further load balancing is
performed by hardware in the cluster, in order to send the queries to the
least loaded Google Web Server.
Racks are custom-made and contain 40 to 80 servers (20 to 40 1U servers on
either side)new servers are 2U Rackmount systems. Each rack has a hub.
Servers are connected via a 100 Mbit/s Ethernet link to the local hub. Hubs
are connected to core gigabit hub using one or two gigabits uplinks.
Main Index
Since queries are composed of words, an inverted index of documents is
required. Such an index allows obtaining a list of documents by a query
word. The index itself is quite large due to the number of documents stored
in the servers, therefore it needs to be split up into "index shards". Each
shard is hosted by a set of index servers. The load balancer decides which
index server to query based on availability of each server.
Server types
Google's server infrastructure is divided in several types each assigned
to a different purpose:
- Google Web Servers coordinate the execution of queries sent
by users, then format the result into an HTML page. The execution
consists of sending queries to index servers, merging the results,
computing their rank, retrieving a summary for each hit (using the
document server), asking for suggestions from the spelling servers, and
finally getting a list of advertisements from the ad server.
- Data-gathering servers are permanently dedicated to spidering
the Web. They update the index and document databases and apply Google's
algorithms to assign ranks to pages.
- Index servers each contain a set of index shards. They return
a list of document IDs ("docid"), such that documents corresponding to a
certain docid contain the query word. These servers need less disk
space, but suffer the greatest CPU workload.
- Document servers store documents. Each document is stored on
dozens of document servers. When performing a search, a document server
returns a summary for the document based on query words. They can also
fetch the complete document when asked. These servers need more disk
space.
- Ad servers manage advertisements offered by
services
like AdWords and
AdSense.
- Spelling servers make suggestions about the spelling of
queries.
Server hardware and software
Servers are commodity-class x86 PCs running customized versions of Linux.
Indeed, the goal is to purchase CPU generations that offer the best
performance per unit of power, not absolute performance. The biggest cost
that Google faces is power consumption given the huge amount of computing
power required. For this reason, the Pentium II has been the most favoured
processor, but this could change in the future as processor manufacturers
are increasingly limited by the power output of their devices.
Published specifications:
- 100,000 servers ranging from 533 MHz Intel Celeron to dual 1.4 GHz
Intel Pentium III (as of 2005)
- One or more 80GB hard disk per server. (2003)
- 2–4 GiB memory per machine (2004)
The exact size and whereabouts of the data centers Google uses are
unknown, and official figures remain intentionally vague. According to John
Hennessy and David Patterson's Computer Architecture: A Quantitative
Approach, Google's server farm computer cluster in the year 2000 consisted
of approximately 6000 processors, 12000 common IDE disks (2 per machine, and
one processor per machine), at four sites: two in Silicon Valley, California
and two in Virginia. Each site had an OC 48 (2488 Mbit/s) internet
connection and an OC 12 (622 Mbit/s) connection to other Google sites. The
connections are eventually routed down to 4 x 1 Gbit/s lines connecting up
to 64 racks, each rack holding 80 machines and two ethernet switches. Google
has almost certainly dramatically changed and enlarged their network
architecture since then.
Based on the Google IPO S-1 form released in April 2004, Tristan Louis
estimated the current server farm to contain something like the following:
- 719 racks
- 63,272 machines
- 126,544 CPUs
- 253 THz of processing power
- 126,544 GB (approx. 123.58 TB) of RAM
- 5,062 TB (approx. 4.77 PB) of hard drive space
According to this estimate, the Google server farm constitutes one of the
most powerful supercomputers in the world. At 126–316 teraflops, it can
perform at over one third the speed of the Blue Gene supercomputer, which is
(as of 2005) the top entry in the TOP500 list of most powerful unclassified
computing machines available to humanity.
Server operation
Most operations are read-only. When an update is required, queries are
redirected to other servers, such as to simplify consistency issues. Queries
are divided into sub-queries, where those sub-queries may be sent to
different servers in parallel, thus reducing the latency time.
In order to avoid the effects of unavoidable hardware failure, data stored
in the servers may be mirrored using hardware RAID. Software is also
designed to be fault tolerant. Thus when a system goes down, data is still
available on other servers, which increases the throughput.
References
-
Google Research Publications - A list of papers on Google's
platform. Last accessed
October 2,
2005.
- Luiz André Barroso, Jeffrey Dean,
Urs Hölzle (2003).
Web Search for a Planet: The Google Cluster Architecture. IEEE
Micro 23 (2): 22–28.
-
How many Google machines, by Tristan Louis
External links
Google Guide made by MultiMedia | Free content and software
This guide is licensed under the GNU
Free Documentation License. It uses material from the Wikipedia.
|