Slide1: Google Cluster Innards Martin Dvorak UltraDvorka@gmail.com
http://www.e-mental.com/dvorka
Agenda: Agenda Inventing Google The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page (founders) / 1998
Cluster Anatomy Web Search for a Planet: The Google Cluster Architecture Luiz André Barroso, Jeffrey Dean and Urs Hoelzle / 2003 Google's secret of success? Dealing with failure Urs Hoelzle (Vice President of Engineering and Operations) / 2004
Programming for Google Cluster MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat (Google staff) / 2004
Slide3: Inventing Google
Inventing Google: Inventing Google Sergey & Larry - Ph.D. students at Stanford University
Prototype (1998)
http://google.stanford.edu
24,000,000 pages (8,058,044,651 today)
Google
“We chose our system name, Google, because it is a common spelling of googol, or 10100 and fits well with our goal of building very large-scale search engines.”
Page Rank
An objective measure of its citation importance that corresponds well with people’s subjective idea of importance.
Inventing Google: Foundation: Inventing Google: Foundation PageRank*:
We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85. There are more details about d... Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)) *) Larry Page
Inventing Google: Foundation: Inventing Google: Foundation Page Rank formula informally
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
PageRank can be thought of as a model of user behavior.
We assume there is a "random surfer" who is given a web page at random and keeps clicking on links, never hitting "back" but eventually gets bored and starts on another random page.
The probability that the random surfer visits a page is its PageRank.
High PR has a page if…
there are many pages that point to it
or if there are some pages that point to it and have a high PR
Note recursive weight propagation through web link structure.
Note that the PageRanks form a probability distribution over web pages, so the sum of all web pages’ PageRanks will be one.
Damping factor d is the probability at each page the "random surfer" will get bored and request another random page.
Personalization
Inventing Google: Foundation: Inventing Google: Foundation PageRank relevancy tuning
Page title
Anchor text
Meta
Font
Size
Weight
Capitalization
…
Inventing Google: Anatomy: Inventing Google: Anatomy
Inventing Google: Anatomy: Inventing Google: Anatomy URL Server
Providers list of URLs to be fetched to crawlers
Google Crawlers (GoogleBot)
Multiple distributed crawlers
Own DNS cache
300 connections open at once
Send fetched pages to Store Server
Originally written in Python
Store Server
Compresses and stores files to repository.
DOCID is created for each page.
Repository
Stores fetched pages for further processing by Indexer
Inventing Google: Anatomy: Inventing Google: Anatomy Indexer
Reads pages from Repository (uncompress)
Parses each document (Flex on top of own stack):
Page converted to set of Hits (position, font, capitalization, title/achor/meta) / 2B
Added to Document Index
Hits are distributed to Barrels (i.e. one document to multiple barrels)
Every link found in page is stored to Anchors file
Forward and Inverted Barrels (2*64)
Forward Index
Barrel keeps range of Hits sorted by DOCIDs
(DOCID, (WORDID, word’s Hit reference+)+)
Processed by Sorter:
Generates inverted index from forward index – sorts Hits by WORDIDs
Creates (WORDID, offsets) used by Lexicon
Inverted Index (short/full)
(WORDID, (DOCID reference, Hit list reference)+))
Short: DOCIDs sorted by/contains just quality Hits (word in title, anchor,...); optimal single word search
Full: DOCIDs sorted by DOCID; optimal Hit lists merging i.e. multi-word search
Anchors file
Anchor (from, to, text)
URL Resolver
Reads anchors file:
Relation 2 absolute URL conversion + DOCID assignment
Creates links file
Links file
(url, target: DOCID)
Inventing Google: Anatomy: Inventing Google: Anatomy Searcher uses…
Lexicon
Keeps map saying which Barrel to use.
Originally kept in memory (256MB).
IMHO now must be used something like Multi-level VM Page Table
It is is/was of fixed size (14,000,000 words)
Barrels
Each barrel keeps range of WORDIDs
WORID 2 DOCID map
PageRank pool
Keeps counted page rank for each DOCID
Doc Index
DOCID ordered information about each document
(DOCID, status, repository pointer, checksum, stat, URL, title)
Slide12: Cluster Innards
Cluster Innards: Global Google: Cluster Innards: Global Google Over 30 Google clusters around the world.
DNS based & geo location driven load-balancing:
Domain Name: GOOGLE.COM Registrar: ALLDOMAINS.COM INC. Whois Server: whois.alldomains.com Referral URL: http://www.alldomains.com Name Server: NS2.GOOGLE.COM Name Server: NS1.GOOGLE.COM Name Server: NS3.GOOGLE.COM Name Server: NS4.GOOGLE.COM Status: REGISTRAR-LOCK Updated Date: 03-oct-2002 Creation Date: 15-sep-1997 Expiration Date: 14-sep-2011
2005, May 7: Google DNS hack speculations
Total PCs
> 5,000 in 2000
>15,000 in 2003
>79,000* in 2004
*) I’m not sure about this number, it was taken from an external resource.
Cluster Innards: HW: Cluster Innards: HW Basics cluster design insights
Reliability in SW rather then server-class HW.
Commodity PCs used to build high-end computing cluster at a low end prices.
Example:
$287,000 – 176x 2GHz Xeon, 176GB RAM, 7TB HDD
$758,000 – 8x 2GHZ Xeon, 64GB RAM, 8TB HDD
Design is tailored for best aggregate request throughput, not peak server response time – individual request parallelization.
Google has inexpensively built out its computing infrastructure by using thousands of "commodity" servers
<2,000 servers in single cluster.
Dual-processor x86 servers (starting at 533MHz Celeron) with 2-4 GB of memory per machine, 1+ 80GB IDE drive.
Rack: 40-80 of x86-based servers.
Cluster Innards: HW: Cluster Innards: HW Optimistically, a consumer PC might crash once in three years from a software glitch or hardware problem.
"At Google scale...if you have thousands of PCs, you can expect one (failure) a day,…"
1,000,000s not 1,000,000,000s of dollars.
“The trick is to make these racks of hardware work together and to ensure that the failure of one machine doesn't derail an operation.”
Switched Ethernet
Commodity networking hardware is used - typically either 100 megabits/second or 1 gigabit/second at the machine level, but averaging considerably less in overall bisection bandwidth.
Locality optimizations (GFS)
Cluster Innards: SW: Cluster Innards: SW Stripped-down version of Linux, which is based on the Red Hat distribution but is really just the operating system kernel modified for Google.
Google File System is optimized for handling large blocks of data.
64MB block
The file system was designed to assume that a failure, such as a failed disk or unplugged network cable, can happen at any time.
Data is replicated in three places, and there is a "master" machine that can locate copies of a piece of data, such as a keyword index, if the original is out of commission.
Google has created "batch" job scheduling software that acts as a sort of taskmaster for millions of operations called the Global Work Queue.
Another important engineering feat done by Google is to make writing programs that run across thousands of servers very straightforward…
Slide17: Programming for Cluster
Programming For Cluster: Programming For Cluster Google's MapReduce is a programming model and an associated implementation for processing and generating large data sets.
Automates the task of recovering a program in case of a failure.
It is critical to keeping the company's costs down.
MR in brief:
Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.
Features
Functional style programming.
Automatic parallelization.
Programming For Cluster: Programming For Cluster Map Reduce runtime…
takes care of the details of partitioning the input data
scheduling the program's execution across a set of machines, handling machine failures
managing the required inter-machine communication.
…and more.
MR hides machines the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library.
Therefore even programmers without any experience with parallel and distributed systems can easily utilize the resources of a large distributed system.
Numbers…
TB of data processed (>20)
On 1,000s of machines
100s of MR programs in place
Programming For Cluster: Programming For Cluster Special purpose computations examples:
Inputs:
crawled documents
web request logs
…
Outputs:
inverted indices
various representations of the graph structure of web documents
summaries of the number of pages crawled per host (dumper)
the set of most frequent queries in a given day, etc.
Programming For Cluster: Programming For Cluster LISP roots
Remind map and reduce primitives:
map(func, list, ...)
Creates new list from the results of applying func to each element of each list. There must be one list per argument to the function.
map(lambda x, y: x+y, [1,2],[3,4]) --> [4,6]
map(None, [1,2],[3,4]) --> [[1,3],[2,4]]
reduce(func, list {,init})
Applies func to each pair of items in turn. The results are accumulated.
reduce(lambda x, y: x+y, [1,2,3,4],5) --> 15
reduce(lambda x, y: x&y, [1,0,1]) --> 0
reduce(None, [], 1) --> 1
Programming model
Key/values 2 key/values
Map & reduce functions written by user are linked with MR library.
map (k1,v1) list(k2,v2)
reduce (k2,list(v2)) list(v2)
Input/output file, tuning parameters, …
Programming For Cluster: Programming For Cluster Example (pseudocode):
map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1");
reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));
Programming For Cluster: Programming For Cluster Example solves the problem of counting the number of occurrences of each word in a large collection of documents.
More examples:
Distributed Grep
Map: (URL, true); Reduce: id
Count of URL Access Frequency
Access page log is used as input for map.
Map: (URL,1); Reduce: (URL; total count)
Reverse Web-Link Graph
Link database is processed by map.
Map: (target; source); Reduce: (target; list(source))
Term-Vector per Host
A term vector summarizes the most important words that occur in a document or a set of documents as a list of (word; frequency) pairs.
Hostname is determined for each document by map and term vector created (so there is multiple entries); reduce function then merges all entries associated with particular host and throws away infrequent terms.
Map: (hostname; term vector); Reduce: (hostname; term vector)
Programming For Cluster: Programming For Cluster
Programming For Cluster: Programming For Cluster Also…
Multiple tasks performed by single worker (load-balancing)
Master
idle, in-progress, completed (workers)
worker failure (re-execution)
master failure (rare, restart)
MR locality optimization
Save network bandwidth
GFS 64MB block replication
MR scheduler takes GFS replicas location into account
Stragglers
Overload, HW problems, etc.
Backup task – fork twin execution for straggler
Task granularity
Number of workers driven by M&R pairs number
For example: M=200,000; R=5,000 using 2,000 workers
Slide26: Putting Things Together
I’m Feeling Lucky: I’m Feeling Lucky Pre-phase:
Browser requests e.g. http://www.google.com/search?q=edu+session
DNS-based load-balancing selects cluster according to the geographical location of the user & actual cluster utilization
The rest of the evaluation is entirely local to the that cluster
Phase 1: Index servers...
Parse the query
Perform spell-check and fork Ad task
Convert words into WORDIDs
Choose inverted Barrel(s) using Lexicon
Barrel index is formed by number of servers whose data are randomly distributed and replicated (full index/index shards) so search is highly parallelizable
Inverted barrel maps each query word to a matching list of documents (Hit list)
Seek to the start of the document list in the short barrel for every word (multiple tasks)
Scan through document list until there is document that matches all search terms
If we are in the short barrels and at the end of any document list, seek to the start of the document list to the full barrel for every word and go to the step 1
If we are not at the end of any document list, go to the step 1
Sort the DOCIDs that have matched
Phase 2: Document servers...
For each DOCID compute actual title, URL and query-specific document summary (matched words context).
Document servers are used to dispatch this completion – also documents are randomly distributed and replicated, so the completion is highly parallelizable
Slide28: Bonus
Stanford lab (around 1996): Stanford lab (around 1996)
The Original Google Storage: 10x4GB (1996): The Original Google Storage: 10x4GB (1996)
Google San Francisco (2004): Google San Francisco (2004)
A cluster of coolness Google History : A cluster of coolness Google History
Google Results Page Per Day : Google Results Page Per Day
References: References Sergey Brin, Lawrence Page; The Anatomy of a Large-Scale Hypertextual Web Search Engine; 1998
Luiz André Barroso, Jeffrey Dean and Urs Hoelzle:Web Search for a Planet: The Google Cluster Architecture; 2003
Urs Hoelzle:Google's secret of success? Dealing with failure; 2004
Jeffrey Dean and Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters; 2004
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung: The Google File System; 2003
http://www.google.com
GoogleBot http://www.google.com/bot.html
http://www.googleguide.com
http://www.searchengineposition.com
http://www.google-watch.org
Uniquely Google ™ http://www-db.stanford.edu/pub/voy/museum/google.htm
Stanford Gadgets http://www-db.stanford.edu/pub/voy/museum/pictures/
Google hacked? http://www.gigaom.com/2005/05/07/google-hacked/