Google: Organizing the World’s Information… and Loving It!

| -Uncategorized

The Google recruitment talk was given by John Abd-El-Malek
(jam@google.com, abdelMAHLik). Other engineers were also around for
the question and answer: Amit Agarwal, Tim James, Jon McAlister, Peter
Szulczewski, Joel Zacaharias. There were two women from HR whose names
I didn’t catch.

Google’s mission statement is to organize the world’s information and
make it universally accessible and useful. The presentation covered
the following points:

– build systems for scalability
– harness the power of data
– innovating new applications
– managing fast-paced growth
– looking ahead to the future

Google works on a massive scale

Google faces the following challenges:

– hardware and networking: build a basic computing platform with low cost
– big distributed systems: create reliable systems from many individual machines
– algorithms, data structures: process data efficiently and flexibly
– machine learning, information retrieval: improve quality of search by analyzing lots of data
– user interfaces: design effective ui for search and other products

Large data set, simple structure. Key insight: Google works
with large data sets with simple structure. For example, web page
repositories, query logs, status records from thousands of machines,
source code control and software build records, etc. These aren’t
stored in SQL databases because they’re too large for DBMSes
(terabytes of data!) and they don’t need the full complexity of a
DBMS.

Simple statistical analysis. Often, analyses of data tend to be
simple. General statistical analysis often only requires computing
small number of statistics, then performing more complex operations
using only these statistics. For example, if we’re trying to find the
most popular query, we don’t need to check all the queries.

Data as a sequence of records. For commutative operations,
record order is irrelevant (example: addition). For associative
operations, aggregation order is irrelevant (example: finding the
maximum). This allows you to write parallel programs to take advantage
of Google’s distributed computing power. For example, consider a week
of code submissions. This short program calculates the minute for one
entry and emits an instruction to add one to the record for that
minute. The emit statements are delivered to an aggregator, which then
combines the results into a graph. (As you can see, we do have weekends.)

(Demo followed by a totally awesome video of query traffic represented
as points of light on a map of the Earth.)

Harnessing the power of data

The conventional wisdom is that given an order of magnitude increase
in computational power, you can solve previously impractical problems.

Google’s insight: Given an order of magnitude increase in data, you
can solve previously unsolvable problems!

It’s not just about getting a more robust solution. Some methods that
appear to fail with limited data works with much larger data sets.

Consider spelling correction. The old way was to use a
lexicon/dictionary – 100k words. This allows you to suggest correction
words that have a short edit distance from unrecognized words. What’s
the challenge? Proper names, which are rarely in lexicons. Example:
Kofi Annan.

The set of terms on web is much larger than standard lexicons and
changes regularly. People misspell queries, even popular ones such as
“britney spears”. Dictionary-based spelling correction has problems
with context.

(Points out funny contrast between britney spears and briney spears (asparagus).)

Solution? Use the web as a contextual lexicon. Find misspellings based
on contextual usage on web. Build a probabilistic model of term
spellings. Context is key.

You can also find interesting patterns in data. For example, here are
the most popular queries from the past few Januarys. (Points out
Superbowl, points out one year when Janet Jackson and “superbowl
halftime” topped the Google queries.)

Innovations

Google Maps

Example: Google Maps. Revolutionary update because it’s dynamic, clean rendering. Open API for developers.

Making it all work

– Plenty of crazy hacks to make it work across browsers
– Mozilla/Safari/Opera don’t support vector markup. Draw driving directions on server in a PNG image and overlay it
– IE does not support alpha transparency in PNGs. Use a little known ActiveX control that’s enabled by default
– Safari and Opera don’t suppot parsing XML strings, so we wrote an XML parser in JavaScript (no joke)

The benefit of DHTML: Simple API

– Putting map on page requires only two lines of JavaScript:
– Initially designed to integrate
– Developers figured this out before we published API

http://www.scipionus.com/katrina.html . Wow. Leaving messages overlaid on a map. Good idea.

Automatic machine translation

Goal: Provide automatic high-uality translations of text between
different languages Enables all text data on the web to be accessible
in any language no matter what the language of original text Approach:
statistical machine translation. Build a statistical model of
translation. Use decision theory to make optimal decisions.
Sentence-by-sentence level.

Pre-translated pairs of text to learn parameters of log-linear model.

Throw statistics at the problem!
BLEU% score: how closely machine translation similar to human translation
Outperformed Chinese-English translation and Arabic-English translation. Why Chinese and Arabic? They’re very different from English. If we can do these languages, then it would be easier to do Spanish and French.

BLEU% for Arabic-English translation as amount of data for language
model is increased. weblm: learning model trained on 219B words of web
data! Seems roughly logarithmic.

Google Desktop

Google Desktop APIs: Indexer, Query, Sidebar, Event API. More info at
http://desktop.google.com/developerguide.html . (Oooh. Event API. What
is the user doing? They’ve done the grunt-work of hooking into the
different applications! Sweeeeeeet! Awesome! Awesome! They have an
event stream already going!)

Sidebar

Show useful information, not distracting, make efficient use of space.
Write a cool panel, and depending on the number of users: limited
edition Google Desktop t-shirt, adwords, iPod nano, internship!

Some ideas: local traffic, calendar, eBay, iTunes, sports scores, quicklaunch, TV guide, random Google Video, webcam, SMS…

Google work environment

Small teams of 3-5 people, problems that matter, with freedom to
explore their ideas. Access to enormous computational resources. 20%
time to explore your own ideas.

Froogle, orkut, news, desktop: all 20% products.

Not just about search

– hardware, mechanical engineering
– networking, distributed sys, fault tolerange
– compilers, programming language
– data structures, algos
– machine learning, statistics, IR , AI
– user interfaces
– product design

Not just about engineering

– product management
– product marketing
– finance
– technical sales
– tech program management
– staffing
– online sales and operations

Hiring all over the world

Great benefits

– flexible work environment
– fun atmosphere
– free gourmet meals
– on-site massage, doctor, concierge, and dry-cleaning
– and all those “standard” things

(still need an apartment to sleep in, though, though.)

Looking at the future

Sampling of Google’s product suite: Google Web Search, Adsense for
Search, Google News, Blogger, Froogle, Gmail, Google Earth, Google
Search Appliance, Google Toolbar…

help users organize information

Google Labs: personalized search, video, suggest, sets

looking ahead: always room for improvement

better systems: improving scalability and performance, providing new infrastructure to build services on
better relevance: improving which pages are presented to the user, giving user access to more/new information
better products/services: new product directions to pursue

Come join the fun! http://google.com/jobs , collegejobs@google.com

Questions and answers

– How does Google make money off Orkut? We never worry about profit

for product. We make it first, and then we see if we can make money
off it.
– Is there an reality in a Google online office? Can’t comment on any rumors.
– How many people are you looking to hire? No specific number in mind. As many great, talented people are out there.
– Server count? Can’t answer that.
– Majority of Google revenue come from licensing technologies? Revenue statements are largely open now that Google is a public company. Most of it comes from Adsense. Some revenue from Google Earth and Google search appliance.
– Only some publications from Google Labs. Is that something encouraged within Google, or is it just happens? Very fine line betwe… we want competitive advantage also. We have opened up software. Historically we haven’t been a huge research company.
– Where do you stand on privacy? “Don’t be evil.” You need to get special permission to go through query logs, for example.
– What about Linux and Mac versions of things like Google Desktop? We want to focus on what will give us the most impact. Cross-platform thing is 20%-time stuff. Most Googlers use Linux, so it’s frustrating having to borrow someone else’s computer to try things out.
– What about linkspam? 50-100 people working on linkspam. Matt Cutts is one of the Googlers working on this.
– What about corporate structure? I’ve heard Google’s supposed to be very democratic. — Teams themselves figure out what features should be added. We just meet and figure out what to do. Engineers have a lot of power. More motivation to work on things.
– How many engineers do you have? 3000+ engineers.
– Why do you help out Firefox? What do you have planned? Sometimes Google just does things to help make the Web a better place. Part of philosophy of not being evil.
– What about UI design? UI designers really help us a lot. For example, sidebar. UI designers helped us do that.
– Software engineering? We have design documents and we review them. Testing. 20% projects are an exception; rules are looser. For most projects, there are design documents, all the code is reviewed before it’s submitted, unit tests are encouraged…
– What are you looking for? Well-rounded bright individuals. We want to be able to learn something from you. We want to make sure you’re a solid recruit for Google. We want to make sure we keep learning something. Something that wows us. “Wow, this guy is sharp.”

Update: Also blogged by Alvin Chin. Also: http://www.the-gadgetman.com/files/Google%20tech%20talk.mp3

You can comment with Disqus or you can e-mail me at sacha@sachachua.com.