Why Google isn't good enough: Useful corpora, meaningful data, and insightful analyses : Why Google isn't good enough: Useful corpora, meaningful data, and insightful analyses Mark Davies
Linguistics and English Language
http://davies-linguistics.byu.edu
More information : More information http://corpus.byu.edu
http://davies-linguistics.byu.edu
mark_davies@byu.edu
LING 485 “Corpus Linguistics” (Winter 2008)
Outline : Outline Duct tape
Some frequency problems with Google
Google / text archives vs real corpora
Semantics
Syntax
Morphology
Lexicon
Upcoming projects
Slide5 : Google: they’re doing something right …
Google: frequency data : Google: frequency data Number of “hits” OK for single words
For multiple word strings, more or less a guess
Slide7 : Google: best guess for string of words: wanted to google
Slide8 : Google: often the real frequency statistics very different from first estimate
Syntax: “ends up V-ing” : Syntax: “ends up V-ing” Interesting semantic and pragmatic space
Which verbs does it occur with?
Increasing or decreasing?
What styles of speech / text types?
Where: US, UK, other world Englishes?
(Similar constructions in other languages?)
Slide10 : First generation corpora: Brown Corpus (1 million words), US, 1960s
Slide11 : First generation corpora: Brown Corpus (1 million words), US, 1960s
Slide12 : ends up going
ended up watching
end up paying
. . . Google: have to search by exact word forms (problem for syntax)
Slide13 : Google: frequency problems (again)
Slide14 : Google: frequency problems (again, today)
Slide15 : Google: frequency problems (again)
Slide16 : Google: trying to get historical data
Slide17 : .uk, .ca, .au
.com, .us, .edu
Google: trying to limit by dialect or register
Google results: “ends up V-ing” : Google results: “ends up V-ing” Problematic frequency results
Can only search for thousands of individual forms
No way to know if increasing or decreasing
No way to know what styles of speech / text types
No way to know where: US, UK, other world Englishes
Slide20 : EBSCO: Academic Search Premier: 1,850 full-text journals, 1985-present
Slide21 : ProQuest: Research Library: 1,800 magazines, 1980s-present
Slide22 : Lexis-Nexis Academic: 1000s of newspapers, transcripts of news programs, etc
Why architecture matters : Textual corpus (Words, sentences) Architecture
Annotation, indexing, search engine Questions Why architecture matters
Slide24 : Real Academia Española (CREA); can’t do syntax
Slide25 : Real Academia Española (CREA); can’t do syntax
Slide26 : O Publico (200 million words); just newspapers
Features of useful corpora : Features of useful corpora Size
cf. million-word Brown Corpus – 2 tokens
Annotation
cf. Google and Spanish corpus; no part of speech
Representativity
cf. Portuguese corpora; all newspapers
Slide28 : British National Corpus: 100 million words, UK, 1980s-90s; end up Ving
Slide29 : British National Corpus: end up Ving by register / genre
Slide30 : BNC: end up Ving by “micro”-register
Slide31 : Oxford English Dictionary (OED): 37m words, 2.2 million quotations
Slide32 : Oxford English Dictionary (OED): 37m words, 2.2 million quotations
Slide33 : TIME magazine; complete archives 1923-present; 100m+ words
Slide34 : SCIENCE
Order in the Zoo (NUCLEAR PHYSICS) Miracles at Rehovot (RESEARCH) The Missing Ammosaurus (PALEONTOLOGY)
SOCIETY
CALIFORNIA: A State of Excitement (Modern Living) CANDIDE CAMERA: IN SEARCH OF THE SOUL (Modern Living) LABORATORY IN THE SUN: THE PAST AS FUTURE (Modern Living) The Battering Parent (Behavior / CHILDREN) Stay Single (Behavior / THE FAMILY)
PRESS
Letting Go of a Legacy (The Press / NEWSPAPERS) Penthouse v. Playboy (The Press / MAGAZINES)
SPORT
The Rise of Roman's Empire (FOOTBALL)
BUSINESS
Nixon's Rookie of the Year Toward a Just Marketplace (CONSUMERS) A License to Print Money (CORPORATIONS) Bargain Season (AIRLINES) NATION
Good Guys All (The Nation / AMERICAN NOTES) Fair Play for Bears (The Nation) Of Peace and Politics (The Nation / THE PRESIDENCY)
WORLD
LEBANON: ALONG THE ARAFAT TRAIL (The World) Voting Under Fire (The World / ISRAEL) EDUCATION
Between Moratoriums (CAMPUS COMMUNIQUE) Bugging the Bargainers (TEACHERS) M.I.T. and the Pentagon (UNIVERSITIES)
LAW
A Brother's Sacrifice (The Law / EQUITY) Threat to the Ombudsmen (The Law / POVERTY LAW)
ARTS & ENTERTAINMENT
Read the story (Time Listings / TELEVISION) Two for the Season (Dance / BALLET) Art Deco (Art / STYLES) The Very Expensive Coco (Show Business) Marshmallow Moratorium (Cinema / NEW MOVIES) Old Master (Cinema) Prosciutto and Melancholy (Cinema) The Shrinking Shrink (Cinema) Imminent Victorians (Books) The Dying of the Light (Books) One-Man Circus (Books) Privileged Heirlooms (Books)
Nov 7, 1969
Slide35 : TIME Magazine: US 1923-present, 100 million words: end up Ving
Slide36 : TIME Magazine: US 1923-current, 100m words: end up Ving
Slide37 : LDS General Conference: 23 million words, 1851-present
Slide38 : Corpus do Português: 45 million words, 1200s-1900s: terminar Vndo
Slide39 : Corpus do Português: 45 million words, 1200s-1900s: terminar.* [vg*]
Slide40 : Corpus del Español: 100 million words, 1200s-1900s: terminar Vndo
Slide41 : Corpus del Español: 100 million words, 1200s-1900s: [terminar] [vvp]
Slide42 : Corpus del Español: 100 million words, 1200s-1900s: [terminar] [vpp]
Slide43 : Corpus del Español: More complex: acabar / terminar (por)
Syntax : Syntax Large (>25m words) tagged corpora with many different genres. Only corpora:
English: British National Corpus
Historical English: OED (BYU interface)
Historical American English (1900s): TIME
Spanish: Corpus del Español
Portuguese: Corpus do Português
Semantics: collocates : Semantics: collocates “You can tell a lot about a word by the other words that it hangs out with”
How would you do it with Google or text archives?
Sort through the chaff – mutual information
Comparison between words (small/little, men/women)
Comparison between registers (chair, chain)
Comparison over time – semantic change (web, engine)
Comparison over time – cultural shifts (woman; 1800s vs 1900s)
Slide46 : Google: collocates: go through examples one by one, looking for nearby words
Slide47 : BNC: Collocates ( up to 10 words left / right ): sign.[n*]
Slide48 : BNC: collocates: sorted by “relevancy”: sign.[n*]
Slide49 : BNC: Comparing collocates by register: chair in FICT, ACAD
Slide50 : BNC: Comparing related words: sheer / utter / absolute [n*]
Slide51 : BNC: Comparing collocates of synonyms: [=evil] [nn*] Evil play ?
Foul damage ?
Severe thing ? Nick Ellis
“The psychological reality of collocation and semantic prosody”
Tomorrow, 10-11:30 115 MCKB
Slide52 : Corpus del Español: Comparing related words: gozar / disfrutar ‘enjoy’
Slide53 : Corpus del Español: Comparing related words: gozar / disfrutar ‘enjoy’
Slide54 : BNC: Culture: woman / man + ADJ
Slide55 : Corpus del Español: Culture: hombre/mujer + ADJ
Slide56 : TIME: Semantic change: Collocates (of engine) by decade
Slide57 : TIME: Semantic change: Collocates (of chip.[nn*]) by decade
Slide58 : TIME: Collocates: Comparisons: strike.[nn*] + ADJ (1980s-2000s vs 1920s-1940s)
Slide59 : TIME: Collocates: Comparisons: wife + ADJ (1980s-2000s vs 1920s-1940s)
Slide60 : Corpus del Español: Collocates of mujeres ‘women’: 1800s vs 1900s
Slide61 : Corpus do Português: Collocates of mulheres ‘women’: 1800s vs 1900s
Semantics (collocates) : Semantics (collocates) Google, text archives, and simple corpora (e.g. Real Academia) can’t do collocates (natively)
Easiest with:
BNC
OED (BYU interface)
TIME
Corpus del Español
Corpus do Português
Morphology: word formation : Morphology: word formation Google, text archives, and simple corpora (e.g. Real Academia) can’t do substrings
Easiest with:
BNC
OED (BYU interface)
TIME
Corpus del Español
Corpus do Português
Slide64 : Google: can’t do substrings: re*ion (retention, recision)
Slide65 : CREA (Real Academia): can’t do substrings: re*ión
Slide66 : BNC: easily does substrings: re*ion
Slide67 : BNC: easily does substrings: charts: re*ion
Slide68 : OED: easily does substrings: de*ion
Slide69 : OED: easily does substrings: de*ion
Slide70 : Corpus del Español: substrings: tables: de*i?n.[n*]
Slide71 : Corpus del Español: substrings: charts: de*i?n.[n*]
Slide72 : TIME: Morphology: *gate (1990s vs 1980s)
Slide73 : TIME: Tables: Comparisons of two different time periods (*heart* 1920s-1940s)
Slide74 : TIME: Tables: Comparisons of two different time periods (*heart* 1980s-present)
Morphology (substrings) : Morphology (substrings) Google, text archives, and simple corpora (e.g. Real Academia) can’t do substrings
Easiest with:
BNC
OED (BYU interface)
TIME
Corpus del Español
Corpus do Português
Lexis: word frequency : Lexis: word frequency Google, text archives, and simple corpora (e.g. Real Academia) can’t compare frequencies
Corpora with relational database architecture can do this easily:
BNC
OED (BYU interface)
TIME
Corpus del Español
Corpus do Português
Slide77 : .uk, .ca, .au
.com, .us, .edu
Google: trying to limit by dialect or register
Slide78 : BNC: Differences across registers / genres: shiny
Slide79 : TIME: Cultural: reds
Slide80 : TIME: Cultural: reds
Slide81 : Corpus del Español: Cultural: [soldado]
Slide82 : BNC: Limit by register: [vvi] (infinitive) in LEGAL vs ACADEMIC
Slide83 : BNC: Frequency: by part of speech: phrasal verbs: comparing registers
Slide84 : OED: Lexical bundles: * * (+1900s -1800s) Nick Ellis
“The processing of formulas in native and second-language speakers: Psycholinguistic and corpus determinants”
Today, 3-4:30, B104 JFSB
Slide85 : TIME: Syntax/lexical: to [vvi] up: (2000s vs. 1930s)
Slide86 : TIME: Lexical: [vvi] (1930s vs. 1940s-1960s)
Slide87 : Corpus del Español: Lexical: [r] (adverbs): 1900s vs 1800s
Slide88 : Corpus del Español and Corpus do Português: Frequency dictionaries (Routledge)
The BYU Corpus ofContemporary American English : The BYU Corpus of Contemporary American English 360 million words, 1990-2007
20 million words each year
Equally divided into:
Spoken (Oprah, Today, NPR; unscripted)
Fiction (Short stories, first chapters of first editions)
Popular magazines (90+; selected by subject area)
Newspapers (10; by sub-sections as well)
Academic (90+; selected by subject area)
Monitor corpus: will be updated every month
Similar interface to BNC interface (corpus.byu.edu/bnc)
Material all collected; online by Feb 2008
Slide90 : The BYU Corpus of Historical American English
Architecture : Architecture Finding architecture that allows for:
Size
Speed
Annotation
Relational databases
Architecture: Linear search : Architecture: Linear search
100 million words = 600 MB of RAM
360 million words = 2.2 GB of RAM
Regular expressions (“pattern matching”)
Semantics (from thesaurus, WordNet, etc)
Increasingly difficult the more annotation you add
Architecture: Huge “hashes” : Architecture: Huge “hashes” Number all words in corpus
[1] he [2] will [3] end [4] up [5] paying [6] more
For “end up Ving”:
Three files with large sets of numbers: [end] +1 [up] +1 [Ving]
Problem: really starts bogging down around 50 million words
How does Google do it?
Architecture: Collocates : Architecture: Collocates Find “pointers” to all occurrences of word Look for collocates at each position Problem: if you hit the hard drive to read each occurrence, even at 3,000 hits per second, then 15-20 seconds for 50,000 occurrences
Architecture: “window” of words : Architecture: “window” of words
More information : More information http://corpus.byu.edu
http://davies-linguistics.byu.edu
mark_davies@byu.edu
LING 485 “Corpus Linguistics” (Winter 2008)