Share PowerPoint. Anywhere!

davies barker 2007

Uploaded from authorPOINT Lite
Download as Download Not Available PPT
Presentation Description

No description available

Like authorSTREAM?


You can vote once a day till December
10th, Vote Now!
Views: 20
Like it  ( Likes) Dislike it  ( Dislikes)
Added: January 23, 2008 This presentation is Public
Presentation Category :Education
Presentation StatisticsNew!
Views on authorSTREAM: 20
Presentation Transcript

Why Google isn't good enough: Useful corpora, meaningful data, and insightful analyses : Why Google isn't good enough: Useful corpora, meaningful data, and insightful analyses Mark Davies Linguistics and English Language http://davies-linguistics.byu.edu


More information : More information http://corpus.byu.edu http://davies-linguistics.byu.edu mark_davies@byu.edu LING 485 “Corpus Linguistics” (Winter 2008)


Outline : Outline Duct tape Some frequency problems with Google Google / text archives vs real corpora Semantics Syntax Morphology Lexicon Upcoming projects


Slide5 : Google: they’re doing something right …


Google: frequency data : Google: frequency data Number of “hits” OK for single words For multiple word strings, more or less a guess


Slide7 : Google: best guess for string of words: wanted to google


Slide8 : Google: often the real frequency statistics very different from first estimate


Syntax: “ends up V-ing” : Syntax: “ends up V-ing” Interesting semantic and pragmatic space Which verbs does it occur with? Increasing or decreasing? What styles of speech / text types? Where: US, UK, other world Englishes? (Similar constructions in other languages?)


Slide10 : First generation corpora: Brown Corpus (1 million words), US, 1960s


Slide11 : First generation corpora: Brown Corpus (1 million words), US, 1960s


Slide12 : ends up going ended up watching end up paying . . . Google: have to search by exact word forms (problem for syntax)


Slide13 : Google: frequency problems (again)


Slide14 : Google: frequency problems (again, today)


Slide15 : Google: frequency problems (again)


Slide16 : Google: trying to get historical data


Slide17 : .uk, .ca, .au  .com, .us, .edu Google: trying to limit by dialect or register


Google results: “ends up V-ing” : Google results: “ends up V-ing” Problematic frequency results Can only search for thousands of individual forms No way to know if increasing or decreasing No way to know what styles of speech / text types No way to know where: US, UK, other world Englishes


Slide20 : EBSCO: Academic Search Premier: 1,850 full-text journals, 1985-present


Slide21 : ProQuest: Research Library: 1,800 magazines, 1980s-present


Slide22 : Lexis-Nexis Academic: 1000s of newspapers, transcripts of news programs, etc


Why architecture matters : Textual corpus (Words, sentences) Architecture Annotation, indexing, search engine Questions Why architecture matters


Slide24 : Real Academia Española (CREA); can’t do syntax


Slide25 : Real Academia Española (CREA); can’t do syntax


Slide26 : O Publico (200 million words); just newspapers


Features of useful corpora : Features of useful corpora Size cf. million-word Brown Corpus – 2 tokens Annotation cf. Google and Spanish corpus; no part of speech Representativity cf. Portuguese corpora; all newspapers


Slide28 : British National Corpus: 100 million words, UK, 1980s-90s; end up Ving


Slide29 : British National Corpus: end up Ving by register / genre


Slide30 : BNC: end up Ving by “micro”-register


Slide31 : Oxford English Dictionary (OED): 37m words, 2.2 million quotations


Slide32 : Oxford English Dictionary (OED): 37m words, 2.2 million quotations


Slide33 : TIME magazine; complete archives 1923-present; 100m+ words


Slide34 : SCIENCE Order in the Zoo (NUCLEAR PHYSICS) Miracles at Rehovot (RESEARCH) The Missing Ammosaurus (PALEONTOLOGY) SOCIETY CALIFORNIA: A State of Excitement (Modern Living) CANDIDE CAMERA: IN SEARCH OF THE SOUL (Modern Living) LABORATORY IN THE SUN: THE PAST AS FUTURE (Modern Living) The Battering Parent (Behavior / CHILDREN) Stay Single (Behavior / THE FAMILY) PRESS Letting Go of a Legacy (The Press / NEWSPAPERS) Penthouse v. Playboy (The Press / MAGAZINES) SPORT The Rise of Roman's Empire (FOOTBALL) BUSINESS Nixon's Rookie of the Year Toward a Just Marketplace (CONSUMERS) A License to Print Money (CORPORATIONS) Bargain Season (AIRLINES) NATION Good Guys All (The Nation / AMERICAN NOTES) Fair Play for Bears (The Nation) Of Peace and Politics (The Nation / THE PRESIDENCY) WORLD LEBANON: ALONG THE ARAFAT TRAIL (The World) Voting Under Fire (The World / ISRAEL) EDUCATION Between Moratoriums (CAMPUS COMMUNIQUE) Bugging the Bargainers (TEACHERS) M.I.T. and the Pentagon (UNIVERSITIES) LAW A Brother's Sacrifice (The Law / EQUITY) Threat to the Ombudsmen (The Law / POVERTY LAW) ARTS & ENTERTAINMENT Read the story (Time Listings / TELEVISION) Two for the Season (Dance / BALLET) Art Deco (Art / STYLES) The Very Expensive Coco (Show Business) Marshmallow Moratorium (Cinema / NEW MOVIES) Old Master (Cinema) Prosciutto and Melancholy (Cinema) The Shrinking Shrink (Cinema) Imminent Victorians (Books) The Dying of the Light (Books) One-Man Circus (Books) Privileged Heirlooms (Books) Nov 7, 1969


Slide35 : TIME Magazine: US 1923-present, 100 million words: end up Ving


Slide36 : TIME Magazine: US 1923-current, 100m words: end up Ving


Slide37 : LDS General Conference: 23 million words, 1851-present


Slide38 : Corpus do Português: 45 million words, 1200s-1900s: terminar Vndo


Slide39 : Corpus do Português: 45 million words, 1200s-1900s: terminar.* [vg*]


Slide40 : Corpus del Español: 100 million words, 1200s-1900s: terminar Vndo


Slide41 : Corpus del Español: 100 million words, 1200s-1900s: [terminar] [vvp]


Slide42 : Corpus del Español: 100 million words, 1200s-1900s: [terminar] [vpp]


Slide43 : Corpus del Español: More complex: acabar / terminar (por)


Syntax : Syntax Large (>25m words) tagged corpora with many different genres. Only corpora: English: British National Corpus Historical English: OED (BYU interface) Historical American English (1900s): TIME Spanish: Corpus del Español Portuguese: Corpus do Português


Semantics: collocates : Semantics: collocates “You can tell a lot about a word by the other words that it hangs out with” How would you do it with Google or text archives? Sort through the chaff – mutual information Comparison between words (small/little, men/women) Comparison between registers (chair, chain) Comparison over time – semantic change (web, engine) Comparison over time – cultural shifts (woman; 1800s vs 1900s)


Slide46 : Google: collocates: go through examples one by one, looking for nearby words


Slide47 : BNC: Collocates ( up to 10 words left / right ): sign.[n*]


Slide48 : BNC: collocates: sorted by “relevancy”: sign.[n*]


Slide49 : BNC: Comparing collocates by register: chair in FICT, ACAD


Slide50 : BNC: Comparing related words: sheer / utter / absolute [n*]


Slide51 : BNC: Comparing collocates of synonyms: [=evil] [nn*] Evil play ? Foul damage ? Severe thing ? Nick Ellis “The psychological reality of collocation and semantic prosody” Tomorrow, 10-11:30 115 MCKB


Slide52 : Corpus del Español: Comparing related words: gozar / disfrutar ‘enjoy’


Slide53 : Corpus del Español: Comparing related words: gozar / disfrutar ‘enjoy’


Slide54 : BNC: Culture: woman / man + ADJ


Slide55 : Corpus del Español: Culture: hombre/mujer + ADJ


Slide56 : TIME: Semantic change: Collocates (of engine) by decade


Slide57 : TIME: Semantic change: Collocates (of chip.[nn*]) by decade


Slide58 : TIME: Collocates: Comparisons: strike.[nn*] + ADJ (1980s-2000s vs 1920s-1940s)


Slide59 : TIME: Collocates: Comparisons: wife + ADJ (1980s-2000s vs 1920s-1940s)


Slide60 : Corpus del Español: Collocates of mujeres ‘women’: 1800s vs 1900s


Slide61 : Corpus do Português: Collocates of mulheres ‘women’: 1800s vs 1900s


Semantics (collocates) : Semantics (collocates) Google, text archives, and simple corpora (e.g. Real Academia) can’t do collocates (natively) Easiest with: BNC OED (BYU interface) TIME Corpus del Español Corpus do Português


Morphology: word formation : Morphology: word formation Google, text archives, and simple corpora (e.g. Real Academia) can’t do substrings Easiest with: BNC OED (BYU interface) TIME Corpus del Español Corpus do Português


Slide64 : Google: can’t do substrings: re*ion (retention, recision)


Slide65 : CREA (Real Academia): can’t do substrings: re*ión


Slide66 : BNC: easily does substrings: re*ion


Slide67 : BNC: easily does substrings: charts: re*ion


Slide68 : OED: easily does substrings: de*ion


Slide69 : OED: easily does substrings: de*ion


Slide70 : Corpus del Español: substrings: tables: de*i?n.[n*]


Slide71 : Corpus del Español: substrings: charts: de*i?n.[n*]


Slide72 : TIME: Morphology: *gate (1990s vs 1980s)


Slide73 : TIME: Tables: Comparisons of two different time periods (*heart* 1920s-1940s)


Slide74 : TIME: Tables: Comparisons of two different time periods (*heart* 1980s-present)


Morphology (substrings) : Morphology (substrings) Google, text archives, and simple corpora (e.g. Real Academia) can’t do substrings Easiest with: BNC OED (BYU interface) TIME Corpus del Español Corpus do Português


Lexis: word frequency : Lexis: word frequency Google, text archives, and simple corpora (e.g. Real Academia) can’t compare frequencies Corpora with relational database architecture can do this easily: BNC OED (BYU interface) TIME Corpus del Español Corpus do Português


Slide77 : .uk, .ca, .au  .com, .us, .edu Google: trying to limit by dialect or register


Slide78 : BNC: Differences across registers / genres: shiny


Slide79 : TIME: Cultural: reds


Slide80 : TIME: Cultural: reds


Slide81 : Corpus del Español: Cultural: [soldado]


Slide82 : BNC: Limit by register: [vvi] (infinitive) in LEGAL vs ACADEMIC


Slide83 : BNC: Frequency: by part of speech: phrasal verbs: comparing registers


Slide84 : OED: Lexical bundles: * * (+1900s -1800s) Nick Ellis “The processing of formulas in native and second-language speakers: Psycholinguistic and corpus determinants” Today, 3-4:30, B104 JFSB


Slide85 : TIME: Syntax/lexical: to [vvi] up: (2000s vs. 1930s)


Slide86 : TIME: Lexical: [vvi] (1930s vs. 1940s-1960s)


Slide87 : Corpus del Español: Lexical: [r] (adverbs): 1900s vs 1800s


Slide88 : Corpus del Español and Corpus do Português: Frequency dictionaries (Routledge)


The BYU Corpus of Contemporary American English : The BYU Corpus of Contemporary American English 360 million words, 1990-2007 20 million words each year Equally divided into: Spoken (Oprah, Today, NPR; unscripted) Fiction (Short stories, first chapters of first editions) Popular magazines (90+; selected by subject area) Newspapers (10; by sub-sections as well) Academic (90+; selected by subject area) Monitor corpus: will be updated every month Similar interface to BNC interface (corpus.byu.edu/bnc) Material all collected; online by Feb 2008


Slide90 : The BYU Corpus of Historical American English


Architecture : Architecture Finding architecture that allows for: Size Speed Annotation Relational databases


Architecture: Linear search : Architecture: Linear search 100 million words = 600 MB of RAM 360 million words = 2.2 GB of RAM Regular expressions (“pattern matching”) Semantics (from thesaurus, WordNet, etc) Increasingly difficult the more annotation you add


Architecture: Huge “hashes” : Architecture: Huge “hashes” Number all words in corpus [1] he [2] will [3] end [4] up [5] paying [6] more For “end up Ving”: Three files with large sets of numbers: [end] +1 [up] +1 [Ving] Problem: really starts bogging down around 50 million words How does Google do it?


Architecture: Collocates : Architecture: Collocates Find “pointers” to all occurrences of word Look for collocates at each position Problem: if you hit the hard drive to read each occurrence, even at 3,000 hits per second, then 15-20 seconds for 50,000 occurrences


Architecture: “window” of words : Architecture: “window” of words


More information : More information http://corpus.byu.edu http://davies-linguistics.byu.edu mark_davies@byu.edu LING 485 “Corpus Linguistics” (Winter 2008)