davies barker 2007

Uploaded from authorPOINTLite
Views:
 
Category: Education
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

Why Google isn't good enough: Useful corpora, meaningful data, and insightful analyses: 

Why Google isn't good enough: Useful corpora, meaningful data, and insightful analyses Mark Davies Linguistics and English Language http://davies-linguistics.byu.edu

More information: 

More information http://corpus.byu.edu http://davies-linguistics.byu.edu mark_davies@byu.edu LING 485 “Corpus Linguistics” (Winter 2008)

Outline: 

Outline Duct tape Some frequency problems with Google Google / text archives vs real corpora Semantics Syntax Morphology Lexicon Upcoming projects

Slide5: 

Google: they’re doing something right …

Google: frequency data: 

Google: frequency data Number of “hits” OK for single words For multiple word strings, more or less a guess

Slide7: 

Google: best guess for string of words: wanted to google

Slide8: 

Google: often the real frequency statistics very different from first estimate

Syntax: “ends up V-ing”: 

Syntax: “ends up V-ing” Interesting semantic and pragmatic space Which verbs does it occur with? Increasing or decreasing? What styles of speech / text types? Where: US, UK, other world Englishes? (Similar constructions in other languages?)

Slide10: 

First generation corpora: Brown Corpus (1 million words), US, 1960s

Slide11: 

First generation corpora: Brown Corpus (1 million words), US, 1960s

Slide12: 

ends up going ended up watching end up paying . . . Google: have to search by exact word forms (problem for syntax)

Slide13: 

Google: frequency problems (again)

Slide14: 

Google: frequency problems (again, today)

Slide15: 

Google: frequency problems (again)

Slide16: 

Google: trying to get historical data

Slide17: 

.uk, .ca, .au  .com, .us, .edu Google: trying to limit by dialect or register

Google results: “ends up V-ing”: 

Google results: “ends up V-ing” Problematic frequency results Can only search for thousands of individual forms No way to know if increasing or decreasing No way to know what styles of speech / text types No way to know where: US, UK, other world Englishes

Slide20: 

EBSCO: Academic Search Premier: 1,850 full-text journals, 1985-present

Slide21: 

ProQuest: Research Library: 1,800 magazines, 1980s-present

Slide22: 

Lexis-Nexis Academic: 1000s of newspapers, transcripts of news programs, etc

Why architecture matters: 

Textual corpus (Words, sentences) Architecture Annotation, indexing, search engine Questions Why architecture matters

Slide24: 

Real Academia Española (CREA); can’t do syntax

Slide25: 

Real Academia Española (CREA); can’t do syntax

Slide26: 

O Publico (200 million words); just newspapers

Features of useful corpora: 

Features of useful corpora Size cf. million-word Brown Corpus – 2 tokens Annotation cf. Google and Spanish corpus; no part of speech Representativity cf. Portuguese corpora; all newspapers

Slide28: 

British National Corpus: 100 million words, UK, 1980s-90s; end up Ving

Slide29: 

British National Corpus: end up Ving by register / genre

Slide30: 

BNC: end up Ving by “micro”-register

Slide31: 

Oxford English Dictionary (OED): 37m words, 2.2 million quotations

Slide32: 

Oxford English Dictionary (OED): 37m words, 2.2 million quotations

Slide33: 

TIME magazine; complete archives 1923-present; 100m+ words

Slide34: 

SCIENCE Order in the Zoo (NUCLEAR PHYSICS) Miracles at Rehovot (RESEARCH) The Missing Ammosaurus (PALEONTOLOGY) SOCIETY CALIFORNIA: A State of Excitement (Modern Living) CANDIDE CAMERA: IN SEARCH OF THE SOUL (Modern Living) LABORATORY IN THE SUN: THE PAST AS FUTURE (Modern Living) The Battering Parent (Behavior / CHILDREN) Stay Single (Behavior / THE FAMILY) PRESS Letting Go of a Legacy (The Press / NEWSPAPERS) Penthouse v. Playboy (The Press / MAGAZINES) SPORT The Rise of Roman's Empire (FOOTBALL) BUSINESS Nixon's Rookie of the Year Toward a Just Marketplace (CONSUMERS) A License to Print Money (CORPORATIONS) Bargain Season (AIRLINES) NATION Good Guys All (The Nation / AMERICAN NOTES) Fair Play for Bears (The Nation) Of Peace and Politics (The Nation / THE PRESIDENCY) WORLD LEBANON: ALONG THE ARAFAT TRAIL (The World) Voting Under Fire (The World / ISRAEL) EDUCATION Between Moratoriums (CAMPUS COMMUNIQUE) Bugging the Bargainers (TEACHERS) M.I.T. and the Pentagon (UNIVERSITIES) LAW A Brother's Sacrifice (The Law / EQUITY) Threat to the Ombudsmen (The Law / POVERTY LAW) ARTS & ENTERTAINMENT Read the story (Time Listings / TELEVISION) Two for the Season (Dance / BALLET) Art Deco (Art / STYLES) The Very Expensive Coco (Show Business) Marshmallow Moratorium (Cinema / NEW MOVIES) Old Master (Cinema) Prosciutto and Melancholy (Cinema) The Shrinking Shrink (Cinema) Imminent Victorians (Books) The Dying of the Light (Books) One-Man Circus (Books) Privileged Heirlooms (Books) Nov 7, 1969

Slide35: 

TIME Magazine: US 1923-present, 100 million words: end up Ving

Slide36: 

TIME Magazine: US 1923-current, 100m words: end up Ving

Slide37: 

LDS General Conference: 23 million words, 1851-present

Slide38: 

Corpus do Português: 45 million words, 1200s-1900s: terminar Vndo

Slide39: 

Corpus do Português: 45 million words, 1200s-1900s: terminar.* [vg*]

Slide40: 

Corpus del Español: 100 million words, 1200s-1900s: terminar Vndo

Slide41: 

Corpus del Español: 100 million words, 1200s-1900s: [terminar] [vvp]

Slide42: 

Corpus del Español: 100 million words, 1200s-1900s: [terminar] [vpp]

Slide43: 

Corpus del Español: More complex: acabar / terminar (por)

Syntax: 

Syntax Large (>25m words) tagged corpora with many different genres. Only corpora: English: British National Corpus Historical English: OED (BYU interface) Historical American English (1900s): TIME Spanish: Corpus del Español Portuguese: Corpus do Português

Semantics: collocates: 

Semantics: collocates “You can tell a lot about a word by the other words that it hangs out with” How would you do it with Google or text archives? Sort through the chaff – mutual information Comparison between words (small/little, men/women) Comparison between registers (chair, chain) Comparison over time – semantic change (web, engine) Comparison over time – cultural shifts (woman; 1800s vs 1900s)

Slide46: 

Google: collocates: go through examples one by one, looking for nearby words

Slide47: 

BNC: Collocates ( up to 10 words left / right ): sign.[n*]

Slide48: 

BNC: collocates: sorted by “relevancy”: sign.[n*]

Slide49: 

BNC: Comparing collocates by register: chair in FICT, ACAD

Slide50: 

BNC: Comparing related words: sheer / utter / absolute [n*]

Slide51: 

BNC: Comparing collocates of synonyms: [=evil] [nn*] Evil play ? Foul damage ? Severe thing ? Nick Ellis “The psychological reality of collocation and semantic prosody” Tomorrow, 10-11:30 115 MCKB

Slide52: 

Corpus del Español: Comparing related words: gozar / disfrutar ‘enjoy’

Slide53: 

Corpus del Español: Comparing related words: gozar / disfrutar ‘enjoy’

Slide54: 

BNC: Culture: woman / man + ADJ

Slide55: 

Corpus del Español: Culture: hombre/mujer + ADJ

Slide56: 

TIME: Semantic change: Collocates (of engine) by decade

Slide57: 

TIME: Semantic change: Collocates (of chip.[nn*]) by decade

Slide58: 

TIME: Collocates: Comparisons: strike.[nn*] + ADJ (1980s-2000s vs 1920s-1940s)

Slide59: 

TIME: Collocates: Comparisons: wife + ADJ (1980s-2000s vs 1920s-1940s)

Slide60: 

Corpus del Español: Collocates of mujeres ‘women’: 1800s vs 1900s

Slide61: 

Corpus do Português: Collocates of mulheres ‘women’: 1800s vs 1900s

Semantics (collocates): 

Semantics (collocates) Google, text archives, and simple corpora (e.g. Real Academia) can’t do collocates (natively) Easiest with: BNC OED (BYU interface) TIME Corpus del Español Corpus do Português

Morphology: word formation: 

Morphology: word formation Google, text archives, and simple corpora (e.g. Real Academia) can’t do substrings Easiest with: BNC OED (BYU interface) TIME Corpus del Español Corpus do Português

Slide64: 

Google: can’t do substrings: re*ion (retention, recision)

Slide65: 

CREA (Real Academia): can’t do substrings: re*ión

Slide66: 

BNC: easily does substrings: re*ion

Slide67: 

BNC: easily does substrings: charts: re*ion

Slide68: 

OED: easily does substrings: de*ion

Slide69: 

OED: easily does substrings: de*ion

Slide70: 

Corpus del Español: substrings: tables: de*i?n.[n*]

Slide71: 

Corpus del Español: substrings: charts: de*i?n.[n*]

Slide72: 

TIME: Morphology: *gate (1990s vs 1980s)

Slide73: 

TIME: Tables: Comparisons of two different time periods (*heart* 1920s-1940s)

Slide74: 

TIME: Tables: Comparisons of two different time periods (*heart* 1980s-present)

Morphology (substrings): 

Morphology (substrings) Google, text archives, and simple corpora (e.g. Real Academia) can’t do substrings Easiest with: BNC OED (BYU interface) TIME Corpus del Español Corpus do Português

Lexis: word frequency: 

Lexis: word frequency Google, text archives, and simple corpora (e.g. Real Academia) can’t compare frequencies Corpora with relational database architecture can do this easily: BNC OED (BYU interface) TIME Corpus del Español Corpus do Português

Slide77: 

.uk, .ca, .au  .com, .us, .edu Google: trying to limit by dialect or register

Slide78: 

BNC: Differences across registers / genres: shiny

Slide79: 

TIME: Cultural: reds

Slide80: 

TIME: Cultural: reds

Slide81: 

Corpus del Español: Cultural: [soldado]

Slide82: 

BNC: Limit by register: [vvi] (infinitive) in LEGAL vs ACADEMIC

Slide83: 

BNC: Frequency: by part of speech: phrasal verbs: comparing registers

Slide84: 

OED: Lexical bundles: * * (+1900s -1800s) Nick Ellis “The processing of formulas in native and second-language speakers: Psycholinguistic and corpus determinants” Today, 3-4:30, B104 JFSB

Slide85: 

TIME: Syntax/lexical: to [vvi] up: (2000s vs. 1930s)

Slide86: 

TIME: Lexical: [vvi] (1930s vs. 1940s-1960s)

Slide87: 

Corpus del Español: Lexical: [r] (adverbs): 1900s vs 1800s

Slide88: 

Corpus del Español and Corpus do Português: Frequency dictionaries (Routledge)

The BYU Corpus of Contemporary American English: 

The BYU Corpus of Contemporary American English 360 million words, 1990-2007 20 million words each year Equally divided into: Spoken (Oprah, Today, NPR; unscripted) Fiction (Short stories, first chapters of first editions) Popular magazines (90+; selected by subject area) Newspapers (10; by sub-sections as well) Academic (90+; selected by subject area) Monitor corpus: will be updated every month Similar interface to BNC interface (corpus.byu.edu/bnc) Material all collected; online by Feb 2008

Slide90: 

The BYU Corpus of Historical American English

Architecture: 

Architecture Finding architecture that allows for: Size Speed Annotation Relational databases

Architecture: Linear search: 

Architecture: Linear search 100 million words = 600 MB of RAM 360 million words = 2.2 GB of RAM Regular expressions (“pattern matching”) Semantics (from thesaurus, WordNet, etc) Increasingly difficult the more annotation you add

Architecture: Huge “hashes”: 

Architecture: Huge “hashes” Number all words in corpus [1] he [2] will [3] end [4] up [5] paying [6] more For “end up Ving”: Three files with large sets of numbers: [end] +1 [up] +1 [Ving] Problem: really starts bogging down around 50 million words How does Google do it?

Architecture: Collocates: 

Architecture: Collocates Find “pointers” to all occurrences of word Look for collocates at each position Problem: if you hit the hard drive to read each occurrence, even at 3,000 hits per second, then 15-20 seconds for 50,000 occurrences

Architecture: “window” of words: 

Architecture: “window” of words

More information: 

More information http://corpus.byu.edu http://davies-linguistics.byu.edu mark_davies@byu.edu LING 485 “Corpus Linguistics” (Winter 2008)