logging in or signing up davies barker 2007 Dario Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 87 Category: Education License: All Rights Reserved Like it (0) Dislike it (0) Added: January 23, 2008 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Why Google isn't good enough: Useful corpora, meaningful data, and insightful analyses: Why Google isn't good enough: Useful corpora, meaningful data, and insightful analyses Mark Davies Linguistics and English Language http://davies-linguistics.byu.eduMore information: More information http://corpus.byu.edu http://davies-linguistics.byu.edu mark_davies@byu.edu LING 485 “Corpus Linguistics” (Winter 2008)Outline: Outline Duct tape Some frequency problems with Google Google / text archives vs real corpora Semantics Syntax Morphology Lexicon Upcoming projectsSlide5: Google: they’re doing something right … Google: frequency data: Google: frequency data Number of “hits” OK for single words For multiple word strings, more or less a guessSlide7: Google: best guess for string of words: wanted to googleSlide8: Google: often the real frequency statistics very different from first estimateSyntax: “ends up V-ing”: Syntax: “ends up V-ing” Interesting semantic and pragmatic space Which verbs does it occur with? Increasing or decreasing? What styles of speech / text types? Where: US, UK, other world Englishes? (Similar constructions in other languages?)Slide10: First generation corpora: Brown Corpus (1 million words), US, 1960sSlide11: First generation corpora: Brown Corpus (1 million words), US, 1960sSlide12: ends up going ended up watching end up paying . . . Google: have to search by exact word forms (problem for syntax)Slide13: Google: frequency problems (again)Slide14: Google: frequency problems (again, today) Slide15: Google: frequency problems (again)Slide16: Google: trying to get historical dataSlide17: .uk, .ca, .au .com, .us, .edu Google: trying to limit by dialect or registerGoogle results: “ends up V-ing”: Google results: “ends up V-ing” Problematic frequency results Can only search for thousands of individual forms No way to know if increasing or decreasing No way to know what styles of speech / text types No way to know where: US, UK, other world EnglishesSlide20: EBSCO: Academic Search Premier: 1,850 full-text journals, 1985-presentSlide21: ProQuest: Research Library: 1,800 magazines, 1980s-presentSlide22: Lexis-Nexis Academic: 1000s of newspapers, transcripts of news programs, etcWhy architecture matters: Textual corpus (Words, sentences) Architecture Annotation, indexing, search engine Questions Why architecture mattersSlide24: Real Academia Española (CREA); can’t do syntaxSlide25: Real Academia Española (CREA); can’t do syntaxSlide26: O Publico (200 million words); just newspapersFeatures of useful corpora: Features of useful corpora Size cf. million-word Brown Corpus – 2 tokens Annotation cf. Google and Spanish corpus; no part of speech Representativity cf. Portuguese corpora; all newspapersSlide28: British National Corpus: 100 million words, UK, 1980s-90s; end up VingSlide29: British National Corpus: end up Ving by register / genreSlide30: BNC: end up Ving by “micro”-registerSlide31: Oxford English Dictionary (OED): 37m words, 2.2 million quotationsSlide32: Oxford English Dictionary (OED): 37m words, 2.2 million quotationsSlide33: TIME magazine; complete archives 1923-present; 100m+ wordsSlide34: SCIENCE Order in the Zoo (NUCLEAR PHYSICS) Miracles at Rehovot (RESEARCH) The Missing Ammosaurus (PALEONTOLOGY) SOCIETY CALIFORNIA: A State of Excitement (Modern Living) CANDIDE CAMERA: IN SEARCH OF THE SOUL (Modern Living) LABORATORY IN THE SUN: THE PAST AS FUTURE (Modern Living) The Battering Parent (Behavior / CHILDREN) Stay Single (Behavior / THE FAMILY) PRESS Letting Go of a Legacy (The Press / NEWSPAPERS) Penthouse v. Playboy (The Press / MAGAZINES) SPORT The Rise of Roman's Empire (FOOTBALL) BUSINESS Nixon's Rookie of the Year Toward a Just Marketplace (CONSUMERS) A License to Print Money (CORPORATIONS) Bargain Season (AIRLINES) NATION Good Guys All (The Nation / AMERICAN NOTES) Fair Play for Bears (The Nation) Of Peace and Politics (The Nation / THE PRESIDENCY) WORLD LEBANON: ALONG THE ARAFAT TRAIL (The World) Voting Under Fire (The World / ISRAEL) EDUCATION Between Moratoriums (CAMPUS COMMUNIQUE) Bugging the Bargainers (TEACHERS) M.I.T. and the Pentagon (UNIVERSITIES) LAW A Brother's Sacrifice (The Law / EQUITY) Threat to the Ombudsmen (The Law / POVERTY LAW) ARTS & ENTERTAINMENT Read the story (Time Listings / TELEVISION) Two for the Season (Dance / BALLET) Art Deco (Art / STYLES) The Very Expensive Coco (Show Business) Marshmallow Moratorium (Cinema / NEW MOVIES) Old Master (Cinema) Prosciutto and Melancholy (Cinema) The Shrinking Shrink (Cinema) Imminent Victorians (Books) The Dying of the Light (Books) One-Man Circus (Books) Privileged Heirlooms (Books) Nov 7, 1969Slide35: TIME Magazine: US 1923-present, 100 million words: end up VingSlide36: TIME Magazine: US 1923-current, 100m words: end up VingSlide37: LDS General Conference: 23 million words, 1851-presentSlide38: Corpus do Português: 45 million words, 1200s-1900s: terminar VndoSlide39: Corpus do Português: 45 million words, 1200s-1900s: terminar.* [vg*]Slide40: Corpus del Español: 100 million words, 1200s-1900s: terminar VndoSlide41: Corpus del Español: 100 million words, 1200s-1900s: [terminar] [vvp]Slide42: Corpus del Español: 100 million words, 1200s-1900s: [terminar] [vpp]Slide43: Corpus del Español: More complex: acabar / terminar (por)Syntax: Syntax Large (>25m words) tagged corpora with many different genres. Only corpora: English: British National Corpus Historical English: OED (BYU interface) Historical American English (1900s): TIME Spanish: Corpus del Español Portuguese: Corpus do PortuguêsSemantics: collocates: Semantics: collocates “You can tell a lot about a word by the other words that it hangs out with” How would you do it with Google or text archives? Sort through the chaff – mutual information Comparison between words (small/little, men/women) Comparison between registers (chair, chain) Comparison over time – semantic change (web, engine) Comparison over time – cultural shifts (woman; 1800s vs 1900s)Slide46: Google: collocates: go through examples one by one, looking for nearby wordsSlide47: BNC: Collocates ( up to 10 words left / right ): sign.[n*]Slide48: BNC: collocates: sorted by “relevancy”: sign.[n*]Slide49: BNC: Comparing collocates by register: chair in FICT, ACADSlide50: BNC: Comparing related words: sheer / utter / absolute [n*]Slide51: BNC: Comparing collocates of synonyms: [=evil] [nn*] Evil play ? Foul damage ? Severe thing ? Nick Ellis “The psychological reality of collocation and semantic prosody” Tomorrow, 10-11:30 115 MCKBSlide52: Corpus del Español: Comparing related words: gozar / disfrutar ‘enjoy’Slide53: Corpus del Español: Comparing related words: gozar / disfrutar ‘enjoy’Slide54: BNC: Culture: woman / man + ADJSlide55: Corpus del Español: Culture: hombre/mujer + ADJSlide56: TIME: Semantic change: Collocates (of engine) by decade Slide57: TIME: Semantic change: Collocates (of chip.[nn*]) by decade Slide58: TIME: Collocates: Comparisons: strike.[nn*] + ADJ (1980s-2000s vs 1920s-1940s)Slide59: TIME: Collocates: Comparisons: wife + ADJ (1980s-2000s vs 1920s-1940s)Slide60: Corpus del Español: Collocates of mujeres ‘women’: 1800s vs 1900sSlide61: Corpus do Português: Collocates of mulheres ‘women’: 1800s vs 1900sSemantics (collocates): Semantics (collocates) Google, text archives, and simple corpora (e.g. Real Academia) can’t do collocates (natively) Easiest with: BNC OED (BYU interface) TIME Corpus del Español Corpus do Português Morphology: word formation: Morphology: word formation Google, text archives, and simple corpora (e.g. Real Academia) can’t do substrings Easiest with: BNC OED (BYU interface) TIME Corpus del Español Corpus do Português Slide64: Google: can’t do substrings: re*ion (retention, recision)Slide65: CREA (Real Academia): can’t do substrings: re*iónSlide66: BNC: easily does substrings: re*ionSlide67: BNC: easily does substrings: charts: re*ionSlide68: OED: easily does substrings: de*ionSlide69: OED: easily does substrings: de*ionSlide70: Corpus del Español: substrings: tables: de*i?n.[n*]Slide71: Corpus del Español: substrings: charts: de*i?n.[n*]Slide72: TIME: Morphology: *gate (1990s vs 1980s)Slide73: TIME: Tables: Comparisons of two different time periods (*heart* 1920s-1940s)Slide74: TIME: Tables: Comparisons of two different time periods (*heart* 1980s-present)Morphology (substrings): Morphology (substrings) Google, text archives, and simple corpora (e.g. Real Academia) can’t do substrings Easiest with: BNC OED (BYU interface) TIME Corpus del Español Corpus do Português Lexis: word frequency: Lexis: word frequency Google, text archives, and simple corpora (e.g. Real Academia) can’t compare frequencies Corpora with relational database architecture can do this easily: BNC OED (BYU interface) TIME Corpus del Español Corpus do Português Slide77: .uk, .ca, .au .com, .us, .edu Google: trying to limit by dialect or registerSlide78: BNC: Differences across registers / genres: shinySlide79: TIME: Cultural: redsSlide80: TIME: Cultural: redsSlide81: Corpus del Español: Cultural: [soldado]Slide82: BNC: Limit by register: [vvi] (infinitive) in LEGAL vs ACADEMICSlide83: BNC: Frequency: by part of speech: phrasal verbs: comparing registersSlide84: OED: Lexical bundles: * * (+1900s -1800s) Nick Ellis “The processing of formulas in native and second-language speakers: Psycholinguistic and corpus determinants” Today, 3-4:30, B104 JFSB Slide85: TIME: Syntax/lexical: to [vvi] up: (2000s vs. 1930s)Slide86: TIME: Lexical: [vvi] (1930s vs. 1940s-1960s)Slide87: Corpus del Español: Lexical: [r] (adverbs): 1900s vs 1800sSlide88: Corpus del Español and Corpus do Português: Frequency dictionaries (Routledge)The BYU Corpus ofContemporary American English: The BYU Corpus of Contemporary American English 360 million words, 1990-2007 20 million words each year Equally divided into: Spoken (Oprah, Today, NPR; unscripted) Fiction (Short stories, first chapters of first editions) Popular magazines (90+; selected by subject area) Newspapers (10; by sub-sections as well) Academic (90+; selected by subject area) Monitor corpus: will be updated every month Similar interface to BNC interface (corpus.byu.edu/bnc) Material all collected; online by Feb 2008 Slide90: The BYU Corpus of Historical American EnglishArchitecture: Architecture Finding architecture that allows for: Size Speed Annotation Relational databases Architecture: Linear search: Architecture: Linear search 100 million words = 600 MB of RAM 360 million words = 2.2 GB of RAM Regular expressions (“pattern matching”) Semantics (from thesaurus, WordNet, etc) Increasingly difficult the more annotation you addArchitecture: Huge “hashes”: Architecture: Huge “hashes” Number all words in corpus [1] he [2] will [3] end [4] up [5] paying [6] more For “end up Ving”: Three files with large sets of numbers: [end] +1 [up] +1 [Ving] Problem: really starts bogging down around 50 million words How does Google do it? Architecture: Collocates: Architecture: Collocates Find “pointers” to all occurrences of word Look for collocates at each position Problem: if you hit the hard drive to read each occurrence, even at 3,000 hits per second, then 15-20 seconds for 50,000 occurrences Architecture: “window” of words: Architecture: “window” of wordsMore information: More information http://corpus.byu.edu http://davies-linguistics.byu.edu mark_davies@byu.edu LING 485 “Corpus Linguistics” (Winter 2008) You do not have the permission to view this presentation. In order to view it, please contact the author of the presentation.
davies barker 2007 Dario Download Post to : URL : Related Presentations : Share Add to Flag Embed Email Send to Blogs and Networks Add to Channel Uploaded from authorPOINTLite Insert YouTube videos in PowerPont slides with aS Desktop Copy embed code: (To copy code, click on the text box) Embed: URL: Thumbnail: WordPress Embed Customize Embed The presentation is successfully added In Your Favorites. Views: 87 Category: Education License: All Rights Reserved Like it (0) Dislike it (0) Added: January 23, 2008 This Presentation is Public Favorites: 0 Presentation Description No description available. Comments Posting comment... Premium member Presentation Transcript Why Google isn't good enough: Useful corpora, meaningful data, and insightful analyses: Why Google isn't good enough: Useful corpora, meaningful data, and insightful analyses Mark Davies Linguistics and English Language http://davies-linguistics.byu.eduMore information: More information http://corpus.byu.edu http://davies-linguistics.byu.edu mark_davies@byu.edu LING 485 “Corpus Linguistics” (Winter 2008)Outline: Outline Duct tape Some frequency problems with Google Google / text archives vs real corpora Semantics Syntax Morphology Lexicon Upcoming projectsSlide5: Google: they’re doing something right … Google: frequency data: Google: frequency data Number of “hits” OK for single words For multiple word strings, more or less a guessSlide7: Google: best guess for string of words: wanted to googleSlide8: Google: often the real frequency statistics very different from first estimateSyntax: “ends up V-ing”: Syntax: “ends up V-ing” Interesting semantic and pragmatic space Which verbs does it occur with? Increasing or decreasing? What styles of speech / text types? Where: US, UK, other world Englishes? (Similar constructions in other languages?)Slide10: First generation corpora: Brown Corpus (1 million words), US, 1960sSlide11: First generation corpora: Brown Corpus (1 million words), US, 1960sSlide12: ends up going ended up watching end up paying . . . Google: have to search by exact word forms (problem for syntax)Slide13: Google: frequency problems (again)Slide14: Google: frequency problems (again, today) Slide15: Google: frequency problems (again)Slide16: Google: trying to get historical dataSlide17: .uk, .ca, .au .com, .us, .edu Google: trying to limit by dialect or registerGoogle results: “ends up V-ing”: Google results: “ends up V-ing” Problematic frequency results Can only search for thousands of individual forms No way to know if increasing or decreasing No way to know what styles of speech / text types No way to know where: US, UK, other world EnglishesSlide20: EBSCO: Academic Search Premier: 1,850 full-text journals, 1985-presentSlide21: ProQuest: Research Library: 1,800 magazines, 1980s-presentSlide22: Lexis-Nexis Academic: 1000s of newspapers, transcripts of news programs, etcWhy architecture matters: Textual corpus (Words, sentences) Architecture Annotation, indexing, search engine Questions Why architecture mattersSlide24: Real Academia Española (CREA); can’t do syntaxSlide25: Real Academia Española (CREA); can’t do syntaxSlide26: O Publico (200 million words); just newspapersFeatures of useful corpora: Features of useful corpora Size cf. million-word Brown Corpus – 2 tokens Annotation cf. Google and Spanish corpus; no part of speech Representativity cf. Portuguese corpora; all newspapersSlide28: British National Corpus: 100 million words, UK, 1980s-90s; end up VingSlide29: British National Corpus: end up Ving by register / genreSlide30: BNC: end up Ving by “micro”-registerSlide31: Oxford English Dictionary (OED): 37m words, 2.2 million quotationsSlide32: Oxford English Dictionary (OED): 37m words, 2.2 million quotationsSlide33: TIME magazine; complete archives 1923-present; 100m+ wordsSlide34: SCIENCE Order in the Zoo (NUCLEAR PHYSICS) Miracles at Rehovot (RESEARCH) The Missing Ammosaurus (PALEONTOLOGY) SOCIETY CALIFORNIA: A State of Excitement (Modern Living) CANDIDE CAMERA: IN SEARCH OF THE SOUL (Modern Living) LABORATORY IN THE SUN: THE PAST AS FUTURE (Modern Living) The Battering Parent (Behavior / CHILDREN) Stay Single (Behavior / THE FAMILY) PRESS Letting Go of a Legacy (The Press / NEWSPAPERS) Penthouse v. Playboy (The Press / MAGAZINES) SPORT The Rise of Roman's Empire (FOOTBALL) BUSINESS Nixon's Rookie of the Year Toward a Just Marketplace (CONSUMERS) A License to Print Money (CORPORATIONS) Bargain Season (AIRLINES) NATION Good Guys All (The Nation / AMERICAN NOTES) Fair Play for Bears (The Nation) Of Peace and Politics (The Nation / THE PRESIDENCY) WORLD LEBANON: ALONG THE ARAFAT TRAIL (The World) Voting Under Fire (The World / ISRAEL) EDUCATION Between Moratoriums (CAMPUS COMMUNIQUE) Bugging the Bargainers (TEACHERS) M.I.T. and the Pentagon (UNIVERSITIES) LAW A Brother's Sacrifice (The Law / EQUITY) Threat to the Ombudsmen (The Law / POVERTY LAW) ARTS & ENTERTAINMENT Read the story (Time Listings / TELEVISION) Two for the Season (Dance / BALLET) Art Deco (Art / STYLES) The Very Expensive Coco (Show Business) Marshmallow Moratorium (Cinema / NEW MOVIES) Old Master (Cinema) Prosciutto and Melancholy (Cinema) The Shrinking Shrink (Cinema) Imminent Victorians (Books) The Dying of the Light (Books) One-Man Circus (Books) Privileged Heirlooms (Books) Nov 7, 1969Slide35: TIME Magazine: US 1923-present, 100 million words: end up VingSlide36: TIME Magazine: US 1923-current, 100m words: end up VingSlide37: LDS General Conference: 23 million words, 1851-presentSlide38: Corpus do Português: 45 million words, 1200s-1900s: terminar VndoSlide39: Corpus do Português: 45 million words, 1200s-1900s: terminar.* [vg*]Slide40: Corpus del Español: 100 million words, 1200s-1900s: terminar VndoSlide41: Corpus del Español: 100 million words, 1200s-1900s: [terminar] [vvp]Slide42: Corpus del Español: 100 million words, 1200s-1900s: [terminar] [vpp]Slide43: Corpus del Español: More complex: acabar / terminar (por)Syntax: Syntax Large (>25m words) tagged corpora with many different genres. Only corpora: English: British National Corpus Historical English: OED (BYU interface) Historical American English (1900s): TIME Spanish: Corpus del Español Portuguese: Corpus do PortuguêsSemantics: collocates: Semantics: collocates “You can tell a lot about a word by the other words that it hangs out with” How would you do it with Google or text archives? Sort through the chaff – mutual information Comparison between words (small/little, men/women) Comparison between registers (chair, chain) Comparison over time – semantic change (web, engine) Comparison over time – cultural shifts (woman; 1800s vs 1900s)Slide46: Google: collocates: go through examples one by one, looking for nearby wordsSlide47: BNC: Collocates ( up to 10 words left / right ): sign.[n*]Slide48: BNC: collocates: sorted by “relevancy”: sign.[n*]Slide49: BNC: Comparing collocates by register: chair in FICT, ACADSlide50: BNC: Comparing related words: sheer / utter / absolute [n*]Slide51: BNC: Comparing collocates of synonyms: [=evil] [nn*] Evil play ? Foul damage ? Severe thing ? Nick Ellis “The psychological reality of collocation and semantic prosody” Tomorrow, 10-11:30 115 MCKBSlide52: Corpus del Español: Comparing related words: gozar / disfrutar ‘enjoy’Slide53: Corpus del Español: Comparing related words: gozar / disfrutar ‘enjoy’Slide54: BNC: Culture: woman / man + ADJSlide55: Corpus del Español: Culture: hombre/mujer + ADJSlide56: TIME: Semantic change: Collocates (of engine) by decade Slide57: TIME: Semantic change: Collocates (of chip.[nn*]) by decade Slide58: TIME: Collocates: Comparisons: strike.[nn*] + ADJ (1980s-2000s vs 1920s-1940s)Slide59: TIME: Collocates: Comparisons: wife + ADJ (1980s-2000s vs 1920s-1940s)Slide60: Corpus del Español: Collocates of mujeres ‘women’: 1800s vs 1900sSlide61: Corpus do Português: Collocates of mulheres ‘women’: 1800s vs 1900sSemantics (collocates): Semantics (collocates) Google, text archives, and simple corpora (e.g. Real Academia) can’t do collocates (natively) Easiest with: BNC OED (BYU interface) TIME Corpus del Español Corpus do Português Morphology: word formation: Morphology: word formation Google, text archives, and simple corpora (e.g. Real Academia) can’t do substrings Easiest with: BNC OED (BYU interface) TIME Corpus del Español Corpus do Português Slide64: Google: can’t do substrings: re*ion (retention, recision)Slide65: CREA (Real Academia): can’t do substrings: re*iónSlide66: BNC: easily does substrings: re*ionSlide67: BNC: easily does substrings: charts: re*ionSlide68: OED: easily does substrings: de*ionSlide69: OED: easily does substrings: de*ionSlide70: Corpus del Español: substrings: tables: de*i?n.[n*]Slide71: Corpus del Español: substrings: charts: de*i?n.[n*]Slide72: TIME: Morphology: *gate (1990s vs 1980s)Slide73: TIME: Tables: Comparisons of two different time periods (*heart* 1920s-1940s)Slide74: TIME: Tables: Comparisons of two different time periods (*heart* 1980s-present)Morphology (substrings): Morphology (substrings) Google, text archives, and simple corpora (e.g. Real Academia) can’t do substrings Easiest with: BNC OED (BYU interface) TIME Corpus del Español Corpus do Português Lexis: word frequency: Lexis: word frequency Google, text archives, and simple corpora (e.g. Real Academia) can’t compare frequencies Corpora with relational database architecture can do this easily: BNC OED (BYU interface) TIME Corpus del Español Corpus do Português Slide77: .uk, .ca, .au .com, .us, .edu Google: trying to limit by dialect or registerSlide78: BNC: Differences across registers / genres: shinySlide79: TIME: Cultural: redsSlide80: TIME: Cultural: redsSlide81: Corpus del Español: Cultural: [soldado]Slide82: BNC: Limit by register: [vvi] (infinitive) in LEGAL vs ACADEMICSlide83: BNC: Frequency: by part of speech: phrasal verbs: comparing registersSlide84: OED: Lexical bundles: * * (+1900s -1800s) Nick Ellis “The processing of formulas in native and second-language speakers: Psycholinguistic and corpus determinants” Today, 3-4:30, B104 JFSB Slide85: TIME: Syntax/lexical: to [vvi] up: (2000s vs. 1930s)Slide86: TIME: Lexical: [vvi] (1930s vs. 1940s-1960s)Slide87: Corpus del Español: Lexical: [r] (adverbs): 1900s vs 1800sSlide88: Corpus del Español and Corpus do Português: Frequency dictionaries (Routledge)The BYU Corpus ofContemporary American English: The BYU Corpus of Contemporary American English 360 million words, 1990-2007 20 million words each year Equally divided into: Spoken (Oprah, Today, NPR; unscripted) Fiction (Short stories, first chapters of first editions) Popular magazines (90+; selected by subject area) Newspapers (10; by sub-sections as well) Academic (90+; selected by subject area) Monitor corpus: will be updated every month Similar interface to BNC interface (corpus.byu.edu/bnc) Material all collected; online by Feb 2008 Slide90: The BYU Corpus of Historical American EnglishArchitecture: Architecture Finding architecture that allows for: Size Speed Annotation Relational databases Architecture: Linear search: Architecture: Linear search 100 million words = 600 MB of RAM 360 million words = 2.2 GB of RAM Regular expressions (“pattern matching”) Semantics (from thesaurus, WordNet, etc) Increasingly difficult the more annotation you addArchitecture: Huge “hashes”: Architecture: Huge “hashes” Number all words in corpus [1] he [2] will [3] end [4] up [5] paying [6] more For “end up Ving”: Three files with large sets of numbers: [end] +1 [up] +1 [Ving] Problem: really starts bogging down around 50 million words How does Google do it? Architecture: Collocates: Architecture: Collocates Find “pointers” to all occurrences of word Look for collocates at each position Problem: if you hit the hard drive to read each occurrence, even at 3,000 hits per second, then 15-20 seconds for 50,000 occurrences Architecture: “window” of words: Architecture: “window” of wordsMore information: More information http://corpus.byu.edu http://davies-linguistics.byu.edu mark_davies@byu.edu LING 485 “Corpus Linguistics” (Winter 2008)