Presentation Transcript
Discovering and testing linguistic generalizations using interactive concordances : Discovering and testing linguistic generalizations using interactive concordances Larry Hayashi
SIL Language Software Development
larry_hayashi@sil.org
7500 W Camp Wisdom Road
Dallas, TX 75236 972-708-7400
www.sil.org
Empirical Linguistics : Empirical Linguistics “If it happens once, you don't know anything. If it happens twice, it suggests further investigation. If it happens three or more times, then you have something to write about!”
concordance: n. 1. an alphabetical index of all the words in a text or corpus of texts, showing every contextual occurrence of a word. : concordance: n. 1. an alphabetical index of all the words in a text or corpus of texts, showing every contextual occurrence of a word. 2. an index of any recurring object or analysis in a text corpus showing every contextual occurrence of that object. part-of-speech
phone
lemma gloss
morph
syntactic phrase
Traditional concordances : Traditional concordances Printed as a separate volume
Requires looking up the reference in the corpus
Traditional computational concordance tools : Traditional computational concordance tools User defines a query
Software searches through corpus and generates a separate static text file
Using object-oriented and relational database technologies for concordances : Using object-oriented and relational database technologies for concordances Concordances are a view of the data instances themselves rather than copies in a separate file
Advantages of using relational or object oriented databases for concordances : Advantages of using relational or object oriented databases for concordances The capability to jump to the broader context of each data instance in the concordance
Immediate update of concordances when data is edited or analyses changed
Concordances can be interactive allowing the user to apply the repercussions of a hypothesis across a collection of data
Examples from SIL software : Examples from SIL software Morphological analysis and text interlinearization in LinguaLinks
Phonetic analysis in Speech Analysis Tools
Textbook morphology problem : Textbook morphology problem 1 mailha reta 'Maila laughed.'
2 mailha rapa 'Maila cried.'
3 mija rapa 'The child cried.'
4 mija retle 'The child will laugh.'
5 arlam birhile 'The girl will be afraid.'
6 guma hoya 'Auntie shook.' 7 sila bhisa 'The jackal escaped.'
8 renjha retle 'The boy will laugh.'
9 sila birhia 'The jackal was afraid.'
10 mija lhomle 'The child will grow up.'
11 mija imang rahle
'The child will come home.‘ … more data
Hypothesis positing and testing : There is a separate morpheme for the concept of future. Find all occurrences of “will” in the free translations of the corpus. Posit a hypothesis Hypothesis positing and testing
LinguaLinks Morphology Explorer – concordance on concept of FUTURE : LinguaLinks Morphology Explorer – concordance on concept of FUTURE will
Discovering further generalizations : There is a separate morpheme for the concept of future. Find all occurrences of “will” in the free translations of the corpus. There is a morpheme with a sense of ‘future’ … Posit a hypothesis Discovering further generalizations
Testing new hypothesis : -le is a separate morpheme for the concept of future. Find all occurrences of “le” in the wordforms of the corpus. Posit a hypothesis Testing new hypothesis
Slide14 : From the LinguaLinks Morphology Explorer
Slide15 : From the LinguaLinks Morphology Explorer le
Verifying that –le is the morpheme for FUTURE : Verifying that –le is the morpheme for FUTURE
Interactive concordances : -le is a separate morpheme for the concept of future. Find all occurrences of “le” in the wordforms of the corpus. Posit a hypothesis Test hypothesis Verified! Apply hypothesis using interactive concordance Interactive concordances
Adding relevant lexical entries from concordance (prototype) : Adding relevant lexical entries from concordance (prototype)
Parsing wordforms from concordance : Parsing wordforms from concordance
Interlinearizing text examples from concordance : Interlinearizing text examples from concordance
Object oriented modeling : Object oriented modeling Data model reflects “real” linguistic objects and the relationships between those objects
Using the relationships between objects, concordances are available for FREE!
CELLAR : CELLAR Computing Environment for Linguistic, Literary and Anthropological Research
Text object model : Text object model
Text and Word analysis object model : Text and Word analysis object model
Questions about Tuwali linkers : Questions about Tuwali linkers What are the syntactic characteristics of linkers?
Are there any other words that I have identified as linker?
Following backreferences from word category to occurrences in text. : di Following backreferences from word category to occurrences in text.
Tuwali “linkers” concordance : Tuwali “linkers” concordance
Lexicon to Wordform to Text object model : Lexicon to Wordform to Text object model
LinguaLinks Lexical Entry with concordance of corpus examples.1 : LinguaLinks Lexical Entry with concordance of corpus examples.1
Double-click on Att.Example goes to broader context : Double-click on Att.Example goes to broader context
SIL Speech Analysis Tools : SIL Speech Analysis Tools
Speech Manager database view : Speech Manager database view
Consonant Chart: list of consonant phones : Consonant Chart: list of consonant phones
Concordance on phone [s] : Concordance on phone [s]
Launch into sound file : Launch into sound file
Look at more context in sound file : Look at more context in sound file
Change a transcription : Change a transcription
View updated data in concordance : View updated data in concordance
Development process for Fieldworks : Development process for Fieldworks Model the linguistic objects Create an XML representation for each object class Run the XML files through a code generator to generate the database Build apps on top of the database
Interactive concordances in the linguistics classroom : Interactive concordances in the linguistics classroom Students learn good empirical methodology
Students empirically test their hypotheses against the corpus data rather than their intuition
Use of the data model in field methods class reinforces the linguistic concepts students are learning
Future SIL Software : Future SIL Software FieldWorks – implements CELLAR 2. Much faster with an easier to use interface.
Stealth-to-wealth analysis tools
Speech Manager 2
Bibliography : Bibliography Barlow, Michael. Web site: Corpus Linguistics. http://www.ruf.rice.edu/~barlow/corpus.html. Includes a list of various text corpora available for research as well as a list of concordance tools.
Simons, Gary F. 1998. The nature of linguistic data and the requirements of a computing environment for linguistic research. In Using Computers in Linguistics: a practical guide, John M. Lawler and Helen Aristar Dry (eds.). London and New York: Routledge. Pages 10-25.
Simons, Gary F. 1994. Conceptual modeling versus visual modeling: a technological key to building consensus. SIL. http://www.sil.org/cellar/ach94/ach94.html.
Web resources : Web resources SIL Computing: http://www.sil.org/computing/
SIL Speech Tools: http://www.sil.org/computing/speechtools/
LinguaLinks: http://www.sil.org/lingualinks/
Further resources: : Further resources: Object oriented modeling tutorial
FieldWorks object models
LinguaLinks demonstration movies (Quicktime format)
Speech Tools (Speech Analyzer and Speech Manager software)
Catch the
buzz on authorSTREAM
Copyright © 2002-2008 authorSTREAM. All rights reserved.