Dictionaries : Dictionaries See
Patrick Hanks “Lexicography” chapter 3 of Mitkov, R. (ed.) The Oxford Handbook of Computational Linguistics, Oxford: OUP, 2004.
Dictionaries/Lexicons : Dictionaries/Lexicons
Lexicography and the computer
Corpus-based lexicography
MRDs
Dictionaries for NLP
Thesauri: structured lexicons
Computational lexicography : Computational lexicography Restructuring and exploiting human dictionaries for use by computer programs
Using computational techniques to compile (new) dictionaries
Focus on English (and other well established languages)
Significant different issues for other languages, especially
Alphabetization and arrangement
Compilation from scratch for previously unstudied languages
Human dictionaries : Human dictionaries Traditional view of what a “dictionary” is
List of words, arranged (usually) alphabetically
Inclusion in dictionary lends authority, even proscriptively
Entry typically gives
spelling ... alternate spellings
POS, morphology (if irregular)
core definition (using defining vocab?)
pronunciation (using own transcription)
etymology
examples of usage
as justification for inclusion
as illustration of use (esp. learner’s dictionaries)
Entry typically doesn’t give
help with spelling
morphology (if regular), especially derivational
subcategorization information
contrastive examples of use
indications of possible metaphorical extensions to meaning
Human dictionaries : Human dictionaries Historically
bilingual dictionaries for translators
monolingual dictionary as (pre/proscriptive) definition of language, often polemical
OED (1884-1928) first dictionary on purely descriptive principle, relying on citations
Deficiencies and difficulties
What to include? (neologisms, slang)
Inclusion of names
Differentiating senses
Differentiating word senses : Differentiating word senses Dictionaries disagree widely
Probably no right answer
General principles (look for excuse to split vs look for reason to lump)
Keep related words of different POS together?
Etymology can be misleading (eg crane, pupil)
Metaphorical extension of original meaning – how far do you go? (eg rose, bar)
Purpose of dictionary may help decide, eg translation
Citations : Citations Senses and uses identified by collecting examples of use
Sent in on “slips” by informants
Lexicographer’s job is to collate these
Criteria for a new word (or new meaning)
Number of citations
Source of citations
Veracity of use
Corpus-based dictionaries : Corpus-based dictionaries A collection of texts, usually collected with a specific purpose in mind
British National Corpus, attempt to capture a synchronic picture of BrE of the late 1980s (100m words)
COBUILD “Bank of English” dynamic “monitor” corpus used to help lexicographers identify/define usage
Machine-readable dictionaries : Machine-readable dictionaries “Machine” means “computer”
Dictionary stored in a format which makes it manipulable on a computer
Originally, derived from MR version of print dictionary (from type-setter’s tapes)
Now the other way round: data stored as a database from which hard copy can be printed (inter alia)
MRDs - advantages : MRDs - advantages Flexibility of access and presentation
Not bound to alphabetical listing
Information presented can be filtered
Can be searched as a database
Different versions (for different users, serving different purposes) can be produced
Increased storage capacity
More information can be stored, especially
Implicit information can be made explicit
More examples, including “negative data”
Lexicons for NLP : Lexicons for NLP Have to state everything we need to know about the word
Phonology: stress pattern, possible weak forms
Orthography: spelling alternatives, hyphenation
Morphology: inflectional paradigms, even if regular
Information about derivations
Syntax: Explicit information about subcategorization and
eg syntactic/semantic features of arguments
Any special interpretation of tenses
Lexical combinatorics: compounds, idioms
Semantics: definition, semantic features, semantic relations
Pragmatics: register, collocation, connotation
Lexicons for NLP - example : Lexicons for NLP - example Information about derivations
Agentive derivation (-er) is very productive
Usually means the actor doing the action of a verb, e.g. swimmer, dancer, killer
Not available for some verbs, e.g. *knower, *cycler, *sayer though cf soothsayer, *hoper
May have a specialised meaning instead of or as well as the derived meaning, e.g. revolver, computer, washer, hitter
In some cases can mean the object undergoing the action (via ergative use of verb), e.g. taster
Subcategorization : Subcategorization Words are assigned to categories (ie parts of speech, POS), eg noun, verb
on basis of form, meaning, use
Syntactic behaviour is predictable from (or determined by) category
Within a category there are subcategories with specific patterns of behaviour, both syntactic and semantic, e.g.
transitive/intransitive verb direct object? passivize?
Subcategorization : Subcategorization Subcat frames indicate complement patterns and preferences, e.g.
subj, obj, double obj, prep-obj, infinitival complement, that complement etc
semantic features of complements, eg obj of eat normally edible
Subcat information can help to disambiguate
cf He told the man where the body was buried .
He found the place where the body was buried .
Much of this info can be captured in general rules
[ ][ ]
[ [ ]]
Slide15 : Have to state everything we need to know about the word, though not necessarily explicitly
There can be rules to capture inheritance of properties, e.g.
accomplishment + prog tense implies incompletion
cf She was baking a cake when she dropped dead no cake
She was stroking the cat when she dropped dead
Exploiting human dictionaries in NLP : Exploiting human dictionaries in NLP In all NLP applications, lexicon is major bottleneck
Availability of MRD versions of human dictionaries provided possible solution
Obviously, MRD gives list of words, and some information
Extract further information about verb frames by analysing the examples
Identify semantic features from definitions
eg a plant which..., a person who...
Identify hidden arguments
eg to lock = to close sthg using a key
cf He locked the door. The key was heavy.
He emptied his pockets. *The key was heavy.
Exploiting human dictionaries in NLP : Exploiting human dictionaries in NLP Generic information about a word and its usage can be derived from definitions in which it occurs:
Wine: alcoholic drink made from fermented juices, especially of grapes
Vintage: a season’s yield of wine from a vineyard
Red wine: wine having a red colour derived from the skins of the grapes used ...
Vineyard: an orchard where grapes are grown for the purpose of wine making
Pinot noir: a dry red Californian table wine
Sake: Japanese rice wine
Claret: a dry red Bordeaux or Bordeaux-like wine
Sherry: a sweet white wine from the Jerez region of Spain
Riesling: a dessert wine made from white grapes grown historically in Germany ...
Corpus-based lexicography revisited : Corpus-based lexicography revisited Similarly, analysis of real examples can reveal patterns of usage
Identify primary meaning: not always what you’d expect (example of reckon)
Identify possible complementation patterns, and their relative frequency
Structured dictionaries : Structured dictionaries Special type of dictionary in which words are grouped together according to their meaning: thesaurus
Classic example Roget’s Thesaurus (1852)
Structured vocabulary much used in field of terminology
Also now a valuable resource for NLP: Miller’s (Princeton) WordNet (1985)