Slide1 : Infrastructural
Language Resources
&
Standards for Multilingual Computational Lexicons
Nicoletta Calzolari
… with many others
Istituto di Linguistica Computazionale - CNR - Pisa
glottolo@ilc.cnr.it
The ENABLER Mission : The ENABLER Mission Language Resources (LRs) & Evaluation: central component of the “linguistic infrastructure”
LRs supported by national funding in National Projects
Availability of LRs also a “sensitive” issue, touching the sphere of linguistic and cultural identity, but also with economical and political implications
The ENABLER Network of National initiatives, aims at “enabling” the realisation of a cooperative framework
formulate a common agenda of medium- & long-term research priorities
contribute to the definition of an overall framework for the provision of LRs
towards …. : towards …. Only
Combining the strengths of different initiatives & communities
Exploiting at best the ‘modus operandi’ of the national funding authorities in different national situations
Responding to/anticipating needs and priorities of R&D & industrial communities
Promoting the adoption of [de facto] standards, best practices
With a clear distinction of tasks & roles for different actors
We can produce the
synergies, economy of scale, convergence & critical mass
necessary to provide the infrastructural LRs needed to realise the full potential of a multilingual global information society
Lexicon and Corpus:a multi-faceted interaction : Lexicon and Corpus: a multi-faceted interaction L C tagging
C L frequencies (of different linguistic “objects”)
C L proper nouns, acronyms, …
L C parsing, chunking, …
C L training of parsers
C L lexicon updating
C L “collocational” data (MWE, idioms, gram. patterns ...)
C L “nuances” of meanings & semantic clustering
C L acquisition of lexical (syntactic/semantic) knowledge
L C semantic tagging/word-sense disambiguation
(e.g. in Senseval)
C L more semantic information on LE
C L corpus based computational lexicography
C L validation of lexical models
C L …
L C ...
...Language as a “Continuum” : ...Language as a “Continuum” Interesting - and intriguing - aspects of corpus use:
impossibility of descriptions based on a clear-cut boundary betw. what is admitted and what is not
in actual usage, language displays a large number of properties behaving as a continuum, and not as properties of “yes/no” type
the same is true for the so-called “rules”, where we find more a “tendency” towards rules than precise rules in corpus evidence
difficult to constrain word meaning within a rigorously defined organisation: by its very nature it tends to evade any strict boundary
BUT Lexicon & Corpus
as two viewpoints on the same ling. object
…. even more in a multilingual context
Extraction from texts vs.formal representation in lexicons : Extraction from texts vs. formal representation in lexicons
It is difficult to constrain word meaning within a rigorously defined organisation: by its very nature it tends to evade any strict boundary
The rigour and lack of flexibility of formal representation languages causes difficulties when mapping into it NL word meaning, ambiguous and flexible by its own nature
No clear-cut boundary when analysing many phenomena: it’s more a continuum
The same impression if one looks at examples of types of alternations:
no clear-cut classes across languages
or within one language
Correlation between different levels of linguistic description in the design of a lexical entry : Correlation between different levels of linguistic description in the design of a lexical entry To understand word-meaning:
Focus on the correlation between syntactic and semantic aspects
But other linguistic levels - such as morphology, morphosyntax, lexical cooccurrence, collocational data, etc. - are closely interrelated/involved
These relations must be captured when accounting for meaning discrimination
The complexity of these interrelationships makes semantic disambiguation such a hard task in NLP
Textual corpora as a device to discover and reveal the intricacy of these relationships
Frame/SIMPLE semantics as a device to unravel and disentangle the complex situation into elementary and computationally manageable pieces
towards Corpus based Semantic Lexicons… at least in principle : towards Corpus based Semantic Lexicons … at least in principle both in the design of the model , &
in the building of the lexicon (at least partially)
with (semi-)automatic means
Design of the lexical entry with a combined approach:
theoretical: e.g. Fillmore Frame Semantics/
Pustejovsky Generative Lexicon, …
empirical: Corpus evidence
even if: not always there are sound and explicit criteria for classification according to “frame elements”/qualia relations/...
Slide9 : But … they will never be “complete” Semantic networks: Euro-/ItalWordNet
Lexicons: PAROLE/SIMPLE/CLIPS
TreeBanks Infrastructure of Language Resources... Lexical acquisition systems (syntactic & semantic) from corpora
Infrastructure of tools
Robust morphosyntactic & syntactic analysers
Word-sense disambiguation systems
Sense classifiers
... ...static …dynamic International
Standards
Slide10 : ItalWordNet
Semantic Network
[Italian module of EuroWordNet] ~ 50.000 lemmas organized in synonym groups (synsets), structured in hierarchies & linked by ~ 130.000 semantic relations
~ 50.000 hyperonymy/hyponymy relations
~ 16.000 relations among different POS (role, cause, derivation, etc..)
~ 2.000 part-whole relations
~ 1.500 antonymy relations, …etc.
Synsets linked to the InterLingual Index (ILI=Princeton WordNet),
Through the ILI link to all the European WordNets (de-facto standard)
& to the common Top Ontology
Possibility of plug-in with domain terminological lexicons
(legal, maritime)
Usable in IR, CLIR, IE, QA, ...
Slide11 : EuroWordNet Multilingual Data Structure
Slide12 : {Casa, abitazione, dimora } Hyperonym: {edificio,..} Hyponym:
{villetta }
{catapecchia, bicocca, .. }
{cottage}
{bungalow }
Role_location: {stare, abitare, ...} Role_target_direction: {rincasare} Role_patient: {affitto, locazione} Mero_part: {vestibolo} {stanza} Holo_part: {casale}
{frazione}
{caseggiato} home, domicile, ..
house TOP Concepts:Object,Artifact,Building Synsets linked
by Semantic Relations in ItalWordNet
Jur-WordNet : Jur-WordNet With ITTG-CNR (Istituto di Teoria e Tecniche dell’informazione Giuridica)
Jur-WordNet ð Extension for the juridical domain of ItalWordNet
Knowledge base for multilingual access to sources of legal information
Source of metadata for semantic mark-up of legal texts
To be used, together with the generic ItalWordNet, in applications of Information Extraction, Question Answering, Automatic Tagging, Knowledge Sharing, Norm Comparison, etc.
Terminological Lexicon of Navigation & Sea Transportation : Terminological Lexicon of Navigation & Sea Transportation ð Nolo Synsets ð 1.614
Lemmas ð 2.116
Senses ð 2.232
Nouns ð 1.621
Verbs ð 205
Adjectives ð 35
Proper Nouns ð 236
Slide15 : PAROLE
Ital. Synt. Lex.
’96-’98 SIMPLE
Ital. Sem. Lex.
’98-2000 CLIPS
2000-2004 morphology: 20,000 entries syntax: 20,000 words semantics: 10,000 senses phonology morphology 55,000 words syntax semantics: 55,000 senses SGML SGML XML PAROLE/SIMPLE
12 harmonised computational lexicons http://www.ilc.cnr.it/clips/
Slide16 : machine language learning
Slide17 : machine language learning development of conceptual networks linguistic learning adaptive classification systems information extraction bootstrapping of grammars linguistic change models language usage models bootstrapping of lexical information
Slide18 : lexica unstructured
text
data annotation
tools annotated
data machine learning
for linguistic knowledge
acquisition lexica cross-lingual
information
retrieval multi-lingual
information
extraction multi-lingual text
mining
user
needs
lexicon
model Architecture for linguistic knowledge acquisition ... LKG …. towards “dynamic” lexicons, able to auto-enrich terminology
Slide19 : Harmonisation: More & more Need of a Global View for Global Interoperability Integration/sharing of data & software/tools
Need of compatibility among various components
An “exemplary cycle”:
Formalisms
Grammars
Software: Taggers,
Chunkers, Parsers, …
Representation Annotation
Lexicon Corpora
Terminology
Software:
Acquisition Systems
I/O Interfaces Languages
A short guide to ISLE/EAGLES http://www.ilc.cnr.it/EAGLES96/isle/ISLE_Home_Page.htm : A short guide to ISLE/EAGLES http://www.ilc.cnr.it/EAGLES96/isle/ISLE_Home_Page.htm Multilingual Computational Lexicon
Working Group
Target: … the Multilingual ISLE Lexical Entry (MILE) : Target: … the Multilingual ISLE Lexical Entry (MILE) General methodological principles (from EAGLES):
high granularity: factor out the (maximal) set of primitive units of lexical info (basic notions) with the highest degree of inter-theoretical agreement
modular and layered: various degrees of specification possible
explicit representation of info
allow for underspecification (& hierarchical structure)
leading principle: edited union of existing lexicons/models (redundancy is not a problem)
open to different paradigms of multilinguality
oriented to the creation of large-scale & distributed lexicons
Paths to Discover theBasic Notions of MILE : Paths to Discover the Basic Notions of MILE clues in dictionaries to decide on target equivalent
guidelines for lexicographers
clues (to disambiguate/translate) in corpus concordances
lexical requirements from various types of transfer conditions & actions in MT systems
lexical requirements from interlingua-based systems
…
Designing MILESteps towards MILE: : Designing MILE Steps towards MILE:
Creating entries (Bertagna, Reeves, Bouillon)
Identifying the MILE Basic Notions (Bertagna,Monachini,Atkins,Bouillon)
Defining the MILE Lexical Model (Lenci, Calzolari, etc.)
Formalising MILE (Ide)
Development of the ISLE Lexical Tool (Bel)
ISLE & spoken language & multimodality (Gibbon)
Metadata for the lexicon (Peters, Wittenburg)
A case-study: MWEs in MILE (Quochi, lenci, Calzolari) the MILE Basic Notions
the MILE Lexical Model
The MILE Basic Notions (the EAGLES/ISLE CLWG) : The MILE Basic Notions (the EAGLES/ISLE CLWG) Basic lexical dimensions & info-types relevant to establish multilingual links
Typology of lexical multilingual correspondences (relevant conditions & actions)
Identified by:
creating sample multilingual lexical entries (Bertagna, Reeves)
investigating the use of sense indicators in traditional bilingual dictionaries (Atkins, Bouillon)
….
The MILE Lexical Classes – Data Categories for Content Interoperability : The MILE Lexical Classes – Data Categories for Content Interoperability Francesca Bertagna*, Alessandro Lenci°, Monica Monachini*, Nicoletta Calzolari*
*ILC–CNR – Pisa
°Pisa University
Overview : Overview MILE Lexical Model with Lexical Objects and Data Categories
Mapping of existing lexicons onto MILE
RDF schema and DC Registry for some pre-instantiated lexical objects together with a sample entry from the PAROLE-SIMPLE lexicons in MILE
Future …
The MILE Lexical Model : MILE Lexical Model The MILE Lexical Model Guidelines syntactic semantic lexicons … where
after?
The MILE Main Features : The MILE Main Features A general architecture devised as a common representational layer for multilingual Computational Lexicons
both for hand-coded and corpus-driven lexical data
Key features:
Modularity
Granularity
Extensibility and “openess” - User-adaptability
Resource Sharing
Content Interoperability
Reusability
Semantic Web technologies & standards
applied at Lexicon modelling
The MILE Lexical Model (MLM) : The MILE Lexical Model (MLM) The MLM core is the Multilingual ISLE Lexical Entry (MILE)
a general schema for multilingual lexical resources
a lexical meta-entry as a common representational layer for multilingual lexicons
Computational lexicons can be viewed as different instances of the MILE schema MILE
Lexical Model lexicon#1 lexicon#3 lexicon#2
MILEthe building-block model : MILE the building-block model The MILE architecture is designed according to the building-block model:
Lexical entries are obtained by combining various types of lexical objects (atomic and complex)
Users design their lexicon by:
selecting and/or specifying the relevant lexical objects
combine the lexical objects into lexical entries
Lexical objects may be shared:
within the same lexicon (intra-lexicon reusability)
among different lexicons (inter-lexicon reusability)
MILEthe building-block model : MILE the building-block model
Modularity in MILE : Modularity in MILE multilingual
correspondence
conditions multiple levels of modularity Horizontal organization, where independent, but interlinked, modules allow to express different dimensions of lexical entries
The Mono-MILE : The Mono-MILE Each monolingual layer within Mono-MILE identifies a basic unit of lexical description morphological layer MU basic unit to describe the inflectional and derivational morphological properties of the word syntactic layer SynU basic unit to describe the syntactic behaviour of the MU semantic layer SemU basic unit to describe the semantic properties of the MU
The Mono-MILE : The Mono-MILE MU Within each layer, a basic linguistic information unit is identified
Granularity in MILE : Granularity in MILE Concerns the vertical dimension. Within a given lexical layer, varying degrees of depth of lexical descriptions are allowed, both shallow and deep lexical representations
Defining the MLM : Defining the MLM The MLM is designed as an E-R model (MILE Entry Schema)
defines the lexical objects and the ways they can be combined into a lexical entry
The MLM includes 3 types of lexical objects:
MILE Lexical Classes (MLC)
MILE Lexical Data Categories (MDC)
MILE Lexical Operations (MLO)
The MILE Lexical Objects : The MILE Lexical Objects Within each layer, basic lexical notions are represented by lexical objects:
MILE Lexical Classes MLC
MILE Data Categories MDC
Lexical operations
They are an ontology of lexical objects as an abstraction over different lexical models and architectures
The MILE E/R diagrams : The MILE E/R diagrams The lexical objects are described with E-R diagrams which define them and the ways they can be combined into a lexical entry
MILE Lexical Objects: Syntactic Layer : MILE Lexical Objects: Syntactic Layer hasSyntacticFrame hasFrameSet composedby correspondTo 1..* * * *
Slide40 : … expanding one node. … …
Slide41 : belongsToSynset hasSemFrame hasSemFeature hasCollocation semanticRelation MILE Lexical Objects: Semantic Layer * 0..1 * * *
Slide42 : hasSourceSynu hasTargetSemu hasPredicativeCorresp IncludesSlotArgCorresp MILE Lexical Objects: Synt-Sem Linking 1 1 1 0..*
Syntax-Semantics Linking : Syntax-Semantics Linking Slot0:Arg1
Slot1:Arg0
Syntax-Semantics Linking : Syntax-Semantics Linking John gave the book to Mary
John gave Mary the book SynU#1 obj_NP obl_PP_to SemU#1 Semantic_Frame:GIVE Arg1
Agent subj_NP SynU#2 obj_NP obj_NP subj_NP Arg2
Theme Arg3
Goal
Slide45 : CorrespSynUSemU Syntax-Semantic Linking in SIMPLE Transitive structure
Slot0 Slot1 SemU1_migliorare SemU2_migliorare CHANGE_OF_STATE CAUSE_CHANGE_OF_STATE PRED_ migliorare ARG0:Agent ARG1:Patient isomorphic non-isomorphic Frameset Intransitive structure
Slot0 Ø CorrespSynUSemU SlotArgCorresp SlotArgCorresp
Slide46 : hasMUMUCorr hasSynUSynuCorr hasSemUSemUCorr hasSynsetMultCorr hasSemFrameCorr The Multilingual layer 1..0 1..0 1..0 1..0 1..0
MILE approach to multilinguality : MILE approach to multilinguality Open to various approaches
transfer-based
monolingual descriptions are used to state correspondences (tests and actions) between source and target entries
interlingua-based
monolingual entries linked to language-independent lexical objects (e.g. semantic frames, “primitive predicates”, etc.)
The Multi-MILE : The Multi-MILE Multi-MILE specifies a formal environment to express multilingual correspondences between lexical items
Source and target lexical entries can be linked by exploiting (possibly combined) aspects of their monolingual descriptions
monolingual lexicons act as pivot lexical repositories, on top of which language-to-language multilingual modules can be defined
The Multi-MILE : The Multi-MILE Multi-MILE may include:
Multlingual operations to establish transfer links between source and target mono-MILE
Multlingual lexical objects
enrich the source and target lexical descripotions, but
do not belong to the monolingual lexicons
Language-independent lexical objects:
Primitive semantic frames, “interlingual synsets”, etc.
Relevant for interlingua approaches to multilinguality
Multi-MILE : Multi-MILE IT_SemU_2 En_SemU_1
IT_SynU_2 En_SynU_1
IT_Slot_0 EN_Slot_1
IT_Slot_1 EN_Slot_0 AddFeature to source SemU
+HUMAN AddSlot to target SynU
MODIF [PP_with]
Multi-MILE : Multi-MILE dito finger toe modif(mano) modif(piede) multilingual conditions run + PP_into entrare
“to enter” +PP_di_corsa multilingual conditions IT Lexicon EN Lexicon
MILE Lexical Classes : MILE Lexical Classes Represent the main building blocks of lexical entries
Formalize the MILE Basic Notions
Define an ontology of lexical objects
represent lexical notions such as semantic unit, syntactic feature, syntactic frame, semantic predicate, semantic relation, synset, etc.
Similar to class definitions in OO languages
specify the relevant attributes
define the relations with other classes
hierarchically structured
MILE Lexical Classesan ontology of lexical objects : MILE Lexical Classes an ontology of lexical objects
MILE Lexical Data Categories : MILE Lexical Data Categories MDC are instances of the MILE lexical Classes
Can be used “off the shelf” or as a departure point for the definition of new or modified categories
Enable modular specification of lexical entities using all or parts of the lexical information in the repository
Each MDC respresents a resource
uniquely identified by a URI
Two types of MDC:
Core MDC
belong to shared repositories (Lexical Data Category Registry)
lexical objects and linguistic notions with wide consensus
User Defined MLDC
user-specific or language specific lexical objects
The MILE Data Categories :
User-defined MDC
The MILE Data Categories Instances of the MILE Lexical Classes are Data Categories
MDC can belong to a shared repository or be user-defined
Core
MDC
The MILE Data Categories User-adaptability and extensibility : The MILE Data Categories User-adaptability and extensibility HUMAN
ARTIFACT
EVENT
ANIMAL
GROUP AGE
MAMMAL instance_of Core UserDefined
MILE Lexical Data Categories : MILE Lexical Data Categories MLM:Feature MLM:GrammaticalFunction
MILE Lexical Operations : MILE Lexical Operations They are used to state conditions and perform operations over lexical entries
Link syntactic slots and semantic arguments
Constrain the syntax-semantic link
Express tests and actions in the transfer conditions in the multi-MILE
…
They provide the “glue” to link various independent intra-lexical and inter-lexical components
Multilingual Operations : Multilingual Operations Source-to-target language transfer conditions can be expressed by combining multilingual operations
Three types of multingual operations:
Multilingual correspondences
Link a source lexical object (MU, SemU, SynU, semantic argument, syntactic slot) and a target lexical object (MU, SemU, SynU, semantic argument, syntactic slot)
Add-operations
Add lexical information relevant for the cross-lingual link, but not present in the source or target mono-MILE
Constrain-operations
Constrain the transfer link to some portions of source and target mono-MILE
Defining the MLM : Defining the MLM MILE
Entry Schema MILE Lexical
Classes RDF/S
Descriptions
RDF Instantiation of the MLM : RDF Instantiation of the MLM Lexicon#1 Lexicon#2 Lexicon#3 Resources Lexical
Objects Lexical
Classes Lexical
Data Categories Resources Metadata
MILE Lexical Model : MILE Lexical Model Ideal structure for rendering in RDF:
hierarchy of lexical objects built up by combining atomic data categories via clearly defined relations
Proof of concept:
Create an RDF schema for the MILE Lexical Model
version 1.2
Instantiate MILE Lexical Data Categories
User-Adaptability and Resource Sharing in MILE : User-Adaptability and Resource Sharing in MILE Compatible with different models of lexical analysis:
Relational semantic models (e.g. WordNet)
Syntactic and semantic frames
Ontology-based lexicons
Compatible with different degrees of specification:
Deep lexical representations (e.g. PAROLE-SIMPLE)
Terminological lexicons
Compatible with different paradigm of multilinguality
Lexicons for Transfer Based MT
Interlingua-based lexicons
…
The MILE Lexical Model : The MILE Lexical Model MILE
Lexical Model
RDF Instantiation of the MLM : RDF Instantiation of the MLM Enable universal access to sophisticated linguistic info
Provide means for inferencing over lexical info
Incorporate lexical information into the Semantic Web
W3C standards:
Resource Definition Framework (RDF)
Ontology Web Language (OWL)
Built on the XML web infrastructure to enable the creation of a Semantic Web
web objects are classified according to their properties
semantics of relations (links) to other web objects precisely defined
The RDF Schema : The RDF Schema Defines classes of objects (MLC) and their relations to other objects
Like a class definition in Java, etc.
Classes and properties in the schema correspond to the E-R model
Can specify sub-classes/sub-properties and inheritance
Goals : Goals Lexical information will form a central component of semantic information
Need a standardized, machine processable format so that information can be used, merged with others
Main task: get the data model right See
Semantic Web
Advantages of RDF : Advantages of RDF Modularity
Can create “instances” of bits of lexical information for re-use in a single lexicon or across lexicons
Instances can be stored in a central repository for use by others
Can use partial information or all of it
Building block approach to lexicon creation
Web-compatible
RDF instantiation will integrate into Semantic Web
Inferencing capabilities
Example : Example Three parts:
RDF Schema for lexical entries
Defines classes and properties, sub-classes, etc.
Sample repository of RDF-instantiated lexical objects
Three levels of granularity
Sample lexicon entries
Use repository information at different levels
Sample Repositories : Sample Repositories repository of enumerated classes for lexical objects at the lowest level of granularity
definition of sets of possible values for various lexical objects
repository of phrases for common phrase types, e.g., NP, VP, etc.
repository of constructions for common syntactic constructions
Slide71 :
Subj
Obj
Comp
Arg
Iobj
tense
gender
control
person
aux
have
be
subject_control
object_control
masculine
feminine
Enumerated classes
Sample LDCR for a Phrase Object :
Sample LDCR for a Phrase Object
Sample LDCR entry for a Construction object : Sample LDCR entry for a Construction object
Full entry : Full entry
John ate the cake
Continued…
Slide75 : Continued from previous slide…
Entry Using Phrase : Entry Using Phrase
John ate the cake
Entry Using Construction : Entry Using Construction
John ate the cake
Semantic Representation : Semantic Representation The data model underlying RDF/UML, etc. is universal, abstract enough to capture all types of info
Semantic representations:
Registry of basic data categories
“meta”-categories: addressee, utterance, etc.
Information categories: eyebrow movement, gestures, pitch, …
Supporting ONTOLOGY of information categories
Interpretative procedures yield another level of meaning represent.
Registry of categories…. UNINTERPRETED
REPRESENATION INTERPRETATION
PROCESS INTERPRETED
REPRESENTATION
MILE Lexical Data Category Registry (MDC) : MILE Lexical Data Category Registry (MDC) Instantiation of pre-defined lexical objects
Extension of the shared class schema with lexicon-specific sub-classes and sub-properties
Can be used “off the shelf” or as a departure point for the definition of new or modified categories
Enables modular specification of lexical entities
eliminate redundancy
identify lexical entries or sub-entries with shared properties
MLC in RDF/S features : MLC in RDF/S features mlm:LexObject mlm:Values mlm:feature mlm:SemValues mlm:SynValues rdfs:subClassOf mlm:semFeature rdfs:subClassOf mlm:synFeature rdfs:subPropertyOf features are properties of lexical objects
MLC in RDF/S syntactic features : MLC in RDF/S syntactic features
...
feature values
MLC in RDF/S semantic features : MLC in RDF/S semantic features
...
“domain ontology”
Synsets in RDF/S : Synsets in RDF/S mlm:Synset rdfs:literal mlm:word mlm:Synset mlm:synsetRelation mlm:Values rdfs:literal mlm:gloss mlm:feature cf. also http://www.semanticweb.org/library/wordnet/wordnet-20000620.rdfs
Synsets in RDF/S :
Synset
This class formalizes the notion of synset as defined in WordNet (Fellbaum 1998).
The WordNet hypernym relation
The WordNet meronym relation
Synsets in RDF/S relation between synsets different types of synset relations
WordNet 1.7 Synsets :
A member of the genus Canis
dog
domestic dog
Canis familiaris
WordNet 1.7 Synsets features hypernym
Foundations of the Mapping Experiment : Foundations of the Mapping Experiment
1. The MILE building-block model : 1. The MILE building-block model The MILE Lexical Classes and the MILE Lexical Data Categories are the main building blocks of the MILE lexical architecture
Building blocks allow two kinds of reusability:
intra-lexicon reusability (within the same lexicon)
inter-lexicon reusability (among different lexicons)
How building-blocks work? : How building-blocks work?
2. MILE: a meta-entry : 2. MILE: a meta-entry MILE is
a general schema for multilingual lexical resources
a lexical meta-entry, a common representational layer for multilingual lexicons
Computational lexicons can be viewed as different instances of the MILE schema
MILE
lexicon#1 lexicon#3 lexicon#2
MILE and Content Interoperability : MILE and Content Interoperability This common shared compatible representation of lexical objects is particularly suited to
manipulate objects available in different lexical resources
understand their deep semantics
apply the same operations to lexical objects of the same type
key elements of Content Interoperability
The Mapping Experiment: Why? : The Mapping Experiment: Why? It is a concrete experiment aimed to test the expressive potentialities and capabilities of the MILE
The idea is that if the MILE atomic notions combined together in different ways suit the different “visions” underlying two lexicons such as FrameNet and NOMLEX,
the MILE will come out fortified
its adoption as an interface between differently conceived lexical architectures can be pushed more
key issues for content interoperability between resources can be addressed
The mapping scenarios : The mapping scenarios High level mapping of the objects of a lexicon into the objects of the abstract model
the native structure is maintained and no format conversion is performed
Translate instances of lexical entries directly in MILE
acts as a true interchange format
FrameNet to MILE : FrameNet to MILE
FrameNet-MILE: Observations : FrameNet-MILE: Observations The mapping is promising
Frame ↔ Predicate (primitive)
Frame Elements ↔ Argument (enlarge the set of possible values)
Lexical_Unit ↔ SemU
Link SemU-Predicate (obligatory) should become underspecified
But …
Lack of inheritance mechanism in the Predicate does not allow to represent the hierarchical organization of Frames and Sub-frames, temporal ordering among Frames, subsumption relations among Frames
We could add a new object PredicateRelation to allow for the description of relations occurring between predicates and sub-predicates
Slide95 : MLC:SynU MLC:SemU MLC:SemanticFrame
TypeOfLinkAgentnom
IncludedArg 0
MLC:Predicate MLC:Argument MLC:Argument MLC:CorrespSynUSemU :nom-type ((subject))
NOMLEX-MILE: Observations : NOMLEX-MILE: Observations The mapping is promising
Notions represented in NOMLEX have a correspondent in MILE
But ..
are expressed with two opposite lexical structures
In NOMLEX,
lexical information is expressed in a very compact way
no clear cut boundaries between the levels of linguistic description
In MILE
compressed info should be decompressed and spread over different MILE lexical layers and objects: SynU, SemU, SemanticFrame with its Predicate and relevant Arguments to account for the incorporation of the Agent.
Lesson Learned from the mapping : Lesson Learned from the mapping The results of the experiments are promising
FrameNet offers the possibility to be confronted with two similar lexical models, but not perfectly overlapping lexical objects test the adequacy of the linguistic objects
NOMLEX gives the opportunity to work with two lexicons where linguistic notions correspond but are expressed with an opposite lexicon structure test the adequacy of the architectural model
The high granularity and modularity of MILE
allow the compatibility with differently packaged linguistic objects
allow the addition of new objects and relations without perverting the general architecture
RDF and MILE: Why? : RDF and MILE: Why? Some reasons (from Nancy Ide et al. 2003)
MILE as a hierarchy of lexical objects built up by combining data categories via clearly defined relations is an ideal structure for rendering in RDF
RDF mechanism, with the capacity of expressing named relations between objects, offers a web-based means to represent the MILE architecture
RDF representation of linguistic information is an invaluable resource for language processing applications in the Semantic Web
RDF description and instantiation is in line with the goal of ISO TC37 SC4
RDF Representation of MILE : RDF Representation of MILE MILE was already supplied with
an RDF schema for the MILE Syntactic Layer
an instantiation of pre-defined syntactic objects
We increased the repository of shared lexical objects with the RDF description and (partial!) instantiations of the objects of the semantic and linking layers
This has been carried out with the intent to
be submitted within the ISO TC37/SC4
foster the adoption of MILE, by offering a library of RDF objects ready-to-use
An RDF Schema for the synt-sem linking : An RDF Schema for the synt-sem linking
CorrespSynUSemU
This class links a SynU to a SemU
PredicativeCorresp
This class contains the associations between the syntactic slots and semantic argument
SlotArgCorresp
This class links a syntactic slots to a semantic argument
Classes
An RDF Schema for the synt-sem linking : An RDF Schema for the synt-sem linking
hasSourceSynU
hasTargetSemU
hasPredicativeCorresp
includesSlotArgCorresp
Properties
The library of Pre-instantiated objects : The library of Pre-instantiated objects Enable modular specification of lexical entities
eliminate redundancy
identify lexical entries or sub-entries with shared properties
create ready-to-use packages that can be combined in different ways
Can be used “off the shelf” or as a departure point for the definition of new or modified categories
MDCR for some objects : MDCR for some objects
Pre-instantiated PredicativeCorresp Pre-instantiated SlotArgCorresp
A Sample Entry in MILE : A Sample Entry in MILE The entry is shown in a double alternative:
the full specification of a lexical object PredicativeCorresp
an already instantiated object PredicativeCorresp
The advantage is that
the object does not need to be specified in the entry
and can be used and reused in other entries
explore the potential of MILE for representation of lexical data
Sample full entry for amareV : Sample full entry for amareV
The “full” object PredicativeCorresp
… the abbreviated entry : … the abbreviated entry
Instantiated object PredicativeCorresp
Slide107 : The RDF Schema, the DCR for MILE objects and the entries are available at
www.ilc.cnr.it/clips/rdf/
and INTERA? … : and INTERA? … INTERA Multilingual Terminological Lexica will follow and merge the two frameworks
The MILE and
ISO TMF (Terminological Markup Framework)
Beyond MILE: future work : MILE Lexical Model oriented towards an
Open Distributed Lexical Infrastructure:
Lexical Information Servers for multiple access to lexical information repositories
Enhance
user-adaptivity
resource sharing
cooperative creation
Develop integration and interchange tools Beyond MILE: future work
Broadening MILE: ... other languages : Broadening MILE: ... other languages Ongoing enlargement to Asian languages (Chinese, Japanese, Korean, Thai, Hindi ...)
promote common initiatives between Asia & Europe (e.g. within the EU 6th FP)
The creation of an Open Distributed Lexical Infrastructure, also supported by Asian Institutions:
AFNLP
University of Tokyo (Dept. of Computer Science)
Korean KAIST and KORTERM
Academia Sinica (Taiwan)
…
To valorise results & increase visibility of LR & standardisation initiatives in a world-wide context,
while concretely promoting the launching of a new common platform for multilingual LR creation & management
Using semantically tagged corpora to … acquire semantic info and enhance Lexicons : Using semantically tagged corpora to … acquire semantic info and enhance Lexicons evaluate the disambiguating power of the semantic types of the lexicon
assess the need of integrating lexicons with attested senses and/or phraseology
identify the inadequacy of sense distinctions in lexicons
check actual frequency of known senses in different text types
have a more precise and complete view on the semantics of a lemma
identify the most general senses
capture the most specific shifts of meaning
Capture just the core, basic distinctions in a core lexicon
Corpus analysis must not lead to excessive granularity of sense distinctions, but draw a distinction between
sense discrimination – to be kept “under control” - clustering (manually or automatically)
additional, more granular information (often of collocational nature) which can/must be acquired/encoded within the broader senses, e.g. to help translation
… Dynamic lexicon : … Dynamic lexicon Current computational lexicons (even WordNets) are static objects, still shaped on traditional dictionaries
suffering from the limitations induced by paper support
Thinking at the complex relationships between lexicon and corpus
towards a flexible model of dynamic lexicon
extending the expressiveness of a core static lexicon adapting to the requirements of language in use as attested in corpora
with semantic clustering techniques, etc.
Convert the extreme flexibility & multidimensionality of meaning into large-scale and exploitable (VIRTUAL?) resources a Lexicon and Corpus together
What to annotate? : What to annotate? Mix of:
Word-sense annotation (implicit semantic markup)
Semantic/conceptual markup
…
Syntagmatic relations
Dependency relations
Semantic roles
…
Need for a common Encoding Policy ? : Need for a common Encoding Policy ? Agree on common policy issues?
is it feasible?
desirable?
to what extent?
This would imply, among others:
analysis of needs – also applicative/industrial - before any large development initiative
base semantic tagging on commonly accepted standards/guidelines ??
up to which level?
Common semantic tagset: Gold Standard??
build a core set of semantically tagged corpora, encoded in a harmonised way, for a number of languages??
make annotated corpora available to the community by large
involve the community, collect and analyse existing semantically tagged corpora
devise common set of parameters for analysis
A few Issues for discussion: MILE & lexicon standards More standardisation initiatives? : A few Issues for discussion: MILE & lexicon standards More standardisation initiatives? MILE - a general schema for encoding multilingual lexical info, as a meta-entry, as a common representational layer
Short & medium term requirements wrt standards for multilingual lexicons and content encoding, also industrial requirements
Relation with Spoken language community (see ELRA)
Semantic Web standards & the needs of content processing technologies: importance of reaching consensus on (linguistic & non-linguistic) “content”, in addition to agreement on formats & encoding issues (…words convey content & knowledge)
Define further steps necessary to converge on common priorities
Broadening MILE: ... other communities : NLP, lexicons, terminologies, ontologies, Semantic Web:
a continuum?
Knowledge management is critical.
For “content” interoperability, need to converge around agreed standards also for the semantic/conceptual level
is the field ‘mature’ enough to converge around agreed standards also for the semantic/conceptual level (e.g. to automatically establish links among different languages)?
Is the field of multilingual lexical resources ready to tackle the challenges set by the Semantic Web development?
Foster better integration with
corpus-driven data
terminology/ontology/semantic web communities
multimodal & multimedial aspects Broadening MILE: ... other communities Oriented towards open, distributed lexical resources:
Lexical Information Servers for multiple access to lexical information repositories
A few Issues for discussion: NLP, lexicons, content, ontologies, Semantic Web: … a continuum? : A few Issues for discussion: NLP, lexicons, content, ontologies, Semantic Web: … a continuum? Need for robust systems, able to acquire/tune multilingual lexical/linguistic/conceptual knowledge, to auto-enrich static basic resources
Relation betw. lexical standards & acquisition & text annotation protocols
Target….. Multilingual Knowledge Management Technical Feasibility: : Target….. Multilingual Knowledge Management Technical Feasibility:
Prerequisite: is it an achievable goal a commonly agreed text/lexicon annotation protocol also for the semantic/conceptual level (to be able to automatically establish links among different languages)?
Yes, at the lexical level
More complex, for corpus annotation?
EAGLES/ISLE
To make the Semantic Web a reality ... : Natural convergence with HLT:
multilingual semantic processing
ontologies
semantic-syntactic computational lexicons To make the Semantic Web a reality ... …need to tackle the twofold challenge of
content availability &
multilinguality
… enables a new role of Multilingual Lexicons: to become essential component for the Semantic Web : … enables a new role of Multilingual Lexicons: to become essential component for the Semantic Web Language - & lexicons - are the gateway to knowledge
Semantic Web developers need repositories of words & terms - & knowledge of their relations in language use & ontological classification
The cost of adding this structured and machine-understandable lexical information can be one of the factors that delays its full deployment
The effort of making available millions of ‘words’ for dozens of languages is something that no small group is able to afford
A radical shift in the lexical paradigm
- whereby many participants add linguistic content descriptions in an open distributed lexical framework -
required to make the Web usable
Beyond MILE: next steps... …. towards an Open Distributed Lexical Infrastucture : Create a first repository of shared lexical entries “extracted” from different lexical resources & mapped to MILE (choosing e.g. lexical entries in areas related to the Olympic Games)
to test mapping different lexicon models to MILE
provide a grid with all the ISLE Basic Notions, short descriptions, attributes and sub-elements,to be filled with the correspondent "notions”
Create a list (Open Lexicon Interest Group)
... Beyond MILE: next steps... …. towards an Open Distributed Lexical Infrastucture Language Enhance user-adaptivity, resource sharing, cooperative creation & management
Lexical Information Servers for multiple access to lexical information repositories Knowledge
A new paradigm for a “new generation” of LR? : A new paradigm for a “new generation” of LR?
New Strategic Vision
towards a Distributed Open Lexical Infrastructure
Focus on cooperation,
also between different communities for distributed & cooperative creation, management, etc. of Lexical Resources
MILE as a common platform
technical & organisational requirements
Beyond MILE: towards open & distributed lexicons : Beyond MILE: towards open & distributed lexicons Semantic Lexicon
URI = http://www.xxx… Syntactic Constructions
URI = http://www.yyy… Ontology
URI = http://www.zzz… Monolingual/Multilingual
Lexicon Lex_object: semFeature
URI = http://www.xxx…#HUMAN Lex_object: syntagmaNT
URI = http://www.zzz…#NP corpora
A few issues for the future... : A few issues for the future... Integration b