Slide1 : Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial
Diana Maynard (University of Sheffield)
Julien Nioche (University of Sheffield)
Marta Sabou (Vrije Universiteit Amsterdam)
Johanna Völker (AIFB)
Atanas Kiryakov (Ontotext Lab, Sirma AI)
EKAW 2006
[This work has been supported by SEKT (http://sekt.semanticweb.org/) and KnowledgeWeb (http://knowledgeweb.semanticweb.org/ ]
Slide2 : Motivation, background
GATE overview
Information Extraction
GATE’s HLT components
IE and the Semantic Web
Ontology learning with Text2Onto
Focused ontology learning
Massive Semantic Annotation Structure of the Tutorial
Aims of this tutorial : Aims of this tutorial Investigates some technical aspects of HLT for the SW and brings this methodology closer to non-HLT experts
Provides an introduction to an HLT toolkit (GATE)
Demonstrates using HLT for automating SW-specific knowledge acquisition tasks such as:
Semantic annotation
Ontology learning
Ontology population
Some Terminology : Some Terminology Semantic annotation – annotate in the texts all mentions of instances relating to concepts in the ontology
Ontology learning – automatically derive an ontology from texts
Ontology population – given an ontology, populate the concepts with instances derived automatically from a text
Semantic Annotation: Motivation : Semantic Annotation: Motivation Semantic metadata extraction and annotation is the glue that ties ontologies into document spaces
Metadata is the link between knowledge and its management
Manual metadata production cost is too high
State-of-the-art in automatic annotation needs extending to target ontologies and scale to industrial document stores and the web
Challenge of the Semantic Web : Challenge of the Semantic Web The Semantic Web requires machine processable, repurposable data to complement hypertext
Once metadata is attached to documents, they become much more useful and more easily processable, e.g. for categorising, finding relevant information, and monitoring
Such metadata can be divided into two types of information: explicit and implicit.
Metadata extraction : Metadata extraction
Explicit metadata extraction involves information describing the document, such as that contained in the header information of HTML documents (titles, abstracts, authors, creation date, etc.)
Implicit metadata extraction involves semantic information deduced from the text, i.e. endogenous information such as names of entities and relations contained in the text. This essentially involves Information Extraction techniques, often with the help of an ontology.
Ontology Learning and Population: Motivation : Ontology Learning and Population: Motivation Creating and populating ontologies manually is a very time-consuming and labour-intensive task
It requires both domain and ontology experts
Manually created ontologies are generally not compatible with other ontologies, so reduce interoperability and reuse
Manual methods are impossible with very large amounts of data
Semantic Annotation vs Ontology Population : Semantic Annotation vs Ontology Population Semantic Annotation
Mentions of instances in the text are annotated wrt concepts (classes) in the ontology.
Requires that instances are disambiguated.
It is the text which is modified.
Ontology Population
Generates new instances in an ontology from a text.
Links unique mentions of instances in the text to instances of concepts in the ontology.
Instances must be not only disambiguated but also co-reference between them must be established.
It is the ontology which is modified.
Slide10 : Structure of the Tutorial
Motivation, background
GATE overview
Information Extraction
GATE’s HLT components
IE and the Semantic Web
Ontology learning with Text2Onto
Focused ontology learning
Massive Semantic Annotation
GATE : an open source framework for HLT : GATE : an open source framework for HLT GATE (General Architecture for Text Engineering) is a framework for language processing (http://gate.ac.uk)
Open Source (LGPL licence)
Hosted on SourceForge http://sourceforge.net/projects/gate
Ten years old (!), with 1000s of users at 100s of sites
Current version 3.1
4 sides to the story : 4 sides to the story An architecture: A macro-level organisational picture for HLT software systems. A framework: For programmers, GATE is an object-oriented class library that implements the architecture. A development environment: For language engineers, computational linguists et al, a graphical development environment.
A community of users and contributors
Slide13 :
Architectural principles
Non-prescriptive, theory neutral (strength and weakness)
Re-use, interoperation, not reimplementation (e.g. diverse XML support, integration of Protégé, Jena, Yale...)
(Almost) everything is a component, and component sets are user-extendable
(Almost) all operations are available both from API and GUI
All the world’s a Java Bean.... : All the world’s a Java Bean....
CREOLE: a Collection of REusable Objects for Language Engineering:
GATE components: modified Java Beans with XML configuration
The minimal component = 10 lines of Java, 10 lines of XML, 1 URL
Why bother?
Allows the system to load arbitrary language processing components
Slide15 : NOTES
everything is a replaceable bean
all communication via fixed APIs
low coupling, high modularity, high extensibility
GATE APIs Onto- logy Protégé Onto- logy Word-
net Gaz- etteers Language Resource Layer (LRs) ...
In short… : In short… GATE includes:
plugins for language processing, e.g. parsers, machine learning tools, stemmers, IR tools, IE components for various languages...
tools for visualising and manipulating ontologies
ontology-based information extraction tools
evaluation and benchmarking tools
GATE Users : GATE Users American National Corpus project
Perseus Digital Library project, Tufts University, US
Longman Pearson publishing, UK
Merck KgAa, Germany
Canon Europe, UK
Knight Ridder, US
BBN (leading HLT research lab), US
SMEs: Melandra, SG-MediaStyle, ...
a large number of other UK, US and EU Universities
UK and EU projects inc. SEKT, PrestoSpace, KnowledgeWeb, MyGrid, CLEF, Dot.Kom, AMITIES, CubReporter, …
Past Projects using GATE : Past Projects using GATE MUMIS: conceptual indexing: automatic semantic indices for sports video
MUSE: multi-genre multilingual IE
HSL: IE in domain of health and safety
Old Bailey: IE on 17th century court reports
Multiflora: plant taxonomy text analysis for biodiversity research in e-science
EMILLE: creation of S. Asian language corpus
ACE / TIDES: IE competitions and collaborations in English, Chinese, Arabic, Hindi
h-TechSight: ontology-based IE and text mining
Current projects using GATE : Current projects using GATE ETCSL: Language tools for Sumerian digital library
SEKT: Semantic Knowledge Technologies
PrestoSpace: Preservation of audiovisual data
KnowledgeWeb: Semantic Web network of excellence
MEDIACAMPAIGN: Discovering, inter-relating and navigating cross-media campaign knowledge
TAO : Transitioning Applications to Ontologies
MUSING : SW-based business intelligence tools
NEON : Networked Ontologies
GATE : GATE
Slide21 : Structure of the Tutorial
Motivation, background
GATE overview
Information Extraction
GATE’s HLT components
IE and the Semantic Web
Ontology learning with Text2Onto
Focused ontology learning
Massive Semantic Annotation
IE is not IR : IE is not IR IE pulls facts and structured information from the content of large text collections. You analyse the facts.
IR pulls documents from large text collections (usually the Web) in response to specific keywords or queries. You analyse the documents.
IE for Document Access : IE for Document Access With traditional query engines, getting the facts can be hard and slow
Where has the Queen visited in the last year?
Which places on the East Coast of the US have had cases of West Nile Virus?
Which search terms would you use to get this kind of information?
How can you specify you want someone’s home page?
IE returns information in a structured way
IR returns documents containing the relevant information somewhere (if you’re lucky)
HaSIE: an example application : HaSIE: an example application Application developed by University of Sheffield, which aims to find out how companies report about health and safety information
Answers questions such as:
“How many members of staff died or had accidents in the last year?”
“Is there anyone responsible for health and safety?”
“What measures have been put in place to improve health and safety in the workplace?”
HaSIE : HaSIE Identification of such information is too time-consuming and arduous to be done manually.
Each company report may be hundreds of pages long.
IR systems can’t help because they return whole documents
System identifies relevant sections of each document, pulls out sentences about health and safety issues, and populates a database with relevant information
This can then be analysed by an expert
HASIE : HASIE
Named Entity Recognition: the cornerstone of IE : Named Entity Recognition: the cornerstone of IE Identification of proper names in texts, and their classification into a set of predefined categories of interest
Persons
Organisations (companies, government organisations, committees, etc)
Locations (cities, countries, rivers, etc)
Date and time expressions
Various other types as appropriate
Why is NE important? : Why is NE important? NE provides a foundation from which to build more complex IE systems
Relations between NEs can provide tracking, ontological information and scenario building
Tracking (co-reference) “Dr Smith”, “John Smith”, “John”, he”
Ontologies “Athens, Georgia” vs “Athens, Greece”
Two kinds of approaches : Two kinds of approaches Knowledge Engineering
rule based
developed by experienced language engineers
make use of human intuition
require only small amount of training data
development can be very time consuming
some changes may be hard to accommodate Learning Systems
use statistics or other machine learning
developers do not need LE expertise
require large amounts of annotated training data
some changes may require re-annotation of the entire training corpus
Typical NE pipeline : Typical NE pipeline Pre-processing (tokenisation, sentence splitting, morphological analysis, POS tagging)
Entity finding (gazeteer lookup, NE grammars)
Coreference (alias finding, orthographic coreference etc.)
Export to database / XML
An Example : An Example Ryanair announced yesterday that it will make Shannon its next European base, expanding its route network to 14 in an investment worth around €180m. The airline says it will deliver 1.3 million passengers in the first year of the agreement, rising to two million by the fifth year. Entities: Ryanair, Shannon Descriptions: European base Relations: Shannon base_of Ryanair Events: investment(€180m) Mentions: it=Ryanair, The airline=Ryanair, it=the airline
System development cycle : System development cycle Collect corpus of texts
Manually annotate gold standard
Develop system
Evaluate performance against gold standard
Return to step 3, until desired performance is reached
Performance Evaluation : Performance Evaluation 2 main requirements:
Evaluation metric: mathematically defines how to measure the system’s performance against human-annotated gold standard
Scoring program: implements the metric and provides performance measures
For each document and over the entire corpus
For each type of NE
Evaluation Metrics : Evaluation Metrics Most common are Precision and Recall
Precision = correct answers/answers produced
Recall = correct answers/total possible correct answers
Trade-off between precision and recall
F1 (balanced) Measure = 2PR / 2(R + P)
Some tasks sometimes use other metrics, e.g. cost-based (good for application-specific adjustment)
Ontology-based IE requires measures sensitive to the ontology
GATE AnnotationDiff Tool : GATE AnnotationDiff Tool
Corpus-level Regression Testing : Corpus-level Regression Testing Need to track system’s performance over time
When a change is made we want to know implications over whole corpus
Why: because an improvement in one case can lead to problems in others
GATE offers corpus benchmark tool, which can compare different versions of the same system against a gold standard
This operates on a whole corpus rather than a single document
Corpus Benchmark Tool : Corpus Benchmark Tool
Slide38 : Structure of the Tutorial
Motivation, background
GATE overview
Information Extraction
GATE’s HLT components
IE and the Semantic Web
Ontology learning with Text2Onto
Focused ontology learning
Massive Semantic Annotation
GATE’s Rule-based System - ANNIE : GATE’s Rule-based System - ANNIE ANNIE – A Nearly-New IE system
A version distributed as part of GATE
GATE automatically deals with document formats, saving of results, evaluation, and visualisation of results for debugging
GATE has a finite-state pattern-action rule language - JAPE, used by ANNIE
A reusable and easily extendable set of components
What is ANNIE? : What is ANNIE? ANNIE is a vanilla information extraction system comprising a set of core PRs:
Tokeniser
Gazetteers
Sentence Splitter
POS tagger
Semantic tagger (JAPE transducer)
Orthomatcher (orthographic coreference)
Slide41 : Core ANNIE Components
Re-using ANNIE : Re-using ANNIE Typically a new application will use most of the core components from ANNIE
The tokeniser, sentence splitter and orthomatcher are basically language, domain and application-independent
The POS tagger is language dependent but domain and application-independent
The gazetteer lists and JAPE grammars may act as a starting point but will almost certainly need to be modified
You may also require additional PRs (either existing or new ones)
DEMO of ANNIE and GATE GUI : DEMO of ANNIE and GATE GUI Loading ANNIE
Creating a corpus
Loading documents
Running ANNIE on corpus
Demo
Gazetteers : Gazetteers Gazetteers are plain text files containing lists of names (e.g rivers, cities, people, …)
Information used by JAPE rules
Each gazetteer set has an index file listing all the lists, plus features of each list (majorType, minorType and language)
Lists can be modified either internally using Gaze, or externally in your favourite editor
Gazetteers can also be mapped to ontologies
Generates Lookup results of the given kind
JAPE grammars : JAPE grammars JAPE is a pattern-matching language
The LHS of each rule contains patterns to be matched
The RHS contains details of annotations (and optionally features) to be created
The patterns in the corpus are identified using ANNIC
Input specifications : Input specifications The head of each grammar phase needs to contain certain information
Phase name
Inputs
Matching style
e.g.
Phase: location
Input: Token Lookup Number
Control: appelt
Slide50 : Rule: Company1
Priority: 25
(
( {Token.orthography == upperInitial} )+ //from tokeniser
{Lookup.kind == companyDesignator} //from gazetteer lists
):match
-->
:match.NamedEntity = { kind=company, rule=“Company1” } => will match “Digital Pebble Ltd” NE Rule in JAPE
LHS of the rule : LHS of the rule LHS is expressed in terms of existing annotations, and optionally features and their values
Any annotation to be used must be included in the input header
Any annotation not included in the input header will be ignored (e.g. whitespace)
Each annotation is enclosed in curly braces
Each pattern to be matched is enclosed in round brackets and has a label attached
Macros : Macros Macros look like the LHS of a rule but have no label
Macro: NUMBER
(({Digit})+)
They are used in rules by enclosing the macro name in round brackets
( (NUMBER)+):match
Conventional to name macros in uppercase letters
Macros hold across an entire set of grammar phases
Contextual information : Contextual information Contextual information can be specified in the same way, but has no label
Contextual information will be consumed by the rule
({Annotation1})
({Annotation2}):match
({Annotation3})
RHS of the rule : RHS of the rule LHS and RHS are separated by
Label matches that on the LHS
Annotation to be created follows the label
(Annotation1):match
:match.NE = {feature1 = value1, feature2 = value2}
Example Rule for Dates : Example Rule for Dates Macro: ONE_DIGIT
({Token.kind == number, Token.length == "1"})
Macro: TWO_DIGIT
({Token.kind == number, Token.length == "2"})
Rule: TimeDigital1
// 20:14:25
(
(ONE_DIGIT|TWO_DIGIT){Token.string == ":"} TWO_DIGIT
({Token.string == ":"} TWO_DIGIT)?
(TIME_AMPM)?
(TIME_DIFF)?
(TIME_ZONE)?
)
:time
-->
:time.TempTime = {kind = "positive", rule = "TimeDigital1"}
Identifying patterns in corpora : Identifying patterns in corpora ANNIC – ANNotations In Context
Provides a keyword-in-context-like interface for identifying annotation patterns in corpora
Uses JAPE LHS syntax, except that + and * need to be quantified
e.g. {Person}{Token}*3{Organisation} – find all Person and Organisation annotations within up to 3 tokens of each other
To use, pre-process the corpus with ANNIE or your own components, then query it via the GUI
ANNIC Demo : ANNIC Demo Formulating queries
Finding matches in the corpus
Analysing the contexts
Refining the queries
Demo
Using phases : Using phases Grammars usually consist of several phases, run sequentially
A definition phase (conventionally called main.jape) lists the phases to be used, in order
Only the definition phase needs to be loaded
Temporary annotations may be created in early phases and used as input for later phases
Annotations from earlier phases may need to be combined or modified
Matching algorithms and Rule Priority : Matching algorithms and Rule Priority Rules compete within a single phase!
3 styles of matching:
Brill (fire every rule that applies)
First (shortest rule fires)
Appelt (use of priorities)
Appelt priority is applied in the following order
Starting point of a pattern
Longest pattern
Explicit priority (default = -1)
Slide61 : Named Entities in GATE
Using co-reference : Using co-reference Orthographic co-reference module matches proper names in a document
Improves results by assigning entity type to previously unclassified names, based on relations with classified entities
May not reclassify already classified entities
Classification of unknown entities very useful for surnames which match a full name, or abbreviations, e.g. [Bonfield] will match [Sir Peter Bonfield]; [International Business Machines Ltd.] will match [IBM]
Named Entity Coreference : Named Entity Coreference
GATE 4.0 : GATE 4.0 Before end 06
Faster and leaner!
Nicer GUI
ANNIC included
Improved Machine Learning API (based on YALE)
and more…
Structure of the Tutorial : Motivation, background
GATE overview
Information Extraction
GATE’s HLT components
IE and the Semantic Web
Ontology learning with Text2Onto
Focused ontology learning
Massive Semantic Annotation Structure of the Tutorial
Information Extraction for the Semantic Web : Information Extraction for the Semantic Web Traditional IE is based on a flat structure, e.g. recognising Person, Location, Organisation, Date, Time etc.
For the Semantic Web, we need information in a hierarchical structure
Idea is that we attach semantic metadata to the documents, pointing to concepts in an ontology
Information can be exported as an ontology annotated with instances, or as text annotated with links to the ontology
Richer NE Tagging : Richer NE Tagging Attachment of instances in the text to concepts in the domain ontology
Disambiguation of instances, e.g. Cambridge, MA vs Cambridge, UK
Magpie: an example : Magpie: an example Developed by the Open University
Plugin for standard web browser
Automatically associates an ontology-based semantic layer to web resources, allowing relevant services to be linked
Provides means for a structured and informed exploration of the web resources
e.g. looking at a list of publications, we can find information about an author such as projects they work on, other people they work with, etc.
MAGPIE in action : MAGPIE in action
MAGPIE in action : MAGPIE in action
GATE and the Semantic Web : GATE and the Semantic Web Supports ontologies as part of IE applications - Ontology-Based IE (OBIE)
Supports semantic annotation and ontology population
Can combine learning and rule-based methods
Allows combination of IE and IR
Enables use of large-scale linguistic resources for IE, such as WordNet
Ontology Management in GATE : Ontology Management in GATE
Linking the Text to the Ontology : Linking the Text to the Ontology
Exported Database : Exported Database
Evaluation for OBIE : Evaluation for OBIE Traditional IE is evaluated in terms of Precision, Recall and F-measure.
But these are not sufficient for ontology-based IE, because the distinction between right and wrong is less obvious
Recognising a Person as a Location is clearly wrong, but recognising a Research Assistant as a Lecturer is not so wrong
Similarity metrics need to be integrated so that items closer together in the hierarchy are given a higher score, if wrong
Augmented Precision and Recall : Augmented Precision and Recall Development of a new BDM (Balanced Distance Metric) which compares key and response concepts wrt a given ontology
In the case of ontological mismatch, provides an indication of how serious the error is, and weights it accordingly
BDM provides a score between 0 and 1 for each key/response match instead of a binary measure
Augmented Precision and Recall : Augmented Precision and Recall
BDM is integrated with traditional Precision and
Recall in the following way to produce a score
at the corpus level:
Examples of misclassification : Examples of misclassification
Slide79 : Structure of the Tutorial
Motivation, background
GATE overview
Information Extraction
GATE’s HLT components
IE and the Semantic Web
Ontology learning with Text2Onto
Focused ontology learning
Massive Semantic Annotation
Ontology Learning with Text2Ontohttp://ontoware.org/projects/text2onto/ : Ontology Learning with Text2Onto http://ontoware.org/projects/text2onto/ Johanna Völker
voelker@aifb.uni-karlsruhe.de
Institute AIFB
University of Karlsruhe
Agenda : Agenda Ontology Learning
Tasks
Problems
Text2Onto
Overview
Architecture
Linguistic preprocessing
Ontology learning approaches
Summary
Ontology Learning : Ontology Learning Extraction of (domain) ontologies from natural language text
Machine learning
Natural language processing
Tools: OntoLearn, OntoLT, ASIUM, Mo’K Workbench, JATKE, TextToOnto, …
Ontology Learning – Tasks : Ontology Learning – Tasks
Slide84 : instance-of( Hewlett Packard, organization ) subclass-of( research, activity )
Slide85 : reach( information, people ) address_in( issue, article ) subclass-of( resource, knowledge )
Ontology Learning – ProblemsText Understanding : Ontology Learning – Problems Text Understanding Words are ambiguous
‘A bank is a financial institution. A bank is a piece of furniture.’
subclass-of( bank, financial institution ) ?
Natural Language is informal
‘The sea is water.’
subclass-of( sea, water ) ?
Sentences may be underspecified
‘Mary started the book.’
read( Mary, book_1 ) ?
Anaphores
‘Peter lives in Munich. This is a city in Bavaria.’
instance-of( Munich, city ) ?
Metaphores, …
Ontology Learning – Problems Knowledge Modeling : What is an instance / concept?
‘The koala is an animal living in Australia.’
instance-of( koala, animal )
subclass-of( koala, animal ) ?
How to deal with opinions and quoted speech?
‘Tom thinks that Peter loves Mary.’
love( Peter, Mary ) ?
Knowledge is changing
instance-of( Pluto, planet ) ?
Conclusion:
Ontology learning is difficult.
What we can learn is fuzzy and uncertain.
Ontology maintenance is important. Ontology Learning – Problems Knowledge Modeling
Text2Onto : Text2Onto Support for (semi-)automatic ontology extraction from natural language text
Support for ontology maintenance and data-driven ontology evolution by incremental ontology learning
Model of Possible Ontologies (POM)
Confidence / relevance values attached to all
concepts, instances and relations
Enhanced user interaction
Maintenance of multiple modeling alternatives in parallel
Independence of certain ontology language
Slide89 : subclass-of( user, human ) / confidence 1.0 subclass-of( document, communication ) / confidence 0.75
Text2Onto – Evidence, Reference and Change Management : Explicit modeling of evidences
Algorithms provide different types of evidences
Explanation component
References for annotation and change detection
Explicit modeling of changes
Corpus, evidence, reference and ontology changes
Future work: ontology change strategies Text2Onto – Evidence, Reference and Change Management
Text2Onto – Workflow : Text2Onto – Workflow Workflow composition
Complex algorithms
Different types of algorithms for each ontology learning task
Flexible combination of results
Combination strategies
minimum, maximum, average, linear,
classifier, …
Slide92 : POM
Visualization Workflow
Manager API GATE Corpus Algorithm Controller OWL
Writer RDFS
Writer F-Logic
Writer POM Evidence
Store Reference
Store Text2Onto Ontology
Linguistic PreprocessingGATE : Linguistic Preprocessing GATE Standard ANNIE components for
Tokenization
Sentence splitting
POS tagging
Stemming / lemmatizing
Self-defined JAPE patterns and processing resources for
Stop word detection
Shallow parsing
GATE applications for English, German and Spanish
Ontology Learning Approaches Concept Classification : Ontology Learning Approaches Concept Classification Heuristics
‘image processing software’
subclass-of( image processing software, software )
Patterns
‘animals such as dogs’
‘dogs and other animals’
‘a dog is an animal’
subclass-of( dog, animal )
JAPE Patterns for Ontology Learning : JAPE Patterns for Ontology Learning rule: Hearst_1
(
(NounPhrase):superconcept
{SpaceToken.kind == space}
{Token.string=="such"}
{SpaceToken.kind == space}
{Token.string=="as"}
{SpaceToken.kind == space}
(NounPhrasesAlternatives):subconcept
):hearst1
-->
:hearst1.SubclassOfRelation = { rule = "Hearst1" },
:subconcept.Domain = { rule = "Hearst1" },
:superconcept.Range = { rule = "Hearst1" }
Ontology Learning Approaches Instance Classification : Ontology Learning Approaches Instance Classification Context similarity
‘Columbus is the capital of the state of Ohio.
Columbus has a population of about 700.000
inhabitants.’
Columbus ( capital (1), state (1), Ohio (1), population (1), inhabitant (1) )
city ( country (2), state (1), inhabitant (2), mayor (1), attraction (1) )
explorer( ship (1), sailor (2), discovery (1) )
instance-of( Columbus, city )
Ontology Learning Approaches Relation Extraction : Ontology Learning Approaches Relation Extraction Subcategorization frames
‘Tina drives a Ford.’
instance-of( Tina, person )
instance-of( Ford, vehicle )
‘Her father drives a bus.’
subclass-of( father, person )
subclass-of( bus, vehicle )
subcat: drive( subj: person, obj: vehicle )
drive( person, vehicle )
Slide98 : incluyen( ontologiás, definiciones ) / confidence 1.0
Other Ontology Learning Approaches : Other Ontology Learning Approaches WordNet
Hyponym( ‘bank’, ‘institution’ )
subclass-of( bank, institution ) ?
Google
‘cities such as London’, ‘persons such as London’ …
‘such as London’
instance-of( London, city ) ?
Instance clustering
Hierarchical clustering of context vectors
Formal Concept Analysis (FCA)
breathe( animal )
breathe( human ), speak( human )
subclass-of( human, animal ) ?
Summary : Summary Ontology Learning is difficult, because
Language is fuzzy
Knowledge is changing
Text2Onto targets these Problems
Model of Possible Ontologies
Heterogeneous sources of evidence
Incremental ontology learning
Thanks! : Thanks! http://www.aifb.de/WBS/jvo/ontology-learning
http://www.ontoware.org/projects/text2onto
Slide102 : Structure of the Tutorial
Motivation, background
GATE overview
Information Extraction
GATE’s HLT components
IE and the Semantic Web
Ontology learning with Text2Onto
Focused ontology learning
Massive Semantic Annotation
Focused Ontology Learning with GATE : Focused Ontology Learning with GATE Marta Sabou A Practical Report on Learning Web Service Ontologies
Slide104 : Goal of the Talk The goal of this talk is:
To describe a Semantic Web relevant task: Focused Ontology Learning.
To exemplify this task in the context of Web Services.
To show how focused ontology learning can be implemented in GATE. The focus of the talk is NOT ontology learning but
the elements of GATE that helped to perform this task.
Slide105 : Outline 1) Generic Problem:
* Focused Ontology Learning
(definition and characteristics)
2) Specific Problem:
* Learning Web Service Ontologies
(Context, Problem Scenario)
3) GATE support for:
* writing extraction patterns
* evaluating term extraction performance
Slide106 : Ontology Learning in Restricted Domains Focused Ontology Learning:
is Ontology Learning in a restricted domain, for a well-defined task
therefore, simpler than Ontology Learning in general
more and more frequent with the growth of the Semantic Web Previous Talk’s conclusion:
Generic Ontology Learning is important but difficult because:
Language is fuzzy
Knowledge is changing However...
The Semantic Web is increasingly used in specialized domains, where:
Language exhibits (strong) domain characteristics
e.g., mathematics, medicine
The Knowledge to be extracted is defined by the task for which the ontology will be used
e.g., searching patient records, accessing drug related articles
Slide107 : Focused Ontology Learning Focused Ontology Learning characteristics:
1. (Small) corpus with special (domain/context) characteristics;
2. Well defined ontological knowledge to be extracted;
3. An easy to detect correspondence between text characteristics
and ontology elements;
4. Usually an easy solution (adaptation of OL techniques);
5. Implemented/adapted by a non NLP-expert. What is needed to support domain experts?
libraries of basic NLP tools/data structures;
tools to easily adapt/combine these NLP elements;
intuitive way to create and debug own applications;
usability plays an important role;
generic methodologies of ontology learning rather than hard-coded algorithms.
Slide108 : Outline 1) Generic Problem:
* Focused Ontology Learning
(definition and characteristics)
2) Specific Problem:
* Learning Web Service Ontologies
(Context, Problem Scenario)
3) GATE support for:
* writing extraction patterns (given)
* evaluating term extraction performance (given)
Slide109 : Context - Semantic Web Services * Semantic WS - semantically annotated WS
* to automate discovery, composition, execution < rdf:ID=”WS1">
> =>broad domain coverage
But
…increasing nr. of web services
Slide110 : A real life story… Semantic Grid middleware to support in silico experiments in biology
Bioinformatics programs are exposed as semantic web services 550 Concepts
But only 125 (23%) used
for SWS tasks 600
(Services) Our GOAL:
Support Expert to learn:
From more services
In less time
A “Better” ontology (for SWS descriptions)
Slide111 : FOL Characteristics - 1 * Data Source:
* short descriptions of service functionalities
* characteristics:
* small corpora (100/200 documents)
* employ specific style (sublanguage) Replace or delete sequence sections.
Find antigenic sites in proteins.
Cai codon usage statistic. 1. (Small) corpus with special (domain/context) characteristics
Slide112 : Web Service Ontologies contain:
A Data Structure hierarchy
A Functionality hierarchy
2. Well defined ontology structure to be extracted FOL Characteristics - 2
Slide113 : 3. An easy to detect correspondence between text characteristics and ontology elements Replace or delete sequence sections. FOL Characteristics - 3
Slide114 : Generic
Solution: Implementation: FOL Characteristics - 4 4. Usually an easy solution (adaptation of OL techniques).E.g. Pos Tagging
Slide115 : FOL Characteristics - 4 4. Usually an easy solution (adaptation of OL techniques).
E.g. Dependency Parsing
Slide116 : Outline 1) Generic Problem:
* Focused Ontology Learning
(definition and characteristics)
2) Specific Problem:
* Learning Web Service Ontologies
(Context, Problem Scenario)
3) GATE support for:
* writing extraction patterns
* evaluating term extraction performance
Slide117 : * Easy to follow extraction (step by step)
* Easy to adapt for domain engineers GATE Implementation
Slide118 : Pattern based rules – Example (
(DET)*:det
( (ADJ)|(NOUN))*:mods
(NOUN):hn
):np
:np.NP={} A noun phrase consists of:
zero or more determiners;
zero or more modifiers which can be adjectives or nouns;
One noun which is the head-noun.
Slide119 : Outline 1) Generic Problem:
* Focused Ontology Learning
(definition and characteristics)
2) Specific Problem:
* Learning Web Service Ontologies
(Context, Problem Scenario)
3) GATE support for:
* writing extraction patterns (given)
* evaluating term extraction performance (given)
Slide120 : Performance Evaluation Linguistic Analysis Extraction Patterns Ontology Building Ontology Pruning A set of important terms are extracted.
Terms are indicated by annotations of type: NP, Funct. * The correctness of these terms has a direct influence on the correctness of the OB step => evaluating them is important. The Corpus Benchmark Tool of GATE compares annotation types in 2 corpora, usually:
the manually annotated Gold Standard corpus and
the automatically annotated corpus.
It identifies correct, missed and spurious annotations of a certain type and computes Precision and Recall per each document and the whole corpus.
Slide121 : Gold Standard Annotations: Automatic Annotation: 105_profit.xml; Keys : 2Resp : 3 Scan a sequence or database with a matrix or profile. Funct(scan_sequence)
Funct(scan_database) Funct(scan_sequence)
Funct(scan_database)
Funct(scan_profile) Correct = correctly identified annotations (true positives)
Spurious = incorrect annotations (false positives) Example 1: Performance Evaluation
Slide122 : Gold Standard Annotations: Automatic Annotations: 104_printsextract.xml; Keys : 1Resp : 0 Preprocess the prints database for use with the program pscan. Funct(preprocess_prints database) Missed = unidentified annotations (false negative) Example 2: Performance Evaluation
Slide123 : Statistics GoldStandard_Terms Extracted_Terms correct missed spurious Performance Evaluation Precision= correct/(All_Extr) Recall= correct/(All_GS)
Slide124 : PROS:
It is very important when developing term extraction.
It allows evaluating:
1) the performance of the linguistic analyses
2) the coverage of the patterns
Allows comparing the performance of different tools:
E.g. two different POS taggers
Easy to use (both from GUI and command line) Possible improvement:
* The current textual output does not allow to directly access all spurious or all missing annotations (these are important when fine-tuning the extraction).
* We try to improve this usability issue through visualisation. Performance Evaluation
Slide125 : Summary Focused Ontology Learning = OL in a restricted domain. GATE supports the development of FOL in many ways:
allows easy reuse and combination of basic NLP modules;
offers software libraries for fundamental NLP data structures (Documents, Corpora, Annotations);
incorporates evaluation mechanisms;
easy to debug and use for non-NLP experts. Example FOL = OL for Web Services.
Slide126 : Structure of the Tutorial
Motivation, background
GATE overview
Information Extraction
GATE’s HLT components
IE and the Semantic Web
Ontology learning with Text2Onto
Focused ontology learning
Massive Semantic Annotation
KIM Platform An OverviewAtanas KiryakovOntotext Lab, Sirma AInaso@sirma.bghttp://www.ontotext.com/kim/ : KIM Platform An Overview Atanas Kiryakov Ontotext Lab, Sirma AI naso@sirma.bg http://www.ontotext.com/kim/
Semantic Annotation: An example : Semantic Annotation: An example XYZ was established on 03 November 1978 in London. It opened a plant in Bulgaria in … Ontology & KB Company type HQ establOn City Country Location partOf type type type “03/11/1978” HQ partOf
Semantic Annotation of NEs : Semantic Annotation of NEs A Semantic Annotation of the named entities (NEs) in a text includes:
a recognition of the type of the entities in the text
out of a rich taxonomy of classes (not a flat set of 10 types);
an identification of the entities, which is also a reference to their semantic description.
The traditional (IE-style) NE recognition approach results in:
Lama Ole Nydahl
The Semantic Annotation of NEs results in:
Lama Ole Nydahl
Platforms for Large-Scale Semantic Annotation : Platforms for Large-Scale Semantic Annotation Allow use of corpus-wide statistics to improve metadata quality, e.g., disambiguation
Automated alias discovery
Generate SemWeb output (RDF, OWL)
Stand-off storage and indexing of metadata
Use large instance bases to disambiguate to
Ontology servers for reasoning and access
Architecture elements:
Crawler, onto storage, doc indexing, query, annotators
Apps: sem browsers, authoring tools, etc.
The KIM Platform : The KIM Platform A platform offering services and infrastructure for:
(semi-) automatic semantic annotation and
ontology population
semantic indexing and retrieval of content
query and navigation over the formal knowledge
Based on an Information Extraction technology
Aim: to arm Semantic Web applications
by providing a metadata generation technology
in a standard, consistent, and scalable framework
KIM Architecture : KIM Architecture
Semantic
Repository API Semantic Annotation API Query API Index API Document Persistence API KIM Web
UI Annotation Server News Collector Any Web
Browser Browser Plug-in Custom
Applications Custom
Back-end Custom IE Entity Ranking KIM
Server RMI
PROTON Ontology : PROTON Ontology a light-weight upper-level ontology;
250 NE classes;
100 relations and attributes;
200.000 entity descriptions;
covers mostly NE classes, and ignores general concepts;
includes classes representing lexical resources. proton.semanticweb.org
KIM Scaling on Data : KIM Scaling on Data The Semantic Repository is based on Sesame.
Our practical tests demonstrate a good performance on top of:
1.2M entity descriptions:
about 15M explicit statements;
above 30M statements after forward chaining.
Document and annotation storage and indexing with Lucene:
.5M docs, processed on a $1000-worth machine;
retrieval in milliseconds.
Simple Usage: Highlight, Hyperlink, and … : Simple Usage: Highlight, Hyperlink, and …
Simple Usage: … Explore and Navigate : Simple Usage: … Explore and Navigate
How KIM Searches Better : How KIM Searches Better KIM can match a Query:
Documents about a telecom company in Europe, John Smith, and a date in the first half of 2002.
With a document containing:
“At its meeting on the 10th of May, the board of Vodafone appointed John G. Smith as CTO"
The classical IR could not match:
Vodafone with a "telecom in Europe“, because:
Vodafone is a mobile operator, which is a sort of a telecom;
Vodafone is in the UK, which is a part of Europe.
5th of May with a "date in first half of 2002“;
“John G. Smith” with “John Smith”.
Entity Pattern Search : Entity Pattern Search
Pattern Search: Entity Results : Pattern Search: Entity Results
Entity Pattern Search: KIM Explorer : Entity Pattern Search: KIM Explorer
Pattern Search, Referring Documents : Pattern Search, Referring Documents
Document Details : Document Details
Summary : Summary KIM is a platform for:
semantic annotation and ontology population,
semantic indexing and retrieval,
providing an API for remote access and integration,
based on Information Extraction (IE) using GATE.
KIM is:
Robust
Scalable
General-purpose, off the shelf platform!
THANK YOU!(for not snoring)The slides: http://www.gate.ac.uk/sale/talks/ekaw2006/ekaw2006-tutorial.ppt [This work has been supported by SEKT (http://sekt.semanticweb.org/)andKnowledgeWeb (http://knowledgeweb.semanticweb.org/ )] : THANK YOU! (for not snoring) The slides: http://www.gate.ac.uk/sale/talks/ekaw2006/ekaw2006-tutorial.ppt [This work has been supported by SEKT (http://sekt.semanticweb.org/) and KnowledgeWeb (http://knowledgeweb.semanticweb.org/ )]