Slide1: After OWL: defacto standards
for semantic technologies
(or: what do you get for €40m
EU research money?)
http://gate.ac.uk/ http://nlp.shef.ac.uk/
Hamish Cunningham, Kalina Bontcheva, Valentin Tablan, Diana Maynard, Wim Peters, Niraj Aswani, Milena Yankova, Yaoyong Li, Akshay Java, Michael Dowman
ILASH workshop, March 2004
Structure of the talk: Structure of the talk Context:
increasing use of “semantic” technology in IT
the role(s) of human language technology
substantial investment in the next phase of semantic web research
Semantic Web: moving on from formal standards
Acronym soup:
GATE: HLT API 4 SDK SW & KT
An application: Ontology-Based IE in KIM
Issues in API design, next steps
The Knowledge Economy and Human Language: The Knowledge Economy and Human Language Gartner, December 2002:
taxonomic and hierachical knowledge mapping and indexing will be prevalent in almost all information-rich applications
through 2012 more than 95% of human-to-computer information input will involve textual language
A contradiction:
to deal with the information deluge we need formal knowledge in semantics-based systems
our information spaces are in informal and ambiguous natural language
The challenge: to reconcile these two phenomena
Slide4: Human Language Formal Knowledge (ontologies and instance bases) (A)IE CLIE (M)NLG Controlled Language OIE Semantic
Web;
Semantic
Grid; Semantic
Web
Services KEY
MNLG: Multilingual Natural Language Generation OIE: Ontology-aware Information Extraction AIE: Adaptive IE CLIE: Controlled Language IE HLT: Closing the Loop
SEKT: Semantic Knowledge Technology: SEKT: Semantic Knowledge Technology 6th framework IP project
Duration: 36 months from 1/1/4, €12.5m
http://sekt.semanticweb.org/
Improve automation of ontology and metadata generation
Develop highly-scalable solutions
Research sound inferencing despite inconsistent models
Develop semantic knowledge access tools
Develop methodology for deployment
PrestoSpace (20th Century Rot): PrestoSpace (20th Century Rot) 20th Century audio-visual media is rapidly disappearing
Preservation and restoration are high cost
The costs must be justified by increased access
“Metadata”: descriptive information about content
PrestoSpace (€9m IP, 40 months from 02/04):
rich metadata and semantic access
cross-lingual access
syndicated delivery
repurposeable content
The “SDK” research cluster: The “SDK” research cluster
“Building the European Research Area” in KM through collaboration with related IP and NoE projects in this area for a coordinated impact strategy
SEKT, DIP, KnowledgeWeb – SDK cluster: http://sdk.semanticweb.org/
Other related projects:
AceMedia IP (semantic knowledge systems)
PrestoSpace IP (cultural heritage / digital libraries)
BRICKS IP (cultural heritage / digital libraries)
Total EU/6FP investment in semantic tech. research €40m: potential to influence the emergence of defacto standards
Next step for Semantics tech: from formal to defacto standards?: Next step for Semantics tech: from formal to defacto standards?
Computer scientists love standards, so we have many
For any given problem there are usually 3 “standards”
OWL is no exception: Lite, DL, Full
There are good reasons, but cf. RDF(S) implementation history: applications will of necessity mix and match
If we can achieve standard practice and libraries in applications we will have made a next step and will promote takeup
(Pathological) example: TCP/IP vs. OSI
HLT API 4 SDK SW & KT: HLT API 4 SDK SW & KT What sorts of software do we need?
Ontology and metadata management: storage; versionning; caching, inferencing; etc. (below)
Human language technology components and services (not monolithic systems, not unproven research prototypes)
The role of measurement in scaling and robustness: in HLT this means MUC, TREC, ACE, TIDES, ...
Here’s one we baked earlier....
GATE (the Volkswagen Beetle of Language Processing) is:: GATE (the Volkswagen Beetle of Language Processing) is: Eight years old, with the largest user constituency of its type
An architecture A macro-level organisational picture for LE software systems.
A framework For programmers, GATE is an object-oriented class library that implements the architecture.
A development environment For language engineers, computational linguists et al, a graphical development environment.
Some free components... ...and wrappers for other people's components
Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc.
Free software (LGPL). Download at http://gate.ac.uk/download/
Critical mass: 000s people 00s sites: Critical mass: 000s people 00s sites GATE team projects. Past:
Conceptual indexing: MUMIS: automatic semantic indices for sports video
MUSE, cross-genre entitiy finder
HSL, Health-and-safety IE
Old Bailey: collaboration with HRI on 17th century court reports
Multiflora: plant taxonomy text analysis for biodiversity research e-science
EMILLE: S. Asian language corpus
ACE / TIDES: Arabic, Chinese NE
JHU summer w/s on semtagging
Present:
Advanced Knowledge Technologies: €12m UK five site collaborative project
ETCSL: Sumerian digital library
MiAKT: medical informatics / AKT
SEKT: Semantic Knowledge Tech
PrestoSpace: AV Preservation
KnowledgeWeb; h-TechSight GATE users = significant proportion of community. A small sample:
the American National Corpus project
the Perseus Digital Library project, Tufts University, US
Longman Pearson publishing, UK
Merck KgAa, Germany
Canon Europe, UK
Knight Ridder, US
BBN (leading HLT research lab), US
SMEs: Melandra, SG-MediaStyle, ...
Imperial College, London, the University of Manchester, UMIST, the University of Karlsruhe, Vassar College, the University of Southern California and a large number of other UK, US and EU Universities
UK and EU projects inc. MyGrid, CLEF, dotkom, AMITIES, CubReporter, Poesia...
Slide12:
Architectural principles
Non-prescriptive, theory neutral (strength and weakness)
Re-use, interoperation, not reimplementation (e.g. diverse XML support, integration of Protégé, Jena, Weka, interoperation with SCHUG in MUMIS)
(Almost) everything is a component, and component sets are user-extendable
(Almost) all operations are available both from API and GUI
Why does this matter? It means that GATE works well with other tools, embeds easily, and achieves robustness through focus (API requirements)
All the world’s a Java Bean....: All the world’s a Java Bean....
CREOLE: a Collection of REusable Objects for Language Engineering:
GATE components: modified Java Beans with XML configuration
The minimal component = 10 lines of Java, 10 lines of XML, 1 URL
Why bother?
Allows the system to load arbitrary language processing components
Slide14: NOTES
everything is a replaceable bean
all communication via fixed APIs
low coupling, high modularity, high extensibility
NOTES (2)
eg: Protégé LR & VR both wrapped in Res. (bean) API
ontology repositories and inference are the same: KAON + Sesame + Orenge + ? GATE APIs Onto- logy Protégé Onto- logy Word-
net Gaz- etteers Language Resource Layer (LRs) ... Web Services
Issues (1): a common HLT API: Issues (1): a common HLT API
OGSA, WMSO in the web services layer?
Eclipse: less code for us, more services for users? (A free OWL/UML drawing tool, for example)
ISO TC37/SC4: JNLE special; LIRICS consortium
API Application: Ontology-based IE: API Application: Ontology-based IE XYZ was established on 03 November 1978 in London. It opened a plant in Bulgaria in … Ontology & KB Company type HQ establOn City Country Location partOf type type type “03/11/1978” HQ partOf
Slide17: “Gordon Brown met George Bush during his two day visit. Classes, instances & metadata Classes+instances before Bush
http://… 1.html
0
12
Gordon Brown
…#Person
…#Person12345
18
32
George Bush
…#Person
…#Person67890
Classes+ instances after
OBIE in KIM : OBIE in KIM
Popov et al. KIM. ISWC’03 An ontology (KIMO) and 200K instances KB
High ambiguity of instances with the same label – uses disambiguation step
Lookup phase marks mentions from the ontology
Combined with GATE-based IE system to recognise new instances of concepts and relations
KB enrichment stage where some of these new instances are added to the KB
Disambiguation uses an Entity Ranking algorithm, i.e., priority ordering of entities with the same label based on corpus statistics (e.g., Paris)
OBIE in KIM (2) : OBIE in KIM (2)
Popov et al. KIM. ISWC’03
KIM demo...: KIM demo... Continue to exploit the pluggability and community effects of GATE (and Sesame, Lucene, ...)
SWAN: Semantic Web Annotator at DERI/Galway
Syndication
Social networking
Evaluation (below)
Next steps in OBIE
(The “P” in OLP) Challenge:Evaluating Richer NE Tagging: (The “P” in OLP) Challenge: Evaluating Richer NE Tagging Need for new metrics when evaluating hierarchy/ontology-based NE tagging
Need to take into account distance in the hierarchy
Tagging a company as a charity is less wrong than tagging it as a person
SW IE Evaluation tasks: SW IE Evaluation tasks Detection of entities and events, given a target ontology of the domain.
Disambiguation of the entities and events from the documents with respect to instances in the given ontology. For example, measuring whether the IE correctly disambiguated “Cambridge” in the text to the correct instance: Cambridge, UK vs Cambridge, MA.
Decision when a new instance needs to be added to the ontology, because the text contains a new instance, that does not already exist in the ontology.
Issues (2): a common OMM API: Issues (2): a common OMM API
Two design approaches:
the “richest set of features” approach pool experience, cover all the bases, be relevant to very many users (“top-down”)
the “highest common factors” approach analyse software, pick common features, create plugability layer (“bottom-up”)
Both useful; can be combined
Approach B. has some key advantages:
leads to quicker version 1.0
minimises arguments (criteria: feature exists in several sys, not is “good”)
Problems:
features present several places but not all – “operation not supported”?
new work not prefigured in version 1.0 – roadmaps, placeholders
The end: The end Tutorial on HLT for the Semantic Web at European Semantic Web Symposium: http://www.esws2004.org/
These slides: http://gate.ac.uk/sale/talks/ilash-semweb-mar2004.ppt
More information: http://gate.ac.uk/ http://nlp.shef.ac.uk/
What’s the difference between Tony Blair and Mother Theresa?: What’s the difference between Tony Blair and Mother Theresa? There’s good news and bad news...
The good news: the Semantic Web is now a major focus of some of the world leaders in AI research
The bad news: AI always fails
(Or: what succeeds doesn’t get called AI any more)
How does the machine tell the difference between “Mother Theresa is a saint” and “Tony Blair is a saint”? (It doesn’t: it has no sense of irony!)
Needed: clever applications of simple semantics (contrast the success of RSS or DC with more complex schemes)
Defacto standards when we do the simple stuff robustly and in the large