Virtuoso Sponger: Virtuoso Sponger Extracting RDF Structured Data from Non-RDF Sources
Growing the Semantic Web: Growing the Semantic Web Classic “chicken ‘n’ egg” problem has impeded the growth of the Semantic Web
Development of applications for the Semantic Web will remain small-scale without a critical mass of RDF data.
A critical mass of RDF data won’t be achieved without adequate Semantic Web applications and tools.
A new class of tools is emerging in response to this need…“RDFizers”
Transform non-RDF data into RDF
Virtuoso Sponger is one such RDFizer
Virtuoso Sponger: Virtuoso Sponger An RDFizer introduced in Virtuoso 5.0
Provides built-in RDF middleware for transforming non-RDF data into RDF "on the fly“.
You can use non-RDF data sources as Semantic Web data sources.
Inputs: Wide variety of non-RDF Web data sources, e.g:
(X)HTML Web Pages (including hosted microformats)
Web services (Google, Del.icio.us, Flickr etc.)
Binary files (MS Office, PDF, OpenDocument etc.)
Output: RDF structured data
Inputs: Supported Data Sources: Inputs: Supported Data Sources RDF (inc. N3, Turtle)
SIOC, SKOS, FOAF, AtomOWL, Annotea …
(X)HTML pages
HTML header metadata: Dublin Core
Microformats: eRDF, RDFa, hCard, hCalendar, XFN, xFolk …
Syndication formats
RSS 2.0, Atom, OPML, OCS, XBEL
GRDDL
Web service APIs: Google Base, Flickr, Del.icio.us, Ning …
Files:
Binary files: MS Office, OpenOffice, images, audio, video …
Data exchange formats: iCalendar, vCard
3rd party metadata extractors: Aperture, Spotlight, SIMILE RDFizers
or add your own!
Output: Structured Data: Output: Structured Data In the context of the Semantic Data Web:
“Data organized into semantic chunks or entities, with similar entities grouped together into relations or classes” Michael Bergman (http://www.mkbergman.com) Article: “More Structure, More Terminology and (hopefully) More Clarity”
Sponger Benefits: Sponger Benefits Majority of the world's data resides in non-RDF form at the current time
Sponger provides a “Swiss army knife” for RDF structured data generation from non-RDF sources
Extracting data from non-RDF Web sources and converting it to RDF
helps “bootstrap” the Semantic Web
helps drive the transition of the traditional Document-Web into the emerging Semantic Data-Web
exposes the data in a canonical form for querying and inference
Sponger Inputs & Outputs: Sponger Inputs & Outputs
Sponger Architecture: Sponger Architecture Sponger is comprised of Sponger Cartridges
Default cartridge collection is bundled as a Virtuoso VAD
Cartridge = Metadata Extractor + Ontology Mapper
Metadata extracted from non-RDF resources is mapped to a suitable ontology by Ontology Mapper to produce Structured Data
Sponger is highly customizable
Custom cartridges can be developed
Using any language (e.g. Virtuoso PL, C/C++, Java) supported by Virtuoso Server Extensions API
Using The Sponger: Using The Sponger Can be invoked in several ways, via:
Virtuoso SPARQL query processor
Virtuoso RDF Proxy Service (/proxy)
E.g. http://localhost:8890/proxy
OpenLink RDF client applications
ODS-Briefcase (Virtuoso WebDAV)
Directly through Virtuoso PL
Using the Sponger:SPARQL Query Processor: Using the Sponger: SPARQL Query Processor Virtuoso extends SPARQL with IRI/URI dereferencing
Highly distributed nature of Semantic Web makes it highly unlikely all the referenced resources/IRIs will be in the local quad store
During query execution:
From a given IRI, remote RDF resources can be downloaded, parsed & the resulting triples stored in local quad store
IRI dereferencing of FROM clauses
Downloads & stores triples from named graphs
IRI dereferencing of SPARQL variables
Downloads & stores triples based on proximity search from a starting IRI to a given depth (# of hops) via specified predicates
SPARQL Extensions:IRI Dereferencing of FROM Clauses: SPARQL Extensions: IRI Dereferencing of FROM Clauses Enabled through ‘define get:…’ pragmas
DEFINE get:method “GET”
DEFINE get:soft “soft”
SELECT ?id
FROM NAMED
FROM NAMED http://myhost/user2.ttl
WHERE { GRAPH ?g { ?id a ?o } };
get:soft – retrieval mode: “soft” / “replace”
get:uri – IRI to retrieve if not equal to IRI of FROM clause
get:method – HTTP “GET” or URIQA “MGET”
get:refresh – max allowed age (seconds) of cached resource
can reduce expiry time specified in HTTP headers
get:proxy – proxy server address if direct download not possible
SPARQL Extensions:IRI Dereferencing of Variables: SPARQL Extensions: IRI Dereferencing of Variables Enabled through ‘define input:grab-…’ pragmas
DEFINE input:grab-var “?more”
DEFINE input:grab-depth 10
DEFINE input:grab-limit 100
DEFINE input:grab-base “http://myhost/”
SELECT ?id ?fullname ?email
WHERE { GRAPH ?g {
?id a ; ?fullname ; ?email .
OPTIONAL { ?id ?more }
} } ;
input:grab-var - SPARQL variable identifying IRIs to be downloaded
input:grab-depth – max # of links (predicates) between nodes in graph
input:grab-limit – max # of resources (subject/object nodes) to retrieve
input:grab-base – base IRI for converting relative IRIs to absolute
plus others (grab-seealso, grab-destination …) - see Reference Manual.
Using the Sponger:RDF Proxy Service: Using the Sponger: RDF Proxy Service Sponger functionality is also exposed by Virtuoso “/proxy” endpoint
An in-built REST style Web service
Takes a target URL & returns its content “as is” or tries to transform it (by sponging) to RDF
http://demo.openlinksw.com/proxy?url=http://www.w3c.org/People/Connelly/&force=rdf
Provides a “pipe” for RDF browsers to browse non-RDF sources
Caches to temporary Virtuoso storage
Cache invalidation similar to traditional Web Browser, based on HTTP ‘expires’ header
RDF Proxy Service: RDF Proxy Service Parameters:
url: the URL of the target
force: if ‘rdf’ is specified, will try to extract RDF data from the target and return it
header: HTTP headers to be sent to the target
output-format: output MIME type of the RDF data
‘rdf+xml’ (default) / ‘n3’ / ‘turtle’ / ‘ttl’
if not specified, proxy service uses content negotiation
Using the Sponger:OpenLink RDF Client Applications: Using the Sponger: OpenLink RDF Client Applications Bundled as part of OpenLink AJAX Toolkit (OAT)
RDF Browser
Uses /proxy service by default
iSPARQL – Interactive SPARQL query builder
Uses /sparql service & 5 sponging settings (translated to IRI dereferencing pragmas on server)
Get Local Data Only
Get Remote Data When Missing Locally
Get All Remote Data
Get All Remote Data & Related Data
Get Everything
Using the Sponger:ODS-Briefcase (Virtuoso WebDAV): Using the Sponger: ODS-Briefcase (Virtuoso WebDAV) Briefcase = A component of OpenLink Data Spaces
Includes high level interface to Virtuoso WebDAV repository
Web browser based interaction
Web services support (direct use of WebDAV protocol)
SPARQL queryable (WebDAV location acts as RDF graph URI)
Metadata automically extracted at file upload time
Wide variety of file formats supported
All WebDAV resources are exposed as SIOC instance data
Extracted metadata available in two forms
Pure WebDAV
RDF (RDF/XML, N3, Turtle) optionally synchronized with Quad Store
Virtuoso Content Crawler / RDF_Sink folder help automate uploading
SIOC as a Data Space “Glue” Ontology: SIOC as a Data Space “Glue” Ontology ODS has its own built-in cartridges for mapping to SIOC
All ODS data containers (ODS-Briefcase, ODS-Weblog, ODS-Wiki, ODS-FeedManager etc) expose their data as SIOC instance data
SIOC
provides a generic data model of containers, items and associations between items
Classes include: User, UserGroup, Role, Site, Forum, Post
SIOC Types Module (sioc-t) defines further types.
Classes include: AddressBook, BookmarkFolder, ImageGallery etc etc
permits the use of other ontologies (e.g. FOAF) when describing attributes of SIOC entities
provides a generic wrapper (“glue” ontology) for describing RDF structured data derived from OpenLink Data Spaces
All ODS-related SIOC data can be queried through SPARQL
Using the Sponger:Directly via Virtuoso PL: Using the Sponger: Directly via Virtuoso PL Sponger cartridges are invoked through a cartridge hook
Provides a Virtuoso PL entry point to the packaged functionality
Can be called directly from your own Virtuoso PL procedures
Sponger Cartridges: Sponger Cartridges
Sponger Architecture: Sponger Architecture Sponger is comprised of cartridges
Cartridge = metadata extractor + ontology mapper
Cartridge is invoked through cartridge hook (Virtuoso PL entry point)
Metadata extractor
Performs initial data extraction
Ontology mapper
Generates RDF instance data from extracted (non-RDF) metadata
Extracted metadata is mapped to an ontology associated (via an internal mapping table) with the data source type
Typically uses XSLT (GRDDL or in-built Virtuoso mapping scheme) or Virtuoso PL
Sponger Cartridge Invocation: Sponger Cartridge Invocation
Sponger Configuration using Conductor UI: Sponger Configuration using Conductor UI Virtuoso Conductor provides a browser-based graphical UI for most Virtuoso administration tasks
including managing Sponger Cartridges and VADs
VAD = Virtuoso Application Distribution
Packaging & distribution system for Virtuoso extensions
RDF Cartridges VAD
Bundles a variety of pre-built cartridges for popular Web resources and file types
Installed as part of default Virtuoso installation
Sponger Configuration using Conductor UI:RDF Cartridges Pane: Sponger Configuration using Conductor UI: RDF Cartridges Pane
Sponger Configuration using Conductor UI:GRDDL Filters: Sponger Configuration using Conductor UI: GRDDL Filters
Sponger Configuration using Conductor UI:XSLT Templates: Sponger Configuration using Conductor UI: XSLT Templates
Sponger Configuration using Conductor UI:Schema Files / Supported Ontologies: Sponger Configuration using Conductor UI: Schema Files / Supported Ontologies
Custom Cartridges: Custom Cartridges Sponger is extensible via pluggable cartridge architecture
Sponge new data formats by creating your own cartridges
Use Virtuoso PL or any language supported by Virtuoso Server Extensions API (incl: C/C++, Java)
Register your cartridge in the Cartridge Registry (SYS_RDF_CARTRIDGES table) before use using Conductor or DML
Custom Cartridges: Custom Cartridges Cartridge Hook - Virtuoso PL Prototype
in graph_iri varchar: IRI of graph being retrieved
in new_origin_uri varchar: URI of the document being retrieved
in destination varchar: destination graph IRI
inout content any: the document content
inout async_queue any: preallocated asynchronous queue used to call the configured ping service
inout ping_service any: URL of the ping service, as assigned to the PingService parameter in the [SPARQL] section of the virtuoso.ini file. This argument could be used to notify the PingTheSemanticWeb RDF document repository & notification service
inout api_key any: unique string providing cartridge specific data taken from the RC_KEY column of the DB.DBA.SYS_RDF_CARTRIDGES table
Flickr Cartridge Extracts: Flickr Cartridge Extracts procedure DB.DBA.RDF_LOAD_FLICKR_IMG (
in graph_iri varchar, in new_origin_uri varchar, in dest varchar,
inout _ret_body any, inout aq any, inout ps any, inout _key any)
{
declare xd, xt, url, tmp, api_key, img_id, hdr, exif any;
...
url := sprintf ('http://api.flickr.com/services/rest/?method=flickr.photos.getInfo&photo_
id=%s&api_key=%s', img_id, api_key);
tmp := http_get (url, hdr);
...
xd := xtree_doc (tmp);
...
xt := xslt (registry_get ('_rdf_cartridges_path_') || 'xslt/flickr2rdf.xsl',
xd, vector ('baseUri', coalesce (dest, graph_iri), 'exif', exif));
xd := serialize_to_UTF8_xml (xt);
DB.DBA.RDF_LOAD_RDFXML (xd, new_origin_uri, coalesce (dest, graph_iri));
return 1;
}
Custom Resolvers: Custom Resolvers Sponger supports pluggable “Custom Resolver” cartridges
Support dereferencing of other forms of URIs besides HTTP URLs, e.g:
URN schemes (LSIDs) and handle schemes (DOIs)
Greatly extends range of data sources which can be linked into the Semantic Web
http://demo.openlinksw.com/sparql?default-graph-uri= urn:lsid:ubio.org:namebank:11815 &should-sponge=soft&query=SELECT+*+WHERE+{?s+?p+?o}&format=text/html
Proxy service also recognizes URNs
http://demo.openlinksw.com/proxy?url=urn:lsid:ubio.org:namebank:11815&force=rdf