Virtuoso Sponger RDFizer Middleware


Presentation Description

No description available


Presentation Transcript

Virtuoso Sponger: 

Virtuoso Sponger Extracting RDF Structured Data from Non-RDF Sources

Growing the Semantic Web: 

Growing the Semantic Web Classic “chicken ‘n’ egg” problem has impeded the growth of the Semantic Web Development of applications for the Semantic Web will remain small-scale without a critical mass of RDF data. A critical mass of RDF data won’t be achieved without adequate Semantic Web applications and tools. A new class of tools is emerging in response to this need…“RDFizers” Transform non-RDF data into RDF Virtuoso Sponger is one such RDFizer

Virtuoso Sponger: 

Virtuoso Sponger An RDFizer introduced in Virtuoso 5.0 Provides built-in RDF middleware for transforming non-RDF data into RDF "on the fly“. You can use non-RDF data sources as Semantic Web data sources. Inputs: Wide variety of non-RDF Web data sources, e.g: (X)HTML Web Pages (including hosted microformats) Web services (Google,, Flickr etc.) Binary files (MS Office, PDF, OpenDocument etc.) Output: RDF structured data

Inputs: Supported Data Sources: 

Inputs: Supported Data Sources RDF (inc. N3, Turtle) SIOC, SKOS, FOAF, AtomOWL, Annotea … (X)HTML pages HTML header metadata: Dublin Core Microformats: eRDF, RDFa, hCard, hCalendar, XFN, xFolk … Syndication formats RSS 2.0, Atom, OPML, OCS, XBEL GRDDL Web service APIs: Google Base, Flickr,, Ning … Files: Binary files: MS Office, OpenOffice, images, audio, video … Data exchange formats: iCalendar, vCard 3rd party metadata extractors: Aperture, Spotlight, SIMILE RDFizers or add your own!

Output: Structured Data: 

Output: Structured Data In the context of the Semantic Data Web: “Data organized into semantic chunks or entities, with similar entities grouped together into relations or classes” Michael Bergman ( Article: “More Structure, More Terminology and (hopefully) More Clarity”

Sponger Benefits: 

Sponger Benefits Majority of the world's data resides in non-RDF form at the current time Sponger provides a “Swiss army knife” for RDF structured data generation from non-RDF sources Extracting data from non-RDF Web sources and converting it to RDF helps “bootstrap” the Semantic Web helps drive the transition of the traditional Document-Web into the emerging Semantic Data-Web exposes the data in a canonical form for querying and inference

Sponger Inputs & Outputs: 

Sponger Inputs & Outputs

Sponger Architecture: 

Sponger Architecture Sponger is comprised of Sponger Cartridges Default cartridge collection is bundled as a Virtuoso VAD Cartridge = Metadata Extractor + Ontology Mapper Metadata extracted from non-RDF resources is mapped to a suitable ontology by Ontology Mapper to produce Structured Data Sponger is highly customizable Custom cartridges can be developed Using any language (e.g. Virtuoso PL, C/C++, Java) supported by Virtuoso Server Extensions API

Using The Sponger: 

Using The Sponger Can be invoked in several ways, via: Virtuoso SPARQL query processor Virtuoso RDF Proxy Service (/proxy) E.g. http://localhost:8890/proxy OpenLink RDF client applications ODS-Briefcase (Virtuoso WebDAV) Directly through Virtuoso PL

Using the Sponger: SPARQL Query Processor: 

Using the Sponger: SPARQL Query Processor Virtuoso extends SPARQL with IRI/URI dereferencing Highly distributed nature of Semantic Web makes it highly unlikely all the referenced resources/IRIs will be in the local quad store During query execution: From a given IRI, remote RDF resources can be downloaded, parsed & the resulting triples stored in local quad store IRI dereferencing of FROM clauses Downloads & stores triples from named graphs IRI dereferencing of SPARQL variables Downloads & stores triples based on proximity search from a starting IRI to a given depth (# of hops) via specified predicates

SPARQL Extensions: IRI Dereferencing of FROM Clauses: 

SPARQL Extensions: IRI Dereferencing of FROM Clauses Enabled through ‘define get:…’ pragmas DEFINE get:method “GET” DEFINE get:soft “soft” SELECT ?id FROM NAMED <http://myhost/user1.ttl> FROM NAMED http://myhost/user2.ttl WHERE { GRAPH ?g { ?id a ?o } }; get:soft – retrieval mode: “soft” / “replace” get:uri – IRI to retrieve if not equal to IRI of FROM clause get:method – HTTP “GET” or URIQA “MGET” get:refresh – max allowed age (seconds) of cached resource can reduce expiry time specified in HTTP headers get:proxy – proxy server address if direct download not possible

SPARQL Extensions: IRI Dereferencing of Variables: 

SPARQL Extensions: IRI Dereferencing of Variables Enabled through ‘define input:grab-…’ pragmas DEFINE input:grab-var “?more” DEFINE input:grab-depth 10 DEFINE input:grab-limit 100 DEFINE input:grab-base “http://myhost/” SELECT ?id ?fullname ?email WHERE { GRAPH ?g { ?id a <Person> ; <FullName> ?fullname ; <Email> ?email . OPTIONAL { ?id <SeeAlso> ?more } } } ; input:grab-var - SPARQL variable identifying IRIs to be downloaded input:grab-depth – max # of links (predicates) between nodes in graph input:grab-limit – max # of resources (subject/object nodes) to retrieve input:grab-base – base IRI for converting relative IRIs to absolute plus others (grab-seealso, grab-destination …) - see Reference Manual.

Using the Sponger: RDF Proxy Service: 

Using the Sponger: RDF Proxy Service Sponger functionality is also exposed by Virtuoso “/proxy” endpoint An in-built REST style Web service Takes a target URL & returns its content “as is” or tries to transform it (by sponging) to RDF Provides a “pipe” for RDF browsers to browse non-RDF sources Caches to temporary Virtuoso storage Cache invalidation similar to traditional Web Browser, based on HTTP ‘expires’ header

RDF Proxy Service: 

RDF Proxy Service Parameters: url: the URL of the target force: if ‘rdf’ is specified, will try to extract RDF data from the target and return it header: HTTP headers to be sent to the target output-format: output MIME type of the RDF data ‘rdf+xml’ (default) / ‘n3’ / ‘turtle’ / ‘ttl’ if not specified, proxy service uses content negotiation

Using the Sponger: OpenLink RDF Client Applications: 

Using the Sponger: OpenLink RDF Client Applications Bundled as part of OpenLink AJAX Toolkit (OAT) RDF Browser Uses /proxy service by default iSPARQL – Interactive SPARQL query builder Uses /sparql service & 5 sponging settings (translated to IRI dereferencing pragmas on server) Get Local Data Only Get Remote Data When Missing Locally Get All Remote Data Get All Remote Data & Related Data Get Everything

Using the Sponger: ODS-Briefcase (Virtuoso WebDAV): 

Using the Sponger: ODS-Briefcase (Virtuoso WebDAV) Briefcase = A component of OpenLink Data Spaces Includes high level interface to Virtuoso WebDAV repository Web browser based interaction Web services support (direct use of WebDAV protocol) SPARQL queryable (WebDAV location acts as RDF graph URI) Metadata automically extracted at file upload time Wide variety of file formats supported All WebDAV resources are exposed as SIOC instance data Extracted metadata available in two forms Pure WebDAV RDF (RDF/XML, N3, Turtle) optionally synchronized with Quad Store Virtuoso Content Crawler / RDF_Sink folder help automate uploading

SIOC as a Data Space “Glue” Ontology: 

SIOC as a Data Space “Glue” Ontology ODS has its own built-in cartridges for mapping to SIOC All ODS data containers (ODS-Briefcase, ODS-Weblog, ODS-Wiki, ODS-FeedManager etc) expose their data as SIOC instance data SIOC provides a generic data model of containers, items and associations between items Classes include: User, UserGroup, Role, Site, Forum, Post SIOC Types Module (sioc-t) defines further types. Classes include: AddressBook, BookmarkFolder, ImageGallery etc etc permits the use of other ontologies (e.g. FOAF) when describing attributes of SIOC entities provides a generic wrapper (“glue” ontology) for describing RDF structured data derived from OpenLink Data Spaces All ODS-related SIOC data can be queried through SPARQL

Using the Sponger: Directly via Virtuoso PL: 

Using the Sponger: Directly via Virtuoso PL Sponger cartridges are invoked through a cartridge hook Provides a Virtuoso PL entry point to the packaged functionality Can be called directly from your own Virtuoso PL procedures

Sponger Cartridges: 

Sponger Cartridges

Sponger Architecture: 

Sponger Architecture Sponger is comprised of cartridges Cartridge = metadata extractor + ontology mapper Cartridge is invoked through cartridge hook (Virtuoso PL entry point) Metadata extractor Performs initial data extraction Ontology mapper Generates RDF instance data from extracted (non-RDF) metadata Extracted metadata is mapped to an ontology associated (via an internal mapping table) with the data source type Typically uses XSLT (GRDDL or in-built Virtuoso mapping scheme) or Virtuoso PL

Sponger Cartridge Invocation: 

Sponger Cartridge Invocation

Sponger Configuration using Conductor UI: 

Sponger Configuration using Conductor UI Virtuoso Conductor provides a browser-based graphical UI for most Virtuoso administration tasks including managing Sponger Cartridges and VADs VAD = Virtuoso Application Distribution Packaging & distribution system for Virtuoso extensions RDF Cartridges VAD Bundles a variety of pre-built cartridges for popular Web resources and file types Installed as part of default Virtuoso installation

Sponger Configuration using Conductor UI: RDF Cartridges Pane: 

Sponger Configuration using Conductor UI: RDF Cartridges Pane

Sponger Configuration using Conductor UI: GRDDL Filters: 

Sponger Configuration using Conductor UI: GRDDL Filters

Sponger Configuration using Conductor UI: XSLT Templates: 

Sponger Configuration using Conductor UI: XSLT Templates

Sponger Configuration using Conductor UI: Schema Files / Supported Ontologies: 

Sponger Configuration using Conductor UI: Schema Files / Supported Ontologies

Custom Cartridges: 

Custom Cartridges Sponger is extensible via pluggable cartridge architecture Sponge new data formats by creating your own cartridges Use Virtuoso PL or any language supported by Virtuoso Server Extensions API (incl: C/C++, Java) Register your cartridge in the Cartridge Registry (SYS_RDF_CARTRIDGES table) before use using Conductor or DML

Custom Cartridges: 

Custom Cartridges Cartridge Hook - Virtuoso PL Prototype in graph_iri varchar: IRI of graph being retrieved in new_origin_uri varchar: URI of the document being retrieved in destination varchar: destination graph IRI inout content any: the document content inout async_queue any: preallocated asynchronous queue used to call the configured ping service inout ping_service any: URL of the ping service, as assigned to the PingService parameter in the [SPARQL] section of the virtuoso.ini file. This argument could be used to notify the PingTheSemanticWeb RDF document repository & notification service inout api_key any: unique string providing cartridge specific data taken from the RC_KEY column of the DB.DBA.SYS_RDF_CARTRIDGES table

Flickr Cartridge Extracts: 

Flickr Cartridge Extracts procedure DB.DBA.RDF_LOAD_FLICKR_IMG ( in graph_iri varchar, in new_origin_uri varchar, in dest varchar, inout _ret_body any, inout aq any, inout ps any, inout _key any) { declare xd, xt, url, tmp, api_key, img_id, hdr, exif any; ... url := sprintf (' id=%s&api_key=%s', img_id, api_key); tmp := http_get (url, hdr); ... xd := xtree_doc (tmp); ... xt := xslt (registry_get ('_rdf_cartridges_path_') || 'xslt/flickr2rdf.xsl', xd, vector ('baseUri', coalesce (dest, graph_iri), 'exif', exif)); xd := serialize_to_UTF8_xml (xt); DB.DBA.RDF_LOAD_RDFXML (xd, new_origin_uri, coalesce (dest, graph_iri)); return 1; }

Custom Resolvers: 

Custom Resolvers Sponger supports pluggable “Custom Resolver” cartridges Support dereferencing of other forms of URIs besides HTTP URLs, e.g: URN schemes (LSIDs) and handle schemes (DOIs) Greatly extends range of data sources which can be linked into the Semantic Web &should-sponge=soft&query=SELECT+*+WHERE+{?s+?p+?o}&format=text/html Proxy service also recognizes URNs

authorStream Live Help