Faceted Views over Large ScaleLinkedData

Views:
 
     
 

Presentation Description

No description available.

Comments

Presentation Transcript

OpenLink Virtuoso - Faceted Views : 

© 2007 OpenLink Software, All rights reserved OpenLink Virtuoso - Faceted Views Faceted Views over Large-Scale Linked Data Orri Erling Program Manager - Virtuoso Development Team,OpenLink Software

Dimensions of Web Usage : 

© 2007 OpenLink Software, All rights reserved Dimensions of Web Usage Web 1.0: Publishing for All (Citizen Publisher) via Web Sites Web 2.0: Commentary for All (Citizen Journalist) via Blogging and Social Networks, with User Generated Content across Data Silos Web 3.0: Analysis for All (Citizen Analyst), via Linked Data enabling Data Mobility and Meshing, Your Data Is Your Statement, Applications Float on a Cloud of Data across a federation of HTTP accessible Data Spaces Meanwhile, in the DBMS world, ad hoc data access and manipulation has consistently won over hard-coded alternatives e.g.: SQL over CODASYL, Today we see Linked Data as delivering the "ad hoc" factor in "best of both worlds" fashion, relative to alternatives (including RDBMS), across the Web and/or within Intranets & Extranets.

The Challenges : 

© 2007 OpenLink Software, All rights reserved The Challenges Scale of Instance Data - 10^9 - 10^11 Triples Scale of Ontology 100,000's of Classes Faceted Browsing, Text and Structure Deployment and Provisioning

It Is Not Only About The Warehouse : 

© 2007 OpenLink Software, All rights reserved It Is Not Only About The Warehouse Up until now, you design the warehouse for the application, load the data, make a data island With Linked Data, the warehouse is self-filling, based on published data using terms from commonly shared vocabularies Virtuoso facilitates the above by integrated RDF-ization middleware; you populate the warehouse as you query, and system simply gets smarter in line with your natural work patterns.

It Is Not Only About Publishing Your Data : 

© 2007 OpenLink Software, All rights reserved It Is Not Only About Publishing Your Data Having secrets does not mean using a secret language Private environments still benefit from common vocabularies and terms People and organizations publish anyway: Now it is about publishing for use in applications and integration, internet, extranet, intranet Linked data and Virtuoso deliver on the Data Spaces concept: Express any statement for which there is a vocabulary and the data exposed by the statement can be found, joined and processed (e.g. Meshups). Basically, The network is the database.

Solutions : 

© 2007 OpenLink Software, All rights reserved Solutions Virtuoso 6, Single Server and Cluster Editions SPARQL and SQL With The Right Extensions for serious BI style analytics Integrated Web Services Platform, Suite of RDF-izers (Extractor & LOD Cloud Lookup variants) Server Hosted Facet Browsing Service (via REST API), Entity Ranking, Other Building Blocks for Web 2.0 Style Development

The lod.openlinksw.com Demo : 

© 2007 OpenLink Software, All rights reserved The lod.openlinksw.com Demo 4.2 GTriples on 2 Commodity Servers Full Text and Structured Querying SPARQL End Point Faceted Browsing Interface for Quick Discovery and Simple Report Composition Usage Statistics across Source & Reference Graph IRIs, plus IFP and owl:sameAs usage stats VoiD Graph Providing Rich Description of hosted Data Sets If OpenLink does not host it with enough capacity or the right data, you can procure your own infrastructure and get the software from us. From now on, anybody who chooses can be a search and analytics player.

Technology : 

© 2007 OpenLink Software, All rights reserved Technology SPARQL Augmented With Run Time Inferencing Entity Ranks for Better Search Anytime Query Answering for Quick Approximate Results A User Interface Combining Discovery and Query Building Easy Web Services API's and SPARQL for Developing Applications

Technology : 

© 2007 OpenLink Software, All rights reserved Technology

Run Time Taxonomies : 

© 2007 OpenLink Software, All rights reserved Run Time Taxonomies No Materialization, Select Taxonomy At Query Time Query Optimization Knows About Class and Property Hierarchies

Run Time Identity : 

© 2007 OpenLink Software, All rights reserved Run Time Identity Optionally Follow owl:sameAs links Optionally consider any two sharing an IFP to be the same No materialization, Control sameAs and IFP following at query time, at the triple pattern level For Ad Hoc, Do Identity at Run Time For Deep Analytics and Batch Processing, Normalize Identities at Load Time

Entity Ranking : 

© 2007 OpenLink Software, All rights reserved Entity Ranking References and the Rank of the Referrer Contribute to Rank, as In Web Search Can Customize Weight By Graph, Predicate Can Run Ranks on Selected Subsets Ranks Are Calculated in a Batch Run

Entity Name Service : 

© 2007 OpenLink Software, All rights reserved Entity Name Service Autocompletion of URI's Autocompletion of Label-Like Properties Ranked List of Synonyms Statistics on Where a URI is Defined and Where it is Referenced

Virtuoso Anytime Query Feature : 

© 2007 OpenLink Software, All rights reserved Virtuoso Anytime Query Feature Partial Results in Fixed Time Useful for Interactive Browsing, Query Development over large data sets (e.g. LOD Cloud) On public SPARQL end points, Protects Against DOS, still giving samples of the answers Metering of query resource utilization

The LOD Cloud Faceted Search, Find, and Lookup Services : 

© 2007 OpenLink Software, All rights reserved The LOD Cloud Faceted Search, Find, and Lookup Services Access via Web Services, SPARQL Developed in Virtuoso using SQL, SPARQL, and Stored Procedures Part of Virtuoso 6.x Open Source Edition (Single Server Edition only)

Experience : 

© 2007 OpenLink Software, All rights reserved Experience If Data In Memory, Interactive Time and Linear Scale RDF Aware Query Optimizer is Key Parallel Execution Engine 1 Thread/Query/Partition For Generic Linked Data, RDF Representation With 4 Indices, plus Full Text Indexing on Literal Objects For Specialized Tasks, SQL + Stored Procedures With Parallel Programming Model (of course output will be Linked Data) Unlimited Cross Partition Joining, Near Full Platform Utilization, and not a problem with the right message flow

Some Performance Data : 

© 2007 OpenLink Software, All rights reserved Some Performance Data Current Live Instance Setup: 2 Linux boxes with 2x4 core Xeons each with 32G RAM for a data set in excess of 4.2 Billion Triples 3.2 Million Single Triple Lookups Per Second Load Rates over 100K Triples/sec Entity Ranks for 4.2 GTriples in 30m/Iteration

Deployment : 

© 2007 OpenLink Software, All rights reserved Deployment For Intermittent Use, 1TB of RAM, 256 Virtual Cores at EC2 is $1228 Per Day For Purchase, Cluster of 1TB RAM, 120 Nehalem Cores Lists Around $75K Rent of Buy? To Handle 10 GT 100% in RAM or 50 GT With Decent Working Set: * April 2009 US retail prices

Conclusions : 

© 2007 OpenLink Software, All rights reserved Conclusions Applications exploiting open data access across heterogenous data sources at Web Scale now within anyone's reach Usable for public Web Sites or for in-house Business Analytics Web enabled Open Data Access & Analysis for All!

authorStream Live Help