Redeploying the Semantic Web

Wouter Beek

w.g.j.beek@vu.nl

Knowledge Representation & Reasoning Group
VU University Amsterdam

Slides at http://wouterbeek.github.io

The first deployment of the Semantic Web has failed

    Proof that the SW failed

  • After the first 15 years of SW deployment most data cannot be automatically:
    • found
    • read
    • queried
    • reasoned over
  • Clash between semantics and pragmatics
    • Theory of identity vs practice of linking
    • Formal meaning vs social meaning
    • Universal statements vs trivia

Problem 1

Most data cannot be found

  • SotA comparable to Yahoo! index anno 1995: hierarchy of links / catalogues (CKAN, LOV, VoID-store)
  • Most SW datasets are not available online.
  • Most online datasets are not registered in any catalogue.

Problem 2

Most data cannot be read

  • Most online data files are not fully standards-compliant.


Probably comparable to HTML/WWW, but that has:

  1. tools optimized for common errors
  2. consumers with human-level intelligence

Problem 3

Most data cannot be queried

  • Data dumps are the most popular deployment strategy
  • Many live queriable datasets have a custom API
  • Most custom APIs are not self-describing
  • Many SPARQL endpoints enforce restrictions
  • Most SPARQL endpoints that do not enforce restrictions have low availability
  • Different SPARQL endpoints enforce different restrictions
  • Different SPARQL endpoints implement different subsets of different versions of the SPARQL standard
  • Web-scale federated querying has not even been considered

Problem 3

Most data cannot not be queried

There are at least millions of data documents but only hundreds of live query endpoints with reasonable availability.

Existing deployment techniques are unable to close the gap between downloadable data dumps and live queryable data: data is growing faster than SPARQL deployment uptake

Query endpoint availablility according to LODStats

Query endpoint availablility according to LODStats

Problem 4

Most data cannot be reasoned over

  • No standardized way of selecting the entailment regime in SPARQL
  • Some entailment results cannot be expressed in RDF
  • Most triple stores only implement subsets of RDF(S) and OWL entailment
  • Different triple stores implement different subsets of different version of RDF(S) and OWL
  • Web-scale reasoning has only been performed in the lab

How to redeploy the Semantic Web?

  1. Data collection
  2. Data cleaning
  3. Data publishing
  4. Web-scale BGP answering
  5. Web-scale backwards chaining

[1] Data collection

  • Scrape catalogues
    • Custom API (CKAN)
    • HTML (VoID Store, LOV)
  • Interpret metadata vocabularies (VoID,DCAT)
  • Scrape the WWW for RDFa, Schema.org (Mika2012)
  • Craw dereferenceable IRIs
  • Hard-craft a seed list
  • Crowd-source the seed list

[2] Data cleaning

Freebase 'Monkey' & Linked Musicbrainz labels
< 10% syntactically correct
LinkedBrainz

Why is data dirty?

  • Character encoding issues
  • Socket errors
  • Protocol errors
  • Corrupted archives
  • Authentication problems
  • Syntax errors
  • Wrong metadata
  • Lexical form ↛ value
  • Logically inconsistent
  • ...

Previous solutions for data cleaning

  • Standards
  • Guidelines
  • Best practices
  • Tools

Targeted towards human data publishers

Inherently slow approaches

Mixed results after the first 15 years of deployment.

New solution for data cleaning

(1) Automate conformity to standards

"Days not decades"



(2) Tools → Web Service

Look at email

LOD Laundromat

The Clean Linked Data programming platform



http://lodlaundromat.org

[3] Data publishing

The cost/benefit model of SW data publishing is broken

The incentive model for data publishing is the wrong way around:

  • Publishing large volumes of high-quality data is penalized.
  • Consuming large volumes of data / asking DDOS-like queries is free.

Fix the cost/benefit model of SW pubishing

(1) Publishing data should be (near) free

(2a) Asking more questions should cost more (CPU)

(2b) Asking complexer questions should cost more (CPU)

HDT + SSD + LDF


HDT: Disk-based, efficient yet queryable storage



SSD: Disks become faster and cheaper



LDF: BGP queries require client-side joins

[4] Streamed BGP answering

  • LOD Cloud-wide triple pattern selectivity estimates
  • Client-side query reordering
  • Client-side symmetric hash joins
  • Culprit 1: Identity closure (800M)
  • Culprit 2: Natural language index (1.5-2B)

LOTUS

Linked Open Text Unleashed

LOTUS in numbers

MetricNumber
# literals encountered12,018,939,378
# integers and dates6,699,148,542
# indexed lexical strings5,319,790,836
# distinct sources508,244
# distinct language tags713
# hours to create index56
disk space use484.77 GB

[5] Streamed backwards chaining

Redeploying the SW opens up a new research agenda

[I] Redefinition of incentive model

Redeploying the SW opens up a new research agenda

[II] Better evaluations

Semantic Web research
=
optimizing algorithms for DBpedia

Datasets used in ISWC 2014 research track papers

17 datasets are used in total

1-6 datasets per article

2 datasets per article on average

Rerunning RDF Vault (1/2)

Rerunning RDF Vault (2/2)

Rerunning Fernandez 2013 (1/2)

OriginalRerun
Triples (M)DocsSize (MB)Compr. rateDocsSize (MB)Compr. rate
1189.073.73%179183.3111.23%
51444.713.48%74799.984.99%
101893.393.27%501,642.605.43%
2011,790.413.31%173,328.574.15%
3012,680.513.27%194,880.265.09%
4013,574.593.26%86,586.957.25%

Rerunning Fernandez 2013 (2/2)

Relate HDT compression rate to average degree

Avg. DegreeDocsCompr rate
1-59221.68%
5-10806.67%
10-∞994.85%

Redeploying the SW opens up a new research agenda

[III] Study the SW as a complex system

Formal meaning vs social meaning

	            
		            abox:item1024 rdf:type       tbox:Tent    .
		            abox:item1024 tbox:soldAt    abox:shop72  .
		            abox:shop72   rdf:type       tbox:Store   .
	            
	          
	            
		            fy:jufn1024   pe:ko9sap_     fyufnt:Ufou  .
		            fy:jufn1024   fyufnt:tmffqt  fy:aHup      .
		            fy:aHup       pe:ko9sap_     fyufnt:70342 .
	            
	          

These graphs denote the same models.

According to model theory, IRIs are individual constants or predicate letters whose names are chosen arbitrarily and thus carry no meaning.


Try to refute the hypothesis that names and meanings are independent.

Hypothesis testing over 544K datasets

Applications

Linked Data Journalism

  1. Big Data platform
  2. Entity Detection: the automatic identification of concepts, people, places, ...
  3. Linking: between concepts/people/places/etc. and data about them
  4. Aggregation: Data → Information
  5. Contextualization: Information → Knowledge