nlGis: A Use Case in
Linked Historic Geodata


Wouter Beek wouter@triply.cc, VU University Amsterdam, Triply

Richard Zijdeman richard.zijdeman@iisg.nl, International Institute for Social History

June 3rd, 2018

Problem statement

Example of an artifact with poor geodata: link

Geodata in today's LOD Cloud…

nlGis datasets

https://druid.datalegend.net/nlgis

Dataset № statements Main concepts № geometries Timeframe
CShapes 6,120 countries, cities 510 1920-present
Mint Authorities 6,987 authorities, houses 950 565-present
Gemeentegeschiedenis 46,929 municipalities, provinces 3,219 1813-present
nlGis 60,036 features, geometries 4,679

Lessons learned

  1. Combine what belongs together
  2. Do not use ambiguous ‘null’ values
  3. No perfect tool for data transformation
  4. No perfect triple store for geo
  5. Direct feedback helps a lot
  6. Use interoperable representations

[1] Combine what belongs together

Date/time


:Greece iisg:cowStartYear "1946"^^xsd:gYear ;
        iisg:cowStartMonth "1"^^xsd:gMonth ;
        iisg:cowStartDay "1"^^xsd:gDay .
            

:Greece iisg:cowStart "1946-01-01"^^xsd:date .
            

(In CShapes, ‘cow’ stands for Correlates of War.)

Combining what belongs together prevents bugs

Longitude/latitude


:somewhere wgs84:lat "..."
           wgs84:lat "..." ;
           wgs84:long "..." ;
           wgs84:long "..." .
            

There is also wgs84:lat_long, but it is almost never used.

There are many instances of this!

OCLC VIAF example


:EmmaGoldman schema:givenName "Ema" ;
             schema:givenName "Ėmma" ;
             schema:familyName "Gol'dman" ;
             schema:familyName "Gōrudoman" .
            

There are many other instances of this problem, e.g., foaf:firstName and foaf:lastName.

[2] Do not use ambiguous ‘null’ values

CShapes uses -1 to denote an unknown year.

In the context of CShapes (countries after 1920) this makes sense.

But on the web we can query CShapes ánd Pleiades.

[3] No perfect tool for
data transformation

Requirements:

  1. Support multiple source formats
  2. Scale to datasets of arbitrary size

No currently available data transformation tool implements these two core requirements.

Support multiple source formats

  • CSV
  • (Geo)JSON
  • XML (GML/MARCXML/EAD)
  • relational DB
  • RDF

Proprietary formats can sometimes be transformed into open formats, e.g., ESRI ShapeFile.

Scale to datasets of arbitrary size

Be able to stream through the data at the required granularity level.


                # , name      , population , shape
                1 , Amsterdam , 1.3M       , MultiPolygon((...))
                2 , Athens    , 3.1M       , MultiPolygon((...))
                ...
            

[4] No perfect triple store for geo

GeoSPARQL support is either absent, not standards-compliant, or not performant.

  • Most stores do not implement GeoSPARQL syntax, but some do.
  • Most stores have miserable/unusable performance, but some have good performance.
  • Some stores change the data merely by loading it.
  • Some stores cannot load larger shapes.
  • Commercial stores are not necessarily better than FOSS (if fact: they are very often worse).

[5] Direct feedback helps a lot

When writing GeoSPARQL queries, a table of results is not enough.

[6] Use interoperable representations

Options for representing geodata in LOD

  • WGS84 Geo Positioning Vocabulary (W3C)
  • GeoSPARQL (OGC)
    • Well-Known Text (WKT)
    • Geography Markup Language (GML)
  • Make up your own vocabulary
  • GeoJSON + JSON-LD

Pleiades


prefix geo: <http://data.ordnancesurvey.co.uk/ontology/geometry/>
place:Athens a lawd:Place ;
  geo:hasGeometry [
    geo:asWKT "LineString(5.16 52.05,…)" ;
  ] ;
            

Without interoperable representations:

  • clients do not know what to do with your data
  • triple store cannot index your geometries
  • reasoners arrive at contradictions

After applying standards…

LOD-a-lot







Large-scale empirical analyses
lod-a-lot.lod.labs.vu.nl

Linked Geodata vocabulary use

Property№ statements№ documents
wgs84:alt2,349,6079,843
wgs84:lat42,883,36311,134
wgs84:lat_long283173
wgs84:location14,688,561117
wgs84:long42,916,78511,134
geo:asGML01
geo:asWKT188,427,32950
geo:hasGeometry28,366,2687

Based on the LOD-a-lot data collection (Fernández et al. 2017).

GeoJSON + JSON-LD

Unfortunately, these two popular formats are incompatible:

  • GeoJSON uses square brackets to denote (nested) lists of geographic coordinates.
  • JSON-LD uses square brackets as summarized syntax for repeated property assertion.

This may be fixed in future a version of the JSON-LD standard.

nlGis datasets

https://druid.datalegend.net/nlgis

Dataset № statements Main concepts № geometries Timeframe
CShapes 6,120 countries, cities 510 1920-present
Mint Authorities 6,987 authorities, houses 950 565-present
Gemeentegeschiedenis 46,929 municipalities, provinces 3,219 1813-present
nlGis 60,036 features, geometries 4,679

Dutch Cultural Heritage institutes already use this to annotate their collection with (example).

Future work

  • Cover more places & times.
  • Annotate Cultural Heritage objects with detailed geographic information.
  • Create a standardized vocabulary for how geolocations change through time.
  • Improve GeoSPARQL support in triple stores.
  • Explore new ways of displaying Cultural Heritage objects in space and time (example).

Thank you!

nlGis: https://druid.datalegend.net/nlgis/


Wouter Beek wouter@triply.cc, VU University Amsterdam, Triply

Richard Zijdeman richard.zijdeman@iisg.nl, International Institute for Social History

June 3rd, 2018