Using the LOD2 stack: linking datasets
The LOD2 project aims at easing the publishing of Linked Data in RDF formet. In this blog,
I guide you through the process of interlinking two datasets using the LOD2 stack.
The tutorial is based on our work for the LOD2 project and is also available at
the wiki of the LOD2 stack.
At the Digital Agenda Scoreboard information is published about the penetration of ICT in the
European's daily life. One such indicator is the % households with access to the internet at home.
This indicator provides insight in the evolution on the usage of internet at home, but not about
the reasons of the value.
One such reasons can be the average income a household has. To investigate the correlations
information about the income of households is required. At Eurostat such information is available.
For example the next dataset "Mean and median income by household type" contains that information
The goal is to create a table where we see the value of the above DA Scoreboard indicator
side by side with the mean income for each country and year.
Selecting Upload RDF file in the Extraction & Loading menu of the LOD2 demonstrator
you can upload a graph into the local RDF store (Virtuoso). Just provide the Resource URL
http://scoreboard.lod2.eu/data/scoreboardDatacube.ttl and a Named Graph URI.
Here we will use "http://localhost/scoreboard" for this example.
Make the http://localhost/scoreboard the current graph in the lod2 demonstrator. Now you
can browse the content selecting OntoWiki in the Authoring menu.
Now you can select eg. an observation to see information for instance about the usage of internet
in households for the Netherlands in 2005 (see table below).
|Observation||Country||Year||Unit||Measure households with access to the Internet at home|
The Eurostat data is not published as RDF by Eurostat. The LATC project has made an effort
to publish a number of eurostat datasets as RDF.
From http://eurostat.linked-statistics.org/ one can download the necessary datasets
- the dataset with observations: http://eurostat.linked-statistics.org/data/indic_di04.rdf
- the dimension information about the geo locations: http://eurostat.linked-statistics.org/dic/geo.rdf
- the time dimension: http://eurostat.linked-statistics.org/dic/time.rdf
The above dataset repository is work in progress, so the format and the availability
of the data is subject to their status.
In the dataset about the household statistics one finds observations such as:
|Observation||Country||Year||Measure Average net income - total|
In order to being able to upload those in the local virtuoso store one has to manually reformat
them and also expand all relative names with the right prefix. We used rapper, a command line tool.
You can easily install it on your local ubuntu system.
Also in order to make the dataset linkable the time dimension misses the necessary identifiers
for each year under consideration. Probably this is a missing element in the time-dimension configuration.
We have manually added the necessary RDF statements. All corrected files resulting from this
operation are found below:
- the dataset with observations (a reduced in size version): ilc_di04.reduced.nt
- the dimension information about the geo locations: geo.nt
- the time dimension: time.nt
Now all datasets of Eurostat are ready for upload in the local LOD2 stack. We upload them the
same way we uploaded the DA Scoreboard data but this time all three files in one
single RDF graph "http://localhost/eurostat".
For creating links between datasets one can choose between two major methodologies:
- updating one of the datasets, or
- creating a separate link dataset.
Each option has its pro and cons. For the example here we have chosen the second option as
we wanted to show that even without changing the original datasets a linking can be established.
In order to link the 2 observations from both sources we use the linking tool SILK available
in the Linking menu. The linking rules we have to express are:
- an Scoreboard Observation is linked with the Eurostat year identifier
(as expressed in time.nt) if the labels are exactly the same.
- an Scoreboard Observation is linked with the Eurostat country identifier
(as expressed in geo.nt) if there exists labels which are exactly the same.
(Note that the labels can be in more than one language.)
The attached silk project scoreboard-eurostat-linking.xml contains those speficications.
Import it in your local silk workbench. When executing the linking process choose the option
to export the link results in the RDF store.
The goal of the exercise is the creation of a table such as:
|Country||Year||Observation DG INFSO||Measure % Households with internet at home||Observation Eurostat||Measure Average net income - total|
To get the table open a sparql editor page in the lod2demonstrator and issue the next query
The above steps have guided you through the linking process of two datasets.
We wish you a happy interlinking experience.