Linked data – currently a bumpy road
So, the UK Government has made a firm commitment in its new “public data principles” to publish any “raw” dataset in linked data form in the smarter government white paper. Quite what does this mean?
I tried to find a clear definition of “raw” data. In almost all definitions you will find on the Internet, it means “unedited” or “unprocessed” data. Is that really what is intended by those that use the term? If so, then all of our statistical outputs would be excluded? The term “raw data” is not at all helpful, especially if when used in this context it is not intended to mean the same thing as the more generally accepted definition, which comes from measurements from scientific instruments.
I suspect that the intention is that it means the underlying datasets that were used to reach policy recommendations. These are almost always not “raw” datasets – they are statistical outputs that most certainly have been processed in order to ensure that they really are meaningful.
So, perhaps we really mean we will publish meaningful processed data (which is almost always not “raw” data) in linked data form. How easy is it to ensure that the provenance of linked data can be well understood, and taken into account by those using it? Not very, if we follow the example set by Brian Kelly of the University of Bath. He set a challenge to some students to use Linked Data to find out which UK city has the highest proportion of students. Within a few hours, a student produced the answer…Cambridge, whose students are 3224% of the population. According to the Linked Data web of data, there are 38,696 students living in Cambridge which has a total population (according to the web of linked data) of 12.
Oh dear! The wonderfully elegant query created by the student looks fine, but the underlying quality of the data in some of the sources would seem to be a bit dodgy. Provenance remains a serious issue for Linked Data, and if we are to start publishing official statistics in Linked Data form it is not at all clear how Brian’s student would know how to write the query to pick up the right population of Cambridge (which I think must be more than 12).
This does not mean we should ignore this world – it is early days, and the concepts make good sense. But equally, we should not rush into thinking that there is a quick fix. Those who are interested in finding a consistent way to represent statistical data in the Linked Data world are working together at this google group. If you are interested in this, please join us.