SOCIETY | Dec 20, 2016

Open Data: everybody is talking about them, but…+

What are open data and how easy and possible is it to analyse them to obtain information?

The term Open Data comes and goes. Just like with trends, newspapers are full of them during certain periods and then they disappear during others, but, regardless of how much we talk about them, Open Data in Italy are still too few and far between (and above all, they are not very useful).

The latest reference to open data comes from the new team for digital transformation led by Diego Piacentini, who in a press release states: The Digital Task Force is committed to aligning standards for conducting a procedure via Internet and for sharing open data: “No more private silos of this or that administration.”

Open Data are back then. Even though it can happen that when we ask (not necessarily “ordinary citizens”, but professional categories who are potentially interested in the disclosure of PA data such as journalists) what are Open Data, very few raise their hands to answer. Some people have heard of them, but have never used them to write an article. Others have written about a start-up which wanted to do business with open data, but which maybe failed because it only found a few, out of date or wrong ones. Some confuse them with Big Data and say: “Of course, data are crucial. Amazon, for example, uses them a lot”. But those are not really Open Data.

Let’s start from the definition of open data, we can refer to the one given by the Open Knowledge Foundation:

“A content or a datum is defined open if anyone is free to use, reuse, redistribute it, being subject, at most, to the requirement of attribution and/or of sharing it in the same way”.

If you wanted to play “Guess Who” for Open Data you could say:

  • they are accessible through ICT and are suitable to be used automatically by software
  • they are available with a license which allows them to be used by anyone
  • they are in an open format (therefore they cannot be saved in a proprietary format such as xls)
  • they are free or have a marginal cost.

Once they have been identified, you now understand that PA resolutions published in a pdf format are not open data (or even more so as pdf images from paper document scans); the tables published on PA websites are not Open Data; the reprocessing of figures published as statistical reports are not Open Data. The datum, in fact, should be “raw”, unrefined, unprocessed and published in its most simple and clear form so that anyone (specifically, even a software or a service) can use it and reprocess it as they see fit.

That being said, everything seems simple: you only need to stick to this definition in order that a PA should have a good datum to be published. Yet, if we analyse journalists’ points of view, for example, working with data is tiring. Or exhausting, depending on the data we are looking for.

What constitutes the Open Data ‘Via Crucis’ for data journalists (or for anyone interested in analysing an open datum)?

The journalist’s first station of the cross is to search for open data in order to write an article and not know where to find them. There is a national portal and there are also individual administration portals. They almost always differ from each other and are not always accessible to non-technicians. Help in these cases may arrive from Saint Google who intercedes in helping to find a more direct route and in order to progress to the next stations.

At the second station, the reporter locates the datum, but this has zero interest. Yes, because the easiest data for PAs to publish are “harmless” ones (the list of available chemists, for example), which do not tell us a great deal about a body or a territory’s performance (and even less about its economy). Data that are useless to anyone but the PA which places these in the Open Data Portal to increase the number of published datasets.

At the third station the journalist locates his datum, but this is open only in the intentions of the politician who announced it, as it is only a pdf image. And how are you supposed to use am image datum unless by printing the document and transcribing it by hand into a table with all the consequent waste of time and risk of error?

At the fourth station, the reporter locates the datum, but this is old, incomplete, too aggregated, not updated. For example, should I need to write an article about tourism and I only found issues two or three years old, what could I do with that datum? If I wanted to write a piece about PA expenses and I found a huge dataset that a normal analysis tool like a spreadsheet is not able to read, what would be the point of that datum being made available? And should I find the measurement data of the level of air pollution of one part of town and not of all the parts, how could I provide complete information?

At the fifth station the journalist judges the datum interesting, but it is not well described nor properly structured, or it is difficult to reprocess due to an incorrect column format. For example, should the journalist find a dataset with a column called CAP_COST how could he correctly interpret the meaning of that column if the datum is not published together with a description file? Or should he wish to reprocess the information and there is a number written as text? Or a date in letters impossible to order? Or with replicated columns and different information in different places?

At the sixth station, the journalist decides that, despite the difficulties, he will download the data, reprocess them on a spreadsheet and interpret them in order to write an article. He will then be crucified should he have misspelt something during the third station, should he have thought that the data was updated but it was not at the fourth station, should he have badly interpreted the name of an undescribed column at the fifth. And glimpsing the risk of crucifixion, he then decides that data journalism using open data is not his cup of tea. That useful Open Data do not exist and are just one of the fashions of the moment with which nobody deals here (because apparently if you put your nose outside this country it seems to be possible).

If this ‘Via Crucis’ should come to life as we have described it, reporters will not ask for data. Nor will citizens and associations (the famous Civil Society) as they will not know what they are and will not even see their value in terms of information or even as new services offered by companies. The PAs will then say that it is not worth investing in the opening of data (since the publication process is not trivial and requires an expenditure of energy). So the fashion will pass. There will be no Open Data except in some speeches made with good intentions destined to remain on paper (or on a blog).

And therefore, the phrase which opens Piacentini’s press release “Only play those notes that are necessary. Try not to play the other ones” in this case will sound like a bad omen. Because innovation certainly does not occur by always playing the same music, consisting only of those few notes which someone decides are necessary.

Sonia Montegiove