SOCIETY | Nov 8, 2016

Data Science in three dimensions

In the age of "data such as oil", is Data Science the refinery?

Data Science is a term that has been in use in both the scientific and industrial fields for a few years now, alongside those of Big Data and Data Scientist. In October 2012, an article by T.H. Davenport e D.J. Patil in the Harvard Business Review defined that of the Data Scientist as the sexiest job of the 21st century: at the time, the second of the two authors was Senior Data Scientist for LinkedIn. The question is quite simple, in a world launched into the digital age, where everything (from air tickets, mobile network traffic, purchases, books and news, up to sensors in homes and on the streets (including those in cars) is a source of digital data, the professionalism of those in a position to acquire, manage, analyse and ultimately create value from them will be that which will be most sought after in the business world. Already today, in the companies that thrive on digital business, the role of the Data Scientist is located in the upper part of the control pyramid and they report directly to the CEO.

The ability to extract value from data is not only a necessity in industry: even in the field of government, it is now well known that in the 2012 U.S. elections, Obama’s victory for his second term was also, and above all, obtained thanks to the specific use of techniques of analysis of voter data (from the web, from social networks, etc.) applied to marketing campaigns by a then unknown analyst, Dan Wagner, who managed to identify voters in a new way and provide useful elements for orienting propaganda actions to the best. In 2015, Obama himself called on D.J. Patil to cover – for the first time in the history of the United States – the role of Chief Data Scientist in order to provide new information on the state of the Union through the data.gov website, both for the administration and for citizens.

The term Big Data tends to be used to identify specifically those engineering elements linked to the size, variety and speed of production of digital data for which known technologies are no longer sufficient; since 2015, the idea has grown that in the 21st century digital economy there will also, and above all, be a need for review and restructuring of basic knowledge and professional skills, to the extent of giving birth to the idea of ​​the need for an independent discipline – that of Data Science – distinct from Computer Science, which encompasses not only the analytical, management and data modelling aspects, but also engineering aspects, of economics and allied skills (the so-called soft skills).

Although still far from having a universally accepted definition, the term Data Science tends to indicate a field of interdisciplinary studies focused on both the processes and the technological systems necessary for extracting knowledge and – in the final analysis – value from data of different natures, forms and sizes. Data Science thus inherits elements of Statistics, Data Mining, Machine Learning, Operations Research, Information Theory, Programming and Big Data.

Europe has not stood by idly in all these years. In line with the initiatives for the Digital Economy and the Digital Single Market, a series of supranational initiatives has been activated that aim to encourage the collection, sharing and production of digital data in various forms.

Three dimensions of Data Science have taken shape, all three fundamental and requiring attention in the coming years. The dimension of the availability of data, including aspects related to format, interoperability and rules for exploitation; the technological dimension, with both open source and proprietary solutions that enable management of these data; the educational dimension, with initiatives for identifying the competencies expected from the labour market for supporting universities and training centres to prepare workers able to adequately address the change of pace expected in the economy of the 21st century.

Within the first dimension of the availability of data, there are various initiatives of the European Commission, such as the INSPIRE directive for the structuring of georeferenced public administration data according to open and interchangeable formats, the Open Data portal and the Copernicus action, in collaboration with the European Space Agency (ESA) for the exploitation of Earth observation data.

In the second technological dimension, we find the instruments developed by the international community for the management of big data, primarily as open source. The best known is the Hadoop framework which, over time, has seen the creation of a series of instruments which are now mature for managing every aspect and problem related to Big Data in the Enterprise context. Apache Spark on the other hand has reached a more advanced maturity and efficiency than Hadoop and aims to become an autonomous platform. Finally, to be noted on the proprietary front is the trend towards all-inclusive offers from major vendors (Microsoft, Google, Amazon Web Services) which include the Internet of Things, Big Data and Cloud Computing; this is a sign that the technological convergence of these three buzzwords has begun. In this sense, we also see the birth of specific platforms for Data Science such as Data Science Experience, recently promoted by IBM.

The third dimension is that of education; in this context, we distinguish between initiatives that aim to increase the level of knowledge of the population in the digital field (understood as all-round, including skills in using a browser and surfing Internet, as well as those of knowing how to program) and those more specific for defining university and post-university curricula dedicated to Data Science or promotion of communities linked to it. The educational aspect will soon merge with the normative aspect, so communities and courses offered on the basis of the specificity and sensitivity of individual teachers and schools will be replaced by content and skills required on the basis of certification and/or indications of professional associations, such as has happened with those of Project Management.

The concept of Data Science will not be the new buzzword that will disappear with the next technological change or on the basis of marketing needs. The signals that we have noticed are destined to consolidate in the near future and to last over time. If the datum is the new petroleum of the 21st century, Data Science is its refinery.

Andrea Manieri e Francesco Saverio Nucci