The fact that the data is the new oil and that information is now digital is also evident from the media coverage generated by attempts at fraudulent misappropriation and the increasingly high attention paid to the issues of digital privacy and identity (and memory).
Refining this material to produce value-added information calls for advanced analysis techniques (data/text mining, machine learning, deep learning) and a non-linear method, which is not particularly reassuring for those who usually treat traditional computer systems, where a problem is always translatable into a rather definite and very manageable estimate in terms of expenditure and result.
The skills needed to obtain information from data
Speaking of skills takes the discourse beyond the bounds of descriptive statistics and refers to the figure of the data scientist, which is now exploding as the professionalism of the near future. This is a profile that is not new but revitalised by the big data trend, centred on the physical/mathematical/linguistic sciences front rather than on computer skills. In fact, what is required is an aptitude to working with data, a thinking less conditioned by direct cause-effect relationships, that is less deterministic, capable of developing a plausible theory without the urgency of quantifying but with attention to qualification and evaluation.
The domain in which it operates is not an invariant and it is necessary to consider the typology of the phenomenon being studied, insofar as a mechanical field has structural characteristics that differ from a system of organic molecules or a detector of frequency and power signals. The format of data also has its own specificity, to the extent that dealing with structured alphanumeric values rather than images, short or long texts and video/audio streams involves the use of specific techniques. Specialist knowledge is certainly required, but just one specialisation is almost always insufficient. For this reason, it is crucial to be able to draw on different skills with a capacity for dialogue mediated by a precise but not too sectoral language.
While computer skill is not the qualifier for the profile of the data scientist, it is however necessary to consider the context in which he/she works. If he/she works with graphic tools, which offer the great advantage but also the limit of being able to call on ready-to-use functions, then that assertion is surely true, and this is a field much frequented by statisticians. The same is true when working on traditional architectures and the work of the data scientist can be applied to data sets that can be processed with reasonable performance on a single machine. However, if we are in a big data architecture, where data preparation and the execution of algorithms require the power of a cluster to be able to distribute loads and processes, then computer skills are no longer negligible. Indeed, if it is true that a data scientist is certainly capable of hypothesising a good theoretical model and implementing it with a programming language, it is not equally true that he/she can immediately also produce good code, especially when the issue of distributed computing, resource savings and performance efficiency are fundamental.
Data scientist: the working method
As for the working method, we are in a data-driven model, which is totally incompatible with the rigidity and clarity of traditional methods such as the waterfall, but much closer to the latest methodologies of agile development.
A data-driven approach means starting from data rather than processes, from those materially produced in a specific reality and not just theoretically offered by a given system. In fact, data trace and collection application being equal, two different realities can produce very different data: let us imagine, for example, how different the data collected by the SAP VM (warehouse management) module of an aeronautical company or of a large retail company can be, just like the CRM data of a company that sells luxury goods will be very different from those of one that markets consumer goods. While formally dealing with the same data, perhaps even managed by the same rigid applications, the contents will be very different in terms of number, distribution of values, degree of movement and the very completeness of the data. In other words, the distribution of variables, samplings and cases to be treated during data preparation depends heavily on individual contexts; this also means that it is can in no way be assumed that a model realised in a given reality can be immediately used in another, albeit similar, reality.
No study assumption, even in well-known environments, can ignore a first data profiling activity designed to understand the distribution of values (variables) and highlight all the contextual features that are reflected in specific cases to be addressed for correct data preparation. The importance of these very undervalued preliminary phases grows exponentially when we want to deal with a lesser-known datum and for which non-deterministic models may be needed right from the most exploratory moments or for controlling and transforming their very variability into knowledge.
A date-driven approach has to deal with the uncertainty of the result in a defined time frame, but aware that a result is almost always possible and potentially can also be very important. When data exist, it is only a question of time and capacity; the method can only be iterative and heuristic for subsequent learning cycles, where there is always a result, at least at the level of knowledge. While the non-guarantee of the result within definite time periods can be scary, this approach matches well with the guiding principles of agile methodologies: controlled iterations, validation of results including with involvement of the business user, corrections and changes of direction in itinere following the most promising fronts of investigation that emerge along the way.
Data-driven does not mean however objective, because iterations also become more rapid the more they are oriented towards a clear purpose; nevertheless, it is always necessary to admit the possibility of failure, at least in the first iterations that will still serve to refine the objective or implementation strategy.
Once clarified, structured, implemented and verified, a process-driven model can be considered concluded, in the sense that it will be used but its potential is fully expressed. On the contrary, a data-driven model never reaches its conclusion precisely because data change over time, continuously generating new potential. For this reason, a data-driven company should expect continuous exploratory activity on data and a permanent data science lab.
We can thus conclude by saying that in order to distil value from the multitude of available data, it is not enough to change technology architecture, but it is certainly necessary to change the model of development and sometimes also the organisational model.
Grazia Cazzin