With the explosion of Big Data, Data Science has found new life and new areas of application. A technique that was identified with the term “Data Mining” until recently, has now acquired wider validity, operating on increasingly generous masses of data and also on unstructured data, such as texts, videos, images and audio files. This change has required the intervention of distributed platforms, responsible for containing, storing and allowing the processing of such information on a large scale.
If, on the one hand, the exploratory approach aimed at understanding the datum for analytical purposes has remained unchanged, on the other, the use of increasingly structured environments, made up of heterogeneous and innovative technologies, has increased in complexity.
How do Data Scientists work?
In this context, in order to interact with Hadoop or more generally with Big Data environments, today’s Data Scientist is called on to use ad-hoc tools that make it possible to carry out surveys and filters with distributed operations. In fact, vendors have fine-tuned many applications developed by vendors to facilitate this heuristic approach; just think of the various Notebooks or Data Discovery and Data Wrangling tools supporting data acquisition-aggregation and cleaning respectively. After terminating the investigative phase, the exploratory process must be translated into executable and reproducible code, through the definition of a processing pipeline that enables completion of the analytical process.
Where R and Python languages are no longer sufficient, Data Scientists exploit other computational paradigms on Hadoop (such as Spark, MapReduce and H2O), which make Machine Learning libraries and algorithms available for processing large datasets exploiting the capabilities of the cluster, but which require good knowledge of the technological platform and specific programming skills. These skills are those cut out for the role of the Data Engineer who deals with a variety of languages and knows the Hadoop ecosystem well, ensuring optimized and high-performing processing flows.
Is the Data Scientist really so sexy?
In fact, at the very beginning, when the Harvard Business Review had defined it as the sexiest role of the twenty-first century, the Data Scientist was a mythological figure: the IT unicorn that was able to perpetrate actions ranging from Machine Learning to programming (distributed or not) and the creation of engaging graphic interfaces, to deep knowledge of Big Data platforms and their tools. After the (somewhat predictable) disillusionment, the different skills were assigned to different roles, required nevertheless to understand and support each other in a virtuous circle of collaboration.
Ever since Machine Learning and Advanced Analytics projects have become a necessity for many companies, which have understood the competitive benefits as well as the intrinsic difficulties, organization of work has become an issue for the attention of team leaders and project managers.
In terms of skills and responsibilities, there is a significant overlap between data engineers and data scientists, even though differentiating themselves in the goal they pursue. In fact, if Data Engineers build infrastructures and architectures to support information generation, data scientists concentrate on advanced mathematics and statistical analysis on such generated data. How far can the data engineer push him/herself in the refinement and tuning (optimization) of the analytical code without compromising its functionalities, just as which IT skills are the minimum requirement for a Data Scientist?
Data Engineer vs Data Scientist
The question concerning the coexistence of such different backgrounds is still the subject of debate: engineering of Data Science is a non-trivial problem that the most important web companies at the forefront of Big Data are also taking on board. The problem is not to be understood only in terms of team mix and effective collaboration, but also in terms of methodology: how can a project based on iterative validation processes that can predict failures and continuous readjustments be planned and managed in the IT context where deadlines and methods of development are well established?
The physiological differences between the design of software (such as the development of a Java application, an ETL process or a Spark job on Hadoop) and a project with analytical purposes on non-certified data (often coming from external sources such as social networks and open data or collected through web crawling), are not obvious in the estimation of developments.
The hidden danger lies in the datum itself, which is potentially incomplete, dirty and insufficient for completing the analysis. According to many recent industry surveys, such as CrowdFlower – 2016 Data Science Report, it is clear that the share of activities related to the cleaning and pre-processing of the datum or, in better known terms, to “data preparation” stands at around 80% of the total. The amount of time devoted to these operations that do not give an immediate result in the eyes of the PM or the client itself, but which guarantee the objectivity and impartiality of the algorithm, can be perceived as a worrying sign of inconclusiveness or slowness. The success of the analysis is not a foregone conclusion, precisely for the already cited reasons related to the data used, just as the accuracy of the algorithm could be disregarded, and all this is difficult to attribute to the logic of project management.
Fail fast is best
The challenge related to the operability of solutions developed by data science, in conjunction with technologically demanding scenarios, is still open. This is an incremental approach based on experience. The “fail fast” motto, or fail quickly in order to learn and improve just as quickly, is very fitting in this context: it is good to tackle the issue of code engineering and of the applicable development methodologies from the outset, starting with small projects developed by mixed and multidisciplinary working groups, in order to understand how to achieve the best result through continuous improvement.