The growing demand for solutions sometimes generously labelled as Big Data is often accompanied by expectation of a solution for ancient discontent related to data management, sometimes confusing technology with project activities to be deployed for achieving one’s goal. Technologically speaking, by properly sizing the cluster to be implemented, volume and velocity can be brought into play, understanding the latter as the response time of a system in operation, but not as rapidity of the data collection process, in which other factors that we will consider take over.
In the traditional distinction between data producer systems (the so-called operational or transactional systems) and systems of analysis, there have often been attempts to unify the two levels, substantially shifting the analysis to the operational level itself (operational intelligence, real-time), to remedy the limitations of this distinction which involves long waits and costs that are difficult to understand for bringing new information into production. But basing systems of analysis entirely on the operational level nevertheless has contraindications (high costs of in-memory appliances) and limitations (concurrent load between operational and analytical, bias of the operational if you do not work with a single ERP but with multiple management solutions) which have contributed to curbing this trend.
With the advent of Big Data, distributed and scalable systems that do not require high-quality hardware, a more viable way appears to have been offered for overcoming this distinction and it is easy to believe that the data warehouse might become obsolete, superseded by a broader data lake containing all the original data, at the operational level.
However, it is necessary to clarify what technology can solve and where it remains a simple, albeit useful, support.
If the potential of a data lake with the operational datum without time limits is comparable to that of the data warehouse, it is not true that implementing that potential is a marginal activity or can be solved simply by a change of infrastructure. The data modelling that is done in a data warehouse already gives a first semantic and a control network (referential integrity) which also defines the minimum level of guaranteed quality of the datum that is loaded. It should also be noted that normally the data warehouse also solves other types of problems such as:
- the historical link between master data that change over time (the same logical object which changes code from year to year)
- the variability of the characteristics of the same object over time (belonging to a group, to a portfolio, to an organisational unit)
- the reconciliation of data and master data from different systems (the same client with a different identifier in the CRM or billing system)
- the reconciliation of data and master data from different companies in the case of mergers and acquisitions.
No date lake, even if capable of receiving the operational data of all the systems of a company for a time tending to infinity, will solve the issues mentioned above, which are issues arising from the data themselves and not from the infrastructure or host paradigm. Thus, even wishing to follow the path of abandonment of data warehouses, it is necessary to develop a whole series of processes of transformation and derivation from primary data on the data lake in order to ensure the same consistency, information functionality, possibility of intersection and historical review that it is usually expected today from a data warehouse.
Precisely because the issues of the datum are not resolved by a technology platform, questions of the quality of information are also repeated on the data lake exactly as they appear on traditional architectures, with the only difference that we no longer have any structural input barrier. Big Data systems are made above all to receive data (like a large staging area), regardless of their format, origin and quality. But to turn them into valuable information it is necessary to address all the issues raised so far, with approaches and techniques that are perhaps different, more oriented to end use than to an overall consistency, but in any case with activities and costs to be considered.
If then, as we hope, there is a veer towards Big Data also for accommodating new data, external to the strictly business domain, the problems are amplified, but interesting and new opportunities also open up.
What are the opportunities with Big Data?
The innovative challenge of Big Data is fully captured by initiatives that do not end up in a discourse of pure technology and do not remain confined within the IT perimeter. The most interesting opportunities arise when working with new data or operating in a new way on old data or doing both at the same time, seeking to enrich already governed information with other data, whether internal or external.
The first step that opens up to an external datum is usually taken by marketing, which aims to assess the reputation of the brand or the widespread feeling towards a certain initiative. This is certainly an important space, but limited to persons with whom some form of relationship already exists, building almost self-referential information. A step forward can be taken by seeking to build information with an enlarged context through the integration of content that is also not strictly related to its information perimeter (open data, social, weather, fee-based data sets, linked data, etc.). However, since this is never a question of detailed information about clearly identifiable individuals, it may appear to be unusable information should the idea persist that the only data usable are those in perfect union with one’s systems, those in which it is possible to identify a real individual and trace back to that individual’s registry instance (something, moreover, that could generate some problems of violation of privacy from more than one side). However, it would be sufficient to reverse the approach and think about generalising the datum just enough to make it comparable with an external datum, operating substantially by cluster. At this level, there is much usable information and it enriches the cluster, perhaps with data on income, spending, propensity to certain types of activities and so on.
It can never be said that, because he/she falls in a certain cluster, a given customer (of whom only the minimum information required to regulate a contract is known with certainty) definitely earns a certain amount or really spends in a certain way. However, it can be said with a reasonable degree of accuracy that people who are very similar to a given customer usually fall into a certain income bracket and are more or less likely to buy certain product categories.
The same could be done by examining the territory from a higher point of view: You may know the address of residence of your customers but nothing more about the area in which they live. They could, however, be placed territorially and, at a less exact level (not by address but by neighbourhood, city, province, region), information could be enriched with demographic indicators (number of inhabitants by type: gender, age group, Italian/foreigner) or linked to the social and industrial fabric (level of education, employment rate, number of SMEs) and so on.
Only in this way is it possible to gather the elements for making a full analysis of their real strength of attraction, rather than comparing the current year with the previous year. Only in this way is it possible to go beyond the usual performance questions (have I achieved the goals? Have I limited the losses? Have I maximised the revenues? Have I made processes more efficient?) in order to face up to the most challenging questions with new arms, aimed at setting development rather than containment policies (have I seized all the opportunities? Have I missed a few trains? Am I within a slow change that I am not grasping?).
There is a broadening of the real data market or data-based services, confirming that their value is not abstract but really convertible, for which the prefix Big no longer characterises even the technology platform.