By the end of 2017, 50% of businesses will have recruited a Chief Data Officer responsible for the quality and governance of company data (data governance), according to Gartner. Data governance deals with the following aspects:
- manage the catalog of metadata to obtain readable information from large volumes of raw data: create order
- secure data by activating digital solutions for their protection (compliance): understand who can use them and to do what
- track the activities performed on data (auditing): understand when, how, why, where and by whom they are used
- put tools in place for displaying the origin of data and their transformations (date provenance and lineage): identify sources and transformation algorithms and facilitate the interpretation of results
- activate all necessary logics of data quality: improve data.
Management is therefore the critical success factor of a project that impacts on large volumes of data.
From anarchy to governance
We have witnessed a phase of enthusiasm stemming from the ability to access a large amount of publicly available data instantly and without any constraint placed by their prior certification. Not having prior knowledge of their meaning, there was, however, the expectation of being able to derive a hidden meaning through greater freedom of interpretation. It was thought that the new professional figure of the Data Scientist would be guided by data themselves to gain deep understanding through the application of sophisticated algorithms, taking advantage of the computing power provided by big data platforms and their ability to gather and correlate information.
After a first phase of effervescence in terms of both technology and techniques of analysis, we have witnessed the consolidation of technology and the reshaping of this purely heuristic analytical approach to the search for business value. In fact, it has been understood that no accurate and thorough analysis can be carried out without full mastery and expertise concerning data.
The data should be meta-described and cataloged in order to know the intrinsic meaning of business, as well as being able to evaluate their effective use for analysis without running into privacy and data protection issues. The process of learning business data should come about through intense and continuous exchange with domain experts and business. At the same time, it is necessary to make an effort to regulate external data and define their rules of use.
Data Protection and the new regulations
The theme of the new European GDPR (General Data Protection Regulation) regulation is arousing great interest. Regardless of the mode used, this regulation applies to any organization that collects the personal data of residents of the European Union, no matter where they are stored. The regulation formalizes, for example, the requirement of companies to ensure that any data breaches are reported quickly (within 72 hours), “privacy by design” (that is, a design of IT solutions that already take into account the aspect of sensitive data management) and the famous right to oblivion. This involves a massive use of software for managing security, encryption, data masking, pseudo-anonymization and other activities.
The data processing techniques that such software allow must be applied in an appropriate manner without interfering with the subsequent use of the data processed in various modes of analysis (for example, how can I use a masked datum in my analyses?).
Starting in 2018, also in the big data field, organizations will have to ensure compliance with the rules through specific governance measures that include detailed documentation, recording and ongoing risk assessment.
Not just much data, but many types of users
We have already mentioned two new professional figures that come into play in this scenario: the Data Scientist and the Chief Data Officer, but the variety of users on a large-scale platform does not end here.
Other categories of users are analysts, IT professionals with expertise in various big-data tools, both from an application and systemic point of view (for example, for the proper management of the cluster in which information is stored) and business users. For each category, data access behavior must be described, monitored and sometimes prohibited.
To increase business-to-business knowledge (“stretch competence” as Forrester calls it) and improve decision-making, business users must have access to raw (unclean and pre-aggregated) data, in order to obtain from them the greatest possible amount of information useful for their business.
Data operations performed by Data Scientists, which by their mandate act in an exploratory way and seek correlations and hidden values (data mining), can involve very burdensome operations for the technological platform, as well as affect predefined data access rules. Think of aggregating different source datasets with different access policies: how are policies on the resulting dataset established? Which conflicts come into play? Who can handle potential disputes over the data?
The right compromise must first be thought, then shared and finally implemented through ad hoc digital tools.
The role of business
From an organizational/cultural point of view, quite a few problems are involved in data ownership. As we have said, pooling corporate data (breaking data silos) can lead to loss of ownership (ownership of the datum), causing security and compliance issues.
Strong engagement of the business should therefore be implemented from the outset to establish the rules and bridge issues of this nature with appropriate guidelines.
Again citing Forrester, in addition to supporting technologies and processes, data governance must guarantee datum quality, accuracy, consistency and usability.
Given that big data offer a wide range of data and process management technologies, to which added specific data governance technologies, it is important to emphasize that the theme remains exquisitely the preserve of business, which does not have to delegate data governance to IT, but must keep a tight grip on this management.
The danger of allowing technological complexity, coupled with the difficulty of reconciling different languages between business and IT, blur the real issue of governance is strong, but it is important to note that it is the task of business to identify the priorities and collaborate with IT on identifying appropriate ways to benefit from data usage.