TECH | Oct 13, 2016

Big Data: problems and specific technologies

What are the characteristic layer and the main problems in the management of Big Data and the relationship with the open source?

The term Big Data associated with the famous three Vs (volume, velocity, variety) that have characterized it, was introduced 15 years ago by the then META Group – now Gartner – which then dedicated its first Big Data Hype Cycle only in 2012, when use of the term had become really pervasive; it also marked a keen interest from the market and companies which were beginning to seriously wonder about the hidden potential in data multiplicity. In 2013-2014, it experienced a very fast upward curve, immediately structuring itself into different technological categories, then dipped rapidly towards the ‘Trough of Disillusionment’ zone of the ‘Hype Cycle’ before coming out definitively in 2015. This is a positive sign of the maturity of demand even before offer, because it has become clear that great value can be drawn from the multiplicity of data potentially available today.

It is evident that the value of some recent giants (Twitter, Facebook, LinkedIn) is based on evaluation of data as an asset, just as the ability expressed by other mega players (Google, Amazon) in obtaining valuable information has transmitted confidence in the possibility of making good use of their data, also in admixture with external data, not only for improving the efficiency of internal processes, but also for setting development policies on new bases.

Layer characteristics, problems and specific technologies

Big Data is a multifaceted subject covering various aspects of the life cycle of the datum, with problems, technologies and specific purposes depending on the point in time. However, if scalability and distributed management are the qualifying technological assets, they do not exhaust the problems and specificities of the entire Big Data paradigm, which instead is structured on different levels of a logical architecture of data management, also in order to understand what is new and the challenge at each point in time.

  • Storage. The first problem of an architecture based on data is obviously that of their storage and operational management. This is the level traditionally occupied by RDBMS in their various forms (OLTP, OLAP, analytical db, db machine, appliance) but basically oriented to the management of a structured datum, for which the modeling of the database is a founding and fundamental moment not only for guaranteeing adequate performance but also for ensuring a first level of quality and cleanliness of the datum. Velocity and volume on structured data can also be controlled using conventional solutions, which tend to focus the problem on the individual and increasingly powerful and optimized machine. However, Big Data means changing direction, going towards distributed architectures, theoretically reaching a logic of commodity hardware. But beyond this, what fundamentally changes is the understanding of the database, no longer strongly and rigidly modeled but much more open to accept data in any form, like a large reserve, data lake or data reservoir, as is now being said. The guiding principle is that of saving the datum, whatever it is, preferring the possibility of having it rather than the need to immediately give it a clear semantic or a minimum guaranteed level of quality, obviously thinking about the possibility of integrating new data beyond those traditionally present and managed in business processes.
  • Data processing. At the level of data processing, it follows that processes are no longer oriented towards a rigidly binding host structure, but the loading phase (now ingestion) is much more streamlined and essentially works to make data available on the cluster, as if it were a large staging area, without loss of data even in the event of very high frequencies of production and/or consumption. The result, at least in the first place, is that the data lake is a large warehouse of data in which the urgency of depositing and preserving prevails over the need to understand and organize. This takes place in a second moment of processing when, however, the purpose of use and specific analysis, rather than the intrinsic semantics of the datum, leads. Datasets and algorithms rather than modeling and transport are used, with the result that data processing, at first very compressed at the level of storage, is now much closer to the level of analysis and fruition.
  • Analysis. It is clear that the first purpose in collecting and storing data is that of also analyzing them a posteriori and making use of them over time. Big Data technologies have been created but considering above all storage and processing issues more than analysis, with the result that, in the early stages, performing massive analyses turned out to be very critical because such systems were slow if stressed in reading. This problem has now been overcome with the various acceleration solutions offered in specific distributions; now the theme of advanced analysis of Big Data clusters is the new frontier in the challenge of value and has revitalized themes that have always been interesting such as data mining and machine learning, so far practiced only in specialized fields. An emerging tract then concerns data visualization for the production of insight, perceptions of phenomena rather detailed measurements. If the data are many, heterogeneous and with a semantic not necessarily well known a priori, the first step is exploratory, aimed at general understanding rather than strictly analytical work. For this, having the possibility to carry out data exploration together with new forms of visualization aimed at communicating rather than describing are to be considered essential elements of data analysis to all effects.

The role of Open Source and Cloud

It’s quite clear that at this time Big Data is a dominant trend in IT, in close relation to other hot topics such as IoT and smart cities. This is reflected in a ferment of technologies and solutions that creates as rich as varied a panorama which is not easy to govern or direct. Another demonstration is that for Big Data open source is a clear reference, at least as a starting point. The ecosystem that grew up around the Apache Hadoop project is, in fact, a reference also for the commercial solutions of the major vendors, which systematically include open source components in their distributions and, in some cases, even help to develop them.

Organizing a Big Data architecture is not like choosing a RDBMS, where it is a question of comparing equatable benchmarks. First of all, its aims and challenges must be expressed on the different levels set out, then trying to choose the most appropriate technology accordingly. Often, especially when the requirements are not yet well defined, it is useful to try proceeding by successive POCs to be measured on real data in order to evaluate the benefit that a certain Big Data system (made of technology and data) can offer the individual organization.

In this sense, open source and cloud are certainly two important resources, at least in the most exploratory moments where, with reduced costs and times, it is possible to proceed by successive approximations and verify in practice their potential ROI before proceeding to potential investment plans and more important transformations.

Grazia Cazzin