TECH | Nov 9, 2017

Big Data: the payback of SQL

How databases have changed and how they will change in the future

“It is perhaps fair to say that from the perspective of many engineers working on the Google infrastructure,
the SQL vs. NoSQL dichotomy may no longer be relevant.”
Source: “Spanner: Becoming a SQL System”

 

We are witnessing a strange phenomenon: after a binge of new technologies for coping with a whole new landscape related to Big Data management, we are increasingly coming across solutions that adopt a relational approach to data or at most provide an SQL interface for their querying.

A technological chapter that seemed closed, or at least did not at first seem so interesting, has thus reopened for Big Data. Let’s see how.

The origin of SQL

The origins of relational databases date back to the Seventies, when databases were implemented that were easy to use for writing, reading, and maintenance operations, based on a modeling of data in the form of tables consisting of rows and columns in relation to each other. The language for querying these entities was originally called SEQUEL (Structured English QUEry Language), consisting of high-level declarative access, translated from the database into a set of procedures to run on physical files.

The relational database is an alternative to what was used at that time, namely solutions related to indexed files (ISAM/VSAM type) for mainframes and is met with great success in the IT world, with a supply of commercial products for businesses, of which Oracle offers a first example.

SQL (Structured Query Language) becomes a standard in 1986 with the emergence of SQL ANSI, and large investment in relational DBs are made by vendors such as IBM, Microsoft and Oracle, also opening the door to open source databases such as MySQL and PostgreSQL, and to various technologies, such as applications for Business Intelligence or ETL (Extract, Transform and Load) tools.

Relational databases in the era of Big Data

With the arrival of Big Data, relational databases lose some popularity because they are often unsuitable for scenarios where the data to be handled can no longer be represented by a rigid schema of rows and columns, or the volume of data is so significant that it requires a scalable hardware solution with lower economic impact. The harshest criticism concerns precisely that aspect. From a traditional point of view, the computing and storage capabilities of the database can only be achieved by increasing the server’s hardware features (scale-ups), which results in an increase in costs according to the volume of data to be managed.

The NoSQL (Not Only SQL) movement and Hadoop-based systems installed on low-cost clusters of machines meet these new needs and are concerned with resolving the theme of scalability as well as a number of other issues, the most significant of which are:

  • handling semi-structured/unstructured data or with a flexible schema
  • supporting very large datasets with high response performance for reading and writing operations
  • making APIs (Application Programming Interfaces) available for interaction
  • increasing the number of cluster servers for handling greater volumes (scale-out).

Were we better off before?

Great enthusiasm and ferment last for a few years, but some perplexities soon emerge. How can the consistency of data be ensured? How do you interface with tools consolidated in the company and by business, such as those dedicated to BI and ET? How can these databases be queried? How is it possible that a simple grouping has to engage a developer who is an expert with the tool? And further: how can flexible enough access to data be guaranteed such as to ensure full understanding of information without having to modify the data model?

These are all questions for which some NoSQLs have tried to provide an answer. For example, Cassandra has equipped its system with an SQL-like language which does not, however, protect against the NO (understood as negation!) SQL functioning of the DB. Also on the Hadoop front, for working on unstructured or semi-structured data, solutions called “query-engine” or “SQL on Hadoop” that every distribution or Hadoop vendor integrates have been alternating for some years now.

Increasingly often, systems are being sought that do not only provide excellent performance in querying aimed at reading data for analytic purposes, as happens in data warehouses, but also provide mechanisms for managing transactions. These two approaches are often mutually exclusive in many Big Data storage technologies, which preferably meet OLPA (On Line Analytical Processing) analytics requirements and sacrifice the OLTP (On Line Transaction Processing) component. Writing operations such as insertions, deletions and changes are in fact available on NoSQL databases, but often cannot be queried through SQL.

As a result, so-called HTAP (Hybrid Transactional/Analytical Processing) hybrid databases are emerging, which are designed to handle both requirements, albeit at the expense of performance. It is on this front that SQL and NoSQL finally converge for managing structured data and scaling of a few orders of magnitude compared with other well-known MPP (Massively Parallel Processing) proprietary systems. Among the latest exponents are Google’s Spanner or Cloudera’s Kudu.

SQL is back!

A second youth for SQL is no doubt welcome news to all those clients of Big Data who, over the years, have focused on this language and who initially felt the possibility of interrogating the Big Data contained in the Data Lake precluded. This is not enough: new opportunities are also arising in this period for interfacing with real-time data via SQL.

A recent announcement is the release of a solution for carrying out “streaming SQL” on data in the Apache Kafka Messaging system, an integral part of many Hadoop distributions and a technological standard. This allows you to perform real-time analysis through continuous querying in SQL language of data that transit in the queue.

Similarly, it is possible to find tools dedicated to streaming data processing that allow you to use this syntax for real-time data querying, characterized by being part of a continuous flow. An example is Amazon Kinesis present in the PaaS cloud platform of AWS.

In conclusion, more and more areas of application lie ahead for SQL language, even where its adoption would seem less obvious but provides valid support for the use of Big Data tools.

Monica Franceschini