Although perceived as innovative technology, it is good to underline that the elephantine Hadoop is already 12 years of age. Being a complex solution, created for storing and processing multi-format data, it has certainly not been adopted in such a widespread way; rather, when it has been introduced in inappropriate scenarios it has been counterproductive. In 2017, Gartner estimated that 60% of Big Data projects (most of which probably based on Hadoop) had been unsuccessful.
In such circumstances, the investment in terms of know-how, but also in infrastructure for management of a cluster, certainly aggravates the situation.
The principle of data locality on which processing in Hadoop is based means that storage and processing rest on the same nodes of the cluster for working optimally without data traffic on the network, taking advantage of the fact that the data to be processed must reside in the nodes (data nodes) of the cluster in which the computational tasks are performed. Furthermore, a typical Hadoop cluster must, by definition, be composed of nodes with the same hardware configuration in order to guarantee uniformity in performance.
These characteristics have for a long time dictated the rules of deployment, almost leading to discouraging its adoption in Cloud, despite the potentialities offered in terms of fast provisioning and scalability.
Demand for elasticity
In this period, all technologies enabling microservices and DevOps architectures, such as containerization solutions like Docker and their orchestration (for example Kubernetes), are very attractive insofar as they make it possible to scale processes on multiple instances elastically, also lending themselves well to mechanisms of continuous integration (CI) and deployment (CD).
Similarly in the world of Cloud Computing, elasticity is the automated capability to scale IT resources transparently, as required in response to runtime conditions or as predefined by the Cloud Provider or consumer. Elasticity is often considered a fundamental justification for adoption of the Cloud, mainly due to the fact that it is closely associated with the advantage of reducing initial investments and ensuring proportional costs.
From what has been said so far, we are starting to see common paths and a shared scenario, to which Hadoop is also coming close, supporting the execution of YARN containers inside Docker containers in the cluster, although this technology is still in experimental form.
Another road traveled by different cloud providers is to ensure the usability of Hadoop in PaaS (Platform as a Service) mode thanks to the decoupling of storage from the component responsible for processing.
How can data locality be invalidated without compromising the performance of Hadoop?
Cloud service providers have increased their network capacity to 10 Gbps or more, up to reaching 40 Gbps frequently. Thanks to this renewed bandwidth capacity, data traffic on the network, which occurs with separation of the storage and processing layers in Hadoop, is no longer such an insurmountable problem from the point of view of performance. In addition, the advent of low-cost Object Stores – such as Amazon Web Services’ S3, Microsoft’s Azure Blob Storage, IBM Cloud Object Storage or Google Cloud Storage, to give just a few examples – has further set the pace. These storage systems are in fact compatible with the APIs of HDFS (Hadoop Distributed File System) and permit a further lowering of infrastructure costs.
But why decouple? What is the trump card?
This technique has a significant number of advantages in terms of flexibility, scalability, maintainability and affordability. In particular, it is possible to optimize the infrastructure according to the workloads for which it is set up or for managing processing peaks. While in a traditional Hadoop On Premises cluster the calculation is closely related to storing, and if there is a need to increase the resources dedicated to the calculation, it will still be necessary to add DataNodes which must have a disk component, even if increasing storage is not necessary for the volumes being managed.
Vice versa, if the cluster needs to work with very large datasets, it is necessary to buy more DataNodes for storing the data. Each node will have to include expensive RAM and CPU resources, even if the overall computing needs of the cluster are already met by existing nodes. This occurs in particular for clusters that manage petabytes of data, which often have many cold data that would not require computational resources for their processing.
Another consideration regards the previously discussed topic, namely elasticity: the addition and removal of DataNodes based on real-time demand result in a significant work of rebalancing HDFS to trace all the information related to the new allocation of the blocks into which files are divided, wasting usable processing resources for this purpose.
The scenario described concerns PaaS solutions from which Hadoop derives greater benefit in the aforementioned terms, but there are other models of Cloud delivery, of which the most common is IaaS (Infrastructure as a Service) where Hadoop is installed on a VM operated by the Cloud Provider. In this scenario, horizontal scalability still has advantages, insofar as adding nodes to the cluster is an operation that can be done in a few clicks, while in terms of elasticity what can be hoped for is “graceful decommissioning”, when the aim is to decrease the number of nodes, an operation that is much more expensive than the previous one, contrary to what one might think.
The differences between the various solutions offered by the Cloud, whether public, private or hybrid, are also influencing the evolutionary choices of Big Data systems, of Hadoop distributions (for example, Cloudera and Hortonworks have developed special modules for cluster provisioning, simplification of creation and execution of workloads in Cloud) and of Hadoop itself in its most characteristic components.