To read the anonymous log of calls made via smartphones and rebuild names, addresses, relationships and even the health conditions of their respective owners: this is what two Stanford researchers have recently managed by correlating phone metadata (such as the user’s geographic location, the duration and frequency of calls and the contact numbers) with other publicly accessible information via social networks, search engines or web portals.
This result clearly demonstrates the potential of Big Data: large amounts of data, seemingly independent from one another, can reveal invaluable information when analyzed and correlated.
Storage and data analysis: opportunities and risks
Data analysis is of great interest to companies and organizations, as they can use these in order to identify new business opportunities, target their initiatives, optimize their actions and processes and therefore reduce operational costs. In recent years, costs for information storage are rapidly shrinking, and the gradual spread of all types of devices and sensors contributes to produce large quantities of data to be recorded. These factors help to define a new trend in the IT industry: to record as many data as possible and to store these for a long time, in order to analyze them to extract their value in the future.
The outlook has changed dramatically over the last few years, when data storage was mainly used for backup. Where archives were once typically stored offline and periodically overwritten, now data are progressively accumulated and are often not properly protected. However, guaranteeing the confidentiality, integrity and authenticity of data within this new scenario is particularly challenging.
Sizing security measures has always been defined as a compromise between the value attributed to the information to be protected and the necessary resources for a potential attacker to carry out an attack. The big data scenario undermines these criteria: it is extremely difficult to determine the value of data which must be protected, as even seemingly irrelevant and non-sensitive data can reveal valuable information when correlated with each other or with other sources freely available via the web. Consequently, it is equally difficult to assess the cost (in terms of computational resources and time) which a potential attacker is willing to invest in order to threaten a system.
Guidelines and future prospects
To limit the risks associated with data storage, it is possible to follow some best practices:
- selecting data to be kept: although this can be an extremely complex activity, it is appropriate to identify which data should be kept and which ignored, finding the optimum balance between the benefits related to the availability of information and the risk that its presence leads to
- controlling access to stored data: when the application scenario allows, it is advisable to store information offline or on networks protected by close access checks
- checking the size of the security measures: where necessary, cryptographic techniques to ensure the confidentiality, integrity or authenticity of the data should be applied, it is also necessary to verify that the parameters used (for example, the size of the encryption keys) are suitable for averting attacks over the period of time envisaged for storage
- managing the entire data lifecycle, paying particular attention to the deletion phase. It is also appropriate to regulate the issue, defining what organizations may and must do during every step of the data lifecycle.
Finally, following the evolution of research elements which for years have been studying new cryptographic solutions designed to enable data analysis and processing, whilst preserving confidentiality, is certainly very interesting. Among the paths which now appear most promising, there are:
- homomorphic encryption: a cryptographic model (of which there are already a few applications) which allows running algorithms on encrypted information, without there being any need for deciphering
- secure distributed computing: an encryption scheme which allows multiple parties to process data, sharing the final result but keeping data used as input confidential.