TECH | Jun 26, 2019

Data between compression and comprehension

From mp3s to the human brain, up to machine learning algorithms: that's why (and how) compressing means understanding the essential information.

The possibility of reducing the size of a file to then be able to reconstruct it completely is one of those techniques that we use every day, often without realizing it, and that greatly simplify our use of information technologies.

There are two particular circumstances where we appreciate the possibility of compressing files, folders and entire disks:

  • when we want to save space on a backup device, such as an external hard drive
  • when we want to save time in transmitting a file along a communication channel, for example an internet connection.

Brief history of data compression

The techniques of data compression which do not cause any loss of information (the so-called lossless compression) date back to the dawn of computer science and refer to the so-called “theory of information”, an application of probability calculation to the study of data transmission. In the 1940s, its inventor, the great American mathematician and engineer Claude Shannon, called it the “mathematical theory of communication”, which consisted in establishing the requirements for a data transmission along a physical channel of communication, therefore subject to disturbance and interference, in order to ensure that the message arrived in its entirety.

When studying this problem of data completeness, Shannon had to establish what the content of the transmitted message was and he formulated the mathematical concept of information measured in bits. That same theory therefore allowed to study the coding techniques of the information contained in a message in more compact codes, such as to allow the reconstruction of the message itself. Thus, data compression.

Compression equals data loss?

Many compression algorithms that we use on a daily basis, lose some information when compressing a message, according to a compromise which attributes more or less value to the ratio “quantity of information/reduction of resources to be processed”: for example, we all use video and audio in mp4 and mp3 formats, which are formats in which the information is compressed and sampled, in order to allow a partial reconstruction, but such as to maintain the essential content of the original message. Clearly, an expert ear recognizes, for example, the difference between the original audio and an mp3 compression of a piano concert, but for pop songs or podcasts, where the purity of the sound and the distinction of the different harmonics which, superimposed, compose it, these are not essential: it is not by chance, in fact, that it is the most common format on Internet, since to move an audio file through the network it is necessary to exchange very long messages. The PDF format is already compressed too and in fact, if we try to zip it, the space gained is laughable.

A fortiori video formats are always subjected to compression, which samples the signal so as not to transmit it in its entirety, but eliminates those parts considered “negligible”. Similar techniques allow filtering a signal in order to eliminate the background noise or to enhance some features and repress others: some hearing aids, for example, cut frequencies which are annoying for those suffering from certain diseases of the auditory system and who develop migraines in any public room because of the background noise, which normally our ears and brains can filter and ignore.

Because, to be honest, we also compress information, in various ways and for various reasons: in fact, we have been taught this from an early age. When we had to make “summaries” of a text in primary schools, we were asked to make a compression (not lossless, but losing some parts of the original message) to extract the basic information content and neglect the rest.

Compression calls for comprehension: and machine learning?

It is significant that such an ability is linked to the comprehension of a text, as if to say that compressing implies knowing how to understand and that therefore compression requires comprehension. In this era when the skills of the human mind are increasingly reproduced, imitated and enhanced by software, it is not surprising that data compression is also used to understand these: for example, compression is used for data mining, the extraction of “hidden” but important information from data sets.

Several studies based on a theory which combines Shannon’s theory with the Turing computability theory, the so-called “algorithmic information theory”, or Kolmogorov complexity (the brilliant Russian mathematician who axiomatized the calculation of probability), have created algorithms by means of which, by compressing data and then comparing the codes resulting from the compression according to more or less simple and easily calculable metrics, it is possible to classify data in a similar way to what the machine learning algorithms such as neural networks do.

Techniques of this type work well with sequences of symbols of a certain alphabet, such as texts in human languages or even DNA, which, simplifying, is a text constructed by sequencing four characters (the bases A, C, G and T). The complete genome of a human being requires about 3 Gigabytes to be stored. However, to compare portions of DNA and to better understand how certain genes activate certain proteins, different DNAs are compared, by compressing them using lossless compression algorithms specifically designed for DNA, thus achieving interesting clustering results.

Similar techniques are applied to strings which do not necessarily represent texts in a language, but information extracted from images or videos, and in many cases this information can be classified simply by compressing it and applying simple correlation formulas, which use only the length of the compressed file and not the compressed data itself.

What is the key to algorithms and to data analysis techniques?

This surprising correlation between the meaning and the content of information of a datum and the length of the compressed datum can be explained by the theories we have mentioned and therefore provides a classification and clustering tool which, for example, compared with neural networks, has the enviable feature of knowing why it works and does not depend on configuration parameters, which must be calibrated on the basis of specific cases.

Then again, our brain, of which a neural network is an oversimplified model, seems to work differently: in order to learn to distinguish dogs from cats, for example, a neural network must be trained with millions of examples. But a three-year-old can distinguish a dog from a cat without having seen that many. If we ask a child to draw a cat, it will provide a very schematic drawing in which, however, there will be distinctive features such as the whiskers or the tail, which somehow capture and characterize the information of “being a cat”. In other words, with that drawing the child compresses the object he/she has in mind by eliminating the non-essential information and keeping only the distinguishing factors, so that we can understand the object he/she has in mind when looking at the compressed message.

Perhaps in the relationship between compression and comprehension (which, after all, are almost the same word in our language!) there is one of the keys to algorithms and data analysis techniques in the years to come, close to the human way of treating information, not in the architecture of the algorithm but in the substance of how it processes data and provides its responses to the stimuli of the environment. Finally, if you thought this article was interesting: try summarizing it!

Paolo Caressa