TECH | Jul 24, 2018

Even neural networks imagine

Constructing (and deconstructing) to do something new. Well, almost anything

When we were children many of us were passionate about Lego constructions: those who are old enough will remember that once upon a time the famous bricks were not differentiated and specialized like today, but more primitive, with simple and essential shapes: parallelepipeds and cubes of various thicknesses, and some joining pieces with curved or triangular shapes. Although what could be constructed was illustrated on the boxes, the bricks always ended up being used to assemble something else. Perhaps we first tried with the construction shown on the cover, but then it was disassembled, reassembled and so on until something new, original and unique was obtained.

The pedagogical value of those afternoons spent constructing and dismantling bricks is undeniable. This sort of “deconstructivist-constructivist” method is, in fact, a learning strategy to create things that are new but similar to things that already exist: it is not surprising then that, in recent times, there have been those who have implemented it in Machine Learning systems.

Deducing data from other data

Neural networks have now become pervasive not only in the field of Artificial Intelligence but also in all information technology, where the most disparate applications are used, all of which can be traced back to a very simple idea: deducing data from other data. For example, a network that classifies a text to decide its “sentiment” deduces Boolean type data (positive/negative) from textual data (a string). The network actually accepts numbers and produces numbers: in this case the numbers in input encode the text, the ones in output, based on a threshold value, give rise to two classes: the numbers below the threshold (negative) and those above the threshold (positive).

When the network is trained, its output is compared with an input X whose sentiment S is known: if the network at input X associates a sentiment equal to S, it has answered correctly, otherwise it has erred and the parameters internal to the network (weights) are adjusted to improve the answer in the future.

So a neural network responds to numbers with other numbers, having internally constructed a model for the response based on a pre-packaged set of given answers: if we show it a million tweets whose sentiment is known and encoded, the network learns to recognize positive and negative tweets and, after the training period on this test data, it can be used on data that it has never seen before, using its own internal model to classify them. This mechanism is a bit common to all neural networks, regardless of their structure and scope of application.

A “training set” of data – which is a set of pairs (X, Y) where X is an input and Y the output expected for it – is used in the training: X is fed to the network, a Y’ is obtained from it and it is compared with the correct Y to eventually correct the behavior of the network itself. In this way, after various “epochs” of learning, the network will be able to correctly associate the Xs with the corresponding Ys.

This is why we say that a network “classifies” or “predicts” data. Nevertheless, this is our linguistic suggestion: given numbers in input, what the network does is produce numbers in output so that the association between input and output is consistent with a set of test data.

Neural networks and images

Input X and output Y are usually quite different: for example, X is an image and Y is a label (we might want a network that distinguishes dogs from cats, or ducks from platypuses, etc.). However, nothing stops X and Y from being of the same type, for example two images. In this case, the network produces images starting from images, and we would like to train it so that image X matches something similar to image Y.

A “simple” case is when X = Y: then the network is trained to reproduce the data that is passed to it in input. To do this in a non-trivial way and internally encoding the structure of this data, the network first associates an internal encoding with image X, usually more compressed, and then deduces a new image from the internal encoding. Thus this type of network actually contains two networks: an encoder that “compresses” the input information into an internal format (encoding) and a decoder that “decompresses” the encoding into an output image. The purpose is to create images similar to those of input.



These networks have two interesting characteristics: they are not supervised, that is, the training set data need not be labeled or otherwise associated with an expected output (in fact, the expected output is the input itself); they are not used to predict or classify but rather to do the inverse, namely to provide examples of given classes.

For example, given a “positive” input, a “generative network” (this is the name usually given to these models) could produce a text with positive sentiment and, given a “negative” input, produce a text with negative sentiment. Or it could produce images of dogs, cats, ducks, platypuses and anything else on request: none of these images would correspond to a photo ever “seen” by the network, but would be a synthesis of a photo similar to the photos it had been able to analyze.

Business applications are interesting: for example, given a very schematic drawing, such as a sketch for a new model of handbag, the network could finish it by adding colors, contours or even styles on request. The following image shows how to apply the style of a painting by Van Gogh to a photo in order to obtain the place of the photo as he would have painted it (more or less!):





Other impressive examples of the power of these networks are given in the following illustrations (taken from a scientific article of 2016): in the first of them we see a row of bedrooms imagined by a generative network; in the second line we see the result when the network was asked not to insert windows, which have been deleted or transformed into something different.

In the following illustration we see instead a sort of “arithmetic of expressions” in which, starting from some faces, the network is asked to subtract a characteristic and add another to create a new face with the desired characteristic: in this case from a smiling woman and a man with a neutral expression we want a smiling man, so we subtract a woman with a neutral expression:


The figure shows three different results produced by the network for this request.

These truly amazing results are obtained by combining in a very interesting way techniques of convoluted neural networks (particular “deep” networks) according to some schemes that have been proposed in the literature in recent years, and which we could explore in a future article.

Paolo Caressa