TECH | Feb 20, 2018

Machine learning and word immersion

Transforming semantically similar words into points near in space: what is the relationship between Cartesian axes and words?

In high school all Italians have heard the story of Manzoni’s famous “washing clothes in the [river] Arno”. The story goes that Alessandro Manzoni, writer of the Italian novel, Promessi Sposi (The Betrothed), and who hailed from Lombardy, used the expression “rinsing ‘panni’ [clothes] in the Arno” to indicate his linguistic revision of his novel to be sure that it used the Italian spoken in Tuscany (actually in Florence); the operation succeeded so well that the exact phrase refers to the Arno “in whose waters I rinsed my cenci [rags]”.

This immersion of lexical clothes in the beautiful river that winds through Tuscany was at the time a poetic metaphor, but today it comes to mind reading the modern expression “word embedding”: this designates a technique of analysis of texts written in a (any) natural language that essentially immerses a text in a river of numbers, which are then fed (actually offered to drink!) to a machine learning algorithm for obtaining information on the semantics of the text itself.

Transforming numeric into symbolic information

The idea of ​​transforming numerical information into symbolic information is anything but recent (like many of the good ideas in modern computing) and has been used for decades in the field of Information Retrieval, that is, techniques for carrying out efficient searches within a documentary corpus:  this idea was also known as Vector Space Model.

To understand it, we need a little imagination: we all know (again from high school) the idea of ​​Cartesian geometry, which consists of identifying points of the plane and pairs of numbers. More precisely, each point of the plane is represented by a pair of coordinates, which are any numbers, and vice versa. The plane itself can thus be thought of as the set of pairs (x, y) of numbers, where x and y vary independently and in all possible ways. The same is true for Cartesian space, only this time the points are represented by triple numbers (x, y, z).

Points on the Cartesian plane: each point is a pair of numbers, its coordinates with respect to the axes: for example (2, 3) has coordinates x = 2 and y = 3, which means that by projecting the point on the x axis the projection falls at distance 2 from the origin, and projecting it on the y axis falls at distance 3 from the origin.
(Source: Wikipedia)

In the nineteenth century, mathematicians realized that although it is impossible to visualize or imagine spaces with more than three dimensions, it is still possible to treat them algebraically: in fact, by analogy with the previous cases, we can think of a space with N dimensions as for the set of N-tuples of number (x1,x2, … xN) where each xi varies independently of the others in all possible ways. Each position in an N-tuple is called “dimension” and the set of all N-tuples is called N-dimensional Cartesian space.

All the geometry of the plane and of the space can be carried out in an N-dimensional space; in particular we can speak of points, lines, planes, etc. It is also possible to calculate the distance between two points using an immediate extension of the distance between two points in the plane, again a recollection from high school: if X = (x1,x2,…,xN) and Y = (y1,y2,…,yN) are points of the N-dimensional space then their Euclidean distance è

For those who like to reflect on these things, the formula is basically the Pythagorean theorem in N dimensions.

Cartesian spaces and AI

The use of Cartesian spaces in dimension N is essential in artificial intelligence: in fact, algorithms like neural networks expect that the data to be processed are points in Cartesian space, that is N-tuples of numbers. And word processing algorithms are no exception: to understand how a text is treatable in this way, let us see how to rinse our clothes in N-dimensional spaces, metaphor aside: immerse the words in a Cartesian space of N dimensions.

The standard technique consists first of all in splitting the text into the succession of the words that compose it, and in identifying in some way words that represent the same thing (for example, writing everything in lower case, etc.). The idea is now to set a dimension N high enough to give “space” to the words we want to insert, for example for a text of billions of words (many of which are obviously repeated) we will choose a space of dimension N=100 or similar.

By succeeding in mapping words on points of the N-dimensional Cartesian space so that semantically similar words end up on near points (in the sense of the Euclidean distance referred to above) the relationship of semantic proximity could be transformed (i.e. “have the same meaning as”) into a relationship of geometric proximity between points.

Mapping words

One of the surprising achievements of the last few years in the field of machine learning is precisely algorithms of this type, of word embedding to be exact, that manage – simply by analyzing a corpus of documents – to map the words that recur in them in order to transform the semantics into geometry    .

The idea of ​​these algorithms comes in reality from the research of some linguists in the 1950s (such as Zellig Harris e John Rupert Firth) who stated that the meaning of a word is somehow encoded by the words surrounding it in the text in which it appears: in simpler terms, that the meaning is related to the context or, as Firth wrote, “you know a word by the company it keeps”.

A class of distribution algorithms invented in 2013 by Tomas Mikolov who then worked at Google (and is now in Facebook Research), is word2vector: it consists in using a particular type of neural network, called recurrent, that learns in a non-supervised way the contexts of words and encodes them in an internal layer of the network itself; this layer contains as many N-tuples of numbers as there are words encountered by the network in its training, and then provides the required association between the words of the corpus and the points in Cartesian space of dimension N.

The surprising fact of this algorithm, which makes it valuable in many applications (such as automatic translation, textual searches, “recommendation” systems, etc.), is that it not only transforms semantically similar words into near points of space, but translates concepts into linear relations, that is, concepts that link pairs of words correspond to segments that join corresponding points: in this way a sort of “geometry of meaning” is obtained that allows translating semantic concepts into relations in Cartesian space, the formulae of which have been studied for four centuries.

 Projection on the Cartesian plane of the word2vec result applied to a corpus in English: the segment that unites the points corresponding to the words “king” and “man” and the segment that unites “queen” and “woman” are identical apart from the position: applying the segment that goes from “man” to “king” to the point “woman” leads you to the point “queen”. In other words, that segment encodes the concept of “royalty”.
(source: http://www.caressa.it/vettorieparole/)

As always, old but good ideas find new application dimensions (it is appropriate to say!) with modern technologies, to reiterate that science and technology must preserve memory of their past in order to build their future.

Paolo Caressa