TECH | Apr 19, 2019

Learning through mistakes. And Artificial Intelligences?

Even in machine learning the mistakes are necessary. Thus the machines become much better than us.

We have all been told “learn through your mistakes”: Romans, more cautiously used to say that ” To err is human ” (but to persist is diabolical). But today that the concept of humanity seems destined to extend to include new forms of intelligence, we can ask ourselves whether the characteristic of being wrong or not is still only a human one.

In fact Machine Learning algorithms, which now populate our devices and have become familiar, at least in name if not in substance, more or less to all those who are interested in technology, do just that, they learn from their mistakes. A neural network, for example, is trained to recognize something or to generate something starting from a set of correct answers: it only learns from its mistakes, trying to minimize them somehow. This is quite a different way of going about it compared with ours: at school, my daughter is not repetitively and randomly subjected to thousands of questions taken from a training set, which she must answer and receive feedback on the answer, but general principles are explained to her and then she does exercises in order to understand or consolidate these principles and not to bring out knowledge from the examples.

When the mistake is necessary

However, the proverb “learning through mistakes” is there to tell us that in many cases there are no general principles to appeal to or that, if they exist, they are difficult to understand and implement. As children (and even as adults), we have all learned something “at our own expense”, like the fact that putting a hand on a fire burns us, that annoying a wasp can cause it to sting us, that putting the cat in the microwave is not a way to dry it but one that kills it, and so on.

These lessons we learn from even fatal mistakes typically remain indelibly imprinted in our memory: they are not part of a general scheme, but they are events which help us not to commit further foolishness.

With children, who being less experienced in life tend to make more mistakes, we often use a rewarding/punitive approach which, again in terms of the old proverbs, we could call “the stick and the carrot”: we reward a correct action which must be repeated and maybe taken on as a healthy habit, we punish an action that must not be repeated and that risks becoming a bad habit. Depending on the number and the extent of the punishment, this millenary method seems to work and in fact continues to be used. It is the quintessence of “learning through mistakes”, now raised to a method.

Reinforcement learning

The same system is used to teach machines to behave in a certain way, according to a machine learning paradigm called reinforcement learning, i.e. reward learning (often translated as “learning by reinforcement”). The idea is very simple and easily described.

To begin with, the “learner”, so to speak, is called “agent” and is an entity which possesses, moment by moment, a certain state parameterized by variables or other, which with an ecological flavor is called “environment”. The environment changes and therefore the agent can find itself in various states according to the circumstances. In any case, at every moment, the agent must perform an action which, depending on the environmental situation, can be positive or negative, in the opinion of a “super partes” entity, the “interpreter”, who in this scenario plays the role of “oracle” (or divinity) and who “judges and gives” (as the Poet says) rewards or punishments depending on the agent’s behavior:

fonte: Wikipedia

This situation could, for example, describe an adult (the interpreter) who decides whether to reward or punish a child (the agent) according to the circumstances (the environment) and to his/her behavior (the actions): for example, should the child tidy up the room after playing, he/she may receive a prize, or perhaps receive a punishment if he/she has not done so, whilst if he/she tidies it up when it’s time to leave for school, thus being late, the same action but in a different state will generate a punishment. In this case the agent is sentient, decides and learns, or rather learns to decide.

In the case of an algorithm, the agent must follow internal rules of behavior dictated by the algorithm itself, what is called a “policy”: precisely, a policy defines, for each state of the environment and for each possible action, a probability that the action performed in that state should lead to a reward. Given these estimates it is therefore possible to calculate what Probabilists call “the wait” and therefore to evaluate how much reward (or how much punishment) a chain of behaviors will bring to the various states.

The goal of reinforcement learning systems is to change one’s policy in order to maximize the likelihood of behaving in the right way at the right time.

What are the areas of application?

These methods, described here in a purely qualitative way, have a solid quantitative theory behind them, which is essentially a branch of the mathematical theory of optimization, and have been applied with considerable success in many “game” situations: for example the very famous AlphaGo, the algorithm which beat the Go champions and invented new strategies for this complicated pastime, uses reinforcement learning in an essential manner in order to obtain its superhuman performances.

Following this, AlphaGo led to a series of artificial “players” based on reinforcement learning techniques combined with Deep Learning, obtaining spectacular results such as being able to learn to play video games simply by watching the games of human players and, having reached a certain level of game knowledge, to play against itself using reinforcement learning to improve its performance.

Of course, millions of dollars are not invested in developing algorithms which are only able to play video games: reinforcement learning is proving to be fundamental in order to guide the behavior of robots used in factories, to regulate complex infrastructures (such as electricity grids), to optimize performances, to learn financial trading strategies and much more.

All this by taking advantage of mistakes and challenging themselves to do better, just like (we should do) us.

Paolo Caressa