TECH | Mar 18, 2020

The data epidemic

Just like Data Science, if known, it could help to interpret the Coronavirus phenomenon

In the last few weeks the attention of the entire nation has focused on the emergency of the spread of Coronavirus: begun as a “distant” affair, trapped inside the boundless territory of the People’s Republic of China, this disease interested us absent-mindedly, episodically, and we stood by thinking that, like SARS (another “Coronavirus” in fact), it would be tamed, come hell or high water, by the greatest commercial power on the planet.

Only when the seriousness of the situation in China became clear, at the beginning of the year, and finally when “we found it on our doorstep”, did our interest become more acute and focused: those who, like me and probably like those who are reading me at this moment, have an interest in digital technology and in “data science”, will certainly have asked themselves how to apply their knowledge to the investigation of this phenomenon which, in my memory (which is now pretty long, unfortunately), has generated an emergency and a need for a reaction which are unprecedented in the history of the Republic.

Sure, there have been other epidemics in the past, but the possibility of their spreading was limited: there were fewer trips by plane, most people did not actually travel, etc.. It’s true that once there was less attention paid to hygiene/sanitary conditions, but I had only seen a “collective quarantine” situation, although very sweetened compared with the real curfew we are experiencing these days, in science fiction movies.

Data Scientist, you too

On the social networks frequented by professionals, such as LinkedIn, it was all a blossoming of charts, tables and yet more charts and tables, where the basic data were essentially the same: those collected by the health and civil protection authorities of the countries concerned, as well as by the WHO and by some particularly prestigious institutions, which themselves produced interactive and navigable dashboards to treasure this wealth of data, which is often a paucity of information (a ranking of these dashboards can be found, for example, in an article of technologyreview).

But over and above the institutional contributions, many, if not all, tried to take part using the tools at their disposal: today it is not only easy to fly around, but also to produce a chart from a data table, or a map, or an interactive website, or something else. I’ve seen dozens of posts with charts of various types, all beautiful from an aesthetical point of view, which aimed at one of the two conclusions which seemed possible:

  1. It’s a common flu which was given media prominence.
  2. We’re all going to die.

The point is that we should obtain information from data: instead we are overwhelmed by a flood of data but remain thirsty for information. It’s a bit like being in a boat in the middle of the ocean: surrounded by a huge mass of water but at risk of dying of thirst.

The spread of “Data Science”, or rather its explosion, is a phenomenon certainly more viral than the Coronavirus, and its means of diffusion is the enormous potential of the technologies at our disposal: with a few lines of Python or R it is possible to read data from an Excel spreadsheet, have a linear regression calculated and produce a chart which illustrates the result. But let me say that this is not really Data Science.

Data Science implies knowing how to collect data from diverse sources and to structure these in order to deliver them to a model whose purpose is to produce information from the data. It is not enough to make a “dead/infected” separation to calculate the deadliness of a disease, without first wondering, for example, how the infected are calculated, without relating this number to time, which therefore makes it a variable and no longer a number, etc..

In short, it takes a model, and not necessarily always of the same type: it could be a statistical model, an optimization model (where Machine Learning and Artificial Intelligence in general represent a special case), a deterministic model, etc. Today we are accustomed to programming tools, such as those mentioned above, which spoil us for choice as regards models and data representations (those of us who know all the functions and classes of SciPy should cast the first stone), which implies that we really need to know both the possible data structures and the possible models to be applied in order to draw reasonably sensible conclusions.

And usually the models for a specific problem are already there! In the case of epidemic spread, of which the Covid-19 emergency (let’s call it by its name!) is an example, there have been established models for almost a century. Let’s see how they work.

No airtight compartments…

An epidemiological model usually aims to describe the curve of the infected, i.e. the number of people in a population who are infected over time. At the moment of writing, the number of infected people is growing every day in Italy, therefore the curve which underlies it has an increasing trend: if this trend were never to change, sooner or later we would all become infected. But what we usually notice is that these epidemics have a “peak” of infected people which then tends to decrease, until it stabilizes on a value, hopefully zero (which would mean that the epidemic has been completely eradicated).The curve therefore resembles the profile of a mountain, with a ridge rising to the peak and the ridge on the opposite side descending towards the valley, in our metaphor, towards normality.

But the people who are infected were healthy beforehand, so this curve of the infected is linked to a curve of the healthy, which represents the people susceptible to get ill: the epidemiological models therefore operate a separation of the population under observation into different compartments, where a single individual, at a given moment, can be in a single compartment, and where the totality of the compartments exhausts the entire population.

The compartments that are typically considered are:

  1. compartment S which stands for “susceptible”, i.e. the people who are not infected but who are susceptible to contracting the infection
  2. compartment I which stands for “infected” i.e. those who have contracted the disease
  3. compartment R for the “removed”, i.e. the people who have contracted the disease but are no longer either in compartment S or in compartment I: for example, those who, unfortunately, have died, those who are cured, developing immunity, etc.

There are also other compartments which can be considered, such as compartment “E” of the “exposed”, i.e. the susceptible people who, for example, by being in contact with infected people, may have been infected without yet having symptoms. But here let’s concentrate on the simplest model, whose dynamics can be schematized as follows

S → I→ R

So, in this model, the population at any given time is divided into susceptible S, infected I and removed R, and the dynamics over time are based on two assumptions:

  1. one or more individuals in compartment S can pass into compartment I
  2. one or more individuals from compartment I can pass into compartment R.

These two qualitative rules are determined by specifying, at each instant, how many members of compartment S pass into I and how many members of compartment I pass into R: these two passages are determined by the current values of the S, I and R numbers and by two “internal” parameters of the model, whose values determine the qualitative trend of the susceptible, infected and removed curves.

Interpreting the model

An epidemiological model of this type always has a “critical threshold” value which is greater than 0, and such that, if greater than 1, the infected increase up to the peak and then decrease, otherwise they always decrease.

The trend of the curves of the susceptible, infected and removed is therefore determined by this threshold (which depends on the internal parameters of the model we mentioned), and its possible cases are illustrated in the following figure (in this simulation we presume that initially S is 80% of the population and I is 20%). For example, here is the case of a threshold value greater than 1 (horizontally the time axis, for example the days, vertically the axis of the number of people involved):

As you can see, in this case the curve of the infected increases exponentially, to then reach its maximum point (the peak of the infection) and then to decrease to zero after 60 days. The curve of the susceptible decreases exponentially (because the curve of the infected increases) up to the peak of the infection, then it basically stabilizes. On the other hand, the curve of the removed grows increasingly slowly until the entire population dries up.

If, on the other hand, the critical threshold in the model is less than 1 we have a continuous decrease of the infected until they dry up:

We notice that in this second chart the infected all end up in compartment R, which, however, does not cover the entire population, more than half of which remains susceptible, i.e. is not affected by the virus.

These two scenarios basically exhaust the possibilities of this model: clearly the specific numbers can change and greatly so, but the possible trends are the two we have illustrated.

It still remains to be understood, in order to be able to apply the model:

  1. how to determine the parameters on which it depends (and therefore the critical threshold that determines the qualitative development of the model)?
  2. how reliable is the result?

These questions are valid for any modelling of a phenomenon, and they require theory and practice in order to be answered: in general the answer to the 1st is to use “parameter estimation” techniques, which use a historical series of real data, even partial, of the S, I and R values, and try, with optimization techniques (for example Machine Learning ones) to determine the values of the parameters which model curves as similar as possible to the curves of the real historical series correspond to.

The second question is usually answered with statistical methods which link the sensitivity of the model to the error made in estimating the parameters with the accuracy of its conclusions.

It should be noted that, if these parameters are considered constant by the model, in reality they vary, indeed the battle we are all fighting at the moment consists in trying to lower the critical threshold: in this case, in fact, the number of infected people would immediately start to decrease, and it would mean that we have left the peak behind us.

And then what?

To conclude this talk about epidemic models, I will leave you with a few points to reflect on and which should always be borne in mind before embarking on hasty deductions:

  • Although the SIR model I presented here is not one of the most sophisticated, it gives an idea of what the models used really are like: the surprising thing is that it dates back to 1927!!! But this is common knowledge: most of the concepts which today are considered “new” and on which innovation is based, actually have distant roots, just think of the optimization methods which date back to 1700!
  • A model is never as complex as the reality it wants to describe: this obviousness can be outlined even better by saying that in a model we describe only a very few features of reality, with which we somehow “aspire” to be able to effectively tackle a problem: but we may have chosen the wrong features (and therefore the wrong model).
  • We must always, with humility, turn first of all to those who study the domain in which we want to run our model: as much as we might be wizards with numbers and algorithms, virologists and epidemiologists are the experts of epidemics, and some basic information, even if it apparently falls outside the quantitative datum, is always useful; in this case it would be possible to learn more about the various Coronaviruses and their characteristics by reading this short page.
  • The data must be transformed into information before being processed: for example, it is easy to talk about mortality, deadliness and other items, when actually the epidemiological issues surrounding these terms are not trivial at all (see for example this item of the Treccani online dictionary).
  • Before producing a chart we should always have a model, an algorithm or in any case a method which is repeatable and replicable by anyone in order to answer clear questions: we do not pass from a single datum to the answer, in the middle there is the model, which is after all the second term in the Data Science locution.

For a more detailed technical analysis, with Python simulations, let me refer to the following repository github.

Paolo Caressa