Lately there has been much talk of fake news, less of fake data or fake charts, which are more subtle and dangerous because they are given that credit granted to quantitative measures more easily than to simple qualitative statements.
The exponential increase in the production of data and charts is leading to an overload of numbers, percentages and trends on virtually anything, with a background noise that risks distorting ideas and opinions.
It’s not a new problem. It was voiced over 60 years ago by Darrel Huff in his now classic booklet “How to lie with statistics“, warning of the risks and the negative consequences of unauthorized – or, worse, stupid and ignorant – use of data and statistics. Risks and consequences today amplified by orders of magnitude from viral spread guaranteed by the web and social media.
Because if “words are important,” as someone said, data and numbers are something more. They are the politically correct language of the fact & data mindset, today proclaimed by many but actually practised by few. Methods, tools and terms of statistics are fundamental for deriving economic, social and business trends from data. However, without intelligent, honest and aware use by the writer and a minimal level of functional literacy, the results may be non-sense semantics, when they are not real malicious manipulations for guiding opinions and changing consensus.
Absolute and relative values
Aware of venturing into a minefield and with a strong emotional impact, I start from a premise that is non-negotiable for me; namely that, besides having no scientific basis, any form of discrimination of people on an ethnic or racial basis is incompatible with every rule of civil coexistence.
That said, since the road to hell is always paved with good intentions, if I want to uphold a sacrosanct principle, but I do so with partial or wrongly processed data – no matter whether through ignorance or for the sake of it – I risk causing the contrary effect, providing a convenient pretext for the same alarmism and prejudices that I intended countering.
A report from ADN Kronos on August 29 was titled ‘Rape, fewer cases by foreigners and more by Italians’. A September 14 article on Wired.it was titled Rape, 6 out of 10 perpetrators are Italian’. Both articles derive conclusions based on absolute values and percentages aggregated on the total (e.g. 6 out of 10 belong to group X), but they have little significance if account is not taken of the number of people considered.
As is clear from the table below – which also contains ISTAT data at January 1, 2017, in addition to those from the two articles – the social group of male Italian residents is about 12 times greater than the social group of male foreign residents. It is therefore normal to expect that Italians reported or arrested for gender-based violence are the majority – the 6 out of 10 in the Wired title – but for foreigners there is an incidence (number of reports/arrests per 100,000 individuals) about 7 times higher, with little difference between 2016 and 2017.
* On the basis (ISTAT) of around 29,4 million Italian male residents and around 2,4 million foreign male residents at 1 January 2017
Is this conscious misinformation or simple ignorance of the arithmetic and statistical bases by important magazines and press agencies?
We should start from a principle: detect greater incidence of some social groups regarding specific phenomena or types of offences, perhaps extending the analysis to also include data on the prison population is a neutral measurement of a numeric indicator. It is then up to sociologists, demographers and anthropologists to make serious and non-partisan analysis of the possible causes of the phenomenon. Obviously not on ethnic, racial or Lombrosian (aka, non-scientific) grounds but rather focusing on factors of the cultural or economic insecurity of this or that social group. This also completely changes the possible actions of contrast and mitigation of the phenomenon that we could undertake.
Averages and medians
The difference between average and median becomes clear with a simple example. Let’s imagine a dataset with 11 numbers.
2 3 3 4 7 9 11 12 14 17 25
They can represent whatever you want. Temperature values detected at the same time over 11 different days or the number of aces served by Roger Federer in the last 11 matches played in Grand Slam tournaments. The average is given by the sum of all N values and the result in this case is 9.73. On the other hand, the median is the value of the central element of the ordered dataset; in this case it is 9. If the number of dataset elements was equal, the median would be calculated as the average of the two central elements.
In this example, average and median values are close by virtue of the distribution of values, but this is not always the case. If you live in a small town of 1,000 families where the income of each family is about 50,000 euros a year, the average income and the median income in the relative dataset virtually coincide.
Imagine now that Warren Buffet – who, according to Forbes, earned 12 billion dollars in 2016 – were to fall in love with your town during a holiday in Italy and came to live there. The most representative income datum in social and economic terms at this point would be the median income which, even with the departure of Buffet from the area, would remain at 50,000 euros (given that the central element of the dataset remains unchanged). However, the average income per family would shoot up to about 11 million euros. An incautious (or not very serious) storyteller could exploit the widespread ignorance of the difference between average and median to embark on a fake sociological analysis about the “town of virtuous well-off, with an average family income of 11 million euros per year but who live in very normal houses and drive a runabout”.
If the use of numbers, tables, averages and medians can confuse ideas, by using charts data mistreatment becomes more sophisticated. Let’s start with the old and well-known tricks to “emphasize” or “minimize” certain trends, appropriately choosing the unit of measure and range of values.
The following three figures show an example, using the data from the previous table on gender violence. They are very simple linear charts showing the trend of the number of acts of violence (vertical axis) from 2016 to 2017 (horizontal axis). The blue lines represent Italians, the red ones foreigners.
This first chart offers a relatively “neutral” representation, limiting the range of values on the vertical axis between 800 and 1600, which is the range in which the set of considered values falls. From 2016 to 2017, we notice a slight increase in the crimes committed by Italians and a very slight decrease in those committed by foreigners. The total number of crimes (Italians + foreigners) is, however, slightly on the rise.
Imagine now that you want to make public opinion believe that acts of violence are actually not increasing (false, overall they also are even though by little). As the second chart below shows, it is sufficient to extend the range of values on the vertical axis to the range of 0 – 3200. In this way, the slopes of the two straight lines are “crushed”. The red line becomes virtually horizontal, giving the impression that the crimes committed by foreigners have remained unchanged (thus diminishing the slight but actual decrease) while the increase in crimes committed by Italians is virtually imperceptible.
If, however, in order to counter populist and xenophobic tendencies, you want to shift attention to the increase in acts of violence committed by Italians, it would be sufficient to narrow the range of values represented on the vertical axis to between 1450 and 1550, as in the third chart below. There would remain only the series of crimes committed by Italians, with a slope suggesting a very rapidly deteriorating situation.
Three completely different messages, only by altering the ranges of values represented, given the same starting raw data.
There are then more sophisticated, but equally inconsistent, graphic elaborations, like this one, again on the same topic, from the Info Data channel of Sole 24 Ore.
The article’s argument is that the incidence of sexual violence does not depend on specific nationality, but only – and in a linear way – on the number of males present in a social group. If this doubles, acts of violence double. If it is halved, the number of acts of violence decreases by half.
The proposed model of analysis calls for creating a chart where the horizontal axis refers to the number of prisoners for sexual offences and the vertical axis the number of male individuals of a generic nationality. Each nationality therefore corresponds to a point on the chart (a small rhombus is used in the representation to better illustrate this). If the Sole 24 Ore analyst’s argument were correct, all points regarding the nationalities considered would ideally be placed on the same line, or deviate from very little from it. This is called linear regression analysis and is performed by calculating an indicator (R-square, sum of the distances squared of the data from the line) whose value is equal to 1 in the case of perfect linear correlation between the data.
The following figure shows the result.
Source: Sole 24 Ore
The small rhombus upper right represents the Italians and the model appears to be working great, with slight deviations from the line of regression, whose gradient (angular coefficient) represents the incidence of the phenomenon, that is, the extent to which the number of detainees increases for that kind of crime as the population increases. In other words, a steep line indicates an increase in cases of violence as the male population increases (positive figure), while a “crushed” line indicates a more significant increase (negative figure), but in any case the increase or decrease would still be in proportion for all groups regardless of nationality.
The problem is that 62 different nationalities are represented in the chart, mixing populations resident in Italy in statistically irrelevant numbers (e.g. Mongols, 47 individuals, with 1 detainee per rape) with groups of hundreds of thousands or, in the case of Italians, tens of millions of individuals. The result is 61 rhombuses (rest of the world), condensed in a very limited area, with the 62nd rhombus (Italians) very distant. Under these conditions any correlation would give an R-square close to 1 (0.981 in this case).
In more intuitive terms, if you have a very circumscribed cloud of points and a single point far away, and if you draw a straight line that goes through the distant point and the center of the cloud, this will in any case appear to approximate the values well, even if the cloud data were random and unrelated.
However, if you exclude the data of Italians from the analysis, you get the following chart, where the slope of the line is significantly different and the deviations are wider. Basically, the linear correlation is no longer evident and therefore the incidence of the phenomenon would seem in fact to depend not only on the number of males in each social group (starting argument), but in effect also on its nationality.
Source: Sole 24 Ore
Finally, the angular coefficient for Italy that can be obtained from the first chart is one detainee every 17,000 individuals. Excluding Italy from the analysis, the coefficient drops to one detainee every 2,200 inhabitants; that is, the incidence of foreigners on this type of offence is about 7.5 times higher.
The “lax” regression analysis of Sole24 Ore was also taken up by other media, such as Huffington Post, which in this article copies and pastes a summary titling it “The data show that the idea that foreigners commit more rapes than Italians is nothing more than a commonplace“. In “absolute” terms this is clearly true, but by reasoning “in proportion” to the number of individuals of the two populations we have seen that this is not the case. The different incidence of the two social groups for this particular offence is confirmed as being equal to a factor of between 7 and 7.5, whether using ADN Kronos and Wired data (number of reports/arrests) or Sole 24 Ore datasets (prison population).
The opinions we elaborate and the decisions we make should be borne out by information obtained from correct data processing. Too often, however, the use of statistical instruments and techniques – in themselves neutral – is distorted, exaggerated and over-simplified, with wrong choices in terms of merit and method.
The media – from the ultra-populist and pseudo-scientific to those in theory more moderate – all have great responsibilities on how to choose data, how to handle them, and above all how to present them in the form of news, without falling into the temptation of confirmation bias through bending the numbers to fit their pre-established views. At the end of the day, it would be a matter of really exercising that freedom of press with respect to which Italy ranks the 77th in the world in 2017.
For our part, as users of information, we must refrain from easy conclusions and heuristic shortcuts. Let’s try to verify as much as possible sources and methods; let’s read carefully and with a critical spirit. In a nutshell, even if we are not data scientists by profession, let’s be the first to cultivate the culture of data. Or at least let’s try, hoping that sooner or later the media will follow us.