“Data has reasons which reason does not understand”
After a previous article of mine on Ingenium on fake data and fake charts, as a specific subset of the more well-known concept of fake news, I must confess that I’ve intensified my searches around the internet and now I see false or incorrectly interpreted numbers almost everywhere .
In fact, I’m a “knowing victim” of that particular cognitive bias known as the law of the instrument, which leads one have excessive confidence in a tool or method just because it’s familiar; this is summed up well by the saying “if all you have is a hammer, everything looks like a nail”.
Nevertheless, it’s undeniable that many of the decisions we are called upon to make, from selecting which technology to use in an IT project to choosing what food to consume, are – unfortunately – ever more commonly driven by fake data.
To that end, we’ve already looked at information on gender-related violence which often begins with correct and reliable data that is presented however in a partial manner, calculated improperly, or inapplicable to the context in question.
Fake-data-driven Big Data Projects
Problem: you have to choose which database technology to use to create an information system which will need to manage enormous quantities of data. Thus you decide to opt for an NoSQL solution, based on a benchmark which shows this type of solution, in light of its distributed architecture, is in general more scalable – i.e. it maintains a steady performance as the volume of data managed increases – compared to a classic relational database .
A benchmark is nothing more than a table of numbers which shows quantity values depending on the reference parameters, which can be put to a proper use or a fake use.
If what you need to create is a precise agricultural IoT, which captures H24 data (precipitation in millimeters, wind speed, humidity at ground level and overhead, atmospheric pressure) from sensors and drones deployed across several dozen hectares of cultivated land, the scalability benchmark would correctly guide your choice. The data collected is not specially structured and, regardless of how it may later be formulated, you must only guarantee the data’s efficient acquisition without a need for fine-tuning modifications, i.e. without transactional support.
The same benchmark becomes fake, in terms of incompleteness or partiality, if on the other hand what you need to create is a billing system for a global services supplier with tens of millions of clients, such as a large telecommunications or utilities provider. Here the issue we hit upon is not – only – to guarantee performance as the quantity of data acquired increases, but the consistency of the structured data transactions (we’re talking about bills and payments) at the time of its acquisition and subsequent modification. In this case, the choice cannot be guided solely by scalability benchmarks, and a good-old relational database with native SQL and ACID transactions is still the best solution.
In reality, all independent studies – i.e. studies not sponsored by this or that vendor –I was able to consult concluded there is not one class of databases whose performance is absolutely better than others, even when the volume of data being managed grows. The best option always depends on the characteristics of the data (structured, linked) and on the operations to be performed on it (massive uploads, precise queries or clustering, modifications with transactional support).
If you’re interested in the subject you can go more in depth here with a full set of good comparison data between SQL and NoSQL databases, both commercial and open source, and the instructions for reproducing the benchmarking tests on a Linux Ubuntu system.
Fake-data-driven Food Choices
Reasoning with the aid of complete data rather than fake or partial data, we can demystify various “post-truths” that are more or less in vogue. Such as the fact that palm oil is dangerous for our health or that of the planet, due to the deforestation caused by its cultivation. This was recently reiterated to me by the mother of one of my daughter’s kindergarten classmates, who is proudly against the vaccinations recently made mandatory for school admission but very careful in selecting snacks for her girl that don’t contain palm oil.
In May 2016, the EFSA (European Food Safety Authority) published a report on the presence of some glycerol derivatives in vegetable oils. The document offers information that is very technical, but rich in good data on the fact that the refining process for most vegetable oils (not just palm oil, but also canola, coconut, corn, sunflower, etc.) at temperatures over 200°C causes the formation of substances which, if consumed in large quantities, can be potentially detrimental.
The Italian Minster of Health later asked the EFSA for clarifications, and the Istituto Superiore di Sanità then declared the presence of these substances in refined vegetable oils had been known for years, clarifying that there was no data available that showed a correlation between the use of palm oil and the development of tumors in humans.
Nevertheless, a fake use of the report’s data, together with the hypersensitivity of public opinion on the topic, led to a surge of unjustified alarmism which drove the supermarket chain Coop to announce it was suspending the production and sale of its products containing palm oil, ignoring all the other vegetable oils presented in the report. Some multinational food & beverage companies had already done the same, publically announcing the absence of palm oil in their products solely for marketing purposes, to take advantage of an important cross-target of health fanatics and environmentalists.
In terms of deforestation, palm oil has an average yield of 3.47 tonnes per hectare: 5 times more than canola oil (0.65 t/hectare), 6 times more than sunflower (0.58 t/hectare), 9 times more than soy (0.37 t/hectare) and 11 times that of olive oil (0.32 t/hectare). In 2013-2014, palm oil represented more than a third of global production of vegetable oils while occupying only 6% of the land cultivated for this purpose.
If we combine this data with predictions about growing world need for vegetable fats due to population increase, we might conclude that palm oil production seems to be more earth-friendly compared with that of other vegetable oils. I’ll try to discuss this with that mother who is rightly worried about her daughter’s health.
Now that you have some more information to help you more knowledgably select the database best adapted for your Big Data project and the best snack, you also shouldn’t be afraid to put a locally-sourced sirloin in your cart, rather than apparently more sustainable basmati rice, which arrived on your supermarket’s shelves after a long trip from India.
In 2014, you may have read this news article on the ANSA website, whose title declared “74% of greenhouse gas emissions caused by bovines”. The text of this article clarified right away that bovines were responsible not for 74% of all greenhouse gas emissions, but 74% of those from livestock, which is in turn equal to 10% of total emissions. But the lazy reader is struck by the headline alone, a textbook example of fake data.
The subject is complex and well-documented and if you wish to explore it further in depth, so as to decide what to put on the dinner table tonight, you can start here for example.
The mistreatment of data is a subject we should become more cognizant of. The manipulation and misuse of data ought to be at the top of our priorities. To avoid making fake-data-driven decisions requires dedication, procedures and resources. Where to begin? By encouraging a bit of scientific culture at all levels, something which is sadly somewhat lacking in our country. Let’s try to emulate scientists, who when using data subject it all to a full, rigorous examination. This way progress in knowledge is slower and more laborious, but much more reliable.
Translated into practice, when we read something with supporting data – at least where it isn’t the product of a double-blind peer review process– let’s not trust the interpretation it gives the data on face value. Let’s try to examine the data, checking its consistency with the headline or the comments that accompany it and from what sources it may have been taken, giving a higher rating of authority to official organizations, best of all international, not directly connected to economically interested parties.
Can you handle, beginning right now, working a bit more and making a few less decisions if they are ones based on more reliable and complete information?