TECH | May 3, 2018

Can algorithms be racist?

The problem of data representativeness in a survey on DBpedia

“Is the iPhone racist?”

The question was asked in a Newsweek article on December 18, 2017, as well as by other British newspapers such Mirror and The Sun, which picked up on the discontent of some Chinese users a few weeks after the launch of the new smartphone model. The latter had discovered that they could unlock other people’s phones using their own image, regardless of the existence of kinship relationships.

How is such an “oversight” possible?  It’s hard to say, because Apple clearly does not disclose the details of its technology.  The problem, for which there are of course known remedies, does not concern only Apple devices.  It seems that different models of facial recognition, even if trained on millions of images, have no acceptable performance on certain segments of the population or etnic groups.

Does discrimination reside in data?

We developed a hypothesis according to which the origin of such “discrimination” resides in data.  To this end, we decided to analyze one of the most representative databases in the world: Wikipedia, or better DBpedia.

Why DBpedia? DBpedia is one of the neuralgic junctions of the Linked Open Data galaxy, an enormous and extremely precious wealth of structured information, accompanied by an extraordinary variety of connections. This project represents a huge multilingual Knowledge Graph, which rests on a public infrastructure and enables saving knowledge in a machine-readable form, which can be explored by users through simple queries using SPARQL.

DBpedia is a spectacular artifact of mankind’s knowledge, derived from the “democratic” contributions of Wikipedia users.  It is beyond the scope of this analysis to know if DBpedia has actually played a role in the training of some facial recognition algorithms. However, we know for sure that its old counterpart Freebase, has had considerable importance recent works.

So, if on the one hand Wikipedia is still one of the favorite datasets for algorithm training, DBpedia (which stores the complex system of relationships that interconnects encyclopedia pages) on the other is the ideal starting point for a variety of types of training.

Now, from a practical point of view, let’s assume that a Martian landing on our planet wants to get an idea of the composition of mankind by accessing one of the most extensive and “universal” resources ever conceived.  We tried to get an idea, on his behalf, of what the distribution of the birthplaces of the “Celebrities” that find space in DBpedia would be. 

The number of people present in the English version of DBpedia available online (by far the richest and most complete) is around the 3 million mark, including cases of inconsistencies, duplicates and incomplete data.

After trying to minimize the impact of these inconsistencies, for the sake of simplicity we decided to restrict our analysis to people born after 1850 for whom a place of birth was indicated. In doing so, we came to just under one million people.

In many cases, in order to trace the country of origin, it was necessary to cross this data with another important source of Linked Open Data, WikiData, and make difficult decisions, such as associating Russia/USSR with the European continent (ed.: Wikipedia places Russia and Turkey in both the Asian and European continents) or the no longer existing entity of the Austro-Hungarian Empire, present-day Austria.

We then reproduced the results of the survey in a graph, restricting ourselves to celebrities born in states with more than 10 million inhabitants, assuming that less populous states would not have significantly altered our estimates. We thus analyzed the weight that each nation, and therefore each continent, has in DBpedia.

Comment on the graph: in the left half of the graph, the division of the world population is reported continent by continent. The thickness of the lines is proportional to the population of the continent (Asia holds sway with 4 billion inhabitants). In the right half of the graph, the fraction of celebrities registered on DBpedia, from each of the selected states, is reported for each continent. The overall thickness of the beam in the right half is proportional to the number of celebrities belonging to each continent. If the number of people present on DBpedia for each continent was proportional to the population of today’s continent, the thickness of the right and left beam would be identical.

What are the results of our survey?

More than 74% of people registered on DBpedia after 1850 were born in North America or Europe, despite the fact that the population of these two continents today does not cover even 20% of the world’s population. For each state, we divided the number of celebrities by today’s actual population. The states most represented on DBpedia are the small islands of Oceania (Niue, Tuvalu and Nauru), which have a population of a few thousand inhabitants. San Marino and the Principality of Monaco follow. Over 5 million inhabitants, Norway is the most represented state. The least represented states are Sudan, Ethiopia, Tanzania, Niger and Yemen.

It is interesting to note that India and China, countries that today alone host 50% of the world’s population, have fractions of percentage points in relation to their actual population (respectively 0.002% and 0.00014%) in our “Universal Encyclopedia”. And Africa, a continent with1 billion people, is less represented than Germany and France respectively. Our Martian would probably conclude that the lands that emerged on our planet are predominantly called Europe and North America, and that the composition of mankind is by far dominated by the original population of these continents.

These considerations can be justified in different ways: by the fact that the version of DBpedia used as a reference is that in English, by far the richest and most populated; by the fact that the history of the last century in Western countries is very well documented; or by the fact that many countries have only recently experienced a strong demographic expansion, which makes them, proportionately, poorly represented. Furthermore, this analysis is necessarily based on some simplified assumptions (people move, and culture is not necessarily that of the country of origin). Nevertheless, there is no doubt that the presence of images (and most probably of values, also in a figurative sense) of (countries) Americans and Europeans on Wikipedia seems to be at least an order of  magnitude greater than any other continent or state of the world.

So is the iPhone racist?

Most likely not, but this question leaves room for some considerations.

The use of Artificial Intelligence necessarily brings with it a cultural scheme, which more than ever in Deep Learning stems from the data used for training. Training a model, be it of facial recognition, of translation or a chatbot, means conveying ​​with it values which guide decisions and actions.

On the one hand, therefore, it is necessary that all of us, as users of “Artificial Intelligences”, do not forget that the result of any algorithm, however logically coherent, is always relative to the way in which the algorithm was conceived and trained. In reality, for any type of prediction, there are specific limits of validity, which are generally clearly defined during definition of the algorithm itself, but which, unfortunately, the end user is not always perfectly aware of.

On the other, Data Scientist interested in building a model that interacts with some delicate aspects of our everyday life (ethnic group, gender, religion of belonging, culture) are called on to have full awareness of the representativeness of their data and the ways in which their algorithms are trained, insofar as their resources are considered “universal”. A self-referential “Intelligence” is neither useful nor constructive for the world in which we live.

Michele Gabusi