Annotated photograph from IBM's Diversity in Faces datasetRecently, IBM
was criticized for taking publicly available photos from Flickr and other sites where users upload their photos for learning neural networks without permission. Formally, everything is according to the law - all photos are published under the Creative Commons license - but people feel discomfort due to the fact that the AI ​​is trained on their faces. Some did not even know that they were photographed. As you know, to shoot a person in a public place you do not need to ask him for permission.
As reported in the media, IBM has used approximately 1 million private photos from Flickr to train its facial recognition system. But then it turned out that in fact IBM did not copy the photos from Flickr, these images are part of the
YFCC100M data
set of 99.2 million photos available for training neural networks. This base was still Yahoo, the former owner of Flickr.
It turns out that the story of IBM is only the tip of the iceberg. Here, the company accidentally came under the distribution, and in fact, photographs of users have long been used to train a variety of systems, it has become common practice: “Our study showed that the US government, researchers and corporations used images of immigrants, children who have been abused, and dead people to test their facial recognition systems, ”
writes Slate . It emphasizes that such activities are practiced even by government agencies such as the National Institute of Standards and Technology (National Institute of Standards and Technology, NIST).
')
In particular, the Facial Recognition Verification Testing (FRVT) program for standardized facial recognition systems developed by third-party companies operates within NIST. This program allows you to evaluate all systems in the same way, objectively comparing them with each other. In some cases,
cash prizes of up to $ 25,000 are awarded for winning the competition. But without a cash reward, a high mark in the NIST test is a powerful incentive for the commercial success of the developer, because potential customers will immediately notice this system, and the A + mark can be mentioned in press releases and promotional materials.
NIST uses large data sets with photographs of individuals taken at different angles and under various lighting conditions.
An
Slate investigation revealed that the NIST dataset includes the following photographs:
Many shots were taken by employees of the Department of Homeland Security (DHS) in public places, while in the process of taking photos of passersby DHS employees posing as tourists who take pictures of the surroundings.
NIST datasets contain millions of images of people. Since data collection took place in public places, literally anyone can be in this database. NIST actively distributes its data sets, allowing everyone to download, store and use these photos to develop face recognition systems (images of child exploitation are not published). It is impossible to say how many commercial systems use this data, but
numerous scientific projects are doing it precisely, writes
Slate .
In a commentary for the publication, a NIST spokesman said that the FRVT database was collected by other government organizations in accordance with their tasks; this also applies to the database with photos of children. NIST uses this data in strict accordance with the law and existing regulations. He confirmed that the base with child porn is really used to test commercial products, but children in this base are anonymized, that is, their names and place of residence are not indicated. NIST employees do not view these photos, they are stored on DHS servers.
The data set with photos of children has been in use since at least 2016. According
to the developer documentation , it includes “photos of children aged from infant to teenager”, where most of the pictures show “coercion, violence and sexual activity”. These images are considered to be particularly difficult to recognize due to the greater variability of position, context, etc.
Probably, this set of data is used for training and testing of
automatic systems
for filtering obscene content .
Journalists also pay attention to the “bias” of the Multiple Encounter Dataset data set with criminals. Although blacks make up only 12.6% of the US population, 47.5% of them are in the base of criminal photos, which is why the AI ​​can also learn bias and
become racist .