Deleting unethical datasets isn’t enough

The researchers’ analysis also suggests that Labeled Faces in the Wild (LFW), a dataset introduced in 2007 and the first to use face images taken from the Internet, has transformed several times in nearly 15 years of ‘use. While it started out as a resource for evaluating facial recognition models only for research, it is now used almost exclusively for evaluating systems intended for use in the real world. This is despite a warning label on the dataset website warning against such use.

More recently, the dataset was reused in a derivative called SMFRD, which added face masks to each of the images to advance facial recognition during the pandemic. The authors note that this could raise new ethical challenges. Privacy advocates have criticized these apps for fueling surveillance, for example, and in particular for allowing government identification of masked protesters.

“This is a very important document, because people’s eyes have generally not been opened to the complexities, damage and potential risks of datasets,” says Margaret Mitchell, AI ethics researcher and leader of responsible data practices, which was not involved in the study.

For a long time, the culture within the AI ​​community has been to assume that data exists to be used, she adds. This document shows how this can lead to problems down the line. “It’s really important to think about the different values ​​that a dataset encodes, as well as the values ​​that are encoded by having a dataset available,” she says.

A repair

The study authors provide several recommendations for the AI ​​community going forward. First, creators should communicate more clearly about the intended use of their datasets, both through licensing and detailed documentation. They should also impose stricter limits on access to their data, perhaps by requiring researchers to sign terms of agreement or asking them to complete an application, especially if they intend to. construct a set of derived data.

Second, research conferences should set standards for how data should be collected, labeled and used, and they should create incentives for the creation of responsible datasets. NeurIPS, the largest AI research conference, already includes a checklist of best practices and ethical guidelines.

Mitchell suggests going even further. As part of the BigScience project, a collaboration between AI researchers to develop an AI model capable of analyzing and generating natural language according to a rigorous ethical standard, she experimented with the idea of ​​creating organizations of data set management – teams of people who not only manage the retention, maintenance and use of data, but also work with lawyers, activists and the general public to ensure they are up to standards legal, are only collected with consent, and can be deleted if someone chooses to remove personal information. Such management organizations would not be necessary for all datasets, but certainly for scraped data which might contain biometric or personally identifiable information or intellectual property.

“Collecting and monitoring datasets is not a one-time job for one or two people,” she says. “If you do it responsibly, it breaks down into a ton of different tasks that require in-depth thinking, in-depth expertise, and a variety of different people. “

Over the past few years, the field has increasingly evolved into the belief that more carefully organized datasets will be essential in overcoming many of the industry’s technical and ethical challenges. It is now clear that building more responsible datasets is not enough. Those who work in AI must also make a long-term commitment to maintaining and using them ethically.

Leave a Comment