Armed Believing in the generative potential of technology, a growing faction of researchers and companies are seeking to address AI bias by creating artificial images of people of color. Proponents argue that AI-powered generators can fill gaps in the diversity of existing image databases by supplementing them with synthetic images. Some researchers are using machine learning architectures to match existing photographs of people with new races in order to “balance the ethnic distribution” of the datasets. Others like Generated media and Lab Kuves, are using similar technologies to create completely new portraits for their image banks, “creating…faces of every race and ethnicity,” Qoves Lab says, to provide “a really honest set of face data.” In their opinion, these tools will eliminate data bias by cheaply and efficiently creating a variety of images on command.
The problem these technologists seek to solve is critical. AI riddled with defects, unlocking phones for not that person because they can’t tell the faces of Asians from each other, false accusation people of crimes they didn’t commit and mistaking black people for gorillas. These spectacular failures are not anomalies, but rather the inevitable consequences of the data that AI is trained on, which for the most part heavily distorts whites and men, making these tools inaccurate tools for anyone who does not fit this narrow archetype. In theory, the solution is simple: we just need to grow more diverse training sets. However, this has proved to be an incredibly time-consuming task in practice, due to the scale of the input data required by such systems, as well as the magnitude of current data gaps (for example, an IBM study found that six out of eight the outstanding facial datasets consisted of over 80 percent fair-skinned faces). That a variety of datasets can be created without manual sourcing is an enticing possibility.
However, as we take a closer look at how this proposal could affect both our tools and our attitude towards them, the long shadows of this seemingly convenient solution begin to take on a frightening shape.
Computer vision has developed in one form or another since the mid-20th century. Initially, the researchers tried to build the tools from top to bottom, manually defining the rules (“human faces have two symmetrical eyes”) to determine the desired class of images. These rules will be converted into a computational formula and then programmed into the computer to help it look for pixel patterns that match the patterns of the object being described. However, this approach proved largely unsuccessful given the sheer variety of subjects, angles and lighting conditions that can make up a photograph, and the difficulty of translating even simple rules into coherent formulas.
Over time, the increase in the number of publicly available images has made possible a more bottom-up process through machine learning. With this methodology, bulk aggregates of labeled data are introduced into the system. Through “supervised learning”, the algorithm takes this data and learns to distinguish between the desired categories indicated by the researchers. This method is much more flexible than the top-down method because it does not rely on rules that may differ depending on different conditions. By learning from various inputs, the machine can determine relevant similarities between images of a given class without explicitly specifying what those similarities are, creating a much more adaptable model.
However, the bottom-up method is not ideal. In particular, these systems are largely limited by the data they provide. Like technical writer Rob Horning puts it down, technologies of this kind “assume a closed system.” They have trouble extrapolating beyond given parameters, leading to limited performance when confronted with objects in which they are not sufficiently trained; discrepancies in the data, for example, led FaceDetect by Microsoft to have a 20 percent error rate for black women, while its error rate for white men hovered around 0 percent. The ripple effect of these learning biases on performance is why technology ethicists have begun preaching the importance of data diversity, and why companies and researchers are racing to solve this problem. As the popular saying in the field of artificial intelligence goes, “garbage in, garbage out.”
This principle applies equally to image generators, which also require large datasets to learn the art of photorealistic rendering. Most facial generators today use Generative Adversarial Networks (or GAN) as their underlying architecture. At their core, GANs work by having two networks, a Generator and a Discriminator, interacting with each other. While the Generator creates images from the input noises, the Discriminator tries to sort the generated fakes from the real images provided by the training set. Over time, this “adversarial network” allows the Generator to enhance and create images that the Discriminator cannot identify as fake. The initial input serves as an anchor for this process. Historically, tens of thousands these images were required to obtain sufficiently realistic results, indicating the importance of a diverse training set for the proper development of these tools.