The “Diet Soda” of Data


In 1952, a father and son team debuted the first “diet” soda. The problem was, as Hyman and Morris Kirsch (of Kirsch Beverages) saw it, that diabetic and cardiovascular patients lacked sweet beverages that would be more health-conscious than regular sugary drinks. The Kirschs’ solution was No-Cal: a carbonated, sugar-free beverage. Initially, No-Cal came in ginger ale and black cherry and achieved modest success; by just the next year, however, No-Cal scaled up to seven flavors and grossed upwards of six million dollars from regional sales. By 1956, the Kirschs released print advertisements featuring Kim Novak, a film and television actress, that directly marketed their “diet” soda to women who desired flavorful soft drinks that helped them to lose weight and keep their figures trim. From that point on, other beverage companies developed their own versions of diet soda, hawking them as equally calorie-free and tasteful, catapulting it into one of the most popular drinks in the United States. Diet soda “owed its success not to what was different and new about it but rather to what was the same as other soft drinks.”

Today, very few people associate diet soda with remedies for diabetics. Advertising and marketing persuade consumers that diet soda is a safe, healthy alternative to regular soda, its sugary counterpart. People are assured that diet soda tastes the same, if not better, than its sugary predecessor. In other words, soda manufacturers promise consumers that they don’t have to sacrifice sweetness to consume fewer or zero calories. People can indulge in drinking diet soda that tastes sweet as much as they want, without—allegedly—the risk of gaining weight.

Unwittingly drawing on the marketing success of diet soda, MIT’s Laboratory for Information and Decision Systems (LIDS) recently launched its “Synthetic Data Vault,” with the statement “Synthetic data is a bit like diet soda. To be effective, it has to resemble the ‘real thing’ in certain ways.”

Like many people, I’ve become a bit inured to the predictions that AI will lead to the end of humanity. However, I felt a palpable existential dread creep down my spine when I read the MIT promise to feed the next generation of AIs on “synthetic” data. Why?

Perhaps my concern about the statement is because I am an avid drinker of Diet Coke (my children call it an addiction). And perhaps it’s because I worry about the research that suggests it’s not effective for weight loss or for one’s health in general.

Or it could be that my anxiety about the comparison of synthetic data to diet soda stems from my work on the Black Beyond Data project at Johns Hopkins University. This project focuses on Black communities and data ethics, leading me to be a bit circumspect, at times, about technological innovation. The Black Beyond Data Reading Group has hosted over twenty scholars who work on different aspects of data, including Kadija Ferryman who discloses the risk that AI-assisted medical technology holds for marginalized communities, including Black people, when it is developed on biased data.

Both these reasons, I’ve discovered, are the source of my concern about a technological future that promises to heavily rely on synthetic data.


Before I explore my response to MIT’s synthetic data–diet soda analogy, it is necessary to discuss synthetic data, what it is, and why the machine learning community says it’s the next big thing in AI. Many people outside of the rarified world of computer science have yet to hear about synthetic data (though they may be familiar with different forms of it such as digital humans.)

Synthetic data is AI-generated and created with algorithms trained on real-world data but does not contain any personally identifiable information (PII) or sensitive properties of it. Synthetic data is not new. In 1993, Harvard University Professor Donald B. Rubin explained that “synthetic data” could replicate microdata, “unit-level data obtained from sample surveys, censuses, and administrative systems” without compromising confidentiality and privacy (at that time, a concern about the improper release of data).

Recently, synthetic data has emerged as an important technological innovation. In 2014, a team of computer scientists, including Ian Goodfellow (popularly known as the GANfather,) created Generative Adversarial Networks, or GANS: two competing AI models (the generator and the discriminator) to produce synthetic or fake images. GANS involves unsupervised machine learning, algorithms that are trained on unlabeled training data. Over time, the generator not only gets better at identifying outputs, but also at producing things that are strikingly like the original data. Generative models like GANS are the source of AI that can create deepfakes, a pervasive problem in tech that policymakers are scrambling to address in anticipation of their use to meddle in the 2024 presidential election.

Even as GANS and generative AI models present unprecedented challenges in disinformation, it has created new opportunities to create synthetic data. In 2022, MIT named synthetic data as one of the breakthrough technologies. So much so that Alexander Linden at Gartner, Inc., an American technological research firm, predicts the majority of data used for AI and analytics projects will be synthetic by 2030. After all, it’s already widely used in finance, the automotive industry, and for many of Amazon’s technologies, like Alexa.


So, why has synthetic data sparked so much discussion? And why is it rapidly becoming a mainstay in how AI is developed?

Much of the interest in synthetic data stems from the simple lack of data available for machine learning. Despite a world that appears to be inundated with and driven by data, quality data that is available for training algorithms is challenging to obtain.

First, AI requires big data and large language models. OpenAI created ChatGPT with 570 gigabytes of text data, or about 300 billion words. Google utilized even more data, 1.56 trillion words, to develop its chatbot, Gemini. The development of ChatGPT has not occurred without controversy and has raised new issues about the ethics and ownership of data. Pending lawsuits against OpenAI have emphasized the sheer scale of data necessary for AI technologies, which are impossible to satisfy with real-world data.

However, even if it were possible, the quality of the data might negate the quantity. AI technology not only requires lots of data; it must meet a particular qualitative standard (as well as comply with privacy policies). Simply put, it is a challenge to collect data from humans rather than the web, according to Aidan Gomez, cofounder and CEO at Cohere (which develops and releases a range of innovative AI-powered tools and solutions for natural language process use cases). Gomez, along with a team of researchers, developed the transformer (a deep learning architecture that is the basis of LLMs). Large language models, states Gomez, are much more sensitive to human data and will deviate from machine learning engineers’ programming.

And it is machine learning that is driving the growing interest in synthetic data, according to a recent report from the Royal Society. This drive is based on three specific areas: privacy; de-biasing data and fairness; and increasing diversity in data or making datasets more robust. Synthetic data, according to its strongest advocates, will fix both problems—quantitative and qualitative—and thus produce the data required to develop AI.

A particularly seductive reason to employ synthetic data is that it will sufficiently diversify training data and fix algorithmic bias. In other words, machine learning developers will no longer have to contend with the risk of creating algorithms that perpetuate or produce racism, sexism, classism, and homophobia, since they were trained on data sets that lack diversity. A digital start-up company, Mindtech, whose slogan is “The Home of Synthetic AI Training Data,” states that they “will advance AI computer vision by reducing bias and enhancing diversity” with what they call “Digital DNA,” that includes AI-generated skin tone, ethnicity, age, physical size, and clothing.

Like the manufacturers of diet soda who market diet soda as a sweet remedy for obesity, tech companies promote synthetic data as a cure-all for algorithmic discrimination.

In the world of Mindtech, there’s no need to be bothered with the pesky privacy, ethical, and,­ regulatory issues that emerge with the use of diverse real-world data.

“What you really want is models to be able to teach themselves,” Gomez notes. “You want them to be able to … ask their own questions, discover new truths, and create their own knowledge. That’s the dream.” In essence, Gomez desires a tech world in which humans are marginal to the process of developing AI technologies.


Since synthetic data is still a new development in artificial intelligence, there exist relatively few critiques. Still, one notable article on the politics of synthetic data by Benjamin N. Jacobsen at the University of York deserves discussion. First, Jacobson argues that synthetic data holds out the tantalizing promise that AI-generated systems can produce highly variable data, which could then develop algorithmic systems that can efficiently respond to moments of crises like the recent COVID-19 pandemic. In other words, machine learning engineers can create synthetic datasets to develop models that anticipate future global health outbreaks.

Jacobson connects his first argument to a second one by stating that synthetic data supposedly “enables the de-risking of algorithms.” It would do so, allegedly, by creating so much variability that it would be impossible to develop any AI technology that is unfair.

What these two premises fail to consider, Jacobson concludes, are the “ethicopolitical implications” of synthetic data’s value, if it’s tied to being risk-free. Data that is not real becomes more valuable, because it lacks risk. He calls this a “technology of risk.” Jacobson’s argument is particularly significant given how time-intensive and resource-laden ethical data use in AI can be. Barr Moses, CEO and cofounder of Monte Carlo—the tech industry’s first end-to-end data observability platform, a process for monitoring, managing, and maintaining data to ensure its quality—points out that tech developers often view a slower pace as counter-intuitive to efficiency. But in reality, warns Moses, “moving too quickly also means we can risk overlooking the hazards waiting in the wings.” Companies hoping to completely avoid these hazards, but also seeking to cut costs and speed up development, would rather use risk-free data, synthetic data, that ostensibly has little prospect of harm.


This idea of risk-free synthetic data brings me back to the diet soda analogy that MIT used to discuss the benefits of synthetic data. After all, diet soda manufacturers are not alone in declaring that they can help consumers enjoy questionable goods without any sort of consequences. They are not even alone in convincing consumers it’s in their interest to partake of products that may harm them.

In the 1960s, Big Tobacco created a concerted and well-researched campaign to market menthol cigarettes as a “therapeutic product” to urban Black communities, shows Keith Wailoo. By this time, there was already a growing awareness of the danger of smoking. Still, companies such as Kool relentlessly promoted menthol cigarettes as a better, healthier option than other cigarettes. In so doing, they seemingly offered Black smokers a risk-free and culturally-specific experience. Sound familiar?

Is it fair to compare the marketing of diet soda and cigarettes with discussions of synthetic data? Maybe not, given the harm that both may cause to people’s health.

Nonetheless, the types of persuasive marketing tactics that are used to sell diet soda make MIT’s comparison of synthetic data more apt than they probably realized. Like the manufacturers and advertisers of diet soda, people in the machine learning community and the tech industry extol the benefits of synthetic data and suggest that it has all the benefits of real-world data without pesky issues such as adhering to privacy and consent. The growing number of synthetic data start-ups means a larger tech discourse that extols the advantages and promise of synthetic data.

In other words, like diet soda, much of the success of synthetic data relies on convincing companies and other industries of its value. The Brouton Lab, a data science and machine learning company states, “AI-generated data is cheap, and it only requires data science/machine learning techniques.” Steve Harris, CEO of Mindtech, where “the future of AI is synthetic,” states, “The rise of synthetic data provides a clear route to tackling the many dilemmas in collecting real-world data.”

One of the biggest promises that some tech companies make is that synthetic can fix or mitigate algorithmic bias. Some tech companies, astutely recognizing the commercial value of being able to develop technology to correct algorithmic bias, have seized on synthetic data as a sort of silver bullet to market their data collection services to businesses.

In 2021, CVEDIA, a company that provides machine learning algorithms for data-limited apps using synthetic data, announced that they had created fully synthetic AI with no bias. YData, a data-centric AI company notes, “Behind the scenes, our powerful synthesizers can learn patterns in the data and accurately synthesize more data points for the minority class, thus mitigating the bias by oversampling the minority class. … We invite you to get in touch with us and experience it yourself!” Wilson Pang, Chief Technology Officer at Appen, a company that provides data collection services for a wide range of businesses and organizations and coauthor of Real World AI: A Practical Guide for Responsible Machine Learning, writes that synthetic data conveys the lack of human involvement in artificial intelligence and in other words, according to him, “No Human Means No Human Bias.” He goes on to say:

With synthetic data rising in popularity, the future is bright for AI. As more and more companies adopt the concept to supplement human-collected datasets, we can expect much more inclusive and representative datasets that will lead to safer and more equitable applications for all genders and races.

The comparison of synthetic data to diet soda continues to be beneficial in understanding how tech companies appropriate it as a magical elixir to fix algorithmic bias.


Most discussions of algorithmic bias have centered on the underrepresentation of data on racial and gender groups in training data sets. In a larger sense, this focus is due to the scholars and tech critics who have raised this issue. Overwhelming, they have been white women and people of color, many of whom are Black women. Joy Buolamwini and Timnit Gebru, authors of “Gender Shades,” one of the most-cited critical exposés of algorithmic bias in Big Tech, concluded that commercial gender classification algorithms “performed best for lighter individuals and males overall. The classifiers performed worst for darker females.” The hazards of being both Black and femme in most Western societies increases the harm of algorithmic bias for people who occupy these identities.

It’s too soon to tell if algorithmic bias, as one of the major rationales for synthetic data, will subside. However, tech companies’ current use of it to promote their services reflects “tech solutionism,” otherwise known as “an assumption that technology can correct for human error, pushing systems toward a standard of computational objectivity.” Tech solutionism as a marketing strategy sells customers a vision of innovation, even when it has unintended consequences, that can always be tweaked and reconfigured to correct mistakes and failures.

There’s a fundamental problem with this premise, according to Ruha Benjamin and Greta Bynum. “What if new technologies sold to solve social problems are simply ill-suited to the complexity of the environments where they are introduced?” they ask about technological fixes. “What if, sometimes, the answer isn’t technology?”

Their questions strike at the heart of one of the biggest problems in tech companies’ drive to generate synthetic data as a quick fix to algorithmic bias. Such bias, after all, is a long-standing problem, rooted in a set of complex social factors. Synthetic data merely glosses over systemic issues, according to South African artificial intelligence engineer and computer scientist Tshilidzi Marwala, leading to the overrepresentation of data classes from specific racial groups. He writes, “We need to fix these problems so that data poverty that leads to the need to generate synthetic data is minimized, especially in the developing world.” In this sense, synthetic data, he notes, can be used, but only as a last resort.

Marwala and a growing group of computer scientists who voice concern about the use of synthetic data to rewire the bias out of algorithms hint at a vociferous debate on the horizon. Consider, on the one hand, Harris, the CEO of Mindtech, insisting that synthetic data allows computer scientists “to remove the biggest roadblock of all … human bias”; then, on the other hand, Karen Hao, the senior AI editor at MIT Technology Review, claiming, “Meanwhile, little evidence suggests that synthetic data can effectively mitigate the bias of AI systems.”

Such debates recall the persistent contention about the health benefits, safety, and efficacy of diet soda in helping people lose and control their weight. In other words, the promise of a risk-free, sugar-free, and calorie-free drinking experience that companies sell consumers is highly suspect. In some sense, this doesn’t matter. Drinking diet soda offers a sensation of doing something that tastes good and has low stakes and little risk. For many people, this is what they are consuming, despite compelling evidence that suggests otherwise.

For the time being, tech companies, and much of the machine learning community, will continue to frame synthetic data as a technological advancement that can eliminate algorithmic bias. Like the manufacturers of diet soda who market diet soda as a sweet remedy for obesity, tech companies promote synthetic data as a cure-all for algorithmic discrimination.

What makes diet soda so attractive is that it tastes like the original, non-diet version, without the pain of high calories. In a similar vein, proponents of synthetic data argue that it offers all the benefits of real-world data without the concerns and issues of inequity and privacy.

However, the questions raised by actual data are not simply annoying and cumbersome issues. They are profound ethical provocations about equity and justice, which are useful for addressing larger questions about humans’ relationship with each other and technology.

Consider the soda industry, generating hundreds of millions of dollars in sales of diet soda to consumers in each year, yet with no evidence that it offers health benefits. Such an industry tells us we cannot afford to buy tech companies’ latest illusion: that synthetic data offers a risk-free method for eradicating bias in machine learning. icon

This article was commissioned by Mona Sloane.

Featured image by Maxim Berg / Unsplash (CC by Unsplash License).



Source link

Like this post? Please share to your friends: