Real vs Synthetic Data: Is Generative Content in Need of Human Touch

Research Finds Large Language and Vision Models Becoming Less Effective When Saturated with Machine-Generated Content

Maintaining the right balance between authentic, human-created content and auto-generated data can prevent smart content generation tools from losing their ability to respond in diverse, natural-sounding language according to researchers from Khalifa University, the Technology Innovation Institute (TII) – Abu Dhabi, the New York University – Abu Dhabi (NYUAD), and the University of California (UC), Berkeley, US.

To avoid Model Autophagy Disorder (MAD) where the intelligent system recycles data and becomes stuck with repetitive, low-quality output, researchers found that the amount of machine-generated data should be considerably smaller compared to the human-origin data.

The findings, in Cornell University’s Arxiv, titled ‘How Bad is Training on Synthetic Data? A Statistical Analysis of Language Model Collapse,’ will be presented at the Conference on Language Modeling 2024 (COLM2024) at the University of Pennsylvania, US being held from 7-9 October 2024. The Khalifa University research team includes Dr. Merouane Debbah, Professor, Computer and Communication Engineering, Mohamed El Amine Seddik, Senior Researcher, TII, Dr. Soufiane Hayou, Researcher, UC, Berkeley, Dr. Pierre Youssef, Associate Professor, Mathematics, and Suei-Wen Chen, Research Assistant, NYUAD.

GPT-2, GPT-3, and GPT-4 have shown they can respond intuitively using pre-existing information and ChatGPT made these kinds of advanced language models accessible to everyday people. As these large language models (LLMs) train on text and even images generated by other intelligent systems – rather than real human work – even in small quantities, machine-generated data can eventually ‘poison’ the smart content apps, leading to MAD or a model collapse. In fact, previous studies highlight how a self-consuming loop affects linguistic diversity where models from the current generation produce data that pollutes existing information on the web; setting up the next generation models to train on contaminated data.

Using a special type of math called a linear SoftMax, researchers looked at the changes in the intelligent system as it calculated probabilities for each next word. Through simulations and tests with realistic GPT-2-style language models on actual data, the researchers confirmed training with data sampled from a previous generation model, always leads to model collapse. A careful control of the amount of synthetic data inputs demonstrated that these findings can also be applied beyond theoretical settings.

Dr. Merouane Debbah said: “With the adoption of generative Large Language and Vision models, the amount of synthetic data on the web is growing at an unprecedented rate. Several works showed that incorporating synthetic data in the training can hurt the performance of trained diffusion models. In fact, practitioners are willingly using synthetic data to train next-generation models leading to Model collapse. As intelligent systems become increasingly advanced and widespread, our research lays important groundwork for better understanding and mitigating model collapse in future machine-generated content.”

Alisha Roy
Science Writer
4 June 2024

��ݮ��Ƶ

Real vs Synthetic Data: Is Generative Content in Need of Human Touch

Related Articles