22.10.2024; Vortragsreihe
Colloquium: Does Generative AI Erode Its Own Training Data? Empirical Evidence of the Effects on Data Quantity and Characteristics from a Q&A Platform
(Erasmus University Rotterdam)
How do generative artificial intelligence (GAIs) affect their own training data such as that from Q&A platforms? The Paradox of Reuse Theory posits that GAI usage would substitute traffic to Q&A platforms where the training data originates. However, it remains an open question whether and how this affects subsequent training data generation on these platforms. We address this question by leveraging the launch of ChatGPT and using rich StackExchange panel data. We find a decrease in the number of questions on the platform, driven by platform abandonment of casual users, but the remaining questions are the more complex ones and exhibit novelty. Further investigation illustrates that the more the number of questions decreases, the more the remaining questions increase in complexity and novelty. These findings support the development of the Paradox of Reuse Theory by illuminating a special case in which training data erosion may be beneficial to the GAI.