Aug 29, 2025
September 1, 2025 - Start of the DFG project "Generate-IT. LLM-generated texts in Italian: A linguistic study"
On September 1, 2025, the DFG project Generate-IT. LLM-Generated Texts in Italian: A Linguistic Study at the Chair of Romance Linguistics (French/Italian) - Institute of Romance Studies - TU Dresden
Applicant: Prof. Dr. Anna-Maria De Cesare
Research Associate: Dr. Giulia Mantovani
The advent of large language models (LLMs) such as GPT-3.5 has completely revolutionized our ability to generate human-like texts in a variety of languages, including Italian. Several studies - mainly on English - claim or show that LLM-generated outputs are high-quality texts that are in many ways comparable to and indistinguishable from human-written texts. At the same time, it has been observed that LLM-generated texts can suffer from "algorithmic bias" and may even contain patterns and structures that are similar to English. English "fingerprints" in generated Italian texts are not surprising: in LLMs such as the GPT suites, English texts are over-represented in the training data.
Given this situation, many research questions arise in the field of linguistics, in particular on the characteristics of LLM-generated texts in Italian (and other languages). A first set of questions concerns the influence of English on these texts: What forms can the above-mentioned "fingerprints" take, how frequent are they, and how consistently do they appear in the results of LLMs trained on datasets in which English texts are weighted differently? A second important question is whether we are witnessing the emergence of a new linguistic variety in the architecture of contemporary Italian and whether this variety appears impoverished and simplified compared to human-authored texts due to the algorithmic bias typical of LLMs.
The aim of the DFG project Generate-IT. LLM-generated texts in Italian: A Linguistic Study is to answer these open and topical questions by describing and explaining the properties of LLM-generated texts in Italian. The project will also address theoretical issues, in particular the need to take into account a new dimension of language variation related to the medium used for language production (artificial neural networks). An important aspect is the nature of the linguistic features relevant to this dimension of language variation. These questions will be addressed by conducting an empirical study based on self-compiled representative corpora of LLM-generated texts and comparable corpora of human-written texts.
Overall, the DFG project aims to develop a new, dynamic and innovative branch of research in the field of (Italian) linguistics. It will complement the research on generated texts carried out in neighboring fields (in particular in computational linguistics and natural language generation), and it will pave the way for future interdisciplinary studies between linguists and LLM developers.
DFG - GEPRIS - Generate-IT. LLM-Generated Texts in Italian: A Linguistic Study
The announcement can be downloaded here as PDF under DFG Projekt_Generate-IT_DeCesare.pdf.