Onkar Litake, Niraj Yagnik, and Shreyas Labhsetwar from the University of California, San Diego, demonstrated in a January 23, 2024 paper that basic data augmentation techniques are more effective than large language models (LLMs) for improving model performance in text classification tasks.
The authors compared various data augmentation techniques for text classification including easy data augmentation (EDA), back-translation, paraphrasing using LLMs, text generation using LLMs, and text expansion using LLMs, in six Indian languages: Hindi, Telugu, Marathi, Gujarati, Sindhi, and Sanskrit. For each of the six languages, they applied data augmentations to two tasks: i) binary classification and ii) multi-class text classification.
As the authors explained, the main motivation for this work was the lack of research on data augmentation for Indian languages, despite its potential to enhance natural language processing (NLP) tasks such as news classification, hate detection, emotion analysis, sentiment analysis, and spam classification.

