To ensure a fair comparison, researchers filmed participants in three specific settings: family breakfasts, game nights, and casual car rides.
The launch follows the data collection project known as Cross-Linguistic Pragmatics: Norms, Rules, and Morality in Daily Life (NoRM-aL). A four-year study, NoRM-aL set out to answer some specific questions, such as the linguistic and multimodal resources used when rules are relevant, negotiated, or enforced.
Based on this data, the PECII corpus shows how people use language and gestures to “police” one another, such as when someone gently corrects a relative’s table manners or reminds a friend of a forgotten game rule. It also enables users to look at cross-linguistic and cross-cultural similarities and differences in normative boundaries.
Behavioral Everyday Speech
In the paper titled “Introducing the “Parallel European Corpus of Informal Interaction” (PECII),” researchers from the three collaborating universities describe the dataset as “an analytic sketch of the practices people use to initiate turns that interfere with and seek to rectify another’s (problematic) behavior, focusing on their variability across languages and settings / activity-contexts.”
Unlike general linguistic surveys, data collection for the project relied on a network of principal investigators and their research teams doing field work.
In most cases, the recordings were made using stationary cameras to minimize the “observer effect” (conditioning and interference). Researchers would set up the equipment and then either leave the room or participate naturally in the interaction.
For the car rides, for example, subjects had two camera angles for each session. This allowed researchers to capture not just the words spoken, but also vital non-verbal cues and spatial dynamics that define how people routinely communicate in real-time.
In one of the videos, three German speakers drive together for over two hours, speaking naturally to one another and discussing several topics. At one point in the dialogue, they talk about the virtues of sunflower or pumpkin seeds on a roll, and one person points out that putting pine nuts in lasagna is odd. (By contrast, Italians would not find pine nuts to be an odd ingredient in lasagna and other dishes, so a conversation on this topic would be quite different.)
Slator Pro Guide: Growth Hacks for Language Technology Platforms
This 45-page Slator Pro Guide uncovers 10 proven strategies behind ARR growth in language AI.
In a recorded breakfast interaction of a German-speaking family with young children, characteristic turns of phrase can be heard, such as the mother using diminutives when talking to her children and all participants discussing family plans, including raising the bike seats and putting on an attachment for their dog. (This type of interaction is potentially common in families across all languages and cultures in the dataset.)
With these types of samples, PECII serves as a digital time capsule for cultural and linguistic preservation of 21st-century informal life discourse and situations. It is also a historical record of social norms that are often too mundane to be documented elsewhere with any level of scholarly rigor.
Additionally, unlike standard speech datasets that rely on pre-selected or scripted text, PECII provides a level of natural human speech complete with overlapping voices, slang, and heavy accents.
This combination makes it an ideal benchmark for applications like automatic transcription tools and chats. However, at the time of writing, there are very few mentions in papers, presentation transcripts, and related projects of a potential use for the PECII data for AI training.
The data is initially being funneled into academic teaching and training, where it serves as a sort of laboratory for students to study human behavior and cross-cultural and linguistic diversity.
The corpus is accessible with permission from DGD for research purposes.