Breakfasts, Game Nights, Car Rides: The PECII Corpus for Natural Language Interactions

In February 2026, the Leibniz-Institut für Deutsche Sprache (IDS, Leibniz Institute for German Language) released version 2.25 of the Parallel European Corpus of Informal Interaction (PECII), a multilingual dataset integrated into the Database for Spoken German (DGD).

Developed through an international collaboration led by academic researchers from the University of Basel, UCLA, and IDS Mannheim, the first-of-its-kind project provides a comprehensive data foundation for scholars seeking to understand the nuances of everyday language and behavior in social and family situations.

PECII is the result of a major technical and logistical undertaking, capturing authenticity in nearly 77 hours of audio and video recordings. The 600,000-token (units of speech) tagged dataset enables users to see and hear the vernacular across four languages (German, British English, Italian, and Polish) and multiple cultures.

To ensure a fair comparison, researchers filmed participants in three specific settings: family breakfasts, game nights, and casual car rides.

The launch follows the data collection project known as Cross-Linguistic Pragmatics: Norms, Rules, and Morality in Daily Life (NoRM-aL). A four-year study, NoRM-aL set out to answer some specific questions, such as the linguistic and multimodal resources used when rules are relevant, negotiated, or enforced.

Based on this data, the PECII corpus shows how people use language and gestures to “police” one another, such as when someone gently corrects a relative’s table manners or reminds a friend of a forgotten game rule. It also enables users to look at cross-linguistic and cross-cultural similarities and differences in normative boundaries.

Behavioral Everyday Speech

In the paper titled “Introducing the “Parallel European Corpus of Informal Interaction” (PECII),” researchers from the three collaborating universities describe the dataset as “an analytic sketch of the practices people use to initiate turns that interfere with and seek to rectify another’s (problematic) behavior, focusing on their variability across languages and settings / activity-contexts.”

Unlike general linguistic surveys, data collection for the project relied on a network of principal investigators and their research teams doing field work.

In most cases, the recordings were made using stationary cameras to minimize the “observer effect” (conditioning and interference). Researchers would set up the equipment and then either leave the room or participate naturally in the interaction.

For the car rides, for example, subjects had two camera angles for each session. This allowed researchers to capture not just the words spoken, but also vital non-verbal cues and spatial dynamics that define how people routinely communicate in real-time.

In one of the videos, three German speakers drive together for over two hours, speaking naturally to one another and discussing several topics. At one point in the dialogue, they talk about the virtues of sunflower or pumpkin seeds on a roll, and one person points out that putting pine nuts in lasagna is odd. (By contrast, Italians would not find pine nuts to be an odd ingredient in lasagna and other dishes, so a conversation on this topic would be quite different.)

Slator Pro Guide: Growth Hacks for Language Technology Platforms

This 45-page Slator Pro Guide uncovers 10 proven strategies behind ARR growth in language AI.

$390 BUY NOW Included in our Pro and Enterprise plan.
Subscribe now!

In a recorded breakfast interaction of a German-speaking family with young children, characteristic turns of phrase can be heard, such as the mother using diminutives when talking to her children and all participants discussing family plans, including raising the bike seats and putting on an attachment for their dog. (This type of interaction is potentially common in families across all languages and cultures in the dataset.)

With these types of samples, PECII serves as a digital time capsule for cultural and linguistic preservation of 21st-century informal life discourse and situations. It is also a historical record of social norms that are often too mundane to be documented elsewhere with any level of scholarly rigor.

Additionally, unlike standard speech datasets that rely on pre-selected or scripted text, PECII provides a level of natural human speech complete with overlapping voices, slang, and heavy accents.

This combination makes it an ideal benchmark for applications like automatic transcription tools and chats. However, at the time of writing, there are very few mentions in papers, presentation transcripts, and related projects of a potential use for the PECII data for AI training.

The data is initially being funneled into academic teaching and training, where it serves as a sort of laboratory for students to study human behavior and cross-cultural and linguistic diversity.

The corpus is accessible with permission from DGD for research purposes.

Breakfasts, Game Nights, Car Rides: The PECII Corpus for Natural Language Interactions

Behavioral Everyday Speech

Slator Pro Guide: Growth Hacks for Language Technology Platforms

Featured

Boost Language Access

AI should speak every language

memoQ Translation Tech

Leading with Excellence