An open archive to contribute to the preservation of the world's linguistic heritage

The evanescence of analytic work means, for many linguists involved in language documentation, that their most important academic creations will be their archived notes, recordings, and supporting materials. (Andrew Garrett, UC Berkeley)

The Pangloss Collection hosts recordings of little-documented languages, which for the most part are currently endangered. These documents are painstakingly produced by professional linguists working to rescue the world's linguistic diversity, which is currently dwindling, parallel to the world's biodiversity. 

The target languages are typically studied in the field, in their geographic and social context. Dialectologists like to say that each word has a history of its own (Jaberg 1908: 6); likewise, each linguistic document has a history of its own. Linguistic resources are a result of the collaboration between the author of the document (a native speaker) and the visiting linguist, a collaboration which often extends over many years. Thus, Georges Dumézil referred to the last speaker of the Ubykh language as my teacher and friend Tevfik Esenç. 

The Pangloss Collection developed over more than twenty years of sustained work by researchers and specialized engineers at CNRS. It grows year after year, through contributions that come from French research centres and their partners in various places across the globe.

For how many languages does the Pangloss Collection host data sets?

As of 2020, the collection contains some 780 hours of recordings in more than 170 languages. About a half of the resources (1530 out of 3600) are transcribed, annotated and translated, allowing listeners to access the contents. Which translation languages are used is up to the depositors: thus, someone working in Brazil may choose Portuguese rather than English as the main language of translation. If you would like to volunteer an additional translation (for instance, translating the Ubykh story The goat and the sheep, which currently only has an English translation, into another language: German, Turkish, Russian, Chinese...), you are welcome to get in touch. All contributions are gratefully acknowledged in the documents' catalogue entry (their metadata).

Integration in international networks

The Pangloss Collection is a member of DELAMAN, the Digital Endangered Languages and Musics Archives Network. It is hosted by the Cocoon platform, Collection de Corpus Oraux Numériques, which is one of the OLAC (Open Language Archive Community) participating archives.

An Open Archive within a free and decentralized Internet

The Pangloss website does not use cookies or track visitors' activity. In keeping with its Open Science policy, the Pangloss Collection follows basic principles of transparency, respect of users' privacy, and free orientation of attention. 

  • Transparency and respect of privacy: major platforms collect data concerning the behaviour of their users, and use them for various types of profiling. By contrast, on the Pangloss website, your activity is not recorded.
  • Free orientation of attention: apart from the foregrounding of a pair of resources on the Pangloss Collection home page, the interface aims to be neutral and let you choose the area that you are interested in and the type of documents that you wish to consult. In the same spirit, "professional" mode, designed for linguists and computer scientists, is available to anyone (simply set the toggle on the upper right-hand corner), without requiring credentials or login. 

An interface designed to facilitate interdisciplinary collaborations with Natural Language Processing

The 'professional' interface is designed to facilitate access not only to linguists but also to computer scientists and engineers. For a computer scientist interested in little-studied languages, obtaining data can be a daunting task: identifying a dataset that has the required characteristics, gaining access rights, downloading, carrying out format conversion (pre-processing), and so on. A lot is at stake in collaboration between linguists and specialists of Natural Language Processing. As an example, data sets from the Pangloss collection are being used in experiments applying automatic transcription tools for language documentation.

Our tools, like our data, are open and freely available (source code is available). In addition, the site is designed to allow batch download of data sets. Please do not hesitate to let us know your wishes and recommendations. We would be happy to talk about collaborations you would like to set up. 

Bibliographical references (open access)

Michailovsky, Boyd, Martine Mazaudon, Alexis Michaud, Séverine Guillaume, Alexandre François & Evangelia Adamou. 2014. Documenting and researching endangered languages: the Pangloss Collection. Language Documentation and Conservation 8. 119–135.

Jacobson, Michel, Boyd Michailovsky & John B. Lowe. 2001. Linguistic documents synchronizing sound and text. Speech Communication 33. 79–96.