• Depositors need to be aware that preparing data for archiving and online distribution is time-consuming, and is still little-recognized institutionally: time spent putting together a beautiful data set and making provisions for its long-time conservation hardly counts at all in career-building.
  • If you nonetheless believe (as we do) that this is an important and worthwhile pursuit, it is best to take archiving goals into account from the earliest steps of data collection. The quality of the data benefits greatly from the efforts that go into the careful design of the data collection procedure.
  • In order to facilitate conversion to logically-structured text (XML) later, it is best to use simple formats, such as a plain text editor (e.g. Notepad++) or a software package such as Toolbox. If you painstakingly typeset glossed texts in a proprietary format such as MS-Word, you will probably have a terrible time converting it to logically structured text later.

Suggestions for data collection

    • Technology evolves rapidly, and the local situations differ greatly from one fieldwork location to another. Will you have access to the power grid? This is not the case in many islands of Oceania, for instance. If there is electricity, are there risks of power outings? Some locations have frequent power outings, which can disrupt recording sessions. An Uninterruptible Power Supply may carry you through short power outings. What is the voltage? In case it is low and/or irregular, a voltage stabilizer may be necessary to avoid damage to electronic devices. Last but not least, are the sockets grounded? In the absence of grounding, using electric power from the grid is hazardous because the power differential, which can reach hundreds of volts, tends to leak into electronic devices such as microphones, causing serious electric shocks. A power stabilizer and/or a Uninterruptible Power Supply do not solve the problem of grounding.
    • For recording a good audio signal, the acoustics of the place of recording matters most of all. Reverberation makes recordings unpleasant to listen to, and detracts a lot from audio quality. Clapping in one's hands is enough to get an idea of whether a place is suitable for recording: the less reverberation there is, the better. A room that has wall-to-wall carpet and bookshelves on the walls (and thick curtains) is likely to dampen reverberation efficiently, yielding good results for recording. At the opposite extreme, a room with bare hard surfaces (cement walls, tiling, glass panes without curtains...). The second most important factor is the quality and appropriateness of the microphones. Recording in stereo is recommended, even for recording one speaker: stereo audio opens possibilities for signal processing later on.
    • The recording device (at present, typically a solid-state recorder) should be chosen together with the microphones, so that the recording device provides a level of preamplification that is appropriate for the specific microphones that you will use. Other criteria for choice include solidity (how rugged the device needs to be, in view of the fieldwork location and your working habits), suitability for your purposes (if you want the recorder to be inconspicuous, for instance), and battery life when the device is operated on batteries (some devices consume the power of alkaline batteries within a very short time).
    • The Lacito research centres has a wide range of recorders / microphones/ video cameras and other devices for recording during fieldwork. Lab members should contact Balthazar Do Nascimento for further information. Reminder: remember to book well in advance (at the same time as you apply for travel authorization). Remember that audio and video equipment is in great demand during the Summer holidays (July+August).
    • Audio files:
    • The recommended format for audio is WAV (and FLAC for very large audio files > 2 Gb)
    • The minimum quality is audio CD standard: sampling at 44,100 Hz on 16 bits. As of 2017, adopting 48,000 Hz / 24-bit is highly recommended. This allows you to leave a margin of security in recording (on the order of 12 decibels) while preserving possibilities for digital signal amplification (volume increase) after recording.
    • An ounce of practice is worth a pound of teaching. Take the time to test your equipment carefully before you leave for the field, and test it again thoroughly in the field before recording sessions, so as to be able to focus on elicitation tasks during the sessions.
    • Video files:
    • The recommended format for video is currently MP4.
  • Researchers tend to underestimate the challenge of speech data collection. Some reflections on this topic, with practical recommendations, are available here.
    • Devising a system for file names from the very beginning is indispensible to arrive at an orderly corpus.
    • Add indications of contents and other meaningful elements in the file names. For instance: YearOfRecording_NameOfLanguage_PlaceRecorded_ShortTitleOfDocument.wav or YearOfRecording_SpeakerCode.wav.
    • Never use spaces in file names: use underscore _ instead. Never use diacritics or special characters in a file name.
    • Each resource has to be documented by providing some pieces of necessary information for the proper indexation of the data.
    • These are: a title for the document, the name of the language, the place of recording, the names of the participants (researchers, language consultants...), the date of recording, a description of the contents, information about restrictions to diffusion (in case the language consultants do not wish the data to be publicly available, or you wish to keep the data private while exploiting them for your publications)
    • A form is available for inputting these pieces of information (downloadable here)
    • Ideally, three backup copies need to be made to avoid accidental data loss due to device failure.
    • If the three copies are kept in the same place, they may be destroyed at the same time (due to fire, electric shock...)
    • In the mid run (two or three years), the data should be deposited at a specialized institution that will ensure its long-term conservation. From then on, data safekeeping and data integrity are guaranteed, and you do not need to improvise locally multiple copies and other 'non-professional' precautions against data loss. Public diffusion of data is technically independent from 'conservation archiving', so even if you plan to differ public availability for as long as you wish, you should deposit the data in an archive for the sake of security.
    • In principle, safeguarding a set of files is better than nothing BUT...
    • In practice, a description of the files' contents and a set of explicit file naming conventions have to be prepared by the person who records the corpus
    • , otherwise difficulties (sometimes insuperable) are encountered when trying to put together the pieces of the jig-saw puzzle.
    • Methods for transcribing, translating and glossing linguistic documents have essentially remained the same over the past 100 years (since the time of Franz Boas). Texts are transcribed, and word-for-word translations are added (technically, this is called interlinear glossing), as well as sentence-level translations (and sometimes additional levels, such as a more polished text-level translation).
    • Logical text structuration constitutes a breakthrough. It associated unequivocally an element in the target language (morpheme, word, sentence, text...) with its gloss and translation, in as many languages as the researcher provides. The association is not defined implicitly through typographic layout, but explicitly, by the XML markup.
    • The implementation of logically structured text that we propose can be referred to as the Pangloss format. This is a basic, simple pivotal format that is easily compatible with various tools (Toolbox, Elan, Transcriber, SayMore, Praat...)
    • Annotation : "Comments, notes, explanations added to a document"
    • In the Pangloss Collection, annotation refers to transcriptions and translations of the contents of an audio recording.
      Annotation can be distributed in different formats:
      • As a PDF file containing a scan of manuscript notes.
      • A logically structured text in XML format: a sample is shown below.
      • Annotation files as defined in the Pangloss Collection have four levels:

        • TEXT This covers all the information
        • S (sentence) This indicates the division of the text into sentences
        • W (word) This indicates the division of a sentence into words
        • M (morpheme) This indicates the division of a word into morphemes

        For each element, it is possible to add translations and glosses by means of a TRANSL tag (translation).
        A NOTE tag is also allowed, to add comments about the text or about individual sentences. Finally, to synchronize the annotation with the audio which it documents, an AUDIO tag is added for each sentence.
        This tag indicates the beginning and end of the sentence on the recording.
        This synchronization allows access to the audio from the annotation, and conversely, it allows for listening to the audio and see the time-synchronized annotation.
        Here is an example of annotated text and its recording, as they can be consulted on the website.

        To visualize the XML file, click here.

Steps for depositing a corpus

  • Transmission of the corpus to the Pangloss team
  • Discussion about rights of diffusion and archiving
  • Verification of validity of formats
  • Formatting or conversion if necessary
  • Validation by the person in charge of this corpus inside the Pangloss team
  • Final exchanges before the corpus is placed online
  • Deposit of the corpus in Cocoon (the data repository that hosts the Pangloss Collection)
  • Definition of access rights
  • Archiving at CINES (via Cocoon)

Two examples

  • Data collection: Since the beginning of the 2000's, Guillaume Jacques has been collecting and transcribing large amounts of recordings in the Japhug (Rgyalrong) language
  • Archiving and online deposit: since 2012, Guillaume Jacques has deposited many documents. Only a fraction have interlinear glosses and text-to-sound alignment at sentence level; most have a complete transcription. These abundant data are now freely available to specialists of Sino-Tibetan languages, who with the help of an online dictionary are able to understand the texts
  • Perspective for enrichment: Transcription and translation can only be carried out by a specialist of the language (who, no matter how proficient, still needs to verify some points with a native language consultant). On the other hand, tasks such as sound-to-transcription alignment and lemmatization can be taken up by students as part of their training, or be entrusted to research assistants. Gradual enrichment of the collection is planned as a collective endeavor, aiming at full lemmatization. This distribution of tasks allows for a richer record in the end, as larger amounts of texts can be transcribed by the researcher, focusing on the task that requires his/her core skills.
  • As of 2016, three texts in the Romani language, with full glosses, had been deposited by Evangelia Adamou. They contain an innovation in the markup: words borrowed from other languages are encoded are such by means of a tag in the XML document, and are signalled by a specific character style in the consultation interface.
  • This example illustrates possibilities for enriching documents by adding pieces of information that had not been thought of when the original core Pangloss format was designed.