Microsoft releases largest publicly available speech data for three Indian languages to aid researchers

Abhishek Baxi

Microsoft India

Microsoft India has announced the availability of Microsoft Indian language Speech Corpus to help researchers and academia build Indian language speech recognition for all applications where speech is used.

Available for Telugu, Tamil, and Gujarati, this is the largest publicly available Indian language speech dataset and includes audio and corresponding transcripts.

The Speech Corpus content is provided as part of Microsoft Research Open Data initiative, a collection of free datasets from Microsoft Research to advance state-of-the-art research in areas such as natural language processing, computer vision, and domain-specific sciences.

Microsoft Indian Language Speech Corpus was launched to address the scarcity of adequate digital data for text, speech, and linguistic resources – which are imperative in building large machine learning models for many vernacular languages across the world. The development of accurate digital tools in Indian languages has been slow owing to subtle differences in enunciation, accent, diction, and slang across various regions in India.

Microsoft’s Indian Language Speech Corpus was tested at Interspeech 2018, the world’s largest and most comprehensive conference on the science and technology of spoken language processing, and was used to create high-quality speech recognition models, thus validating the efficacy of the Corpus.

It is imperative that India’s increasing digital literacy is supported by a multi-lingual digital world and initiatives like these for researchers and academia will help accelerate innovation in voice-based computing for India.