×

img Acces sibility Controls

Research Projects Banner

Research Projects

Crowdsourcing for Language Processing (CLAP): A platform for collecting multilingual data for speech and language processing

Implementing Organization

Indian Institute of Technology (IIT)
Principal Investigator
Prof. Preethi Jyothi
Indian Institute of Technology (IIT)
CO-Principal Investigator
Dr Kameswari Chebrolu
Associate Professor
|
Indian Institute of Technology (IIT)

Project Overview

Over the last decade, Artificial Intelligence (AI) is increasingly making inroads into society and our lives. However, as with all other technologies, it is an important challenge to make such technology accessible to people from all strata of Indian society. This challenge manifests itself mostly at the interface between humans and computers. A key component of this interface that is highly sensitive to the cultural and linguistic background of the users is automatic speech recognition (ASR). To build competitive ASR systems for Indian languages, one requires large amounts of labeled speech data i.e. speech clips in different Indian languages accompanied by their corresponding text. Publicly-available repositories of labeled speech in Indian languages are currently a limited resource. This project aims at collecting large volumes of labeled speech in a number of different Indian languages in a scalable manner using crowdsourcing. Towards this end, the investigators undertake the following: 1) Investigators will design a mobile application in Android and a corresponding backend server to crowdsource tasks for labeled speech in various Indian languages. Users of this app (workforce) will be given two different types of tasks to complete: a "Speak" task where users will read out prompts in their native tongues and a "Verify" task where users will be asked to confirm whether a prompt and its corresponding speech (obtained from a different user) are well-matched. 2) To collect large volumes of data, it is essential to have effective mechanisms to recruit the workforce as well as retain them. For recruitment, investigators will explore Facebook/Google social media advertising, contact student bodies (as part of the National Social Service scheme), taxi driver associations etc. For incentivizing, investigators will employ gamification, PayTM-based money-transfers and the coupling of AI education with data collection by presenting internship opportunities for top students. Investigators will explore various combinations of these mechanisms to determine the right scheme that gives the best return on investment. 3) Given the crowdsourced nature of the collection, it is possible for poor-quality data to creep into the corpora. To tackle this challenge, investigators will employ a host of techniques to post-process the data. Operations such as noise reduction and volume control will be applied to all the speech clips. Verify tasks coupled with majority voting could be used to catch instances of spamming in speak tasks. As an additional measure to ensure quality, investigators will perform automatic random checks on each user using gold standard verify tasks (where the outcome is known) to catch spammers. Investigators highlight that they intend to make the collected speech data available as publicly-available corpora that can be used by researchers and industry practitioners to build or bootstrap their existing systems. Investigators believe this would be a very valuable contribution towards furthering research on speech technologies in Indian languages.
Funding Organization
Funding Organization
Department of Science and Technology (DST)
Ministry of Education (MoE)
Quick Information
Area of Research
Computer Sciences and Information Technology
Focus Area
Development of multilingual data collection platform
Start Year
2019
Sanction Amount
₹ 36.89 L
Status
Ongoing
Output
No. of Research Paper
00
Technologies (If Any)
00
No. of PhD Produced
N/A
Startup (If Any)
00
No. of Patents
Filed :00
Grant :00
arrowtop