Curating and constructing benchmarks and development of ML models for low-level NLP tasks in Hindi-English code-mixing
Implementing Organization
Indian Institute of Technology (IIT)
Principal Investigator
Mr. Mayank Singh
Indian Institute Of Technology (IIT) Gandhinagar, Gujarat
About
This project focuses on developing an ecosystem to extensively enhance the research paradigm around understanding the code-mixing behavior of the Indian audience on the multitude of social media platforms. The proposed project will bring an immediate shift in the focus of the entire NLP community that mostly researches in English, European and Mandarin languages towards the low-resource Indic languages. The project will enable the researchers to develop state-of-the-art tools and algorithms for specifically understanding the least explored India-specific text volume. Here, we focus on code-mixing, where tokens or phrases are mixed, written or spoken in multiple languages inside a sentence (intra-sentential mixing). Sometimes, we also mix sentences written or spoken in multiple languages (inter-sentential mixing). Below, we present an example where English is mixed with Hindi within a Hinglish (mixture of Hindi and English) sentence. English: We have a fully autonomous vehicle. Romanized Hindi: hamaare paas poori tarah se svaayatt vaahan hai Hinglish: Hamare paas fully autonomous vaahan hai In this project proposal, we first-of-all aim to curate and construct manually annotated benchmark datasets for Indic code-mixed languages. To the best of our knowledge, this is the first proposal that considers Indic code-mixed language pairs. Unlike current initiatives that focus on curating small annotated data (in order of few thousands), we aim to curate large-scale (at least 100x times) datasets for Hindi-English code-mixing. We aim to curate data from two social media platforms – Twitter and Instagram and several India-specific manually created corpuses like PM’s Mann ki Baat episodes, speeches made by political leaders (BJP, INC, etc.) across the country, PIB releases, etc. Secondly, we plan to develop models for low-level NLP tasks like token-level language identification, named entity recognition, POS tagging and matrix language prediction. We also plan to develop machine translation models for the three proposed code-mixing pairs. The large-scale curated datasets will also help in developing pre-trained language models and higher dimension representations that are largely explored in recent NLP initiatives. Thirdly, as a promise of public availability, we aim to develop a web portal for easier accessibility of the curated dataset, tools and resources. The portal aims to contain the following features (i) the curated datasets and associated APIs, (ii) a submission system to submit the test results for the proposed low-level NLP tasks to enable competition among the research community, (iii) a demo portal to submit the real-time text samples and get the output from the trained models, and (iv) annotation frameworks for constructing similar annotated datasets in future.
Patents
0
Source
Source
Science and Engineering Research Board (SERB), DST 2022-23
Science and Engineering Research Board (SERB), New Delhi
Anusandhan National Research Foundation (ANRF)
Quick Information
Area of Research
Engineering Sciences
Start Year
2023
End Year
2026
Sanction Amount
₹ 47.67 L
Status
Ongoing
Contact
singh.mayank@iitgn.ac.in
Output
No. of Research Paper
00
Technologies (If Any)
00
No. of PhD Produced
00
No. of Patents
Filed :00
Grant :00
Disclaimer:
Information available on this portal is sourced from various organizations and is provided for informational purposes only. Users are advised to verify details from the respective official sources.
Please enter your details
Please provide your name and email to continue. Your details are saved in this browser for future use.
Latest Updates
Loading…
⚠️
You are leaving this website
You are about to be redirected to an external website that is not operated by
India Science, Technology & Innovation (ISTI) Portal.