Curating and constructing benchmarks and development of ML models for low-level NLP tasks in Hindi-English code-mixing

Implementing Organization

Indian Institute of Technology (IIT)

Principal Investigator

Mr. Mayank Singh

Indian Institute Of Technology (IIT) Gandhinagar, Gujarat

About

This project focuses on developing an ecosystem to extensively enhance the research paradigm around understanding the code-mixing behavior of the Indian audience on the multitude of social media platforms. The proposed project will bring an immediate shift in the focus of the entire NLP community that mostly researches in English, European and Mandarin languages towards the low-resource Indic languages. The project will enable the researchers to develop state-of-the-art tools and algorithms for specifically understanding the least explored India-specific text volume. Here, we focus on code-mixing, where tokens or phrases are mixed, written or spoken in multiple languages inside a sentence (intra-sentential mixing). Sometimes, we also mix sentences written or spoken in multiple languages (inter-sentential mixing). Below, we present an example where English is mixed with Hindi within a Hinglish (mixture of Hindi and English) sentence. English: We have a fully autonomous vehicle. Romanized Hindi: hamaare paas poori tarah se svaayatt vaahan hai Hinglish: Hamare paas fully autonomous vaahan hai In this project proposal, we first-of-all aim to curate and construct manually annotated benchmark datasets for Indic code-mixed languages. To the best of our knowledge, this is the first proposal that considers Indic code-mixed language pairs. Unlike current initiatives that focus on curating small annotated data (in order of few thousands), we aim to curate large-scale (at least 100x times) datasets for Hindi-English code-mixing. We aim to curate data from two social media platforms – Twitter and Instagram and several India-specific manually created corpuses like PM’s Mann ki Baat episodes, speeches made by political leaders (BJP, INC, etc.) across the country, PIB releases, etc. Secondly, we plan to develop models for low-level NLP tasks like token-level language identification, named entity recognition, POS tagging and matrix language prediction. We also plan to develop machine translation models for the three proposed code-mixing pairs. The large-scale curated datasets will also help in developing pre-trained language models and higher dimension representations that are largely explored in recent NLP initiatives. Thirdly, as a promise of public availability, we aim to develop a web portal for easier accessibility of the curated dataset, tools and resources. The portal aims to contain the following features (i) the curated datasets and associated APIs, (ii) a submission system to submit the test results for the proposed low-level NLP tasks to enable competition among the research community, (iii) a demo portal to submit the real-time text samples and get the output from the trained models, and (iv) annotation frameworks for constructing similar annotated datasets in future.

Patents

Source

Science and Engineering Research Board (SERB), DST 2022-23

Related Research

View All

Funding Organization

Science and Engineering Research Board (SERB), New Delhi

Anusandhan National Research Foundation (ANRF)

Quick Information

Area of Research

Engineering Sciences

Start Year

2023

End Year

2026

Sanction Amount

₹ 47.67 L

Status

Ongoing

Contact

singh.mayank@iitgn.ac.in

Output

No. of Research Paper

Technologies (If Any)

No. of PhD Produced

No. of Patents

Filed : 00

Grant : 00

Acces sibility Controls

Research Projects