Linguistics and Computational Linguistics Research @CTRANS

The links and details on this page is currently being updated!

This page lists the Linguistics and Computational Linguistics research by Dr. Ritesh Kumar and his group working at the K.M. Institute of Hindi and Linguistics and the Centre. Explore the resources and technologies developed below.

Research on Minoritised and Endangered Languages and Varieties of Language

For most of the languages mentioned on the left, there is a high probability that you might not have heard their name, let alone being aware of language descriptions, resources or technologies for them. One of our primary aims is to produce at least basic resources and language processing technologies for all of the 100s or even 1000s of Indian and South Asian languages. Here we list the very humble start that we have made in that direction and we hope to accelerate and give it a major push in coming years.

Datasets

Indo-Aryan Language Identification


UniMorph Repo
UD Repo
Datasets

Magahi

Magahi Part-of-Speech Tagger

Magahi Morphological Analyser

Magahi UD Treebank

Magahi Unimorph Paradigms

UniMorph Repo
UD Repo
Datasets

Braj Bhasha

Braj Bhasha Part-of-Speech Tagger

Braj Bhasha Morphological Analyser

Braj Bhasha Treebank

Braj Bhasha Unimorph Paradigms

Datasets

Awadhi

Awadhi Part-of-Speech Tagger

Datasets

Bundeli

Text corpus of Bundeli. It is at the final stages of proofreading and annotation and we hope to release it soon.

Datasets

Bodo

Text corpus of Bodo.

Datasets

Taluitew

Politeness in Taluitew.

Datasets

Western Hindi Variety [Speech]

Speech Corpus of Western Hindi.

Datasets

Eastern Hindi Variety [Speech]

Speech corpus of Eastern Hindi Variety (from Bihar). The corpus is currently under process and we should be able to release it soon.

Hate Speech and Aggressive Language Research

It is probably because of the very early beginnings of research in aggressive language at our University that Hindi is today a rather resource-rich language as far as aggressive and hate speech research is concerned. Of course, in the last couple of years, there have been more resource and technology development efforts in the field for Hindi but our University was undoubtedly a pioneer in the field. We strive to continue this tradition as we now have started exploring hateful and aggressive speech in other major languages of India, beyond Hindi.

Project Page
Datasets

The Aggression Project

Supported by UK-India Education and Research Initiative and carried out in collaboration with multiple institutions from India and UK, the project led to the development of first speech dataset of over 50 hours in Hindi and English, marked with aggression as well as a tool to automatically recognise aggression in Hindi and English speech (a demo of the tool is available here - http://panlingua.co.in/art/ and the dataset will be released soon here - https://github.com/kmi-linguistics/speech-aggression)

Project Page
Datasets

Detection of Aggressive Behaviour on Social Media

Supported by Microsoft Research, this project resulted in the first dataset from Twitter and Facebook in Hindi and English as well as models for automatic detection of aggression in social media text. Visit the Github page for the dataset.

Project Page
Datasets

OffensEval


Project Page
Datasets

The ComMA Project

Supported by Facebook Research, this ongoing project aims to build multilingual, multimodal datasets and recognition systems for recognising, misogyny, communalism and other forms of aggression in Indian languages such as Bangla, Hindi, Meitei, English and others. Please visit the project website for more details.


Applications / Competitions / Shared Tasks

Besides the two major areas of research mentioned above, we contribute to different areas of Linguistics via course projects, dissertations, etc. as well as other kinds of research. We list some of our significant contributions here, especially those which are built as assistive technologies for researchers working in language documentation and revitalisation of endangered, minoritised and lesser-known Indian languages. We also list some of the systems that we submitted for shared tasks / public competitions.

App

mScrabble

mScrabble is a multilingual mobile and web-based version of the popular language game - Scrabble - especially aimed towards the endangered and lesser-known languages of India. More importantly, it allows for generating mScrabble games for different languages using only a dictionary of words in the concerned language and a list of characters in the script. The app is currently available for Koda, Mahali (two critically endangered Austro-Asiatic Indian Languages), Magahi, Bhojpuri, Awadhi, Braj Bhasha, besides Hindi, Bangla and English and more languages are being continuously added.

App

Bahubhashi (Multilingual)

Bahubhashi is a multilingual language assistant that could perform various language-related tasks such as spell and grammar checking, word prediction, etc for Indian languages. It is currently under active development and a beta version will soon be released for Hindi and a few other lesser-known and endangered languages such as Magahi, Bhojpuri, Koda, Mahali, etc.

App

Linguistic Field Data Management and Analysis System [LiFe]

It is a project to build an app for managing linguistic field data, especially within the context of language documentation, assisting in the analysis, publishing and exporting in multiple formats and also keeping the data in a way that it could be leveraged for developing NLP applications. The app is currently under development and we hope to release the test alpha version within the next couple of months. The app is mainly targeted towards researchers working in the fields of language documentation and revitalisation.

Shared Task: SigTyp2020


Shared Task: HASOC 2019


Shared Task: WMT2019 [Similar Language Translation]