Master of Science in Computational Linguistics

Departments and Partners

M.Sc. in Computational Linguistics is being jointly offered by 4 departments of the University -

  • Department of Linguistics (K.M. Institute of Hindi and Linguistics) - Coordinating Institution

  • Department of Statistics (Institute of Social Sciences)

  • Department of Computer Science (Institute of Engineering and Technology)

  • Department of Mathematics (Institute of Basic Sciences)

The courses in the program is co-taught by external collaborators, researchers and faculty from the following industry and academic partners (this list is continuously being updated as more people come on-board) –

  • Indian Institute of Technology, Kharagpur

  • Microsoft Research India, Bangalore

  • Panlingua Language Processing LLP, New Delhi

The programme will be coordinated by the Department of Linguistics under the aegis of the proposed Centre for Transdisciplinary Studies.

Motivation

Processing natural languages is one of the most difficult but at the same time most essential requirements of modern times. As computers have made inroads in every sphere of our life, it is realised that the full potential of automation or artificial intelligence cannot be realised without developing machine that could understand, process and generate human languages. As such Natural Language Processing and Computational Linguistics have found its application in areas ranging from healthcare and defense to legal domain and digital assistants. If we list out some of the most common applications of the NLP in our world, they would include chatbots / personal assistants / any kind of voice-enabled application in any machine (also called Conversational AI), machine translation, text summarization, OCR technologies, search engines / information retrieval systems, automatic speech recognition and numerous others.

Having said this, it also remains a fact that such technologies are not really available for our very own major as well as regional and local languages. The chatbots dont work in Braj Bhasha (or even official languages like Bodo or Manipuri), there are no machine translation systems for Awadhi or Santhali, there are hardly any resources to build those. To make things worse, there are very few trained computational linguists in South Asia / India, working on South Asian / Indian languages – they could be counted on our fingers. One of the main reasons for this is the lack of dedicated, state-of-the-art training centres and departments in the region / country. Barring a couple of departments and degrees, which still are bound by the parochial disciplinary boundaries, unfortunately India (and South Asia, in general) doesnt have a truly multi-disciplinary centre for teaching and research in the field. This program aims to fill that gap and create first truly multi-departmental program in computational linguistics in South Asia / India where students will be trained by experts from 4 different departments of the University – Linguistics, Mathematics, Statistics and Computer Science – and mentored by the professional and experts of the field in Industry. We hope that this experimental program will be able to train truly outstanding talents in India who will work and contribute towards the development of language technologies for South Asian / Indian languages (especially low-resourced, minority and endangered languages) in multiple domains (especially with respect to socially-aware systems).

Objectives

The purpose of this programme is to acquaint the students with the aspects of the extremely fascinating area of Natural Language Processing and Computational Linguistics and train them for research in this field, especially with respect to working with low-resource, minority and endangered languages and developed of socially-aware NLP systems. This course outline is prepared keeping in view that the students should have a sound background in different theoretical and methodological orientations of Linguistics, Computer Science as well as Computational Linguistics. At the same time, NLP being an applied field, they should be well-equipped to develop real-life applications which are both innovative and useful for different linguistic and cultural communities of the country. Thus every course include both theoretical as well as application-oriented approach.

Course Structure

The program is broadly divided into 4 groups

  • Language Sciences [Group A]

  • Computer Science [Group B]

  • Mathematics/Statistics [Group C]

  • Natural Language Processing. [Group D]

The soft core courses in each of the group is designed in such a way that they introduce the students to different areas and methodologies needed by a computational linguist and the elective courses provide either a more in-depth study of these areas or introduce a new, emerging area of interest. Thus, while being aware of different orientations in the field, the student could select areas of their interest for a more in-depth knowledge in that area through the elective courses.

Depending on the background of the students, an evaluation of their needs and their own interest / preferences, each student may be put into one or more of Group A, B and C by the course adviser, in consultation with the program director and coordinators such that the students will need to complete all the core courses from the assigned group(s). Group D will remain compulsory for all the students. In general, students with a background in Linguistics may not need to complete courses from Group A; similarly, depending on their prior experience, students with background in computer science and maths / statistics may be exempt from completing courses in Group B or C or both. In such situations, the students may still choose to complete the core courses from these groups; alternatively they may opt for elective courses from any of the groups or departments to complete the credit requirements for the program.

The detailed course outline for each course may be decided by the course instructor in consultation with the program academic committee / board of studies at the beginning of the semester.

The complete list of courses and a brief description of the soft core courses could be downloaded from the following link

Evaluation Pattern

The normal evaluation pattern of the courses will largely follow the CBCS guidelines of the University and will be as follows -

Continuous Assessment: 40%

Lab / Project: 30%

End-Sem Exam: 30% (written / practical exam of minimum 3 hours)

In Group A and C, the end-sem exam will be a written exam of minimum 3 hours.

In Group B and D, the end-sem may consist of a mix of practical and written exam such that written exam is not of more than 50% weightage. In these courses, the written exam may also be completely done away with.

It is to be noted that the question paper setting and assessment for the end-sem exam will be carried out by the concerned course instructor, as per the CBCS norms.

For the course ALI107, Field Methods and Language Documentation, the evaluation pattern will be as follows

Field Work Report: 60% (evaluated by BOTH external examiner and course instructor(s))

Presentation (based on the report): 40%

The courses ALI172 Fundamental of Programming for NLP with Python and ALI175, User Applications for NLP are lab courses and there will not be any written examination for these courses.

The Non-credit courses will be graded as only pass/fail and the assessment pattern may be decided by the course instructor.

Credit Requirements and Duration

M.Sc. in Computational Linguistics is a 2-year full time programme spread over 4 semesters. In any semester, 20 credits is the normal workload for coursework (excluding internships, dissertations, etc). A student may register for extra credits upto a maximum of 30 credits per semester and a minimum of 10 credits per semester. A total of 96 credits is to be earned across 4 semesters, including 76 credits from the core courses, the project/internship and dissertation.

In the first semester, there will be 1 core course from each of Group B, C and D and 2 core courses from Group A. In the second semester, there will be 1 core course from each of Groups A, B and D and 1 core course from Group C. In the third semester, there will be 1 core course from each of the Group A, B and D. In the final semester, there will be 1 core course from each of Group A and D. The rest of the elective courses may be completed from any of the 4 groups or anywhere other than the participating departments, subject to the minimum credit requirement as mentioned above.

There will also be a Capstone Project (leading to the development of a real-life application) or an Internship of at least 1 month in industry or academia, to be completed across the first two semesters. The students will also need to complete a dissertation spanning across the last 2 semesters of the program (but submitted in the last semester of the program). The report of the project or the work done during the internship may be integrated in the dissertation to be submitted at the end of the program. The electives throughout the 4 semesters can be opted from the ones offered by the department as well as from any other department subject to the condition that a minimum of 76 credits is earned from the departmental core / elective courses and the project / internship in the 3rd or the 4th semester and dissertation across the last two semesters. Rest of the 20 credits may be earned from the departmental electives or open electives (from any department in the University other than those jointly offering the programme) or a combination of these.

Admission Requirements

Minimum Eligibility: Graduation in any discipline with 45% or equivalent (for admission in 1st year)

Number of Seats: 10

Mode of Admission: Entrance Exam / Interview / Merit (as decided by the admission committee)

Fee Structure

Semester Fees**: Rs. 8,000 per semester

Examination Fees**: Rs. 2,070 per semester

Dissertation Fees**: Rs. 2,000 (one time)

**Subject to revision and updation by the relevant bodies of the University / state government / any other competent body