Log In

IIT-Bombay leads push for India-centric AI

Published 2 weeks ago4 minute read

Jun 01, 2025 06:16 AM IST

Mumbai: Indian Institute of Technology (IIT) Bombay has released 16 new datasets on AIKosh, the central government’s platform that provides a repository of datasets to enable artificial intelligence (AI) innovation. This is a major step in developing AI that understands India’s linguistic and cultural landscape, professor Ganesh Ramakrishnan, from IIT Bombay. These datasets will support innovation and research in AI and machine learning (ML), especially in areas involving Indian languages, scripts, documents, media, and audiovisual content.

Indian Institute of Technology (IIT) Bombay has released 16 new datasets on AIKosh, the central government’s platform that provides a repository of datasets to enable artificial intelligence (AI) innovation (Praful Gangurde / HT Photo) (HT PHOTO)
Indian Institute of Technology (IIT) Bombay has released 16 new datasets on AIKosh, the central government’s platform that provides a repository of datasets to enable artificial intelligence (AI) innovation (Praful Gangurde / HT Photo) (HT PHOTO)

The effort is part of BharatGen, a multilingual large language model (LLM) initiative led by IIT Bombay and funded by the Department of Science and Technology. So far, BharatGen has contributed 16 India centric datasets and launched 21 AI models on AIKosh. The initiative includes top institutions such as the International Institute of Information Technology in Hyderabad and the IITs of Kanpur, Mandi, Madras, Hyderabad, Indore.

IIT Bombay’s datasets are designed to build a solid foundation for developing Indian AI tools and applications. These include over 218,000 sentences for improving digitisation of Sanskrit texts, audio-visual data on practical skills like upcycling discarded materials into toys and organic farming, English-Sanskrit translations with 53,000 sentences for modern prose, over 78 hours of Sanskrit audio for speech recognition, multilingual question-answer sets in 11 Indian languages, including Hindi and English, math word problems in Hindi and English for AI reasoning, and table detection datasets in 14 Indian languages.

The datasets include visual question answering models (a system capable of answering questions related to an image), datasets to improve translation accuracy and recognize text in videos, a comprehensive overview of Indian Knowledge Systems (IKS), cross-lingual video and text retrieval in seven Indian languages (allowing AI to retrieve relevant information when the document is written in a different language from the query), and handwritten and printed text detection datasets.

These datasets and models are part of a broader effort by IIT Bombay and BharatGen to build sovereign AI models for India aligned with the India AI Mission, a central government initiative that aims to build an ecosystem that allows AI innovation by enhancing data quality and facilitating computer access. The team is not just fine-tuning existing models, but training new ones from scratch using Indian data. They are also building benchmarks to test these models for Indian use in conversation and education.

A major highlight of this initiative is the launch of ‘Param 1’, a bilingual foundational language model with 2.9 billion parameters. It supports both English and Hindi and has been trained on 36% Indic language data—significantly more than international models like Meta’s Llama, which had less than 0.01%.

“Pre-training (the initial stage of training a machine learning model on a large dataset) is an enormous undertaking and often a barrier for many. That’s why we’ve taken on this challenge,” professor Ramakrishnan, lead of BharatGen. Developers can now fine-tune Param 1 to build Indic chatbots, copilots (virtual assistants for research), and knowledge systems. “We hope our efforts toward creating a sovereign Generative AI ecosystem and milestones such as the release of such LLM model checkpoints, serves as a foundation for India-specific solutions,” said professor Ramakrishnan.

Alongside Param 1, BharatGen has launched over 20 speech models across 19 Indian languages. These include speaker adaptive text-to-speech (TTS) systems that can mimic a speaker’s voice in languages like Hindi, Tamil, Telugu, Marathi, and Bengali. Advanced speaker-conditioned TTS models and automatic speech recognition systems have also been developed to make voice-based applications more natural and inclusive.

“Our goal is not just to build AI models but to provide resources that startups and system integrators can leverage,” said professor Ramakrishnan.

Origin:
publisher logo
Hindustan Times
Loading...
Loading...
Loading...

You may also like...