Indonesia has more than 700 languages. Can AI save them?
๐ Abstract
The article discusses the threat of extinction facing hundreds of regional languages and dialects in Indonesia, and how the government is turning to artificial intelligence (AI) and large language models (LLMs) to help preserve these endangered languages.
๐ Q&A
[01] Growing up in Indonesia and the threat to Using language
1. What is the Using language, and how did Antariksawan Jusuf's experience with it change over time?
- Antariksawan Jusuf grew up speaking Using with his family and friends in the Indonesian province of Banyuwangi.
- It wasn't until he went to university in Bali and had to speak the national language Bahasa Indonesia that he realized Using was in danger of dying out.
- Antariksawan says "Using is threatened by modernization" and that "a lot of parents now prefer Bahasa Indonesia when they communicate with their children."
2. What is the status of regional languages and dialects in Indonesia overall?
- Indonesia has more than 700 regional languages and nearly 800 dialects across its vast archipelago.
- However, more than 400 dialects are at risk of becoming extinct by the end of the 21st century, according to researchers.
[02] The role of AI and LLMs in preserving endangered languages
1. How are large language models (LLMs) typically trained, and what are the challenges for low-resource languages?
- Popular LLMs like GPT, Gemini, and Llama are largely trained on English data, excluding billions of people who speak other languages.
- For low-resource languages that are widely spoken but have limited data online, there are concerns about whether the available data best represents the cultures.
2. What is the Indonesian government doing to address this challenge?
- The government has turned to AI technology and LLMs to help preserve regional languages and make them more accessible.
- They are working on building their own multilingual LLMs in low-resource and endangered languages.
3. What are some examples of LLM initiatives in Indonesia?
- Yellow.AI launched Komodo-7B, an LLM trained on Bahasa Indonesia and 11 other regional languages.
- Singaporean startup Wiz.AI launched an LLM for Bahasa Indonesia, and the SEA-LION family of open-source LLMs also trains on Bahasa Indonesia and other Southeast Asian languages.
- Indosat Ooredoo Hutchison is developing Garuda LLM, which can be applied across industries while preserving Bahasa Indonesia and its dialects.
[03] Challenges and opportunities in preserving regional languages
1. What are the key challenges in digitizing and preserving regional languages in Indonesia?
- There is a paucity of high-quality data, including books, media, academic papers, and code repositories, for most regional languages.
- Concerns around data sources and potential biases, especially in a country with rampant censorship and government control of information.
2. What are the efforts being made to address these challenges?
- Antariksawan Jusuf helped publish a Bahasa Indonesia-Using dictionary and has written a novel in the two languages.
- He has also set up a collective in Banyuwangi to preserve the Using language and culture, publishing short stories, novels, and videos.
- The collective is working with the Banyuwangi regional library to digitize the literature and make it more accessible.
3. What is the potential impact of AI and LLMs in preserving regional languages in Indonesia?
- Antariksawan is hopeful that AI technology and LLMs can help preserve the Using language and make it more accessible to the younger generation.
- The government and tech companies see the potential in reaching Indonesia's 275 million people through multilingual LLMs that can capture linguistic nuances and cultural contexts.