Log In

What The Future Of Translation Tech Means For The Basque Language

Published 16 hours ago9 minute read

People wearing colorful vests and waving flags

The 2024 Korrika, a biennial relay race celebrating the Basque language. (Photo by Unanue)

Europa Press via Getty Images

In a warehouse-like space on a narrow island in Bilbao, Spain, linguists and technologists are testing the possibilities of automated translation. Their projects include antispoofing work to better detect and combat synthetic voices, which are now highly sophisticated; vocal analysis of calls to potentially identify early signs of neurological disorders; and a limited set of speech commands in elevators, which may be especially useful to people with disabilities.

This is the Bilbao base of Vicomtech, a nonprofit research foundation focused on technology. Its funders include private companies and four layers of government (provincial, regional, national, and European). The strong influence of local governments, in particular, is a common theme across both language-revitalization and technology-development projects in the Basque Country.

An automated translation program that Vicomtech worked on, Itzuli, is used for 300,00 translations a day, according to the organization. Itzuli is embedded on a government website, where it allows general translation between Basque and Spanish, French, and English. It also offers formal translation, appropriate for legal language, between Basque and Spanish. And the developers are working to add an offering specific to the Bizkaian dialect of Basque.

However, Itzuli remains less well-known than Google Translate, which remains convenient for many Basque Country businesses, even if it’s not quite as sophisticated. (Google did not respond to a request for comments regarding Google Translate and Basque.)

Basque (euskara), a language spoken in parts of northern Spain and southern France, is unusual for several reasons. Most languages spoken in Europe are Indo-European, but many linguists believe that Basque predates those. It’s now essentially unique in Western Europe.

While many minority languages in Europe are dwindling, Basque is bucking the trends. Over 1 million people can now speak or understand it. Some of the numbers are dramatic. For instance, while in 1997–98, 40% of students in the Basque Autonomous Community (BAC) of northern Spain chose to take their university entrance exams in Basque rather than Spanish, this shot up to over 70% in 2018–19, according to Euskararen Etxea, a museum and cultural center dedicated to the Basque language.

This points to another unusual feature of Basque: it’s a young language. In contrast, many minority languages remain the preserve of the oldest community members. In Basque, 22% of BAC residents older than 70 speak Basque, dwarfed by the over 90% of 10–14-year-olds who speak Basque.

However, while Basque has grown significantly as a language of education and culture, it is not yet spoken casually to the same degree. “Basque is a young language because it is children and young people who use it most, and that includes use on the street,” according to Euskararen Etxea.

Euskararen Etxea.

Christine Ro

Also, the expansion of Basque has been uneven. It is declining in the French Basque Country, though overall Basque punches above its weight, in terms of representation. For example, there are about the same number of active users for the Basque and Uzbek versions of Wikipedia, although Uzbekistan has roughly 18 times the population size.

Basque has had a tumultuous recent history. It was banned under the Spanish dictatorship of Francisco Franco, which began in 1936. In the decades that followed, the Basque nationalist group Euskadi Ta Askatasuna (Eta) killed over 800 people while agitating, among other things, for protection of the Basque language. In the Basque regions, language battles have been closely intertwined with tensions, sometimes violent, over identity and power. Controversies have continued over, for example, proposed Basque language requirements for some public jobs.

It wasn’t until 1968 that a standardized version (euskara batua) was created. Language enthusiasts have embraced new technologies such as video games for keeping Basque alive. Now, digitizing Basque is part of the regional government’s drive to both safeguard the language and to invest heavily in technology.

This is symbolized by Zorrotzaurre, the artificial island housing Vicomtech’s Bilbao office. Construction is occurring all over this formerly run-down strip of land, which many industrial companies abandoned after the 1980s. The island still appears modest, but two international starchitects have left their fingerprints on it. Zorrotzaurre’s master plan was drawn up Zaha Hadid, and the island is connected to the mainland by Frank Gehry Bridge (Frank Gehry Zubia), whose Guggenheim Museum design was a controversial and expensive gamble that has hugely paid off. Now, Vicomtech associate director Jorge Posada says of the authorities’ plans, “they want to create a kind of Guggenheim effect” for Zorrotzaurre as well.

The technology has advanced faster than some people’s desire to incorporate it.

A logical source of Basque-language content for tech developers, including Vicomtech, is the public broadcaster, Euskal Irrati Telebista (EITB). EITB has five TV channels, of which two are fully in Basque, and six radio stations, with two of them exclusively in Basque. “As a public service, it is one of our big goals” to preserve the Basque language, says Igor Jainaga Irastorza, the chief technology officer for EITB. “It’s one of our foundational basics.”

So far, the broadcaster is taking a cautious approach to AI-based translation technologies, with automatic transcription being the first critical step. Jainaga has seen much improvement in the services over the last few years. He calls them “good enough for being helpful,” especially for general purposes or non-native speakers. But overall, “we are going slowly with these [AI-based] services, because what we see is that if technology is not mature enough, it can introduce noise in the production processes.”

While they haven’t set a specific accuracy threshold they need to reach, “it’s best effort,” Jainaga reports. It’s particularly important to avoid language-based errors in certain types of content: “If it’s an entertainment program, maybe it’s not as critical as if it’s a news program.”

That balance of caution and context means that EITB allows different levels of AI-powered translation for different types of programming. As Jainaga says, “We have a big mixture of some of the programs being transcribed by humans, some with automatic processes and some with automatic transcription with human checks, mainly with the products that are coming from outside.”

More specifically, for some of EITB’s news programs, the automatic transcription of subtitles may be supervised by humans. Some online broadcasts have automatic transcription with human checks, but not automatic translation. The audio platform Guau has automatic transcription and translation. And the recently launched news site Orain allows automatic translation into Spanish, English, and French (using Itzuli).

Itzuli interface on the euskadi.eus website.

Christine Ro

All of this needs localization into Basque. In weather forecasts, repeated weather-related terms may be easy to automate and achieve 100% accuracy. But AI models may need to be trained to accurately reproduce names of athletes and small towns, for instance. “If you are giving that service to the people of the Basque Country, what they expect is that the names of the towns or local people are properly spelled,” Jainaga says.

One theme that has emerged from the creation of AI language tools for this small language is the importance of quality over quantity in amassing data and developing models. Jainaga comments, “Big companies or other developers can…eat all the info on the internet available,” potentially without obtaining rights. “With minority languages, we have less information, so the only thing that we can do from our point is to have good-quality data.”

An organization currently working on collecting high-quality language data is Euskorpora, a young nonprofit whose partners include government departments, private companies, and language institutes. (EITB and Vicomtech are also partners.) Euskorpora’s flagship project is the Basque Language Digital Corpus, a collection of audio, text, and video samples of Basque from varied settings, with different language varieties represented over time. The intention is for this corpus to be available to anyone who wants to use it, though likely with some sort of payment structure for commercial uses.

This type of corpus is needed, according to Leire Barañano Orbe, Euskorpora’s general manager, because other Basque corpora for training machine learning models have focused on research or academic exploration. She believes that “this distinction is crucial, as research-oriented projects often prioritize innovation and theoretical advancements, while commercial efforts aim to create practical, user-ready tools.”

Another difference with the Basque Language Digital Corpus is that Euskorpora is spending a lot of time and care on making sure that they have all the legal permissions for the content they would like to incorporate. In contrast, some other datasets for machine learning models may have murky origins. For instance, it’s challenging to gather enough spontaneous snippets of audio and video. So Euskorpora is looking into using audio from call centers—though this would require careful consideration to ensure that all such data is anonymous, with no identifying details captured.

Audio is also a challenge for Vicomtech. It can be hard to capture good-quality audio from real-world recordings on the street, or to refine speech recognition in noisy environments like elevators or factory floors. For the moment, direct speech–speech translation is not mature enough, according to Arantza Del Pozo, head of speech and language technologies at Vicomtech. And there is a “concatenation of errors” when AI systems translate between speech and text, she says.

The quality-over-quantity approach means that Basque language tools won’t be the biggest. Nor will they be the quickest, given the European Union’s more careful approach to regulating AI, compared to the U.S. and China. Vicomtech isn’t looking to be the fastest or the first, Posada says.

Another gap in recorded spoken language is in specialized areas like law and engineering, where there may not be many media samples using this type of specific language. So for such areas, Euskorpora is considering using some proportion of synthetic data to supplement the real-world data. There again, care would be needed to avoid distorting the datasets.

Like just about everyone working on Basque language tools, Barañano of Euskorpora wants to ensure the vitality of the language. She believes that the main European languages have been very strong in terms of digital transformation, but there has a been a large and widening gap for other languages.

For this it’s necessary to tap into not only government resources, but also larger networks of collaboration and support. For a language fighting for survival, no one organization can go it alone. Barañano believes that “this collective effort can advance both the preservation and modernization of a minority language in an increasingly digital world.”

Reporting for this story was supported by a press trip organized by the Provincial Council of Bizkaia.

Origin:
publisher logo
Forbes
Loading...
Loading...
Loading...

You may also like...