Machine learning improves Arabic speech transcription skills

With advances in speech and natural language processing, it is hoped that one day you can ask your virtual assistant what the best salad ingredients are. Currently, it is possible to ask your home gadget to play music or open on voice command, a feature already present in many devices.

If you speak Moroccan, Algerian, Egyptian, Sudanese or any of the other dialects of the Arabic language, which are extremely varied from region to region, where some of them are mutually unintelligible, it is another story. If your native language is Arabic, Finnish, Mongolian, Navajo, or any other language with a high level of morphological complexity, you may feel left out.

These complex constructions intrigued Ahmed Ali to find a solution. He is a Senior Engineer in the Arabic Language Technologies Group at the Qatar Computing Research Institute (QCRI), which is part of the Hamad Bin Khalifa University of the Qatar Foundation and founder of ArabicSpeech, a “community that exists for the benefit of of Arabic speech science and speech technologies. . “

Qatar Foundation Headquarters

Ali was captivated by the idea of ​​talking to cars, appliances and gadgets many years ago while working at IBM. “Can we build a machine that can understand different dialects: an Egyptian pediatrician to automate a prescription, a Syrian teacher to help children understand the essentials of their class, or a Moroccan chef describing the best couscous recipe? ” he declares. However, the algorithms that power these machines cannot sift through the 30 or so varieties of Arabic, let alone make sense of them. Today, most speech recognition tools only work in English and a few other languages.

The coronavirus pandemic has further fueled an already growing reliance on voice technologies, where the way natural language processing technologies have helped people comply with stay-at-home guidelines and physical distancing measures. However, as we use voice commands to facilitate online shopping and manage our households, the future has even more applications in store for us.

Millions of people around the world are using Massive Open Online Courses (MOOCs) for its open access and unlimited participation. Speech recognition is one of the main features of the MOOC, where students can search specific areas in the spoken content of courses and activate translations via subtitles. Voice technology makes it possible to digitize lectures to display spoken words as text in university classrooms.

Ahmed Ali, Hamad Bin Kahlifa University

According to a recent article in Speech Technology magazine, the voice and speech recognition market is expected to reach $ 26.8 billion by 2025, as millions of consumers and businesses around the world rely on voice robots not only to interact with their devices or cars, but also to improve customer service, drive healthcare innovations and improve accessibility and inclusion for people with hearing, speech or speech disabilities. traction.

In a 2019 survey, Capgemini predicted that by 2022, more than two in three consumers would opt for voice assistants over visits to stores or bank branches; a share that could rightly increase, given the home and physically remote life and commerce that the epidemic has imposed on the world for more than a year and a half.

Nonetheless, these devices fail to deliver large areas of the globe. For these 30 types of Arabic and millions of people, this is a largely missed opportunity.

Arabic for machines

English or French voice bots are far from perfect. However, teaching machines to understand Arabic is particularly tricky for several reasons. Here are three commonly recognized challenges:

  1. Lack of diacritics. Arabic dialects are vernacular, as mainly spoken. Most of the available text is not diacritical, which means that it lacks accents such as high (´) or low (`) that indicate the sound values ​​of letters. Therefore, it is difficult to determine where the vowels go.
  2. Lack of resources. There is a dearth of labeled data for the various Arabic dialects. Collectively, they lack standardized spelling rules that dictate how to write a language, including standards or spelling, hyphenation, hyphenation, and emphasis. These resources are crucial for training computer models, and the fact that they are too few in number has hampered the development of Arabic speech recognition.
  3. Morphological complexity. Arabic speakers engage in many code changes. For example, in areas colonized by the French – North Africa, Morocco, Algeria, and Tunisia – dialects include many borrowed French words. Therefore, there is a high number of so-called non-vocabulary words, which speech recognition technologies cannot understand because these words are not Arabic.

“But the field is changing at breakneck speed,” says Ali. It is a collaborative effort among many researchers to move it forward even faster. Ali’s Arabic Language Technology Lab is leading the ArabicSpeech project to bring together Arabic translations with native dialects from each region. For example, Arabic dialects can be divided into four regional dialects: North African, Egyptian, Gulf, and Levantine. However, since dialects do not respect borders, this can go up to one dialect per city; for example, a native Egyptian speaker can differentiate his Alexandrian dialect from his fellow citizen of Aswan (a distance of 1000 kilometers on the map).

Building a tech-savvy future for all

At this point, machines are about as precise as human transcriptionists, thanks in large part to advancements in deep neural networks, a subfield of machine learning in artificial intelligence that relies on algorithms inspired by the biological and functional functioning of the human brain. However, until recently speech recognition has been hacked a bit. The technology has a history of relying on different modules for acoustic modeling, construction of pronunciation lexicons and language modeling; all modules that must be trained separately. More recently, researchers have trained models that convert acoustic characteristics directly into text transcriptions, potentially optimizing all parts for the final task.

Even with these advancements, Ali still cannot give voice commands to most devices in his native Arabic. “It’s 2021 and I still can’t talk to a lot of machines in my dialect,” he comments. “I mean, now I have a device that can understand my English, but automatic multi-dialect Arabic speech recognition has not yet taken place.”

Making this happen is the goal of Ali’s work, which resulted in the first transformer for the recognition of Arabic speech and its dialects; one that has achieved unmatched performance so far. Dubbed the QCRI Advanced Transcription System, the technology is currently used by broadcasters Al-Jazeera, DW and BBC to transcribe content online.

There are several reasons why Ali and his team have been successful in creating these vocal engines right now. Mainly, he says, “It is necessary to have resources in all dialects. We need to develop the resources so that we can then train the model. Advances in computer processing mean that computation-intensive machine learning now occurs on a graphics processing unit, which can quickly process and display complex graphics. As Ali says, “We have great architecture, great modules, and we have data that represents reality. ”

Researchers from QCRI and Kanari AI recently built models capable of achieving human parity in news broadcast in Arabic. The system demonstrates the impact of captioning Aljazeera’s daily reports. While the human error rate (HER) in English is around 5.6%, research has found that the Arabic HER is significantly higher and can reach 10% due to the morphological complexity of the language and the lack of standard spelling rules in dialectal Arabic. Thanks to recent advancements in deep learning and end-to-end architecture, the Arabic speech recognition engine manages to outperform native speakers in broadcast news.

While modern Standard Arabic speech recognition seems to work well, researchers at QCRI and Kanari AI are dedicated to testing the limits of dialect processing and achieving excellent results. Since no one speaks Modern Standard Arabic at home, attention to the dialect is what we need for our voice assistants to understand us.

This content was written by the Qatar Computing Research Institute, Hamad Bin Khalifa University, member of the Qatar Foundation. It was not written by the editorial staff of the MIT Technology Review.

Leave a Comment