Technology Innovation Institute launches the world’s largest Arabic NLP model
Technology Innovation Institute (TII), a global research centre, has launched NOOR, the world’s largest Arabic natural language processing (NLP) model to date. The NOOR model carries out varied, cross-domain tasks simply from natural language instructions.
To build NOOR, researchers at TII designed an end-to-end pipeline for the collection of high-quality data, including crawling, filtering, and curation at scale. TII’s specialists also built optimized services for extreme-scale distributed training and serving – to deliver applications with efficient inference and model specialization.
TII’s team of advanced researchers and specialists at its Artificial Intelligence (AI) Cross-Centre Unit, joined forces on this initiative with LightOn, a technology company that unlocks extreme-scale machine intelligence for businesses, to revolutionize Arabic NLP models.
Prof. Mérouane Debbah, Chief Researcher, Digital Science Research Centre and AI Cross-Centre Unit, TII, said: “With NOOR, TII has expanded the scope of the modern standard Arabic model by leveraging know-how in large language models to build cross-disciplinary, cutting-edge expertise in this new generation of AI research.”
NOOR’s training dataset is the world’s largest high-quality cross-domain Arabic dataset, combining web data with books, poetry, news articles, and technical information to significantly widen the applicability of the model.
Dr. Ebtesam Almazrouei, Director, AI Cross-Centre Unit, TII, said: “Large language models have taken the world of natural language processing by storm, and we are proud to introduce this cutting-edge model with 10 billion parameters - the world’s largest Arabic NLP model. The uniquely large Arabic dataset collected to train the model is the result of months of work that included curating, scrapping, and filtering of varied sources.”
Dr. Almazrouei pointed out that the NOOR model is based on the popular Transformer architecture. As a decoder-only model, similar in structure to GPT-3, it is programmed to tackle generative tasks with architecture upgraded to reflect the latest developments in the world of machine learning, including improvements such as better positional embeddings. To help ensure quality at scale in the NOOR dataset, the TII team designed an automated filtering pipeline based on machine learning techniques. These tools identify text like quality references and safeguard the model from exposure to spam content.
Leveraging state-of-the-art 3D parallelism, NOOR was trained on a High-Performance Computing resource with 128 A100 GPUs, allowing for the distribution of computations and ensuring efficient use of the available hardware resources.
Dr. Almazrouei also noted that this was only the first step in TII’s efforts to contribute to the wider UAE Strategy for Artificial Intelligence, through supporting AI integration across key sectors of the economy.