In cooperation with institutes of Arabic-speaking countries, containing AASTMT, BA and IIAI, the cognitive model and data research team of Beijing Academy of Artificial Intelligence(BAAI) has published ArabicText 2022, the world’s largest Arabic text dataset among the open-source community for pre-training language models.
By collecting, aggregating and cleaning the public-available Arabic web data, we finally obtains a 200GB+ high-quality text dataset, which is the largest around the world’s open-source community. During the process of data cleaning, we applies and optimizes WudaoCleaner, an efficient and effective web text cleaning tool approved by WuDaoCorpora. At the same time, we integrate the open-source Arabic text cleaning toolkit, ArabertProcessor, into the whole cleaning pipeline as a insurance of language-specific data quality. Moreover, the informative data such as news and encyclopedia, account for more than 65% in our dataset, indicating that language models is able to gain prior knowledge easily from our corpus.
Based on ArabicText 2022, we train Arabic Language Model (ALM) which has been released on github.