You are here

You are here

  • Using advanced algorithms to uncover large-scale patterns in ancient texts, researchers are gaining insight to how ideas and memory were shaped in the 700-1500 CE period.
Professor Bowen Savant: Using machine learning on ancient texts

ISMC’s project has made two major advances during the research period.

Professor Sarah Bowen Savant, of AKU’s Institute for the Study of Muslim Civilisations, is using advanced algorithms to uncover large-scale patterns in ancient texts. She leads the KITAB—Knowledge, Information Technology and the Arabic Book—project, which has been created by an international team of experts in IT, history and the Arabic language.





Professor Sarah Bowen Savant of AKU’s Institute for the Study of Muslim Civilisations leads the KITAB - Knowledge, Information Technology and the Arabic Book - project.

KITAB acts like an online toolbox that sheds light on how ideas and memory were shaped in the 700-1500 CE period. The application of advanced technology and the project’s open access nature align with efforts to promote knowledge sharing and partnerships towards achieving the SDGs.

The team has now completed the first release of its entire corpus through Zenodo, an online platform that supports open access to research.

The corpus features 1,859 authors and 4,288 titles totalling 755,689,541 words. If multiple versions of the same title are counted, there are 7,144 titles totalling 1,520,667,360 words. The texts are part of OpenIT, the Open Islamicate Texts Initiative, a multi-institutional effort to construct the first machine-actionable scholarly corpus of pre-modern Islamicate texts in multiple languages, including Arabic, Persian, Ottoman Turkish and Urdu. The initiative seeks to encourage computational analysis of these written traditions, with KITAB being a major contributor of Arabic texts.

To date, most of the Arabic texts have been collected from open-access online collections of pre-modern and modern Arabic texts. The texts are currently being annotated by the team and its partners.

All major versions of the corpus, as well as analytical datasets generated from the corpus using different methods, will continue to be published on Zenodo as part of the project’s commitment to open access.


As part of the effort, the team is working with the Qatar National Library to create an online corpus and digital research pipeline for the Sira of Ibn Ishaq (d. 767). The Sira is an important and exemplary case of a dispersed text within the early Arabic tradition. There is no single, original, complete text surviving today. Instead there are multiple versions, in fragmentary form, scattered within hundreds of other books from the 9th century to early modern times. These include well-known witnesses to the text, including the commentary of Ibn Hisham (d. 828), which contains two of the original four parts but is often mistakenly referred to as the complete Sira of Ibn Ishaq.

Members of the team are also seeking to improve alignments between texts that contain sections from the Sira and to make them available for online study. Possible research questions relate to the manner of production, transmission and circulation of texts from the period of Ibn Ishaq’s lifetime to the present.

The digital research pipeline relies on innovations in optical character recognition, text reuse detection, data modelling and data visualisation to shed light on this important text.