Text Analytics and Transcription Technology for Quranic Arabic
Keywords:
tajwid, prosody, phonemic transcription, phrase boundaryAbstract
Natural Language Processing Working Together with Arabic and Islamic Studies is a 2-year project funded by the UK Engineering and Physical Sciences Research Council (EPSRC) to study prosodic-syntactic mark-up in the Quran (Atwell et al 2013). Tajwīd or correct Quranic recitation is very important in Islam. The original insight informing this project is to view tajwīd mark-up in the Quran as additional text-based data for computational analysis. This mark-up is already incorporated into Quranic Arabic script, and identifies phrase boundaries of different strengths, plus lengthened syllables denoting prosodically and semantically salient words. We have developed a grapheme-phoneme mapping scheme (Brierley et al 2016), plus state-of-the-art software (Sawalha et al 2014) for generating a stressed and syllabified phonemic transcription or citation form for each word in the entire text of the Quran, using the International Phonetic Alphabet (IPA). This canonical pronunciation tier for Classical Arabic is informed and evaluated by Arabic linguists, tajwīd scholars, and phoneticians, and published in an open-source Boundary-Annotated Quran corpus and machine learning dataset (ibid). We utilise statistical techniques such as keyword extraction to explore semiotic relationships between sound and meaning in the Quran, invoking a Saussurean-type view of the sign as ‘...a bi-unity of expression and content...’ (Dickins 2007). Our investigation entails: (i) text data mining for statistically significant phonemes, syllables, words, and correlates of rhythmic juncture; and (ii) interpretation of results from interdisciplinary perspectives: Corpus Linguistics; tajwīd science; Arabic Linguistics; and Phonetics and Phonology.