Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee...

38
Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence and Biological Systems School of Computing University of Leeds

Transcript of Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee...

Page 1: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

Corpus Linguistics for Understanding the Quran

Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad

I-AIBS Institute for Artificial Intelligence and Biological SystemsSchool of Computing University of Leeds

Page 2: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

The Challenge: An interdisciplinary approach to understanding the Quran

Page 3: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

(1) What is the Quran?

Holy Book Prophet Text Dated

Suhuf Ibrahim (Scrolls) Abraham ?

The Tawrat (Torah) Moses 1500 BCE?

The Zabur (Psalms) David 1000 BCE?

The Injil (Gospel) Jesus 1 CE

The Quran Muhammad (PBUH) 610-632 CE

The last in a series of 5 religious texts

Page 4: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

(1) What is the Quran?

- Classical Arabic

- Islamic Law (legal logic)

- Divine guidance & direction

- Scientific & philosophical knowledge

- Has inspired many scientific achievements, e.g. Algebra and linguistics

The central religious text of Islam

Page 5: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

(2) Traditional Arabic Linguistics

- Orthography (diacritics and vowelization)- Etymology (Semitic roots)- Morphology (derivation and inflection)- Syntax (origins of dependency grammar)- Discourse Analysis & Rhetoric- Semantics & Pragmatics

Originated in Arabs studying the language of the Quran (detailed analysis for at least 1000 years):

Page 6: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

(3) Computational Linguistics

Current use of computing to analyze the Quran is mostly…

- Keyword search (useful) - Frequency analysis (numerology?)

Where are we now?

Page 7: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

(3) Computational Linguistics

Example question-answering dialog system:

QuestionHow long should I breastfeed my child for?

Answer Mothers should suckle their offspring for two years, if the father wishes to complete the term (The Holy Quran, Verse 2:233).

- How far can we go?- Is an artificial intelligence system realistic?

Page 8: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

An AI approach to understanding the Quran

Central HypothesisAugmenting the text of the Quran with rich annotation will lead to a more accurate AI system.

- Prepare the data by annotating the Quran.- Use the data to build an AI system for concept search and question-answering.

Page 9: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

Annotating the QuranChallenges

Orthography - Complex script verified in Unicode?

Morphology - Arabic is highly inflected and this is challenging to model by computer

Syntax - Phrase structure or dependency grammar?

Semantics – lexical semantics, ontology, logic, lexical frames?

Page 10: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

Annotating the QuranSolutions

- Recent computational advances have made possible annotating the Quran to very high accuracy

- Community effort using volunteers

- Leverage existing resources from Traditional Arabic Grammar

- Automatic annotation followed by manual verification

Page 11: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

Recent Advances: Orthography

Google Search for verse (68:38) on Jan 21, 2008 shows many typos

Does an accurate digital copy of the Quran exist?

Encoding Issues- Missing diacritics

- Simplified script (not Uthmani)

- Windows code page 1256, not Unicode

Page 12: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

Recent Advances: OrthographyTanzil Project (http://tanzil.info)

- Stable version released May 2008

- Uses Unicode XML encoding, including the special characters designed for the complex Arabic script of the Quran

- Manually verified to 100% accuracy by a group of experts who have memorized the entire text of the Quran

Page 13: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

Recent Advances: OrthographyJava Quran API (http://jqurantree.org)

March 2009

- Java classes for querying the Tanzil XML of the Quran

- First step towards software package for analyzing the Quran

Page 14: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

Recent Advances: Morphology

- Buckwalter Arabic Morphological Analyzer (2002)

- Morphological Analysis of the Quran at the University of Haifa, Israel (2004)

- Lexeme & feature based morphological representation of Arabic (Nizar Habash, 2006)

Page 15: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

The Haifa Corpus (2004)Multiple analysis for each word (up to 5)rbb+fa&l+Noun+Triptotic+Masc+Sg+Pron+Dependent+1P+Sgrbb+fa&l+Noun+Triptotic+Masc+Sg+Gen

Not a manually verified corpusAuthors reports an F-measure of 86%

Non-standard annotation scheme not familiar to traditional Arabic linguists (e.g. extracting a list of all verbs in the corpus is non-trivial)

Arabic text is only encoded phonetically instead of using the original Arabic. Searching for the possible morphological analyses for a specific word is not easy

Page 16: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

The Quranic Arabic Corpus

http://corpus.quran.com

- Manually verified (99% accuracy)

- Poplar website with very positive feedback

- million(s) of visitors

1. Initial tagging using Buckwalter Analyzer2. Paid annotator working for 3 months3. Community of volunteers verifying against existing books of Traditional Arabic Grammar which analyse the Quran

Shows Arabic and English morphological analysis side-by-side, with phonetic transcription, search and translation.

Page 17: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

The Quranic Arabic Corpus http://corpus.quran.com/

• Kais Dukes Arabic Language Computing Applied to the Quran – PhD (part-time)

an open-source online focus for linguistic research on Classical Arabic:

morphology - each word shows colour-coded morphological analysis syntax - each verse shows dependecy parse following Arabic tradition semantics - entitites and concepts are linked to an ontology translation - word-for-word English translations to aid understanding Machine Learning - annotations provide training data for a parser Impact on society - dozens of researchers collaborated on the analysis and over a million visitors have used the website this year.

Page 18: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

The Quranic Arabic CorpusPart-of-speech TaggingPart-of-speech Tag Name Arabic Name N Noun اسمPN Proper noun علم اسماءPRON Personal pronoun ضميرDEM Demonstrative pronoun اشارة اسمREL Relative pronoun موصول اسمADJ Adjective صفةV Verb فعلP Preposition جر حرفPART Particle حرفINTG Interrogative particle استفهام حرفVOC Vocative particle نداء حرفNEG Negative particle نفي حرفFUT Future particle استقبال حرفCONJ Conjunction عطف حرفNUM Number رقمT Time adverb زمان ظرفLOC Location adverb مكان ظرفEMPH Emphatic lām prefix التوكيد المPRP Purpose lām prefix التعليل المIMPV Imperative lām prefix االمر المINL Quranic initials مقطعة حروف

-Part-of-speech tags adapted from Traditional Arabic Grammar, and mapped to English equivalents (not the other way around)

- These tags apply to words in the Quran, as well as to individual morphological segments in the text

Page 19: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

The Quranic Arabic CorpusVerified Uthmani Script

- Unicode Uthmani Script- Sourced from the verified Tanzil project

Page 20: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

The Quranic Arabic CorpusPhonetics (faja'alnāhumu)

- Phonetic transcription generated algorithmically- Guided by Arabic vowelized diacritics

Page 21: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

The Quranic Arabic CorpusInterlinear translation

- Word-for-word translation from accepted sources- Interlinear translation scheme

Page 22: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

The Quranic Arabic CorpusLocation Reference (21:70:4)

- Common standard for verses (Chapter:Verse)- Extended in the QAC corpus to include word numbers and segment numbers, e.g. (21:70:4:2)

Page 23: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

The Quranic Arabic CorpusMorphological Segmentation

- Division of a single word into multiple segments- Part-of-speech tag assigned to each segment- Traditional Arabic Grammar rules used for division

Page 24: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

The Quranic Arabic CorpusMorphological segment features

Page 25: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

The Quranic Arabic CorpusArabic Grammar Summary

Page 26: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

The Quranic Arabic TreebankSyntactic Annotation

- Dependency Grammar based onإعراب (i'rāb)- Syntactico-semantic roles for each word

Page 27: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

The Quranic Arabic TreebankWhat’s new about this research?

- First Treebank of Classical Arabic

- Free Treebank of the Quran

- Well-defined formal representation of Traditional Arabic Grammar using hybrid constituency/dependency graphs

Page 28: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

Automatic AnnotationClassical Arabic Dependency Parser

-

- Joakim Nivre (2009) dependency parsing using a shift/reduce queue/stack architecture with machine learning

- Following similar architecture, but with hand written rules, custom parser has anF-measure of 77.2%

Page 29: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

Quran ‘Search for a Concept’ Tool

Nora Abbas developed the first Quran "search for a concept" tool and website, Qurany;

Noorhan Abbas. Qurany: A Tool to Search for Concepts in the Quran (PDF). MSc by Research Thesis, School of Computing, Leeds University, 2009

Page 30: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

Quran ‘Search for a Concept’ Tools

What the available Quran tools on the net provide?What the available Quran tools on the net provide?

What is the main problem with these tools?What is the main problem with these tools?

What about the Recall value of their results?What about the Recall value of their results?

What is the main reason for these poor results?What is the main reason for these poor results?

The SearchTruth tool 48%The SearchTruth tool 48%• Search Truth Search Truth http://www.searchtruth.com/

The Holy Quran Viewer tool 34%The Holy Quran Viewer tool 34%• Holy Quran Viewer Holy Quran Viewer

http://www.2muslims.com/directory/Detailed/223253.shtml

The University of Southern California tool 49%The University of Southern California tool 49%• MSA-USC Qur’an Database MSA-USC Qur’an Database

http://www.usc.edu/dept/MSA/reference/searchquran.html

Page 31: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

• What is a CONCEPT?• NOT just a “keyword”

• “index term” in a textbook?

Quran ‘Search for a Concept’ Tool

Page 32: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

• General/Abstract Concepts:– Women’s financial status

– Main pillars of Islam

– Characteristics of Paradise

• Concrete Concepts:– Names of places

• (Makkah, Mecca, Meccah)

– Names of prophets, angels,…etc.• (Musa, Moses)

– Names of Holy Books• (The Book (Bible), Bible, New Testament)

Quran ‘Search for a Concept’ Tool

Page 33: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

Quran ‘Search for a Concept’ Tool

• What does my tool look like?

12345

6

Page 34: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

Quran ‘Search for a Concept’ Tool

Handling the Concrete Concepts– Eight Parallel English Translations– Search for one English word or a

group of words in one search request

– Search for one Arabic word or a group of words in one search request

– Search for a mixed list of Arabic and English words in one search request

– Offers a list of synonyms for the English words

Page 35: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

Quran ‘Search for a Concept’ Tool

• General/Abstract Concepts– It is imported from ‘Mushaf Al Tajweed’ index of topics

published by Dar Al-Maarifa in Syria. – The tool has 15 main concepts.– The tool covers all the concepts in both languages Arabic

and English.– The total number of concepts covered is 1170.– For example, to represent:

• Women’s financial status• Main pillars of Islam• Characteristics of Paradise

Page 36: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

Knowledge representation and text mining of the Qur'an

• Abdul-Baquee Muhammad • http://www.comp.leeds.ac.uk/scsams/ • http://www.textminingthequran.com/wiki

Page 37: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

Qur'anic ApplicationsText Mining The Quran

Verse similarity: Allows you to see all verses that share a certain percent of characters with your input verse.

Quranic Chapter Relatives allows you to see the strongest relatives of a given Quran Chapter.

Word Cloud: See word clouds of a sura or group of suras of the Qur'anic. Qur'an Concordance: Concordance over lemma. Part-of-Speech Display of Sura: View a sura of the Qur'an with color-coded Part of

speech tags. Quranic word co-occurence: Allows you to enter a quranic terms to finds its most

frequent neighbors. N-gram Search: Search upto 5-gram phrases of the Quran with a frequency of 5 or

more. Pronoun References: Given a verse, see all pronoun references within this verse. List of Concepts: See a list of concepts arising from Pronoun referents in the Quran.

Page 38: Corpus Linguistics for Understanding the Quran Eric Atwell, Kais Dukes, Nora Abbas, Abdul-Baquee Muhammad I-AIBS Institute for Artificial Intelligence.

AI for understanding the Quran

Kais Dukes developed the first online annotated linguistic resource which shows the Arabic "irab" morphology and grammar for each word and verse in the Holy Quran, the Quranic Arabic Corpus including word-by-word morphology and English gloss, and Ontology of Quranic concepts;

Nora Abbas developed the first Quran "search for a concept" tool and website, Qurany;

Abdul-Baquee Sharaf developed tools and resources for text mining the Quran including verse similarity, lemma concordance and collocation, and text mining the Hadeeth