The main goal of this research is to investigate and to develop the appropriate text collections, tools and procedures for Arabic document classification. The following specific objectives have been set to achieve the main goal:
To investigate the impact of preprocessing tasks including normalization, stop word removal, and stemming in improving the accuracy of Arabic DC system.
To introduce a novel technique for Arabic stemming in order to improve the accuracy of the document classification system. The new algorithm for Arabic stemming tries to overcome the deficiencies in state-of-the-art Arabic stemming techniques and dealing with MWEs, foreign Arabized words and handling the majority of broken plural forms to reduce them into their singular form.
To use Arabic text summarization technique as feature reduction technique to eliminate the noise on the documents and select the most salient sentences to represent the original documents.
To explore the impact of different feature selection techniques on the accuracy of Arabic document classification and proposes and implements a new variant of Term Frequency Inverse Document Frequency (TFIDF) weighting methods that take into account the important of the first appearance of a word and the compactness of the word which can be taken as factors that determine the important features in the document.
To implement various classifiers and compares their performances.
Despite the achievements in document classification, the performance of document classification systems is far from satisfactory. document classification tasks are characterized by natural languages. This means DC is closely related to natural language processing (NLP) which require knowledge of its subject matter. In general NL reveals many of syntactic and semantic ambiguities beside the complexities . In the context of DC, a researcher tries to address various problems arising from characteristics of documents in the process of feature extraction and feature representation; or problems emanating from the classification algorithms. The following sections provide ideas on research problems.
1.1.1. Preprocessing Text Problem
The preprocessing stage is a challenge and affects positively or negatively on the performance of any DC system. Therefore, the improvement of the preprocessing stage for highly inflected language such as the Arabic language will enhance the efficiency and accuracy of the Arabic DC system. In spite of the lack of standard Arabic morphological analysis tools most of the previous studies on Arabic DC have proposed the use of preprocessing tasks to reduce the dimensionality of feature vectors without comprehensively examining their contribution in promoting the effectiveness of the DC system. One of the challenges facing the researchers in Arabic document classification systems is the absence of a strong and an effective stemming algorithm. Arabic is morphologically a complex language , it uses both kinds of morphologies: inflectional and derivational morphologies. Based on these types of morphology, a single word may yield hundreds or even thousands of variant forms . The importance of using the stemming technique in the documents classification lies in that it makes the processes less dependent on particular forms of words and reduces the highly dimensionality of the feature space, which, in turn, enhance the performance of the classification system. In spite of the rapid research conducted in other languages, Arabic language still suffers from the shortages of researchers and development. The state-of-the-art Arabic stemmers suffer from high stemming error-rates due to its understemming errors, overstemming errors, ignored the handling of multiword expressions (MWEs), broken plural forms, and Arabized words. Therefore, the limitations of the current Arabic stemming methods have motivated this author to investigate a novel technique for Arabic stemming to be used in the extraction of the word roots of Arabic language in order to improve the accuracy of the document classification system in chapter 5.
1.1.2. Highly Dimensionality of the Feature Space
Extremely high – dimensional features paces and large volumes of data problems occur in automatic document classification. High dimensionality problems arise because the number of features used in the classification process increases along with dimensionality of the feature vectors[13, 15, 48, 49]. Practical examples show that the number of features consisting the dimensionality could amount to thousands.
A large number of features are irrelevant to the classification task and can be removed without affecting the classification accuracy for several reasons: First, the performance of some classification algorithms is negatively affected when dealing with a high dimensionality of features. Second, an over-fitting problem may occur when the classification algorithm is trained in all features. Finally, some features are common and occur in all or most of the categories .
In order to solve this problem, the feature vector dimensionality is required to be reduced without degradation of classification performance. It was important to extract the features with high discriminating power using various techniques. Text summarization, feature selection and feature weighting are common techniques and methods that are used in document classification to reduce the highly dimensionality of the feature space and to improve the efficiency and accuracy of the classification system. The term frequency (TF) weighted by inverse document frequency (IDF) which is abbreviated as TFIDF can partially solve the problem of variation in content and length in the documents but it cannot solve the problem of the distribution of the important words within the document. In general, the document is written in an organized manner to describe its main topic(s). For example, the main topic for news articles may mentions at the title and the first part of the document to draw the attention of the reader. Therefore, depending on the location, the document parts may have different degrees of contribution to the document’s main topic(s) . In this thesis, we propose new feature weighting methods that treat the problem of the distribution of the important words within the document in chapter 6.
In order to satisfy the objectives stated in this research, the research questions of this study can be summarized as:
What are the impact of text preprocessing techniques such as normalization, stop word removal, and stemming in improving the performance of Arabic DC system? What are the available Arabic text preprocessing methods to be implemented in this research? What are their advantages and disadvantages? How to compare and improve their performance in order to improve the accuracy of the Arabic documents classification system?
What are the Impact of feature reduction techniques on Arabic document classification? How to overcome the problem of the highly dimensionality of the feature space and the difficulty of selecting the important features for understanding the document?
Which classification algorithms have the best performance when applied on different representations of Arabic dataset?
This research focuses on exploring different preprocessing techniques, dimensionality reduction techniques and investigating their effect on Arabic document classification performance. More specifically, the main contributions of this thesis are as follows:
Demonstrate that using preprocessing task such as normalization, stop word removal, and stemming for Arabic datasets have a significant impact on the classification accuracy, especially with complicated morphological structure of the Arabic language. Furthermore, we demonstrate that choosing appropriate combinations of preprocessing tasks provides significant improvement on the accuracy of document classification depending on the feature size and classification techniques.
In this thesis, we propose a novel stemmer for Arabic documents classification. The proposed stemmer attempts to overcome the weaknesses of root-based stemming technique and light stemming technique, in addition to dealing with the majority of broken plural forms, MWEs, and foreign Arabized words. We compare the proposed stemmer with the well-known Arabic stemmers, including root-base stemming (Khoja stemmer) and light stemming (Larkey stemmer), to study its contribution in improving the classification system. The comparison is carried out for different datasets, classification techniques, and performance measures.
Demonstrate that using document summarization technique help to improve the efficiency of Arabic document classification by reducing the highly dimensionality of the feature space without affecting the value or content of documents, then saving the memory space and execution time for documents classification process.
In this thesis, we investigate the impact of different feature selection techniques, namely, Information gain (IG), Goh and Low (NGL) coefficients, Chi-square Testing (CHI), and Galavotti-Sebastiani-Simi Coefficient (GSS) that have a significant impact on reducing the dimensionality of feature space and thus improve the performance of Arabic document classification system.
In this thesis, we investigate the impact of feature representation schemas on the accuracy of Arabic document classification. The document usually consists of several parts and the important features that more closely associated with the topic of the document are appearing in the first parts or repeated in several parts of the document. Therefore, the proposed weighting methods take into account the important of the first appearance of a word and the compactness of the word which can be taken as factors that determine the important features in the document.
Unfortunately, there is no free benchmarking dataset for Arabic documents classification. One of the aims of this research is to compile dataset for Arabic documents classification that cover different text genres which will be used in this research and can be used in the future as a benchmark for computation linguistics researches including text mining, information retrieval. The dataset collected from several published papers for Arabic document classification and from scanning the well-known and reputable Arabic websites. Compiling freely and publically available corpora is advancement step on the field of Arabic document classification.