International Journal of Electrical and Computer Engineering (IJECE) Vol. No. June 2016, pp. ISSN: 2088-8708. DOI: 10. 11591/ijece. Automatic Extraction of Malay Compound Nouns Using A Hybrid of Statistical and Machine Learning Methods Muneer A. Hazaa1. Nazlia Omar2. Fadl Mutaher Ba-Alwi3. Mohammed Albared3 Faculty of Computer Science and Information Technology. Thamar University. Yemen University Kebangsaan Malaysia. Faculty of Information Science and Technology Faculty of Computer and Information Technology. Sana'a University. Yemen Article Info ABSTRACT Article history: Identifying of compound nouns is important for a wide spectrum of applications in the field of natural language processing such as machine translation and information retrieval. Extraction of compound nouns requires deep or shallow syntactic preprocessing tools and large corpora. This paper investigates several methods for extracting Noun compounds from Malay text corpora. First, we present the empirical results of sixteen statistical association measures of Malay compound nouns extraction. Second, we introduce the possibility of integrating multiple association measures. Third, this work also provides a standard dataset intended to provide a common platform for evaluating research on the identification compound Nouns in Malay language. The standard data set contains 7,235 unique N-N candidates, 2,970 of them are N-N compound nouns collocations. The extraction algorithms are evaluated against this reference data set. The experimental results demonstrate that a group of association measures (Ttest . Piatersky-Shapiro (PS) . C_value. FGM and rank combination metho. are the best association measure and outperforms the other association measures for collocations in the Malay corpus. Finally, we describe several classification methods for combining association measures scores of the basic measures, followed by their evaluation. Evaluation results show that classification algorithms significantly outperform individual association Experimental results obtained are quite satisfactory in terms of the Precision. Recall and F-score. Received Oct 12, 2015 Revised Dec 7, 2015 Accepted Dec 21, 2015 Keyword: Association Measures Classification Algorithms Compound Nouns Malay Language Copyright A 2016 Institute of Advanced Engineering and Science. All rights reserved. Corresponding Author: Muneer A. Hazaa. Faculty of Computer and Information Technology. Dhamar University. Yemen. Email: muneer_hazaa@yahoo. INTRODUCTION Compound nouns are a commonly occurring construction in natural languages. Compound nouns are made up of two or more nouns which together function syntactically as single noun such as Aogolf club Aoor Aocomputer scienceAo. The compound noun syntax and semantics are discussed in details in Levi . Compound nouns which consist of two words are analyzed syntactically by means of the rule N IeN N or the rule N IeN N I applied recursively. Compounds of more than two nouns are ambiguous in syntactic structure. Noun-Noun compounds, as a subset of compound nouns, characteristically occur with high frequency and high lexical and semantic variability . Noun compounds . r NC. have received a significant deal of attention in recent years in computational linguistic literature. Identification of compound noun Multiword Expression (MWE) and understanding their syntax and semantics is difficult but important for many Natural Language Processing (NLP) applications, particularly parsing, and dictionary-based Journal homepage: http://iaesjournal. com/online/index. php/IJECE A ISSN: 2088-8708 applications like machine translation . and question answering . Extracting Malay compound nouns is challenging task in terms of obtaining accurate results. Hence, this study attempts to improve the effectiveness of Malay noun compound extraction by proposing a hybrid of statistical and machine learning methods. The main activity in our research work is to observe and find an acceptable technique to extract a pair of compound nouns in Malay. Compounds have thus been a recurrent focus of attention within theoretical, cognitive, and in the last decade also within computational linguistics. Considerable research has been proposed on automatic identification of multiword units and noun-noun compounds and on to classify semantic relationships between Compounds components. Most of these studies on noun-noun compounds only deal with English and some other languages but not much research have been carried at this level for Malay. Various Lexical association measures have been suggested in literature for identification of MWEs. These association measures are mathematical formulas that compute an association score between two or more words based on their occurrences and co-occurrences in a text corpus. The scores indicate the potential for a candidate to be a collocation. They can be used for ranking . andidates with high scores at the to. , or for classification . y setting a threshold and discarding all bigrams below this threshol. An overview of the most widely used techniques is given in . Compound nouns in Malay have been classified into three major types based on their syntactic structure, as discussed in . The syntactic structure of the first and the second categories is noun followed noun, for example Augunung-ganang Ay . and Aukapal layarAy . ailing shi. For the third category, the syntactic structure is a noun followed by a noun word. The POS of the non-noun word can be a determiner, verb, adjective, adverb, preposition phrase or ordinal. In this study, our work is focused on the automatic extraction of the N-N Malay compound nouns multiword expression. In this paper, first, several statistical association measures . have been investigated for the identification nounAenoun compounds in Malay corpus. After that, we present an automatic nounAenoun compounds extraction based on weighted combination of multiple lexical association measures lists. Finally, we describe several classification methods which uses association measures scores as their feature sets. Experiments presented in this paper were performed on Malay data and our attention was restricted to the first and second categories of Malay noun compounds. This paper is organized as follows: In Section 2, we give a summary of related. Section 3 describes our Malay Noun compounds extraction methods. Section 4 presents the evaluation methods, the experimental results and discussion on the results. Finally. Section 5 concludes the study and gives some future work. RELATED WORK Several approaches have been proposed have been carried out regarding MWE in various languages like English. German and some other languages Generally speaking, these approaches can be divided into four mainstream methodologies: statistical approaches . , linguistic methods . and Hybrid Methods . , and machine leaning methods . In statistical methods for MWE extraction. Church and Hanks . presented the concept of association measures firstly, and then proposed Mutual Information (MI) as an objective measure for estimating word association. Pecina . present empirical evaluation of a comprehensive list of automatic collocation extraction methods . kinds of association measures for bigram collocation extractio. and concluded that in Czech data. MI has the best performance. Yoshida et al. propose a new method (Enhanced Mutual Information and Collocation Optimizatio. to extract MWE from text. The results show that the new method significantly improves the performance of multiword expression extraction in comparison with a classic MI extraction method. Chakraborty . and Dandapat. Mitra et al. have used statistical measurements to extract Noun-Noun (N-N) and Noun-Verb (N-V) collocations as MWE in Bengali Corpus respectively. Kunchukuttan and Damani . developed a system for Hindi compound noun MWE extraction from a Hindi corpus. Their extraction methods are based on statistical co-occurrence The linguistic methods for MWE extraction is based on wordsAo POS tags that form the grammatical and syntactical requirement for a word sequence to be a MWE. Bourigault . propose grammatical analysis method for the extraction of terminological noun phrases. Argamon. Dagan et al. proposed a memorybased approach to learn language patterns from corpora. Their method relies on local POS information of a word sequence instead of full parsing a sentence. The hybrid approach combines both statistical and linguistic information of word sequences. Dias . proposed a hybrid system which uses mutual expectation to score both the association of words and the association of POS patterns in the tagged corpora. Su. Wu et . designed an automatic compound retrieval to extract compounds within a text. They use n-gram mutual information, relative frequency count and POS as the features for compound extraction. IJECE Vol. No. June 2016 : 925 Ae 935 IJECE A 927 ISSN: 2088-8708 machine leaning methods. Pecina . used machine learning approach for MWE extraction. Their method uses 55 kinds of association measures, such as joint probability. MI and t-score, to score each compound noun candidate. After that, a machine learning method . inear logistic regression, linear discriminant analysis and neural ne. is used to classify new coming collocation candidates using the association measuresAo scores as features and to determine whether or not they are MWEs. The machine learning methods significantly improved ranking of collocation candidates on all of their data sets than the best association measure. Duan. Lu et al. developed a bio-inspired approach for multi-word expression Extraction. RESEARCH METHOD We have developed a system that extracts bigram compound nouns MWEs from a text corpus. The compound nouns extractor creates a ranked list of Malay compound nouns. Several approaches which mainly rely mainly on the statistical co-occurrence information of the compound nouns and POS patterns have been Basic system architecture is shown in Figure 1. The following subsections will discussed in detail the extraction methods used. Corpus Acquisition Corpora have been extensively employed in several NLP tasks as the basis for automatically learning models for language analysis and generation. In this step, we crawl and collect Malay news articles which are written in Malay language from Malaysian National News Agency (BERNAMA) news source . ttp://w. com/bernama/v6/index. The size of the corpus is 49661 news article and 13,346,381 token. Corpus Acquisition Corpus Preprocessing Candidate Generation Candidate Compound Nouns Automatic CNs Extraction Methods . anking and classificatio. Final Compound Nouns lists Figure 1. Extraction and Filtration of Compound Nouns Multiword Units Preprocessing In this phase, all crawled web pages are preprocessed by removing all HTML tags, identifying main content, automatic noise removal and breaking the content down to a sequence of individual tokens. After that, all-uppercase, capitalized and mixed case words were lowercased. Punctuations, special symbols and numbers are removed. Table 1 shows the n-gram statistic of our corpus. Table 1. Statistics of the Malay corpus Number of types Number of tokens Number of unique bi-grams Number of bi-grams Number of unique tri-grams Number of tri-grams 13,346,381 705,680 13,296,724 1,730,916 13,247,067 Automatic Extraction of Malay Compound Nouns Using A Hybrid of Statistical and . (Muneer A. Haza. A ISSN: 2088-8708 Candidate Generation In this phase, we have tagged all nouns in the text corpus given a list of Malay noun list obtained from a manually annotated small tagged corpus and Malay lexicon which contain Malay words with their possible POS tags. This phase gives all possible N-N collocations that occur in a corpus. From the tagged corpus, if two consecutive words tagged as Noun and Noun respectively is extracted as a candidate N-N These compound nouns candidates are then passed to the next phase for automatic compound nouns extraction method. Compound nouns candidates which occur with very low frequency are discarded. Only candidate compound nouns collocations whose frequency in the corpus are greater than or equal to three are considered. Automatic Extraction Once we have extracted the candidate N -N compounds in the compound nouns candidate generation phase, we have ranked or classified each compound noun MWE candidate extracted from a In our task, several statistical co-occurrence measures and sequence type concerned model are calculated on each of the extracted candidates, and the candidate collocations are ranked or classified by these measures. They can be used for ranking . andidates with high scores at the to. , or for classification . y setting a threshold and discarding all bigrams below this threshol. Statistical co-occurrence association model The major statistical measures used and evaluated in N -N compounds recognition in our study are Pointwise Mutual Information (PMI). Z-socre. T-test: These methods try to compare the observed frequencies of collocation candidates with the expected frequencies based on the assumption of independence in the target pairs . 1,w. Krenn . did a thorough evaluation among t-score, z-socre and MI measures and showed that t-score over performed the other association measures for collocations in a German corpus. However, the statistical measures t-score, z-score, and MI are formulated where N: of the total instances of NNCs . : of the total instances of w1. O: of the total instances of pair . fw2 : of the total instances of w2. Chi-square test ( -test ) : PearsonAos test of independence can be used to test if the words in the collocation are independent of each other. The N 2-test is a classical method that is widely used for this type of analysis. The N 2-test is formulated below: where N: of the total instances of NNCs . O: of the total instances of pair . : of the total instances of w1. fw2 : of the total instances of w2 : of pairs do not contain w1 and w2 simultaneously : of pairs contain w2 but not w1. : of pairs contain w1 but not w2 Phi coefficient: In statistics, the Phi coefficient is a measure of association for two binary The Phi coefficient is adopted in several works for compounds extraction . The Phi coefficient is formulated below: Log Likelihood Ratio (LLR) : The likelihood-ratio test is a more general test of significance compared to the N^2test and makes no assumptions of approximation to the normal distribution. The LLR has IJECE Vol. No. June 2016 : 925 Ae 935 IJECE ISSN: 2088-8708 A 929 proved to give better results . The log-likelihood is calculated with a formula adjusted for co-occurrence contingency table as follows: For a given pair of words , let be the number of windows in which cooccur, let be the number of windows in which only occurs, let c be the number of windows in which occurs, and let be the number of windows in which none of them occurs, then Other methods: in addition to the methods described above, other statistical association measures such as dice coefficient, odds ratio and Jaccard (J). Normalized Expectation (NE). Mutual Dependency (MD), and Mutual Expectation (ME) are also used. These methods are widely used in the collocation extraction . These methods are formulated below: The statistics of compound nouns and their components concerned methods The C-value Approach: The C-value method is an efficient domain-independent multi-word term recognition method . , which combines linguistic and statistical information . C-value is sensitive to the nested compounding by its enhanced statistical measure of frequency of occurrence. C-value is defined as: is the number of simple nouns that consist of CN, is its where CN is a candidate compound noun, frequency of occurrence in the corpus, is the set of extracted candidate terms that contain CN, the number of these candidate terms. c(CN) is the number of those term candidates. Combining frequency and geometric mean of nouns (FGM) : the main advantage of this method is that it manages to take into account both statistics of compound noun space and actual use in a corpus within one scoring function . where f(CN) is the number of independent occurrences of noun CN, # LN(N) and # RN(N) are the number of distinct simple words which directly precede or succeed N and LN(N) and RN(N) are the frequencies of nouns that directly precede or succeed N. Rank combination Each of the above association measures methods gives a ranked list. We tried the following approach to combine these ranked lists: Rank Aggregation (RA): The aim is to combine ranked lists produced by several association measures using information of the ordinal ranks of the elements in each list. The weighted combination method has proved to give better results their individuals . Given multiple ordered lists L1. L2. of CNs, the rank aggregation problem is to combine these lists into a single ranked list. We use the following rank aggregation heuristic which is called BordaAos positional ranking: Automatic Extraction of Malay Compound Nouns Using A Hybrid of Statistical and . (Muneer A. Haza. A ISSN: 2088-8708 Given lists L1. L2. Lm , where m Ok for each candidate c NNCs and list Li, the score is the number of candidates ranked below c in Li. The total Borda score is The candidates are then sorted by descending Borda scores. Statistical Classification The main idea is to feed statistical and linguistic information about two adjacent Malay nouns to a machine learning classification framework. As shown in the previous sections, the statistical association measures are only measure the association strength of pairs of words. After that, their scores are usually Then, thresholds or evaluation points are set by users to evaluate them given a standard test. However, their scores even after ranking cannot indicate explicitly whether pairs of words scored are compound nouns or not. For example. Aukad kreditAy Aucredit cardAy word pair is scored Au 61. 65 Au by t_test and ranked ninth in t_test list, but all these information cannot tell clearly weather the Aukad kreditAy is a Malay CN or not. However, compound nouns extraction problem can be formulated as a binary classification problem Each compound nouns . in which each candidate is assigned one class: candidate x is described by the feature or attribute vector , is the statistical score given by one of the above association measures. We have several association scores given by several association measures methods for each candidate and want to combine them together to achieve better performance. other words, the classification algorithms integrate all the association measures described above, and use their scores as attributes or features to classify N-N candidates. We evaluated several classification methods for compound nouns extraction. Linear Logistic Regression: Logistic regression predicts the probability of an outcome that can only have binary response Logistic regression can handle several predictors . umerical and categorica. The multiple logistic regression model has the form : The model defines the predicted probability as: where the coefficients controls the effect of the of the predictor . The farther a falls from 0, the stronger the effect of the predictor . Linear Discriminant Analysis: Linear Discriminant Analysis (LDA) is a popular tool for multiclass discriminative dimensionality reduction. The basic idea of LDA is to find a one-dimensional projection defined by a vector that maximizes class separation. This method maximizes the ratio of between-class variance to the within-class variance in any particular data set thereby guaranteeing maximal separability. vtS Bv v tSW v Support Vector Machines: SVM proposed to solve two-class problems by finding the optimal separating hyper-plane between two classes of data. Suppose that X is set of labeled training points . eature vecto. 1, y. , ( xn, y. , where each training point xi OO RN is given a label yi OO {Oe1, . ,where i = 1,. and to find a classifier ,n. The goal in SVM is to estimate a function which can be solved through the following convex optimization: with as a regularization parameter. IJECE Vol. No. June 2016 : 925 Ae 935 IJECE ISSN: 2088-8708 A 931 EXPERIMENTS AND DISCUSSION Data set and experiment setup To create an evaluation gold standard, manual identification of compound nouns MWEs was done on a Malay corpus. All N-N compound nouns collocations are manually annotated by a native speaker. The entire reference data set containing 16535 N-N candidates . 5 unique N-N candidate. , 2970 of the 7235 are N-N compound nouns collocations. We evaluate the extraction algorithms against the reference set of compound nouns collocations manually extracted from the 8200 files. As described above, the collocation statistics were collected from a larger corpus of 49661 Malay news documents . ,346,381 word. from Malaysian National News Agency (BERNAMA) news source . ttp://w. com/bernama/v6/index. Using a larger corpus provided more evidence for the statistical measures we used. Since we manually annotated the entire reference data set, we have used standard metrics Precision and Recall for evaluating automatic compound nouns extraction method. These metrics are computed at different ranks, called Evaluation Points (EP) in the following way . Precision at evaluation point k is defined as: Recall at evaluation point k is defined as: F-1 score at evaluation point k is defined as: Experimental results and analysis In our experiment, we incrementally examined the n-highest ranked candidate lists returned by each The precision values are calculated for the first 100, 200, 500, 1000 and 2000 top ranked candidates. The precision metrics for different methods are shown in Figure 2. The x-axis represents the Evaluation Points, while the y-axis represents the precision values . he percentage of true N-N Compound noun. achieved at these Evaluation Points. The performance metrics (Precision. Recall and F-scor. for all methods are also shown in Table 2. A first analysis of the precision curves and other metrics in Table 2 reveals distinction in two curve Some of the methods start with very high precision and then decreases quite substantially. On the contrary, other methods start with low Precision and then slightly increase. The precision curve of each measure is important in this purpose because the monotonously decreasing graph indicates the more number of N-N compound nouns collocations in upper ranks rather than in lower ranks. Although all methods approximately have the same precision at 3000 top ranked list, finding a bigger proportion of the true N-N compound nouns at an early stage is simply more economical. It is quite prominent from the results of Table 2 and Figure 2 that T-test . PS. C_value and FGM prove to be good measures for automatic extraction of Malay N-N compound nouns collocation as MWEs, since their Precision scores are higher at almost all evaluation points, while the worst measure appears to be CS method. As example, 99, 99, 98 and 98 of the top 100 ranked N-N by T-test. PS. C_value and FGM, respectively, are N-N compound nouns collocation. The top five candidates for each method and their corresponding tags are shown in Table 4. In fact, these methods show an interesting behavior compared to their behavior in other languages. The results obtained using these algorithms on Malay corpus are better than their results reported by other evaluation studies for other languages . It is important to note from Table 2 and Table 3 that some methods which are not mathematically equivalent . , assigning identical scores to input candidate. such as T-test and PS achieve the same average precision and produce the same lists of ranked candidates. The ability to identify such groups of association measures may help in simplifying their formulas . Automatic Extraction of Malay Compound Nouns Using A Hybrid of Statistical and . (Muneer A. Haza. A ISSN: 2088-8708 Figure 2. Overall Precision of different measures at different evaluation points For the rank combination experiments, we combined the best four methods (T-test. PS. C_value and FGM). Table 4 shows BordaAos positional ranking methodAos performance (Precision. Recall and F-scor. and the top five candidates. BordaAos positional ranking that does an approximate aggregation of the ranked list has been used as standard ranking function in previous studies . However, in our case, the BordaAos positional ranking behaves in the same way as its individuals. Table 2. The performance metrics (Precision. Recall and F-scor. for all methods at different evaluation point Evaluation Point Evaluation Point Evaluation Point Evaluation Point LLR DICE IJECE Vol. No. June 2016 : 925 Ae 935 CHI KAPPA T-test Odd PHI FGM Jacc. IJECE A 933 ISSN: 2088-8708 To avoid incommensurability of association measures in our experiments, we used a common preprocessing technique for scores standardization: all association measure values are centered towards zero and scaled them to unit variance. To evaluate machine learning methods Precision, recall and F1-measure of all classification methods were obtained by vertical averaging in ten-fold cross validation on the same reference data as in the earlier experiments. In each cross--validation step, nine folds were used for training and one fold for testing. All classification methods performed very well. Detailed results (Precision, recall and F1-measur. of all classification methods are given in Table 5. The best result was achieved by a support vector machines. SVM achieves precision, recall and F1-measure of 75. 44%, 87. 78% and 81. 14 % respectively. Experiments show that classification algorithms which combine association scores given by several association measures methods lead to a significant performance improvement in comparison with individual basic methods. In fact. Experimental results obtained are quite satisfactory, especially when being compared to results obtained in other works . In . a hybrid method of linguistic and statistical approaches has been proposed in terms of identifying compound nouns. Its clear that the hyprid method which combine both statistical and machine learning is outperformed the hybrid method of linguistic approach and statistical methods. Table 3. Top 10 Malay N-N candidates extracted by different methods panggung wayang sahabat handai karenah birokrasi makhluk perosak kanun keseksaan jejari kentang adat resam akar umbi wakaf mempelam karbon dioksida LLR sahabat handai panggung wayang karenah birokrasi barah otak paya pahlawan hukum syarak makhluk perosak ais krim milo ais tumbuhan ubatan DICE jem madu sahabat handai lubuk yu pendingin hawa laman web harta intelek kanun keseksaan hukum syarak panggung wayang khabar angin sanak sudara pustaka sufi angin sakal tuanku maharajalela pokok mempisang online kegilaan emas kerajang penubuhan platun mesyuarat informal bukit tekoh NCN NCN NCN NCN NCN NCN NCN NCN NCN NCN NCN CHI sahabat handai lubuk yu jem madu pendingin hawa laman web harta intelek kanun keseksaan hukum syarak panggung wayang khabar angin jem madu sahabat handai lubuk yu pendingin hawa laman web harta intelek kanun keseksaan hukum syarak panggung wayang khabar angin KAPPA jaksa pendamai NCN laman web kanun keseksaan pendingin hawa harta intelek musim perayaan penghilang dahaga tali pinggang topi keledar akar umbi kenaikan harga ehwal pengguna harga minyak kementerian perdagangan CN bahan api kerajaan negeri ketua pegawai harga barang kad kredit musim perayaan T-test kenaikan harga ehwal pengguna harga minyak kementerian perdagangan bahan api kerajaan negeri ketua pegawai harga barang kad kredit musim perayaan lubuk yu sahabat handai jem madu panggung wayang nira nipah karenah birokrasi barah otak era globalisasi ais krim kondominium pangsapuri kenaikan harga ehwal pengguna harga minyak kementerian perdagangan kerajaan negeri bahan api ketua pegawai harga barang kad kredit musim perayaan Odd roti canai mahkamah sesyen kanun keseksaan sungai nyiur penyaman udara musim tengkujuh kacang buncis muka sauk setebal muka panggung wayang NCN NCN NCN NCN NCN PHI pengarah syarikat makanan ternakan pakej umrah perlindungan harta produk buatan pegawai jabatan bot pukat bulan april permohonan lesen muka surat jem madu sahabat handai lubuk yu pendingin hawa laman web harta intelek kanun keseksaan hukum syarak panggung wayang khabar angin FGM kenaikan harga harga minyak kementerian perdagangan ehwal pengguna kerajaan negeri harga barang bahan api ketua pegawai harga bahan stesen minyak Jacc. jaksa pendamai laman web kanun keseksaan pendingin hawa harta intelek musim perayaan penghilang dahaga tali pinggang topi keledar akar umbi NCN NCN NCN Automatic Extraction of Malay Compound Nouns Using A Hybrid of Statistical and . (Muneer A. Haza. A ISSN: 2088-8708 Table 4. Results for rank combination Method Evaluation Point Precision Recall F-score Top 5 ranked candidates kenaikan harga ehwal pengguna harga minyak kementerian perdagangan bahan api Table 5. Performance of Classification Methods Combining All Association Measures Method SVM LDA GLM Precision Recall CONCLUSIONS In the present work, we have developed a compound noun MWE extraction system which ranks collocations using statistical methods. We developed and manually annotated a reference data set containing 5,610 Malay N-N bigrams, 1,854 of them were agreed to be a N-N compound noun. We implemented several lexical association measures, employed them for N-N compound noun extraction and evaluated them against the reference data set. The results obtained using these algorithms on Malay corpus are better than their results reported by other evaluation studies for other languages. The results also show that T-test. SP. C_value. FLR and RC are good measures for automatic extraction of Malay N-N compound nouns Finally, we employ three classification models . inear logistic regression, linear discriminant analysis and support vector machine. to combine association scores of the individual measures. Evaluation results show that these models significantly outperform individual association measures. SVM achieves precision, recall and F1-measure of 75. 44%, 87. 78% and 81. 14 %, respectively. In the future, we will implement, and evaluate other available methods suitable for this task. addition, we will focus especially on automatically interpreting compound nouns relations and improving quality of the training and testing data. Finally, we will attempt to demonstrate contribution of collocations in selected application areas, such as machine translation or information retrieval. REFERENCES