International Journal of Electrical and Computer Engineering (IJECE) Vol. No. August 2017, pp. ISSN: 2088-8708. DOI: 10. 11591/ijece. Classification with Single Constraint Progressive Mining of Sequential Patterns Regina Yulia Yasmin. Putri Saptawati. Benhard Sitohang School of Electrical Engineering & Informatics. Institut Teknologi Bandung. Jl. Ganesha 10. Bandung 40132. Indonesia Article Info ABSTRACT Article history: Classification based on sequential pattern data has become an important topic to explore. One of research has been carried was the Classify-By-Sequence. CBS. CBS classified data based on sequential patterns obtained from AprioriLike sequential pattern mining. Sequential patterns obtained were called CSP. Classifiable Sequential Patterns. CSP was used as classifier rules or features for the classification task. CBS used AprioriLike algorithm to search for sequential patterns. However. AprioriLike algorithm took a long time to search for them. Moreover, not all sequential patterns were important for the user. In order to get the right and meaningful features for classification, user uses a constraint in sequential pattern mining. Constraint is also expected to reduce the number of sequential patterns that are short and less meaningful to the user. Therefore, we developed CBS_CLASS* with Single Constraint Progressive Mining of Sequential Patterns or Single Constraint PISA or PISA*. CBS_Class* with PISA* was proven to classify data in faster time since it only processed lesser number of sequential patterns but still conform to userAos need. The experiment result showed that compared to CBS_CLASS. CBS_Class* reduced the classification execution time by 89. Moreover, the accuracy of the classification process can still be maintained. Received Dec 4, 2016 Revised Apr 4, 2017 Accepted Apr 18, 2017 Keyword: CBS_CLASS* Classification-by-sequence with single constraint pisa PISA* Sequential pattern mining Single constraint progressive mining of sequential patterns Copyright A 2017 Institute of Advanced Engineering and Science. All rights reserved. Corresponding Author: Regina Yulia Yasmin. School of Electrical Engineering & Informatics. Institut Teknologi Bandung. Jl. Ganesha 10. Bandung 40132. Indonesia. Email: regina. yasmin@gmail. INTRODUCTION The classification process is needed to categorize data with various types for various purposes, for instance classifying web click data to find out the website search patterns of active internet users, classifying face or voice biometric data to recognize its type of source and classifying e-commerce data to categorize prospective customers etc. The large amount of data with high increasing growth rates and various data types, especially big data, require the development of more efficient classification method . Moreover, the time order of data appearance is expected to bring knowledge that can be utilized to classify data. Classification method of data that appear in time sequence can be used to categorize the class of poker step sequences . The study was conducted to determine whether a particular poker step will result in victory or defeat . Classification of data that appear in time sequence is called a sequential data classification. Sequential data as the input for classification process can be provided directly or preprocessed using a sequential pattern mining. Various studies on sequential pattern mining had been done with a variety of mining techniques . , . Several approaches in sequential pattern mining are Apriori . GSP . SPADE . Prefix Span . PISA . Research on integrating classification process with preprocessing using sequential pattern mining had been done, as well as integrating sequential pattern mining with clustering process. By integrating Journal homepage: http://iaesjournal. com/online/index. php/IJECE IJECE ISSN: 2088-8708 sequential pattern mining in preprocessing with the next mining process, the performance was improved . Research on classification of sequential patterns has been done by combining AprioriLike sequential pattern mining with the classification process. The approach was called CBS. Classify-by-Sequence . Basically. AprioriLike algorithm searches for sequential patterns by generating sequence candidates that are constructed from the shortest sequences to get longer sequences iteratively. However. AprioriLike algorithm consumed long time because the iterative process was repeated for lots of data. Moreover, the number of generated sequential patterns was overmuch and mostly they were short and trivial to users so that they burdened the next classification process. Hence, the classification process using AprioriLike sequential pattern mining took longer time. Moreover, from business point of view, users of data mining sometimes only require the data analysis from a certain perspective. This viewpoint is adopted from the organization needs. For example, users are generally more interested in getting knowledge of the latest data than of the old ones. Therefore, there's a necessity to improve sequential pattern mining performance in preprocessing by improving execution time to get sequential patterns and reducing the number of sequential patterns to only those that satisfy the user's need. Constraint in sequential pattern mining is expected to answer this necessity. The sequential pattern mining algorithm is developed based on Progressive Mining of Sequential Patterns. PISA. PISA algorithm arranges a sequence database in the form of Progressive Sequence Tree. PS-tree, and does not generate sequence candidates so that the mining process becomes faster. We named the single constraint progressive mining of sequential patterns, as PISA*. In addition to its ability to generate sequential patterns that satisfy the usersAo need, constraint can reduce the number of short and trivial sequential patterns . The lesser number of sequential patterns will reduce workload of the next classification process. It is believed that the classification process will be faster with better accuracy level. Therefore, this study is expected to provide solutions to the problem . whether sequential pattern mining, especially PISA* can lower the number of sequential patterns and, . improve classification time performance and maintain accuracy level of the classification process. In order to solve the problem, we propose a method Classify-By-Sequence_Class*. CBS_CLASS* which integrates PISA* to find sequential patterns and classification process. The objective of this research is to classify data based on sequential patterns that were found using PISA*. Therefore. CBS_CLASS is modified to integrate PISA* algorithm and classification process. This study is expected to improve the performance of classificationAos total execution time while maintaining the accuracy level. This paper is organised as follows. Section I contains introduction, related work. CBS approach and PISA*. Section II explains about CBS_CLASS* with PISA*: the proposed approach. Section i explains about the results and analysis. Section IV contains conclusion of this paper and future research. Related Work Sequential pattern mining is one of the development of frequent itemset mining . Denoted by I = . 1, i2, . , i. which is a set of all items, subset of I is called itemset. Sequence = t1, t2, . , tm where . i OI I) is ordered list. Each itemset in a sequence represents the set of events that occur at the same time . ame timestam. Different itemset appears at different time . Sequence = a1, a2, . , an is a subsequence of sequence A= b1, b2, . , bm if there exist integers i1 < i2 < A < in such that a1 Es bi1, a2 Es bi2. A, an Es bin . Research on the mining sequence as features for classification had been done to classify steps sequence in order to predict whether the steps will lead to victory or defeat . The study introduced a Feature Mining approach. In this study, there were 3 steps taken, i. the repeated simulation step to generate execution traces, . mining the simulated execution traces to obtain features and . train classifier using features to predict the success or failure of simulated traces. This study used two processes, sequential pattern mining process to obtain sequential patterns that correlate with target class and the classification process using features obtained from the previous process . However. CBS was proven to have better accuracy performance than the Feature Mining since CBS integrates AprioriLike sequential pattern mining with probabilistic induction so that sequential patterns can be used as a classifier rule . Classify-by-Sequence. Cbs Approach Classify-by-Sequence. CBS, is a classification process that builds model from sequential patterns. As shown in Figure 1. CBS performs two processes sequentially, . searches for sequential patterns through AprioriLike sequential pattern mining, . classifies based on the sequential patterns which act as a classifier rule . Sequential patterns that are obtained from AprioriLike sequential pattern mining are also called as Classifiable Sequential Patterns. CSP. AprioriLike seeks sequential patterns with length-. from length-n. By the iterative process. AprioriLike requires a long time to get sequential patterns. Classification with Single Constraint Progressive Mining of Sequential Patterns (Regina Yulia Yasmi. A ISSN: 2088-8708 Figure 1. CBS Scheme . Data that are processed in CBS are categorical data. Each record is associated to a particular class Denote Di that consists of a time series of data associated with class n. D = {D1. D2,D3. AprioriLike searches for Classifiable Sequential Patterns (CSP) as classifier rules. Therefore, a sequential pattern i corresponds to class m. SPi E Cm, with SPi is a sequence of a1 E a3 E a7 and Cm is the class m . CBS algorithm is divided into two types, namely CBS_ALL and CBS_CLASS. CBS_ALL mines all data in the database while CBS_CLASS first divides the data based on corresponding class to be mined Since CBS_CLASS builds classification model from each class. CBS_CLASS was proven to achieve better accuracy compared to CBS_ALL . The CBS_CLASS method is as follows, . divides the data based on the corresponding class or label, . from each class. AprioriLike searches the sequential patterns, . classifies data using the sequential patterns that act as Classifiable Sequential Patterns. CSP. The CSP is used as a classifier rule. To find sequential patterns. AprioriLike uses the minimum support. Each CSP will be given a score based on the length of sequential pattern. Then. CSP score is normalized so that the maximum score is 1 . Figure 2 shows about CBS_CLASS algorithm that consists of AprioriLike algorithm and classification process . Input: Output: Method: dataset, minimum support class of sequential patterns CBS_CLASS(Dataset D, min_su. For each ci A class_set(D) do Di = class_dataset(D,c. CSPi = FindSP(Di, min_su. //to get CSP using AprioriLike End //Start AprioriLike sequential patterns FindSP(Dataset D, min_su. Sp1 = . arge-1 item. For . =2. Spi-1. i ) do SP_Ci = gen_candidateSP(Spi-. For each data d A D do SPs = SP_Ci O subseq. For each sp A SPs do Sp. End End Spi = . p | sp A SPs, sp. sup Ou min su. End Return Oycn SPi //This is the end of AprioriLike algorithm //Start program for classifying test data Class_of_sequence. Total_score = new array. lass_count(D)]. For each cspi A CSP do Total_score. = cspi. End Score_array[]=new array. lass_count(D)]. For each cspi A CSP do If cspi A subseq. For each cm A belong_class_set. Score_array y. length/total_score. //count score for test data End End if End k = index_of. core_array[]}). return ck. //End program for classifying test data Figure 2. CBS_CLASS algorithm . IJECE Vol. No. August 2017 : 2142 Ae 2151 IJECE ISSN: 2088-8708 Progressive Mining of Sequential Patterns. PISA A constraint is a limitation set by user before the mining process begins. Constraint in sequential pattern mining is expected to reduce the number of sequential patterns . There are many types of user constraints, such as item, length, super-pattern, aggregate, regular expression, duration and gap constraint . Constraint can be either . item constraint, is a limitation where a sequential pattern should contain or not contain certain item, . length less constraint, requires that the length of sequential pattern should be shorter than the given value, . length more constraint, requires that the length of sequential pattern should be longer than the given value, . super-pattern constraint, is a constraint adopted to look for sequential pattern that contains a particular pattern as a set of sub-pattern, . aggregate constraint, is a constraint on the aggregate of items in the pattern, . regular expression constraint, is a constraint with regular expression, such as disjuntion and Kleene closure of the set of items, . duration constraint, is a constraint on the sequence database that requires sequential pattern to satisfy the time difference between predetermined start and end transactions, . gap constraint, requires to satisfy predefined time difference between two adjacent transactions . Based on how to check constraint on sequential patterns, constraints can be categorized into monotonic or anti-monotonic constraint. Constraint checking utilizes characteristic of subsequence and super A constraint, denoted by CM, is said as a monotonic constraint if there is a sequence meets C M constraint then any super-sequence of also satisfies constraint CM. Meanwhile, a constraint, denoted by CA, is said as anti-monotonic constraint if there is a sequence of satisfy the constraint CA then any subsequence of also satisfies constraint CA . PISA represents a sequence database into a progressive sequential tree. PS-tree . PISA uses the concept of a time frame named as Period of Interest. POI, which is a sliding window and move continuously in time . POI made PISA to be flexible in adding or deleting data . PISA processes sequence database that is composed from itemset from each sequence ID based on occurrence timestamp. PS-tree with PISA algorithm . records sequence ID, label and timestamp. PISA algorithm traverses from the root node to the leaf one and calculate the occurrence frequency of item. If the occurrence frequency of item is larger than minimum support then it will be considered as sequential pattern. However, original PISA approach has not accommodated constraint checking when searching for sequential patterns. Therefore. PISA* has been developed so that it is able to check single constraint. CBS_CLASS* WITH PISA*: THE PROPOSED APPROACH Classification CBS* with Single Constraint Progressive Mining of Sequential Patterns or PISA*, is a classification approach that integrates sequential pattern mining PISA* with classification process. Classification process needs sequential patterns as input. Figure 3 shows that sequential patterns were obtained from sequential pattern mining PISA* that was conducted before the classification starts. Single constraint PISA. PISA* searches for sequential patterns that meet a minimum support, a single constraint within the user-defined POI. Minimum support, single constraint and POI were set before sequential pattern mining process starts. The advantages of PISA* are . the flexibility in adding or subtracting data, . the conformance of sequential patterns with userAos need, . the removal of short and trivial sequential patterns. Sequential pattern will be used as a CSP. Classifiable Sequential Pattern that act as a classification rule, which is denoted as Spi E Cm, sequential pattern-i belongs to the class m. Classification rules are used to determine the class of new data . Figure 3. Classification by sequence with prior process PISA* . This approach proved to perform better in searching for sequential patterns based on a minimum support and a single user-defined constraint. Single constraint that has been tested are constraint item, length less constraint and length more constraint. Moreover, this approach was also proven to decrease the number of short and trivial sequential patterns and performed in faster time than the original PISA algorithm. Like CBS_CLASS. CBS_CLASS* also splits database based on the corresponding class. Denotes. Dm contains sequential data that corresponds to class m. Database is represented as Classification with Single Constraint Progressive Mining of Sequential Patterns (Regina Yulia Yasmi. A ISSN: 2088-8708 D = {D1. D2. D3. D4. A D. , with n classes. Data sets in Dm consists of time-series data instances . m1, im2, im3,A. , im. , where imn represents the value of Dm at the nth time point. At each database that relates to corresponding class. PISA* seeks sequential patterns that meet minimum support and a single constraint. Sequential pattern obtained is called CSP. Classifiable Sequential Pattern. CSP is weighted based on the length of sequential pattern. Longer sequential pattern is given a greater weight. Maximum weight of each CSP is 1. CBS_CLASS* with PISA* is explained as follows: For each database Dm per class m, minimum support, single constraint and POI are predefined based on user requirements. Single constraint may be an item or length less or length more constraint. Find sequential patterns using PISA* as follows: Transform the database into sequence database that has property of sequence id, timestamp and Label is also called as itemset. Using Procedure Traverse: Develop and traverse PS-tree of itemset elements within POI. Every new element is recorded at the root node, while every element that has predecessor is recorded at common node. While traversing, check whether sequential pattern satisfies minimum support, which is the sequence list size Ou minimum support * number of sequence. Using Procedure PisaConstraintItem: To obtain sequential patterns that satisfy the single constraint, check sequential pattern against single constraint in POI time frame. Check constraint based on property anti monotonic or monotonic constraint and apply it to all sequential patterns. For anti-monotonic constraint such as length less constraint, if sequence of satisfies the constraint anti-monotonic CA then any subsequence of also satisfies constraint C A. If the SPExisting satisfies the constraint and if SP is subsequence of SPExisting then SP also satisfies the constraint. For monotonic constraint such as item, length more, super pattern constraint, if there is a sequence satisfies constraint monotonic CM then any super-sequence of also satisfies constraint CM. If the SPExisting meets the constraint and if SPExisting is subsequence of SP or SP is super-sequence of SPExisting then SP also satisfies the constraint. Using Procedure Class_of_sequence: Classify the data For each obtained sequential pattern or CSP, compute CSP score by accumulating length of sequential pattern. For each CSP, normalization of scores is counted. If CSP is a subsequence of the existing CSP array, then score = accumulated of . ength of sequence / total scor. Check the sequential patterns found from test data against list of CSP. If the corresponding class is failed to get, then check the subsequence of sequential patterns from test data, from the longest to the shortest length, against the list of CSP recursively until its correspondence class is found. Based on the scores obtained in point 4, index k will be obtained, where k is the maximum score. Class obtained is a class with index k. CBS_CLASS* with PISA* algorithm is described in Figure 4: IJECE Vol. No. August 2017 : 2142 Ae 2151 IJECE ISSN: 2088-8708 Input: dataset, minimum support. POI and single constraint criteria, for example item constraint Output: class of sequential patterns Method: CBS_CLASS*(Dataset D, min_sup. POI, labe. For each ci A class_set(D) do Di = class_dataset(D,c. CSPi = PISA*(Di, min_sup. POI, labe. End PISA*(Dataset D, min_sup. POI, labe. Var PS. //PS Tree Var currentTime. //timestamp now Var eleSet. //used to store elements ele While . here is still new transactio. eleSet = read all ele at currentTime. urrentTime. PS). PisaConstraintItem. p, labe. //item constraint checking currentTime . //This is the start of modified PISA algorithm Procedure traverse. urrentTime,PS) For. ach node of PS in post orde. do If. ode is Roo. For. le of every seq in eleSe. do For. ll combination of elements in the el. do If. lement==label of one of node. Update timestamp of seq to currentTime. Else Create a new sequence with currentTime. Else //create a child Create a new child with element, seq and currentTime. Else //the node is a common node For. very seq in the seq_lis. do If. timestamp<=currentTime-POI Delete seq from seq_list and continue to the next seq. If. here is new ele of the seq in eleSe. For. ll combination of elements in the el. do If. lement is not on the path from Roo. If. lement==label of one of node. Child. seq_list. timestamp=seq. Else Create a new sequence with Endif Else //create a child Create a new child with element, seq and seq. Endif EndFor Endif Endif If. eq_list. size==. Delete this node and all of its children from its parent. If. eq_list. size>=support*sequence numbe. Output the labels of path from Root to this node as a SP End Figure 4. The proposed CBS_CLASS* algorithm RESULTS AND ANALYSIS Experiment uses e-commerce sales data which contains categorical data. The amount of data is about 22,699 transactions with attributes are product ID, date of sales, customer ID, order ID. The objective of the experiments is to compare the classification time performance and the accuracy between CBS_CLASS and CBS_CLASS*. In order to get the accuracy level, k-fold stratified cross validation is used as testing In this experiment. CBS_CLASS uses original PISA method for searching sequential patterns, and CBS_CLASS* uses PISA* with single constraint. Constraint used is either an item constraint or length less or length more constraint. Item constraint requires sequential patterns to contain 1 specific item. Length less constraint requires that length of sequential patterns must be less than a certain number. On the contrary, length more constraint requires that length of sequential patterns must be more or the same than a certain number. First, data in each database Dm that corresponds to class m, were represented in sequence database based on sales date and time. Figure 5 represents the sequence database. Sequence ID or SID represents sequences based on userAos point of view to analyze. In this case, sequence ID is based on sales date. Each Classification with Single Constraint Progressive Mining of Sequential Patterns (Regina Yulia Yasmi. A ISSN: 2088-8708 SID consists of series of itemset. Each itemset consists of product ID. Itemset in each SID is grouped by the same timestamp. In this experiment, time or timestamp is in sales hour unit. Moreover. POI is used as a user defined time sliding window between specified sales hours interval. Therefore, based on sales hours, maximum number of POI is 23. In this experiment. POI is set of 5, means that time sliding window is set between 5 hours. Figure 5. Sequential database representation in each Dm . Sequential patterns are sequence of itemsets that contain: Sequential pattern 1: <. , <. , . , <. Sequential pattern 2: <. , <. , . , <. Sequential pattern m: <. , <. , . , <. In this experiment, classification process categorizes sequential patterns in two classes, prospect and non prospect products. For example, sequential patterns generated by PISA* that act as features for the classification process are as follows: Class of prospect product : Sequential patterns 1: <(Samsung Galaxy V SM-G313HZ Whit. , (Samsung Galaxy Tab S 8. 4 "SM-T705NT - Titanium Brow. >, . Sequential patterns 2: <(Preorder Microsoft Lumia 535 Blac. , (Samsung Galaxy V SMG313HZ Whit. > and so on. Class of non prospect product : Sequential patterns 1: <(Emtec USB 2. 0 Flash Drive 8 GB Black Panthe. , (V-Gen Micro SDHC 16 GB)> Sequential patterns 2: <(PNY Dual USB Flash Drive - OU1 16GB), . NiQue Laptop Backpack i-Protect Bir. , (V-Gen Micro SDHC 16 GB)> and so on. The objective of the first experiment was to compare the number of sequential patterns from PISA in CBS_CLASS and PISA* from CBS_CLASS*. The experiment was conducted at minimum support 0. The result is shown in Figure 6 that number of sequential patterns obtained from PISA* in CBS_CLASS* are lower than in CBS_CLASS. From the result, the hypothesis was proven that CBS_CLASS* with prior PISA* process can reduce the number of sequential patterns to only those that satisfy user constraint. Figure 3. Comparison of number of sequential patterns at minimum support 0. IJECE Vol. No. August 2017 : 2142 Ae 2151 Figure 4. Comparison of classification execution time at minimum support 0. IJECE ISSN: 2088-8708 The objective of the second experiment was to compare the classification execution time between CBS_CLASS and CBS_CLASS*. The experiment was conducted at minimum support 0. The result is shown in Figure 7 that classification execution time of CBS_CLASS* is much lower than CBS_CLASS. From the result, the hypothesis was proven that CBS_CLASS* with prior PISA* process can reduce the classification execution time since classification processes lesser number of sequential patterns so that they do not burden the classification process. The third experiment was aimed to compare the accuracy level between CBS_CLASS* and CBS_CLASS at minimum support 0. As shown in Figure that the accuracy level of CBS_CLASS* is higher than CBS_CLASS. From the result, it can be concluded that the lesser number of sequential patterns that satisfy userAos constraint increases the accuracy of classification. Moreover, subsequence checking of test data to the list of CSP increases classificationAos accuracy level. It meets the expectation that besides reducing classification execution time. CBS_CLASS* is capable of maintaining the accuracy level of classification Figure 8. Comparison of classificationAos accuracy at minimum support 0. The forth experiment was conducted to know the comparation of classification execution time between CBS_CLASS* and CBS_CLASS at several minimum supports. The result was shown in Figure 9 that CBS_CLASS* consumes lower classification execution time than CBS_CLASS at several minimum It meets the expectation that CBS_CLASS* is more efficient than CBS_CLASS. Figure 9. Comparison of classification execution time at different minimum support The fifth experiment was aimed to compare the accuracy level between CBS_CLASS* and CBS_CLASS at several minimum support. It is shown in Figure 10 that CBS_CLASS* can maintain the Classification with Single Constraint Progressive Mining of Sequential Patterns (Regina Yulia Yasmi. A ISSN: 2088-8708 accuracy level, compared to CBS_CLASS. It is concluded that CBS_CLASS* can lower the execution time and maintain the accuracy level as well. Figure 10. Comparison of number of sequences at different minimum support. CONCLUSION The experiment results proved that the proposed method CBS_CLASS* gives better results in classification execution time than CBS_CLASS significantly and also can maintain the accuracy level. Compared to CBS_CLASS. CBS_CLASS* can reduce the classification execution time by 89%. This is due to CBS_CLASS* only processes lesser number of sequential patterns that satisfy userAos constraint so that classification execution time is reduced. Moreover, subsequence checking of test data to the list of CSP can increase the classificationAos accuracy level. This result gives good expectation for further research to incorporate multiple constraints in sequential pattern mining and integrate it with classification process. We believe this approach can yield much better performance. ACKNOWLEDGEMENTS We would like to thank Prof. Sitohang and Dr. Saptawati for the guidance, assistance with this research and comments that greatly improved the manuscript. REFERENCES