International Journal of Electrical and Computer Engineering (IJECE) Vol. No. February 2019, pp. ISSN: 2088-8708. DOI: 10. 11591/ijece. Granularity analysis of classification and estimation for complex datasets with MOA Chanintorn Jittawiriyanukoon. Graduate School of ELearning. Assumption University. Thailand Article Info ABSTRACT Article history: Dispersed and unstructured datasets are substantial parameters to realize an exact amount of the required space. Depending upon the size and the data distribution, especially, if the classes are significantly associating, the level of granularity to agree a precise classification of the datasets exceeds. The data complexity is one of the major attributes to govern the proper value of the granularity, as it has a direct impact on the performance. Dataset classification exhibits the vital step in complex data analytics and designs to ensure that dataset is prompt to be efficiently scrutinized. Data collections are always causing missing, noisy and out-of-the-range values. Data analytics which has not been wisely classified for problems as such can induce unreliable outcomes. Hence, classifications for complex data sources help comfort the accuracy of gathered datasets by machine learning algorithms. Dataset complexity and pre-processing time reflect the effectiveness of individual algorithm. Once the complexity of datasets is characterized then comparatively simpler datasets can further investigate with parallelism Speedup performance is measured by the execution of MOA Our proposed classification approach outperforms and improves granularity level of complex datasets. Received Mar 15, 2018 Revised Jul 26, 2018 Accepted Aug 16, 2018 Keywords: Big data curation Classification Estimation Granularity level MOA Parallel processing Regression based machine Copyright A 2019 Institute of Advanced Engineering and Science. All rights reserved. Corresponding Author: Chanintorn. Jittawiriyanukoon. Graduate School of ELearning. Assumption University. Thailand. Email: pct2526@yahoo. INTRODUCTION Complex datasets can be the prospects and inquiries they affect the data analytics. The complexity of datasets is the indication of difficulty data scientist experiences as curating the insightsAea complex dataset is usually more problematic to classify than regular dataset, and generally involves a diverse set of technical approaches to figure so . Complex datasets require increased effort to outline the data prior to visualization and curation. To characterize the complexity of datasets is essential as well as the forthcoming complexity is to be taken into account. Big data represents complex dataset hence massive amount of data slows the high speed computers down close to bottleneck stage in order to calculate and extract insights . , . Other implications derive from distinctive sources. Various sources can generate disorganized datasets or datasets succeed dissimilar structures. Data must be preprocessed in order to comply with primary repository format. In order to iron out the bottleneck problem of complex dataset processing, data transformation and refining steps . re-processin. help reduce processing power and time. Besides data mining approach based upon the integration of knowledge is introduced. The pre-processing steps of business oriented data are opted to form an ontology ambitious information system (OAIS). The knowledge base is then determined to help sort out the post-processing of interpretation. Finally, the integration of objective and subjective criteria in teaching is evaluated to develop an expert knowledge. Pre-processing of datasets incorporates normalization, attribute extraction, noise removal, classification and structure re-configuration. Nawi et al. have presented an artificial neural network based algorithm for data pre-processing. The algorithm has turned out Journal homepage: http://iaescore. com/journals/index. php/IJECE A ISSN: 2088-8708 to be common and becomes an analytical tool for mining pattern recognition and machine learning. Big data has been mined using parallelism approach as introduced in . This mining approach has not mentioned how to discard redundant and messy data which is important in preprocessing steps. The relation between preprocessing and complex dataset with technological approaches has been experimented in . Various frameworks for analytical tools like Flink. Spark and MapReduce are also issued for complex data learning. Insights from big data curation and the infrastructure for analytics at Twitter have been presented by . A dynamic role in assisting data scientists with big data has been emphasized, but comprehensive insights are not available. Data analytics from several algorithms must be aggregated into production system, but they achieve in sharing outputs for academic intellect at Tweeting. In this research, the performance of several pre-processing models in order to specify granularity level, decrease noisy samples and correct possible error of the training samples is investigated. The main objectives are to confirm accuracy of classification, simplify the computation and to excel preprocess. Bayesian. Boosting. Nearest Neighboring and the proposed classification models are introduced in this Additionally, the complex datasets proceed to be executed at post-processing environment. accelerate the post-processing calculation, the parallel processing system as presented in . is employed. The MOA simulation . results and speedup performance are summarized. In the simulations, complex datasets obtained from public repository are used. The continuing part is written as follows: Section 2 and 3 expose the theoretical context of complex dataset characteristics and the pre-processing approaches respectively. Section 4 presents the parallel estimation model. Results and analysis finally is established in section 5. COMPLEX DATASETS It is known that there is a debate about Aubig dataAy. It is about a complexity per se. The data with difficulty in handling is the matter of size. Enormous effort in making use of big size of data, just to point out where to manipulate is mandatory. Complexity reflects a tedious task. Not to mention, even a trivial dataset can parade complexity causing data scientists hard to mine with current techniques. Data from various senders, or different datasets from the same sender, is structured dissimilarly. For instance, one unit has few different filesAewhile another unit stores the information on a database. Furthermore, in some of the database instance there is duplicate content which is identical to files content. make use of data from multiple sources, without duplicating or losing information, necessitates preprocessing task . As a definition of Aubig data,Ay the collected data size can upset both processing units and applications used to analyze. Size can be in petabytes (PB)Aethe taller the dataset is, the more problematic to squeeze them on built-in-memory while processing. Let A denote a given dataset matrix which contains a rows and b columns [Ai1. Ai2. Ai3,A. Ai. Ai. for each i= 1, 2, 3,A, a. The A matrix is presumed to be a deterministic set. Obviously, state space of the dataset becomes . , . and computational cost is O . The level of granularity is vigorous for development of full report or dashboard and data integration or visualization. It is simpler for developer to drill-down into the latest detail of datasetsAenevertheless, this is a balance between data indexing and the computational cost of analytical depth. Data curation which appreciates granular drill-down deals with the involvement of bigger adhoc based amount of data due to the ignorance of data integration, summary and pre-process. Diverse databases communicate dissimilar query languages. Structural Query Language is the principal communications of querying data from central Relational Database, but if a third party hardware is used then syntax and API have to be interfaced, and additionally communication protocols and the internal database structure must be exploited to access. Analytical tool is to be elastic in order to approve the built-in connection to destined database through API unless a bulky process of extracting data to SQL database/warehouse is invalidated . Processing with multimedia data warehoused in table style (. is a burden, but unstructured massive data is another tedious task, since it is a rich-text oriented dataset plus video and audio streams. Various types of data exhibit diverse rules, and compromising a single type of truth data among all is critical in order to produce decisions making . Disseminated data occurs whenever data is stored in several places, for instance, at work place, in clouds, or different branches. These data is isolated and to collect them all is not easy. Not to mention, after collectionAesome standardization, normalization and cleansing are compulsory prior to the different datasets can be cross-referenced and manipulated. Location based dataset is gathered regarding to the related objectives and applications . Lastly, not only current data is taken into account but the forthcoming speed of data . rowth rat. is also considered. It is altering or rising. If the datasets are often being updated meaning that additional Int J Elec & Comp Eng. Vol. No. February 2019 : 409 - 416 Int J Elec & Comp Eng ISSN: 2088-8708 datasets are being augmented, this beefs up computational resources and boosts the mentioned complexities about type, size and format . In practice, complexity occurs in data then a development of analytical tools is needful and depending on . clustering analysis or . classification method. Even though such a tool irons out all data analysis problems then a dataset which appears as follows arises. Note that it is not estimated by a straight line nor easily segmentized into clusters. It is complex per se as it demonstrates spherical, recurring or loopy Figure 1 shows examples of complex data traditional techniques cannot classify all characteristics. Figure 1. Example structures of complex data PREPROCESSING METHODS In this section, preprocessing approaches are described. Our proposed method which is applicable for complex data, classification algorithms and the comprehensive discussion are given. Bayesian classification One of the classical predictions is called Bayesian with a simple hypothesis in which all input parameters are assumed to be autonomous . This classiAcation is recognized as a minimum computational cost as well as incomplexity. Let there be m different classes (C1. C2. C3,A. and the trained Bayesian classiAer expects X which belongs to class Ci with high accuracy. The classification model performs as follows: Let each tuple be an n dimensional attribute vector of X . 1, x2 , x3,. , x. be n finite attributes, and suppose xi can take different Ci values, namely. P(Ci/X) > P(Cj/X) for 1 O j O m and j O i. The Bayesian classiAer calculates a probability of Ci as following P(Ci/X) = P(X/C. P(C. / P(X). The values P(X) and P(X/C. are approximated from the training dataset . dimensional table with tupl. The algorithm obviously accumulates the counts due to taking a new batch of examples. The algorithm of Bayesian classification is described as shown in Figure 2. Algorithm Bayesian Require:Dataset matrix which contains a rows . and b columns . Ensure:[A]a x b for i= 1 to a do for j = 1 to b do Build a frequency table for all the features against Ci Construct the likelihood table for the features against Ci Compute the conditional probabilities for Ci Compute the maximum probability for Ci end for end for Figure. Bayesian algorithm Boosting classification Boosting denotes an algorithm which renovates fragile learners to tough learners. The weighting parameter decomposes the matrix A into two parts equally. First half of the weight . is allocated to the perfect classification part, and the second half is assigned to the misclassified . Poisson distribution for computing the random probability to train the model is employed. The key concept of boosting is to accept a sequence of fragile learners. Weighted parameter is applied to model which was wrongly classified in the previous iteration. Only this time being the weighting parameter alters regarding to Granularity analysis of classification and estimation for complex datasets with MOA (Chanintor. A ISSN: 2088-8708 the boosting weight as proceeding through each round of computation in order. The estimation keeps calculating through a weighted sum . or weighted majority . to result the final The following algorithm listed in Figure 3 explains the iteration of boosting . Algorithm Boosting Require:Dataset matrix which contains a rows . and b columns . Ensure:[A]a x bIe[A. and [A. N = dimension of [A] Set:Initial weight parameter is wn (=1/N) for i = 1 to a do for j = 1 to b do for k = 1 to K do Accept Ck . after minimizing error of weight parameter Ek ( ) Compute Ek = Oc yc 1. cu ) O yc ] ( ) ( ) Compute k = Oc yc 1. cu ) O yc ]/ Oc yc Compute k = ln Randomize through Poisson distribution to update the weight parameter ( ) yc = yc exp. cu ) O yc ]} end for Estimate using final result YK. = sgn Oc yu yc . OO {-1, 0, . end for end for Figure 3. Boosting algorithm Nearest neighboring classification Nearest with k neighbors . -NN) used in classification has multiple functions which differs from other algorithms as described above. It is non-parametric which requires no hypotheses about the probability density function of the inputs. In case of unknown input distribution, k-NN is healthier than other parametric However, parametric algorithms seem to generate few errors due to considering input probability This k-NN is a lazy machine learning algorithm, which analyzes data during the testing phase, rather than in the training period. An advantage of lazy k-NN is that it rapidly adjusts to any changes as it does not take a common dataset from the beginning. But a major disadvantage is the huge computational cost occurs during testing period. In k-NN classification, an input is classified by its majority of the k neighbors. The algorithm is presented in . Proposed classification The proposed method is a logistic regression based learner which incorporates classifications in order to maximize the probability of monitored values. At base level of calculation, there are diverse learning algorithms that are trained individually based upon a perfect training set. This is unlike other algorithms that opt the sample values that minimize the sum of squared errors. The proposed method involves the combination of preprocessing techniques for a post-processing of the output at deep learning level. Note that the original learners are not customized while the proposed mechanism aims at obtainable higher accuracy in classification and higher performance on complex datasets. The proposed model is trained on the metaoutputs from base level of calculation. The algorithm is depicted in Figure 4. Proposed Algorithm Require:Dataset matrix which contains a rows . and b columns . Ensure:[A]a x b . M classifiers. N = dimension of [A] for i= 1 to a do for j = 1 to b do for k = 1 to P do /** Base level calculation **/ Learner Mk with dataset A end for for q = 1 to N do /** Maximize probability based on regression **/ Am = . Aoq,b. , where aAoq= m0 m1 aq m2 aq . mP aq end for Apply learner M with Am /** Deep level calculation **/ Restore M end for end for Figure. Proposed algorithm Int J Elec & Comp Eng. Vol. No. February 2019 : 409 - 416 Int J Elec & Comp Eng ISSN: 2088-8708 Granularity and performance In a preprocessing approach, the number of classes observed for the process designates a diverse distribution of the dataset. As far as the performance is concerned, it implies the dispersion of the original dataset among the classifiers. Granularity is used to measure the level of hierarchy . n decision tre. , the relative size, the detailed level, depth of penetration and scale in a dataset. Regarding to this, the performance for any classifications differs based on the number of selected classes. One reason is that the capability of learning algorithms creates fewer rational to data shortage. However, higher granularity develops the structure of a healthy model, regarding to the detail of the state space. In this research, the following focuses are fulfilled. Firstly, the dependency of the granularity level in complex datasets is investigated. The classifiers in an experimental learner with complex datasets are chosen. Secondly, these training results list the benefit of a higher granularity for all datasets. Lastly, the robust model in terms of the data granularity is further analyzed by high processing power in order to examine a speedup performance and the efficiency. The following metrics are concerned to evaluate the performance of the proposed technique. The accuracy means the number of acceptable classifications according to the total number of instances. The processing time consumed by individual classifier is quantified for the efficiency comparison. The speed-up reflects the performance of a parallel processing system in comparison with a slower version. The speed-up can be computed by sequential time over parallel reference time. ESTIMATION METHOD The open-source based simulation tool called MOA is employed for the analytics. Four complex datasets have been selected and the granularity analysis of preprocessing methods has been accumulated. The execution has been run on a Fujitsu Windows 8 with IntelA Core E i5 CPU, 2. 67 GHz Processor and 8 GB RAM on board. The datasets have been selected in order that they are different in number of attributes, instances, details and size. Datasets 1, 2, 3, and 4 are run on a single server (M/M/. , and each dataset is divided into 4 subtasks to be independently processed on four parallel processors (M/M/. The parallel processing is inclusive of splitting time and re-assembling time. Splitting is based upon software developed by . and the simulation model is shown in Figure 5. Performance evaluation of parallel processing for reducing of problem complexity and time is also presented in . The simulation results run on one and four processing units are depicted in Table 10. Figure 5. Simulation model RESULTS AND ANALYSIS Mean Absolute Error (MAE). Root Mean Squared Error (RMSE) and simulation runtime for four datasets are tabularized in Table 1. Granularity and completeness of these four datasets can be found as shown in Table 2-5. It is obviously seen that dataset 2-4 are complete datasets while only dataset 1 is containing high percentage of missing and considerable as incomplete dataset. Performance of preprocessing methods described in section 3 lists out all metrics, such as Area Under the Receiver Operating Characteristic curve (AUROC). Classification Accuracy (CA) and precision. Preprocessing performance evaluations for each dataset are shown in Table 6-9. In all cases proposed method outperforms marginally. Then the proposed preprocessing time in msec is taken into account in order to compute for the parallel processing . ost-processin. in the simulation model as shown in Figure 5. In order to compare to other research, the Nayve Bayes (NB) in Spark pre-processing mechanism is considered. Note Granularity analysis of classification and estimation for complex datasets with MOA (Chanintor. A ISSN: 2088-8708 that NB-Spark results only AUROC as depicted in Table 10. The speed-up metric for these four datasets is calculated from simulation result as displayed in Table 11. In case of dataset #3 and #4, preprocessing time improves speed-up as it differs significantly from post-processing time. Table 1. MOA Simulation Results MAE RMSE RunTime. Dataset Table 2. Granularity of Dataset #1 Skewness Kurtosis Dispersion Missing (%) Table 3. Granularity of Dataset #2 Dataset #1 Attr 1 Attr 2 751,271 Skewness Kurtosis Dispersion Missing (%) Dataset #2 Attr 2 Attr 3 4,606. Attr 1 Table 4. Granularity of Dataset #3 Skewness Kurtosis Dispersion Missing (%) Attr 1 7,131. Attr 2 Attr 3 Dataset #3 Attr 4 Attr 5 Attr 6 Attr 7 Table 5. Granularity of Dataset #4 Skewness Kurtosis Dispersion Missing (%) Attr 1 Attr 2 Dataset #4 Attr 3 Attr 4 Attr 5 Table 6. Preprocessing Performance of Dataset #1 Runtime . Boost NN5 NN15 Bay Proposed AUROC (%) Dataset #1 CA (%) Precision (%) Table 7. Preprocessing Performance of Dataset #2 Runtime . Boost NN5 NN15 Bay Proposed AUROC (%) Dataset #2 CA (%) Precision (%) Table 8. Preprocessing Performance of Dataset #3 Runtime . Boost NN5 NN15 Bay Proposed AUROC (%) Dataset #3 CA (%) Int J Elec & Comp Eng. Vol. No. February 2019 : 409 - 416 Precision (%) Int J Elec & Comp Eng ISSN: 2088-8708 Table 9. Preprocessing Performance of Dataset #4 Runtime . Boost NN5 NN15 Bay Proposed AUROC (%) Dataset #4 CA (%) Precision (%) Table 10. Preprocessing Performance Comparison Boost NN5 NN15 Bay NB Spark Proposed Runtime . N/A AUROC (%) Dataset #4 CA (%) N/A Precision (%) N/A Table 11. Results Comparison for One and Four Processing Units (Pre:Pos. Residual Time . M/M/1 M/M/4 Speed-up 74:109 74:42. Dataset 46:60 9:286 46:18 9:109 80:1591 80:459 CONCLUSION In parallel processing system, several processing units are connected in parallel fashion with each other and this combined structure is filled with a complex dataset. Since the complexity of dataset exists, preprocessing techniques are compulsory. The proposed algorithm for preprocessing is introduced and outperforms for both CA and precision analysis compared to other existing methods. The proposed classification method outperforms and improves granularity level of complex datasets. In the end, parallel processing is employed to measure the post-processing time and speed-up metrics. It is clear that Dataset complexity and pre-processing time reflect the effectiveness of each algorithm. Speedup is based on the runtime of MOA simulation. The future research considers the approximation technique in order to lessen the processing time complexity issued by simulation. The next publication touches a concept of optimizing both CA and precision in preprocesses. REFERENCES