JUSIKOM PRIMA (Jurnal Sistem Informasi dan Ilmu Komputer Prim.
Vol.
6 No.
Agustus 2022
E-ISSN : 2580-2879
K-MEANS ALGORITHM DATA MINING
IN SALES LEVEL ANALYSIS
Dodi Nofri Yoliadi Fakultas Ushuluddin Adab dan Dakwah.
UIN Mahmud Yunus Batusangkar Jl.
Jenderal Sudirman No.
Limo Kaum.
Kec.
Lima Kaum.
Kabupaten Tanah Datar.
Sumatera Barat 27217 E-mail : dodinofriyoliadi@uinmybatusangkar.
ABSTRAK- In this research, data collection and processing of sales of electronic goods were carried out with CV.
Berkah Elektronik.
The data obtained is then aggregated with the K-Means algorithm to gain knowledge about which electronic products are selling well on the market and which are not.
In this study, the clustering method was used with Tanagra 1.
48 software, and the K-Means algorithm was used as an algorithm to draw conclusions about which items were selling well and which were not inputted, namely product prices.
goods and sale of goods.
And from the results of testing and manual testing with the Tanagra application, the same clusters are produced, namely, products that do not sell well (Cluster_KMeans_.
and products that sell well (Cluster_KMeans_.
Keywords : Clustering.
Sales.
K-Means Algorithm.
Data Mining.
Tanagra.
INTRODUCTION
Data mining is a technique used to find information in a database.
Data mining usually uses mathematics, and machine learning to extract and identify useful information and aggregate it from large databases .
Data mining is the process of extracting data from a very large and unlimited collection of In fact, data mining is a KDD .
nowledge discovery in database.
Information services are activities of several processes, including data cleaning, integration, data selection, transformation, data mining, pattern assessment, and data presentation .
Currently, computerized systems have been widely used in various business companies to obtain important information about marketing activities.
One company that implements this computing system is a business that is in the field of selling electronic goods.
The function of data processing for these companies is to obtain precise, accurate, and easy-tounderstand information about sales of electronic To obtain this information, various techniques are used, one of which is the clustering technique available in data mining, namely the K-Means clustering algorithm.
Among several clustering techniques, the k-means clustering algorithm is the simplest and easiest technique.
In this technique, objects are grouped into k groups .
For merging, the value of k must be clustered first.
KMeans is a clustering method that is easy, simple, and easy to implement because it has the advantage of being able to group large amounts of data and create clusters quickly, precisely, and accurately.
The goal of the K-Means algorithm is to cluster objects in such a way that the distance of each object to the center of the cluster is minimal.
Data grouping can be applied to analyze the level of sales of electronic goods in CV.
The sooner the level of sales of electronic goods is known, the better because the company can make Analyzing the sales level of available products by carrying out the K-Means clustering algorithm technique is expected to provide accurate and precise information about the types of electronic goods that are most sold and purchased and are in demand by buyers .
THEORY STUDY
1 Data Mining Data mining is a technique used to find information in a database.
Data mining usually uses mathematics, and machine learning to extract and identify useful information and aggregate it from large databases .
Data mining refers to the process of extracting information from very large data sets.
fact, data mining is a step towards KDD .
nowledge discovery in database.
Information services as a process consist of data cleaning, integration, data selection, transformation, data mining, pattern assessment, and data presentation.
2 Data Mining Cycles According to CRISP-DM, the data mining life cycle is divided into six stages.
All sequential steps are adaptive.
the next step in the sequence depends on the output of the previous step.
6 in the CRISP-DM stage, namely .
: .
Steps in the Business Understanding Phase .
Steps in the Data Understanding Phase Steps in the Data Processing Phase .
Steps in the Modeling Phase Steps in the Evaluation Phase .
Step into Deployment Phase 3 Clustering Clustering is a well-known and widely used technique in data mining.
Researchers in the field of data mining are still making various efforts to perfect the cluster model because the development method is JUSIKOM PRIMA (Jurnal Sistem Informasi dan Ilmu Komputer Prim.
Vol.
6 No.
Agustus 2022 still heuristic.
The calculation of the optimal number of clusters and the best clusters is still in progress.
Therefore, with the current method, it cannot be guaranteed that the clustering results will be optimal.
However, from a practical point of view, the results achieved are usually quite good .
The main objective of the clustering method is to group a set of data or objects into groups .
in such a way that each cluster contains as much similar data as In clustering, we try to place similar .
objects in a cluster and make the distance between clusters as wide as possible .
4 Cluster Analysis One of the data mining methods used in this study is clustering, in which the method identifies objects with certain general characteristics and then uses these characteristics as eigenvectors or centroids .
Cluster analysis finds a collection of objects in such a way that objects in one group are similar .
r relate.
to another and different .
r unrelate.
to objects in another group .
The goal of cluster analysis is to minimize the distance within a cluster and maximize the distance between clusters .
Clustering can be applied to data that is quantitative .
, qualitative .
, or a combination of both.
Data can come from the results of process Each observation can have n measurement variables and n-dimensional vectors grouped as Zk = [Z1k,.
Zn.
T,Zk Rn.
The set of observations N is given by Z = Zk | denotes k = 1.
N and is represented as an n x N.
5 K-Means Algorithm The K-Means algorithm is the most popular and widely used clustering algorithm in the industrial This algorithm is based on a simple idea.
First, determine how many clusters are formed.
Every object or first element of a cluster can be chosen to act as the center of the cluster .
Several alternatives for implementing K-means have been proposed, with several theories of computational evolution associated.
This includes options .
Distance space to calculate the distance between the data and the centroid.
methods for mapping data back to each cluster.
the purpose for which it is used.
K-Means belongs to the partition grouping.
All data must be included in a particular cluster, and all data belonging to a certain cluster can be transferred to another cluster in one process step, in the next step.
K-Means divides the data into k distinct ranges, where k is a positive integer.
The K-Means algorithm is known for its simplicity and ability to classify large amounts of data and outliers very quickly .
6 K-Means Technique K-Means takes k input parameters and divides a set of n objects into k clusters, resulting in high intra-
E-ISSN : 2580-2879
cluster similarity but low inter-cluster similarity.
Cluster similarity is measured relative to the average of objects in the cluster, which can be seen as the cluster's center of gravity or center of mass.
The KMeans algorithm then repeats the following steps until stability is achieved .
bstacles cannot be move.
Determine the center coordinates of each cluster.
Determine the distance of each object to its center
Group these objects according to their minimum
RESEARCH METHOD
The research methodology includes steps that must be taken to facilitate dissertation work and can also be used as a guide for researchers conducting research.
Research is a series of scientific activities aimed at solving a problem.
The task of research is to find explanations and answers to problems and show possible alternatives that can be used to solve them.
frame of reference is needed for research steps to be carried out in a structured manner, creating research methodology steps to maximize the results obtained.
This research framework can be seen in the image Figure 1.
Research Framework 1 Literature Study At this stage, the researcher reviewed the literature on data mining, clustering, and the k-means algorithm.
The literature studied came from various sources, such as textbooks, journals, websites, essays, e-books, and academic articles.
2 Information Collection At this stage, data is collected from CVs through direct observation.
At Berkah Electronics, the data collected is sales data.
After the data has been collected and researched, the next step is to classify JUSIKOM PRIMA (Jurnal Sistem Informasi dan Ilmu Komputer Prim.
Vol.
6 No.
Agustus 2022 the sales data, and the results of grouping the sales data can be used to decide on problems to be solved and solutions to be sought.
3 Formulate the Problem After the data collection stage is complete, the next step is to formulate questions about the existing How to make decisions about determining the level of sales of electronic products in the CV Berkah Electronics uses the K-Means algorithm based on sales data.
4 Analysis and Planning in General This step describes analytical methods commonly used in design to determine the best-selling electronic 5 System Implementation In this phase, programs created with the Tanagra programming language are run to analyze and evaluate the results of the K-Means algorithm method.
At this stage the researcherstage,iewed the literature on data mining, clustering and the kclustering,gorithm.
The literature studied was taken from various sources, such as textbooks, journals, websites, books, e-books, and academic books.
6 Testing Data with Tanagra The next step after programming is system testing.
In this phase, a series of tests are carried out to ensure that the designed system matches the designed model.
The steps in the data mining testing mechanism are as Selection of information Selection .
of data from the operational data set must be done before the data mining phase in KDD begins.
Selected information used in the data mining process is stored in files separate from the operational database.
E-ISSN : 2580-2879
understood by interested parties.
This step is part of the KDD process called rendering.
In this phase, it is examined whether the formula or information found contradicts the facts or hypotheses that existed before.
7 Draw conclusions At the final stage of the research, conclusions will be drawn whose aim is to compare the results of the manual system implementation phase with the KMeans clustering algorithm method.
RESULTS AND DISCUSSION
1 Data Analysis CV.
Berkah Elektronik has information about the company's sales activities.
One of them is the information used to record transactions that occur within the company.
consisting of several attributes such as electronic brand, type of goods, name of goods, price of goods, and number of goods sold.
Data is used as an attribute to perform data processing.
The definition of a best seller .
n item that sells in large quantitie.
, while the notion of a no seller is an item that is not selling well .
ot sellin.
or has a sales volume below the average.
2 K-Means Algorithm for Clustering Analysis K-Means is included in the partitioning method of data mining.
all data must be entered in a certain cluster, and all data belonging to a certain cluster can be transferred to another cluster in one process step, in the next step.
K-Means divides the data into K discrete ranges, where K is a positive integer.
The K-Means algorithm is known for its simplicity and ability to classify large data sets and outliers very quickly.
The following is a flowchart of the K-Means algorithm and explains the steps of the algorithm, assuming the input parameters are the number of records in the data and the number of initialization centers.
K = 2 according to research:
Pretreatment/cleaning Before the data mining process can be carried out, it is necessary to carry out a cleaning process for the data that is the focus of KDD.
The cleaning process includes, but is not limited to, removing duplicate data, checking for inconsistent data, and correcting errors in the data, such as B.
KDD
typos, etc.
external data or information.
Transformation/Change Coding is the process of transforming the selected data so that it is suitable for the data mining The KDD coding process is a creative process and depends heavily on the type or model of data retrieved from the database.
Interpretation/Evaluation The data model created by the data mining process must be presented in a form that is easily Figure 2.
K-Means Process Flowchart JUSIKOM PRIMA (Jurnal Sistem Informasi dan Ilmu Komputer Prim.
Vol.
6 No.
Agustus 2022 Determination of the number of clusters In this first stage, the number of clusters is determined using the data obtained.
Define the center of the first cluster In determining the n initial cluster centers, random numbers are generated that represent the input data sequence.
The initial center of the cluster is obtained from the data itself, not by defining a new point, namely by randomizing the initial center of the data.
Calculate the object's distance to the cluster To measure the distance between data and cluster centers.
Euclidean distance is used, which is an algorithm for calculating the distance between data and cluster centers.
Take the data value and cluster center value, then calculate the Euclidean distance to each cluster center.
Grouping Data Objects The calculated distance is compared, and the shortest distance between the data and the cluster center is selected.
this distance indicates that the data is in the same group as the nearest cluster The data grouping algorithm is as follows:
Take the value of the distance from the center of each cluster to the data.
Find the smallest distance value.
Grouping the data based on the center of the cluster with the closest gap.
Establishment ff A New Cluster Center To get a new cluster center, it can be calculated from the average of cluster members and cluster The new cluster center is used to carry out the next iteration when the results obtained have not converged, stopping the iteration when the maximum number of iterations entered by the user is reached or the results obtained have converged .
he new cluster center is the same as the old cluster cente.
The algorithm used to determine the cluster center is: Find the number of members in each cluster, and then calculate the new center using Formula 2.
Calculate the Distance To the Center Of The Cluster Calculate the Euclidean distance from all data to the new centers (C1 and C.
as in step 2.
After we get the calculation results, compare these results.
E-ISSN : 2580-2879
4 The Clustering Process Uses The K-Means Algorithm In this phase, the main process is carried out with CV brand-based merchandise sales data.
Electronics Blessings, that is, segmentation or grouping of goods sales data seen in sales reports, which is the K-Means algorithmic grouping method.
The collection of sales data obtained from sales reports is used as an application of the K-Means algorithm in selling goods.
Experiments were carried out with the following total clusters: 2.
total records: 37.
total attributes: 5 Determination of the Number of Clusters Determine the number of clusters obtained from existing sales data.
The number of clusters in this case is three, namely brand, type, and type.
Initial Cluster Center Determination The initial center of the cluster or centroid is obtained randomly or randomly based on existing sales data.
for the initial determination of the cluster, it is assumed:
Cluster center 1: .
, .
Cluster center 2: .
, .
Calculation of Object Distance to Cluster Center To measure the distance between the data and the cluster center.
Euclidean distance can be used.
Of the 37 data points prepared to serve as samples, the selection of the initial cluster centers was carried out, namely C1 = .
, .
, and C2 = .
, .
Then it will calculate the distance from the remaining sample data to the cluster center, for example with M.
5 System Implementation Testing the results of the analysis is very important to find out and verify whether the results are correct or not.
Test the accuracy of the results of manual data processing.
The stages of testing carried out in determining the level of sales of electronic goods at CV.
Berkah Elektronik in testing sales data samples using one of the Tanagra 1.
48 software With the following steps:
3 Interpretation/Evaluation Interpret patterns resulting from data mining, and then check whether the patterns or data found contradict the facts or hypotheses that existed before.
JUSIKOM PRIMA (Jurnal Sistem Informasi dan Ilmu Komputer Prim.
Vol.
6 No.
Agustus 2022
E-ISSN : 2580-2879
Figure 5.
Display Define Status 1 Target Figure 3.
Database View After determining the database to be used, we will start implementing the Tanagra 1.
48 application.
Figure 4.
Dataset Display After the dataset is displayed, add DEFINE STATUS and enter the price and amount inputs into Define Attribute Statuses.
In the same way, enter the target brand, type, and type into Define Attribute Statuses.
In the next stage, we will see Clustering on the menu at the bottom, then select K-Means, then drag towards Define Status 1, then K-Means under Define Parameter Parameters, then display the K-Means Parameters screen to display the K results for KMeans in the Data Visualization menu.
Scatterplot.
Figure 6.
View Scatterplot display The next step is the Data Visualization menu.
the menu will be Export Dataset 1 and place it under K-Means 1.
To see the results of Export Dataset 1 obtained by Export Dataset 1, select parameters and then set where to store the results or output from the JUSIKOM PRIMA (Jurnal Sistem Informasi dan Ilmu Komputer Prim.
Vol.
6 No.
Agustus 2022
E-ISSN : 2580-2879
process that has been carried out starting from start to finish with the result as shown below:
Figure 9.
Display Output for 60 Records Figure 7.
DIg Op Prm Export Dataset view After determining the storage location, if you want to see where the output results are stored, you can do it easily by viewing Export Dataset 1, as shown Figure 10.
Display Output for 60 Records (Continue.
Where the output produced in Figure 10.
If Cluster_KMeans_1 or c_kmeans_1 = then sales level = sellout If Cluster_KMeans_2 or c_kmeans_2 = not selling, then sales level = not selling.
Figure 8.
View Export Dataset 1 In the next stage, we will see the output of the processed data.
6 Discussion of Test Results From the stages of testing that have been carried out by following all the processes manually and with the Tanagra 1.
48 software, it can be concluded that the test results are the same as the centroid score and the clustering results are produced by Tanagra 1.
software is the same as using manual calculations.
Likewise testing using all data .
the results of the manual calculation of 60 data records that produce three iterations of the manual calculation of Euclidean Distance one and Euclidean Distance two, it JUSIKOM PRIMA (Jurnal Sistem Informasi dan Ilmu Komputer Prim.
Vol.
6 No.
Agustus 2022 is obtained that the number of Cluster 1 is 12 data records, and for Cluster 2 there are 28 data records.
After manual calculations, tests were carried out using Tanagra 1.
48 software to compare the results In testing the Tanagra 1.
48 software from 60 records of sales data, the results for Cluster 1 were 12 data, and for Cluster 2 there were 28 data.
So it can be said that manual calculations are tested using all data .
on Tanagra 1.
48 software has the same result.
The results of the first cluster test of the 60 data that were carried out gave no less than 36 objects presented in the following table:
Table I.
First Cluster Results PRODUCT
TYPE
21FU6RLR
14FU7ABF
19LH20RC
DVD
DV-452
KULKAS
MESIN CUCI
GNVL12SL-1K
WF-15CR1T
MESIN CUCI
WP-60IN-2T
PANASONIC
BLENDER
MX-J1G
3,95
PANASONIC
PANASONIC
PRICE
FEU-409
KULKAS
NR-A17KX1P
PANASONIC
MESIN CUCI
NAW70BC1-2T
POLITRON
PLE-686F-1
POLITRON
DISPENCER
PWC-107F
POLITRON
DVD
DVD-2183
KULKAS
PR-211SM2P
KULKAS
PR-169VSB1P
POLITRON
POLITRON
POLITRON
MESIN CUCI
PAW6T205-2T
POLITRON
MIX-1403
POLITRON
PS-21UM
POLITRON
APS-29UM
AHA9KCY-1
KULKAS
SJG170TZB-1
16,25
SHARP
SHARP
SHARP
KULKAS
SJG200UZS-2
SHARP
21DXS888
SHARP
29DXS200
MESIN CUCI
VH-8200E1T
TOSHIBA
TOSHIBA
KULKAS
GRH240PD-2P
PHILIPS
BLENDER
HR-2071
PHILIPS
SETRIKA
HI-114
RINAI
KOMPOR GAS
RI-522A
CS29M21ML
AS-05RLN1
SAMSUNG
SAMSUNG
DSC-W180B
SONY
CAMERA
DIGITAL
DSC-W210C
SONY
PLAYSTATION
TECHTRON
DVD
P-DVD9500
The results of the second cluster when testing 60 data points produce 12 items presented in the following table:
Table 2.
Second Cluster Results MERK
PRODUCT
TYPE
CT-07LCS-2
AMOUNT
KIPAS ANGIN
CAMERA
DIGITAL
MERK
SONY
SCPH90006-1
E-ISSN : 2580-2879
PRICE
AMOUNT
KULKAS
ST-05ICE1.
GNB352YLC1B
KULKAS
GNM392YPC-2
CSC10KKP-2
PANASON
PANASON
KULKAS
NR-B203G2P
PANASON
THL32C10X
PANASON
SHARP
TH-L37X10
AHAP5HHY1.
SHARP
AHAP9KHL-2
SONY
HANDYCA
DCR-SX60E
SONY
MINI
COMPO
MHCGN1300D
The K-Means algorithm is considered an algorithm that helps in grouping data whose properties can be clearly recognized and makes it easier for users to extract information from relevant data.
CONCLUSION
From the descriptions and explanations that have been discussed in the study, several conclusions can be drawn, including: With the K-Means algorithm, determining the level of sales of electronic products is easier than using the manual method.
The resulting grouping is used to conclude electronic goods sales Berkah Electronics uses the K-Means algorithm to determine which products are performing well and which are not.
The selection of variables and attributes to be used has a significant impact on the data used.
The development of this system from a manual or data system to a computerized system is carried out from the perspective of problems that arise from the old Here are suggestions to consider for the future: In this study, we experimented with one of the existing data mining clustering methods, namely the K-Means In the future, it will be necessary to develop more in-depth research to compare results and find out which theory gives good results.
The software used is Tanagra 1.
Although this software is easy to use, you first need to understand how to use it.
JUSIKOM PRIMA (Jurnal Sistem Informasi dan Ilmu Komputer Prim.
Vol.
6 No.
Agustus 2022
REFERENCES