Jurnal Informatika Universitas Pamulang Penerbit: Program Studi Teknik Informatika Universitas Pamulang Vol. No. September 2020 . ISSN: 2541-1004 e-ISSN: 2622-4615 32493/informatika. A Survey on Phishing Website Detection Using Hadoop Muhammad Rayhan Natadimadja1. Maman Abdurohman2. Hilal Hudan Nuha3 School of Computing. Telkom University. Jl. Telekomunikasi Terusan Buah Batu. Bandung. Indonesia, e-mail: 1mrayhann@student. id, 2abdurohman@telkomuniversity. hilalnuha@telkomuniversity. Submitted Date: August 31st, 2020 Revised Date: September 26th, 2020 Reviewed Date: September 22nd, 2020 Accepted Date: September 30th, 2020 Abstract Phishing is an activity carried out by phishers with the aim of stealing personal data of internet users such as user IDs, password, and banking account, that data will be used for their personal interests. Average internet user will be easily trapped by phishers due to the similarity of the websites they visit to the original Because there are several attributes that must be considered, most of internet user finds it difficult to distinguish between an authentic website or not. There are many ways to detecting a phishing website, but the existing phishing website detection system is too time-consuming and very dependent on the database it has. In this research, the focus of Hadoop MapReduce is to quickly retrieve some of the attributes of a phishing website that has an important role in identifying a phishing website, and then informing to users whether the website is a phishing website or not. Keywords: Phishing. Hadoop. Website. Information Security. Phishing Detection Introduction Some people will do everything they can to get what they want and some of them will use their knowledge in a bad way like phishers. They make fake websites that are made to steal personal data from those accessing the site such as user IDs, passwords, and debit/credit cards. Average internet users may not be able to identify whether the websites are phishing or not because the websites are almost identical to the real one. Phishing activity is almost the same as fishing, but when fishing catches fish, whereas phishers capture personal information from a person or organization (Pham. Nguyen. Tran. Huh, & Hong, 2. They made fake websites with the aim of stealing their personal data and without the user knowing they had given information to phishers. From the report of Anti-Phishing Working Group, there are 266,387 website phishing in the third quarter of 2019 (Figure. This is 46 percent increased from the second quarter of 2019, which amounted to 182,465 (Anti-Phishing Working Group, 2. Therefore, phishing is still a big crime because it can result in substantial losses. Figure 1. 2019 Phishing web report Figure 2. Most phisher target diagram http://openjournal. id/index. php/informatika Jurnal Informatika Universitas Pamulang Penerbit: Program Studi Teknik Informatika Universitas Pamulang Vol. No. September 2020 . According to Figure. MarkMonitor, member of APWG made an observation and got results that SAAS/Webmail is the target of largest phisers in the third quarter of 2019 (Anti-Phishing Working Group, 2. Attacks on site File hosting and eCommerce are less popular in the third quarter of 2019, but attacks on payment sites are still among the second largest after SAAS/Webmail. Until now, there have been many techniques used to detect phishing sites. As is usually paired into e-mail and browsers such as Google Safe Browser and SmartScreen Filter. Because phishing attacks take advantage of human ignorance of the internet, this is a difficult problem to be solved All these anti-phishing experiments were developed with the aim of minimizing the impact of phishing attacks. Literature Review In the technology industry that is developing today, which is very influential on this security problem has given anxiety to some users both at work and at home. Incident that exploit human vulnerability have increased in recent years (Dunlop. Groat, & Shelly, 2. In this era , there are many developments in the field of security systems aimed at ensuring that security is the top priority and that preventive action must be taken as quickly as possible to avoid being hacked by people who wish to commit crimes in cyberspace. Some of cyber security workers are currently using a reliable and stable detection technique to be their phishing website detection technique (Mahajan & Siddavatam, 2. This system uses a crawler (Rakshith & Prabhakara, 2. to detect URLs in the database and web pages that will be checked, then given to MapReduce to be checked for authenticity. MapReduce is used to improve the performance of phishing site searches. This MapReduce technique improves the performance of phishing site The method is done by taking a page from a phishing website and then compressing the image to reduce the intensity (Tangy. Uz. Caiy. Mamoulisy, & Chengy, 2. The results of the compress are distributed into several containers whose size has been set, to produce a histogram. This histogram is used to compare datasets with existing datasets. The current detection of phishing attacks is mostly in two categories namely, detecting and filtering phishing emails, and detecting and filtering phishing websites both approaches are very important to counter phishing attacks. http://openjournal. id/index. php/informatika ISSN: 2541-1004 e-ISSN: 2622-4615 32493/informatika. Phishing e-mails and websites must be considered more because of their unpredictable nature, therefore sometimes phishing attacks can escape filters that have been installed. Apart from that, there are several tools used by phishers to bypass phishing emails and websites such as SMS, malware, social media and also online games. (Hong, 2. In this study, we will discuss more about phishing attacks through websites, there are also several detection techniques that have been used or have been suggested. The detection technique used in existing browsers such as Firefox and Chrome is to blacklist websites that have been registered in the database. The main weakness of the Blacklist is that it was created by volunteers who found it, therefore the blacklist must be frequently updated manually and the process takes a long time and therefore this technique is weak against new websites created that day (Jain & Gupta, 2. Another well-known technique in phishing detection is Visual Cryptography (Kumar & Kumar, 2. which is a detection technique using Others use logos and textual content from a web page (Chiew. Chang. Sze, & Tiong, 2. A frequent example is captcha that will block interruptions coming from other machines but is not very effective to prevent interruptions from Detection using Heuristic technique is also a technique that has been used to deal with phishing Heuristics is a technique that estimates whether a web page has heuristics characters (Zhu. Chen. Ye. Li, & Liu, 2. This technique can recognize phishing websites based on a series of features extracted from them (Tan. Chiew. Wong, & Sze, 2. But just relying on heuristics will not be enough, because the phishers can outsmart their website so that could not be detected by heuristic Website visitors can be fooled easily because of its resemblance to the original website. Cantina is one example of a well-known heuristic-based approach. They propose the detection of phishing websites using Google PageRank, but only by relying on the value of PageRank (Sunil & Sardana, 2. It is difficult to identify whether the site is really a phishing website or not, because the website could be an official website that was newly created or a low rank blog Aaron Blum. Brad Wardman. Thamar Solorio proposed research (Blum. Wardman. Solorio, & Warner, 2. focusing on the idea of limiting the source of features that can facilitate Jurnal Informatika Universitas Pamulang Penerbit: Program Studi Teknik Informatika Universitas Pamulang Vol. No. September 2020 . information extraction through the host. The URL will be considered a binary feature vector. The vector is entered into the algorithm, then from the vector it will be found whether the URL is phishing or not. Ramesh Gowtham and Ilango Krishnamurthi Anti-phishing system with filtering mechanism based on 15 heuristic features (Gowtham & Krishnamurthi, 2. However, the accuracy of the login window must match the features provided. According Rakesh Verma and Keith Dyer, proposed a set of lexical URLs, and also how many letters are in them (Verma & Dyer. However, if the URL does not have spelling errors, then this feature may not work properly. Machine learning based detection techniques also one of techniques to used to detect phishing Machine learning techniques rely on a set of features being extracted onto every web pages and further require the genuine website for training data to be retrieved as well as a phishing website to be checked. (Qabajeh. Thabtah, & Chiclana, 2. the accuracy of the result greatly affected by the quality of websites in the training set (Rao & Pais. Despite these challenges, the approach of using machine learning techniques has become an active subject of discussion for this phishing website detection research. Several studies have been carried out using varied data sets and using different classification algorithms (Abdeljaber. Mohammad. Thabtah, & McCluskey, 2013. Feng et al. , 2018. Sahingoz. Buber. Demir, & Diri. The accuracy of algorithm is affected by features used in classification, but some study thought of the choice of intelligent method features (Rajab, 2. Choice of feature is an important task to build a good, generalized phishing detection. Currently, a feature that is widely used as an option is heuristics (Babagoli. Aghababa, & Solouk, 2. URL-based detection technique is also one of the techniques used to detect phishing websites. This technique analyzes the features from URL and inform if there any dangerous websites. Marchal et proposes a phishing detection system, in which the system uses lexical analysis of URLs as well as queries from search engines (Marchal. Francois. State, & Engel, 2. But queries sent across the network can increase the space as well as the costs While James et al. , they do research on lexical-based phishing detectors as well as the information they get on the web page (James. Sandhya, & Thomas, 2. this feature relies on special features made for certain websites, http://openjournal. id/index. php/informatika ISSN: 2541-1004 e-ISSN: 2622-4615 32493/informatika. therefore this feature is not suitable for large-scale There are also other phishing website detection techniques such as user habits, according Srinvasa Rao and Alwyn R Pais, they exploit the phishing web pages to find out what happens when they enter data on the website, such as entering fake credentials and also observing the contents of the login page to get the desired results. (Rao & Pais. However, there are some limitations regarding the login system, for example, some websites can only enter an incorrect password three Also, in some websites the login column cannot be detected correctly, so false credentials cannot be sent automatically. Finding phishing targets is useful for analyzing the behavior of an attacker and can help users to access legitimate web pages. In (Ramesh. Gupta, & Gamya, 2. , they propose to classify hyperlinks from suspicious web pages according to the related domain. However, this method requires analysis of many links and candidates for phishing targets which may not be included in the hyperlink In (Wenyin. Fang. Quan. Qiu, & Liu, 2. , they detect phishing targets from suspicious web pages using the consideration of the Sematic Link Network and their construction. with this method web page detection can be done. however, it requires a fairly high cost. Due to the use of the open internet to carry out various online activities. Users must be prepared from the threat of cyber crime. There are many types of cyber crimes, and one of them is phishing is one of the most popular cyber (Pujara & Chaudhari, 2. Phishing will remain a dangerous attack despite extensive research on phishing website filters (Gutierrez et al. , 2. Therefore, a monthly report to record phishing attacks is produced by the Anti-Phishing Working Group (APWG), and another group that plays a role in fighting phishing is Phishtank. Phishtank is a web-based application that provides crowdsourcing services aimed at reporting and validating a website (Dobolyi & Abbasi, 2. Phishtank users can add websites suspected of being phishing websites with the aim of indicating that website is a phishing website, and if true then that site's URL will be entered into the Phishtank database. 1 Phishing Phishing is a method of committing fraud by tricking the target with the intention of stealing the target account (Mao. Tian. Li. Wei, & Liang. Jurnal Informatika Universitas Pamulang Penerbit: Program Studi Teknik Informatika Universitas Pamulang Vol. No. September 2020 . Phishing is also often known as website violence (Satish & K, 2. The term comes from the word fishing which means to lure the victim to be trapped into his trap. This phishing was created with the aim of stealing important information of a person or an organization such as their personal and financial information. Phishing is a serious crime and web threat because it can cause large financial losses (Mohammad. Thabtah, & McCluskey, 2015. Thabtah & Kamalov, 2. The purpose of phishers is to deceive users into being able to provide their sensitive information (Abdelhamid. Ayesh, & Thabtah, 2. To trap internet users who frequently visit websites, attackers create phishing web pages . hich are similar to social medi. so that victims can enter their personal information on those web Attackers usually publish links from their phishing website address on social media intended to trick users into visiting their phishing pages. Because of social media being an easy place to catch inexperience users and are diverted so that users access their websites. Stolen information is usually in the form of a password or information about a user's credit card (Baykara & Gyrel, 2. With the help of a website display that resembles an official site, average users will enter their personal data into the phishing site. Information that is often stolen by these websites are, userAos account number, userAos password and username, credit card information, and user e-banking information. Phishing like this is also often found in users' e-mails. In studies of user experience from phishing attacks, users are fooled by phishing websites (Volkamer. Renaud. Reinheimer, & Kunz, 2. for these five reasons. Users lack knowledge of URLs, users do not know which website can be trusted, users do not see the full URL, because there is a redirection or hidden URL, users do not have time to ask the authenticity of a website, or users accidentally enter the website, users cannot distinguish phishing website from official website. Although caution and user experience are important to avoid phishing, users may not be able to completely avoid phishing scams (Greene. Steves, & Theofanos, 2. Because before they carry out an attack, the attacker also takes into account the habits and characteristics of the user (Curtis. Rajivan. Jones, & Gonzalez, 2. Cyberattacks can cost up to billions of dollars in losses as well as the loss of confidential user information (Shaikh. Shabut, & Hossain, 2. In addition, attackers can also attack the user's mobile device, http://openjournal. id/index. php/informatika ISSN: 2541-1004 e-ISSN: 2622-4615 32493/informatika. especially at this time, where the use of smartphones are increasing (Goel & Jain, 2. 2 Hadoop Hadoop Is a framework or Java-based open source platform under Apache to support applications that run on big data. Hadoop is used to handle large amounts of data, be it structured, semistructured, or unstructured data. Hadoop replicates the data in several clusters so that if there is a problem in one cluster then the other clusters are still alive. The name hadoop itself comes from the elephant doll owned by Doug Cutting's son, then Hadoop was developed by Mike Cafarella and Doug Cutting in 2005. 3 MapReduce Google introduced a programming model that aims to process large datasets called MapReduce (Zhang & Chen, 2. The framework of MapReduce is used to process large dataset using many nodes, commonly called clusters or grids. The process can occur in a filesystem or database. MapReduce usually consists of three stages. Map. Shuffle, and Reduce. 4 Phishtank Phishtank was launched in October 2006. Phishtank is a community-based service that provides a place to report and verify phishing Users can report a website URL that is suspected to be a phishing site, then the Phishtank community will vote whether the URL is phishing or not. Phishtank is used by Opera web browser, online reputation, and internet security service browser plugin Web of Trust. Yahoo! Mail, the McAfee antivirus, and Kaspersky. The blacklist that has been approved by Phishtank can be downloaded as a JSON file. 5 Phishing Website Figure 3. Example of phishing website. Jurnal Informatika Universitas Pamulang Penerbit: Program Studi Teknik Informatika Universitas Pamulang Vol. No. September 2020 . Phishing website pages have a similar interface to the original website, but they have different URLs. A cautious and experienced user can distinguish official and genuine websites only from their URLs. However, due to time constraints, some users do not see the entire URL, because they believe that the URL from social media is a genuine By using this kind of fraud, phishers try to obtain sensitive information and victim's personal (Gupta. Arachchilage, & Psannis, 2. user has entered this website, which they believe that this website is genuine like Figure. Users can easily provide their personal information without suspicion because of the similarity of the website with the original. Writing Method This research is a type of literature study obtained/studied from reliable sources relating to phishing, the techniques used to detect phishing and how to handle it using Hadoop. The writing of this paper begins with the lack of literature that summarizes the phishing detection techniques and method, and solution to speed up phishing Result and Disscussion The general description of the system is to use the MapReduce technique to generate attributes of a phishing website. Users enter the URL that user want to visit, then the website will be analyzed, and the value of the attribute will be calculated. The dataset from Phishtank will be used to compare the attributes that have been obtained by MapReduce and after comparison it will produce results that the website is included as phishing web site or not. The overview of how the phishing detection system work based on Figure. 4 are as follows, users enter the URL they want to check. Hadoop MapReduce will extract the attributes from the URL that has been given, the results of the extracted attributes will be made into a comparison material with data in the dataset, data in the dataset will be given to the classifier to make a rule to be used as a comparison, the classifier will forward the data and rules to predict, then predict will produce results, the website is a phishing or not. 1 Dataset To create this phishing detection system, data sets that can represent URLs on the internet are Therefore, we need a large dataset and the URL that can represent the internet. To build reliable dataset, the URLs used on this system are http://openjournal. id/index. php/informatika ISSN: 2541-1004 e-ISSN: 2622-4615 32493/informatika. from the Phishtank website. This URLs are the core for the rule-making algorithm on this system whose attributes will be used in the construction of a phishing website detection system. Dataset will be provided as input to the Classifier that is applied in the WEKA machine learning data mining tool. Data sets are arranged hierarchically. 2 Attribute Generator In this proposed system, the attribute generator is a module that has an important role in determining the genuineness of a URL. Attributes considered consists of three layers. This system uses the Layered attribute. where the first layer contains identity of URL and domain. While layer two consist of Security and encrytion, and source code and script. And layer three consist of web adress bar, page style, social human factor, and Figure 4. Overwiev of proposed system Figure 5. Architecture of Attribute that will be used. After searching from many documents to find which attributes are needed for this system. The architectural model for the attributes in Figure. 5 Ae Figure. 8 is based on (Aburrous. Hossain. Dahal, & Thabatah, 2. This model is used because the consideration of using visual aspects and this model is not used only for specific Jurnal Informatika Universitas Pamulang Penerbit: Program Studi Teknik Informatika Universitas Pamulang Vol. No. September 2020 . ISSN: 2541-1004 e-ISSN: 2622-4615 32493/informatika. purposes, it can also be used to determine the attributes of a general website. Attributes will be generated by several rules. The authenticity of a website will be inversely proportional to the value of the suspicious attribute that has been obtained. Hadoop MapReduce will separate the attributes. Using MapReduce will reduce the computation time for each dataset. then the separated attributes will be compared to determine the genuineness of the website. Figure 6. Detail of Attributes in layer one. Figure 8. Detail of Attributes in layer three. Figure 7. Detail of Attributes in layer two. http://openjournal. id/index. php/informatika 3 Classifier The function of this module is to fetch a data from the database and makes some rules for comparing website whether phishing or not. make rules that can be trusted, a tool for mining data. WEKA (Hall et al. , 2. can be used to help the process. With all the data mining algorithms in it, it can help in determining the most suitable rules. Then the PART algorithm is used in this system. PART is short for Projective Adaptive Resonance Theory. This algorithm is very useful if faced with a large database. This system works in a way, gives the attributes that are received and given to the predict module. Then the layer will act as a coordinator between the rules that are made and the attributes that are accepted. By classifying attribute values correctly, layers can estimate the nature of a To make this system simple, websites are classified into three categories. Trustworthy. Suspicious, and Phishing. Jurnal Informatika Universitas Pamulang Penerbit: Program Studi Teknik Informatika Universitas Pamulang Vol. No. September 2020 . 4 Predict The task of this module is to make decisions based on input obtained from the attribute generator and classifier. The rules from the classifier will be used as a decision maker. The next input is obtained from the attribute generator using Hadoop MapReduce. The attributes of the web ISSN: 2541-1004 e-ISSN: 2622-4615 32493/informatika. page will be searched for by Hadoop MapReduce and then forwarded to the predict module (Baitule & Deshpande, 2. Using Hadoop MapReduce, attributes from attribute generator can be compared with datasets that have been arranged according to Table 1. Result Summary Paper Utilisation website logo (Chiew et al. Effective Phishing Websites Detection Model (Zhu et al. PhishWho (Tan et al. Method Heuristic/SVM Dataset Accuracy Phishtank and 93. Alexa Neural network UCI Dataset, 99. Optimal Phishtank, and feature selection Alexa Heuristic Phishtank. OpenPhish, and Alexa PageRank (Sunil & Heuristic/ Sardana, 2. Google PageRank Phishtank Efficient learning framework (Rao & Pais, 2. Novel neural network (Feng et al. , 2. J48, AdaboostM1. Random Forest. SVM. Bayers Neural Network /Monte Carlo Algorithm Machine learning K-star, kNN, based (Sahingoz et SMO , 2. Heuristic nonlinear Mete-heuristic/ regression (Babagoli Decision et al. , 2. and Wrapper PhishScore (Marchal SVM. LMT, et al. , 2. Jrip. PART Phishtank and 99. Alexa Analyzing the feign TVD algorithm relationship (Ramesh et al. , 2. Google. Alexa, 99. Netcrafts. Millersmiles. Phishtank. ReasonablePhishing Webpage list. http://openjournal. id/index. php/informatika UCI repository Phishtank. Yandex UCI Datasets Phishtank Remarks Can detect imagebased phishing Continously change of features and can deal with phishing with sensitive feature Cannot address visual cloning, use three phishing website, and loaded to clientAos Only relying on value of Pagerank and cannot zero-day training set depend on the quality, and using various algorithms. All pages must be downloaded, using 30 Have a relatively huge dataset and using various algorithms. Use third-party service and use 20 features. Real-time detecting system and Minimal use of thirdparty service and low false positive rate. Jurnal Informatika Universitas Pamulang Penerbit: Program Studi Teknik Informatika Universitas Pamulang Vol. No. September 2020 . Reported Output Using Hadoop MapReduce will speed up the process of detecting phishing websites. Hadoop MapReduce runs in a distributed environment, the attribute distributing process will run faster. so, the results will be obtained faster. Experimental results reported in the literature is summarized by Table 1. Conclusion The main objective of this proposed system is to improve the search performance of phishing websites, especially their speed. This can be achieved with the help of Hadoop MapReduce by spreading tasks through several different nodes, this way the user can find out if a URL is phishing or not more quickly, and also Hadoop MapReduce will speed up the overall system response. References