Use of hog descriptors in phishing detection

19
Use of HOG Descriptors in Phishing Detection Ahmet Selman Bozkir, Ebru Akcapinar Sezer Hacettepe University Computer Engineering Department, TURKEY ISDFS 2016

Transcript of Use of hog descriptors in phishing detection

Use of HOG Descriptors in Phishing DetectionAhmet Selman Bozkir, Ebru Akcapinar SezerHacettepe University Computer Engineering Department, TURKEYISDFS 2016

Yazar (Y) - * Hello, I am Selman Bozkr from Computer Engineering department of Hacettepe University, Turkey and before my speech, I want to say Welcome to the dear chairman and distinguished delegates. As I could not come to the beautiful city of Little Rock here i prepared this voice recorded presentation.

* And let me start to tell you my work entitled "Use of HOG Descriptors in Phishing Detection"TopicsWhat is phishing? Facts and the rise of phishing attacksExisting approachesWhy vision based scheme?HOG descriptorsDemonstration of developed methodExperiments and ResultsConclusion

Yazar (Y) - Today, we will first start up with "what the phishing is" and move on to the facts about this digital crime and address the increment in phishing attacks. Then we ll briefly investigate the existing approaches in anti phishing studies as well as the the benefits of vision based methods. In the following section HOG descriptors will be roughly demonstrated. At next stage, we will dive to details of developed approach and we will finish with the results.What is phishing?Phishing is a scamming activity which deals with making a visual illusion on computer users by providing fake web pages which mimic their legitimate targets in order to steal valuable digital data such as credit card information or e-mail passwords. Phone phreaking + fishing -> phishing

Yazar (Y) - Lets start with stating what the phishing is..

The word phishing was coined around 1996 by hackers stealing America Online accounts and passwords. As a term, Phishing is a scamming activity which deals with making a visual illusion on computer users by providing fake web pages which mimic their legitimate targets in order to steal valuable digital data such as credit card information or e-mail passwords.

*By the way, hackers commonly replace the letter f with ph, a nod to the original form of hacking known as phone phreaking. So the word phishing is originated from phone phreaking.Facts and figures* Source: PhishLabs 2016 Phishing Trends & Intelligence Report

Yazar (Y) - Now lets give some facts about the state of phishing. According to 2016 Phishing Trends & Intelligence Reports of PhishLabs, 77% of the phishing attacks were carried out in USA. China comes at second place with only 5%.Facts and figuresIn 2012-2013, 37.3 millions users were affected by phishing attacks*37.3M* Source: 2013 Verizon Data Breach Investigation Report

Yazar (Y) - Between 2012 and 2013, 37.3 millions users were affected by phishing attacks

Facts and figures1 million confirmed malicious phishing sites on over 130,000 unique domains. (as of 2013)* Source: PhishLabs 2016 Phishing Trends & Intelligence Report

Yazar (Y) - as of 2013, 1 million confirmed malicious phishing sites have been created on over 130,000 unique domains.Facts and figuresAverage life time of phishing pages is 32 hours Risk of zero-day attacks getting higher due to not being discovered by blacklists32h* Source: APWG, Phishing activity trends paper. [Online]. Available at http://www/antiphishing.org/resources/apwg-papers/

Yazar (Y) - It is also reported that average life time of phishing pages is around 32 hours. This implies that blacklist approaches getting more vulnerable to zero-day attacks.Facts and figuresConsumer-oriented phishing attackstargeted financial institutionscloud storage/file hosting siteswebmail and online servicesecommerce sitespayment services.90%* Source: PhishLabs 2016 Phishing Trends & Intelligence Report

Yazar (Y) - According to latest reports, 90% of consumer oriented phishing attacks focused on financial institutionscloud storage/file hosting siteswebmail and online servicesecommerce sites andpayment services.Facts and figuresfinancial institutionspayment services.* Source: PhishLabs 2016 Phishing Trends & Intelligence Reportcloud storage/file hosting sites

Yazar (Y) - PhishLabs also reports that while there is a negative trend in financial instittions and payment services oriented attacks, there exist an increment in cloud storagefile hosting sites related phishing

This means that phishing is a bit shifting in order to steal not only money but also non profit personal dataExisting Anti-Phishing Approaches

Yazar (Y) - Now lets talk about the state of anti-phishing studies.

AntiPhishing studies can be categorized in many different ways. However here we are grouping them in four distinct categories. In first one the general content based methods were employed. On the other hand, the DOM based methods were incorporated in order to detect visual or structural similarities in HTML files. Nonetheless, DOM based analysis have started to be insufficient due to the new kinds of tricks in HTML based markuping. Furthermore, with advent of computer vision methods such as local visual features or scale invariant feautures the vision based methods have been started to be used in recent years.Why vision based scheme?Substition of textual HTML elements with or applet like contentsZero day attacks need pro-active solutionsDynamic / AJAX type content loading Different DOM organizations between legitimate and fake web pagesMore robust to complex backgrounds or page layoutsAnd the most important is vision based solutions are in concordance with human perception* Source: PhishLabs 2016 Phishing Trends & Intelligence Report

Yazar (Y) - Lets investigate why there is a growing trend in vision based methods?

With the advent of vision based methods and as they are in concordance with human visual perception, we can see vision based anti phishing approaches have becoma popular in recent years.

Moreover, there are some reasons behind this trend:First of all, there is a war between phishers and anti phishing solutions. Expert phishers tend to replace textual contents with image like elements. On the other hand it can be seen that dynamic content loading causes failures in conventional DOM based methods.

It is also known that current web page design trends tend to have complex graphical interfaces so vision based methods are robust at analyse and detection of phishing web pages in these kind of situations.

And the most important part is pure vision based analyze is simulating the human perception. This fact makes the vision based methods superior to DOM or blacklist approaches. As a consequence, vision based methods exhibit faster and more robust environment against zero day attacks.Methodology: HOG Features and DescriptorsHistogram of Oriented GradientsDalal & Triggs-2005A good way to characterize and capture local object appearance or shapes by utilizing distribution of intensity gradients or edge directions.Preffered because of:(i) HOG descriptors are able to capture visual cues of overall page layout; (ii) they are able to provide a certain degree of rotation and translation invariance.

Yazar (Y) - Before stating the proposed method, i think it is a good moment to talk about HOG in brief.

Histogram of Oriented Gradients is invented by Dalal and Triggs and it is powerful computer vision method which has been used for characterizing and capturing local object appearance or shapes by utilizing distribution of intensity gradients or edge directions.

In essence, HOG descriptors are designed to represent and reveal orientations in a local patch of an image.

We prefer, HOG descriptors because of two reasons: First of all they are able to capture visual cues of overall page layout; On the other hand, they are able to provide a certain degree of rotation and translation invariance.Developed approach in details

Yazar (Y) - Now lets see the details of proposed approach

Our main idea is briefly based on dividing the web page surface in grid cells and computing the orientations in each cell. Then we calculate the visual similarity of legitimate and fake web pages by summing up histogram intersection values. It should be noted that we are computing 9 orientation bins as it is proposed in the paper of Dalal & Triggs.

Our system consists of two modules. The first module so called Wrapper, was designed and implemented in order to find out effective page boundaries and taking a screenshot of web page. Following the stage of revealing target Region of Interest. Then the effective portion is cropped and prepared for being an input to next module. By the way it was decided t use the top most 1024 pixels.

Second module so called Hogger was implemented in order to take the secreenshot of target web page and output a concatenated HOG feature vector.

Through out the wrapper implementation we employed Mozilla GeckoFx.NET browser API. The wrapper window was precisely set for taking 1024 pixel wide screen shots. At next stage, we cropped the portion below 1024 pixels. For the cases where height of web page is lower than 1024 pixels, we applied a dominant color detection method for filling the empty lowest part. Finally the output image was converted to grayscale in order to increase the gradient computation accuracy.

During the experiments we applied to cell size configuration such as 64pixels width and 128pixels width. At the next part, we will investigate the effects of cell sizes.ExperimentsFor the first phishing web page dataset, 50 unique phishing pages reported from Phishtank covering the days between 14 December 2015 and 5 January 2016 were collected. For the legitimate web page pairs, we have collected 18 legitimate home pages from Alexa top 500 web site directory. Afterwards, we have shuffled the page URLs in order to obtain 100 distinct legitimate home page pairs.64 pixel wide and 128 pixel wide cells were employed

Yazar (Y) - For the experiments,

we prepared two datasets. the first one was built with 50 unique phishing web pages collected from Phishtank the famous phishing web page archive. The second one was created by selecting 18 legitimate home pages from Alexa top 500 web sites.

Then we shuffled the legitimate home page pairs in order to build 100 legitimate web page pairs. Our hypotese was to compute higher similarity scores between legitimate-fake web page pairs than legitimate-legitimate page pairs.

During the experiments we applied two cell size configurations. 64px and 128px wide cells were applied in order to investigate the sensitiveness of granularity in HOG vectors.Results - 1StatisticsSimilarity of Pairs of Phishing Pages (50 pages)HOG-64 px cellsHOG-128 px cellsmin51.873 %49.910 %max98.861 %98.390 %mean78.868 %78.637 %standard deviation12.147 %10.963 %

Statistcs Of Phshng And Their Target Page Pars In Hog-64 and Hog-128

StatisticsSimilarity of Pairs of Legitimate Pages (100 unique pairs)HOG-64 px cellsHOG-128 px cellsmin38.420 %45.683 %max74.459 %77.092 %mean60.739 %66.012 %standard deviation11.026 %9.492 %

Statistcs Of Unque Legtmate Page Pars In Hog-64 and Hog-128

Yazar (Y) - At the left table we see the statistics of first dataset which covers phishing and legitimate web pairs. At the right table we see the results derived by comparing legitimate-legitame web page pairs.

If we investigate the results by looking the summary tables, we can easily see that similarity scores between the pairs of test 1 and test 2 sets were found notably different. This implies that HOG descriptors are suitable for phishing detection tasks. Therefore our hypotese is validated.Results - 2Similarity scores of unique legitimate page pairs

Yazar (Y) - In this graph we are seeing the similarity scores of 100 legitimate web page pairs.

The orange line depicts the values in HOG-64 configuration. The blue line shows the HOG-128 values.

By looking up to graph we seethat in most of the cases it can be seen that HOG 64 performs better discrimination than HOG-128.

On the other hand it is seen that the maximum smilarity score between the pairs is found 75%Results - 3Similarity scores of phishing pages and their legitimate targets

Yazar (Y) - In this graph we are seeing the similarity scores of phishing web pages and their legitimate targets.

As we expected, the similarity scores are notably higher than legitimate pairs.

On the other hand, it can be seen that there is no significant difference between HOG-64 and HOG-128 configurations regarding fake and legitimate pairs. Moreover, it is seen that similarity score around 75% is a good threshold for detection of phishing alert.Discussion and ConclusionThis work is the first study that employs HOG in phishing detectionIt performs a robust method for phishing detection as it is pure vision based and able to capture local visual cues on web page surface.However we addressed some shortcomings.Image contents in phishing web pages are generally different than the legitimate ones. So the image invariance must be supplied in order to achieve a better and robust phishing detection.The method must be also verified with a more comprehensive dataset.

ReferencesY. Zhang, J. Hong, L. Cranor, CANTINA: A Content-Based Approach to Detecting Phishing Web Sites, WWW 2007Chou, N., R. Ledesma, Y. Teraguchi, D. Boneh, and J.C. Mitchell. Client-Side Defense against Web-Based Identity Theft. In Proceedings of The 11th Annual Network and Distributed System Security Symposium (NDSS '04). Netcraft, Netcraft Anti-Phishing Toolbar. Visited: April 20, 2016. http://toolbar.netcraft.com/ E. Medvet, E. Kirda and C. Krueger, Visual-Similarity-Based Phishing Detection, Securecomm 08 International Conference on Security and Privacy in Communication Networks, 2008W. Zhang, H. Lu, B. Xu and H. Yang, Web Phishing Detection Based on Page Spatial Layout Similarity, Informatica, vol. 37, pp. 231-244, 2013.A.Y. Fu, L. Wenyin and X. Deng, Detecting Phishing Web Pages with Visual Similarity Assesment based Earth Movers Distance (EMD), IEEE Transactions on Dependable and Secure Computing, pp. 301-311, 2006.M.E. Maurer and D. Herzner, Using visual website similarity for phishing detection and reporting, In CHI12 Extended Abstacts on Human Factors in Computing Systems, 2012.G. Wang, H. Liu, S. Becerra, K. Wang, Verilog: Proactive Phishing Detection via Logo Recognition, Technical Report CS2011-0669, UC San Diego, 2011.T. Chen, S. Dick, J. Miller, Detecting Visually Similar Web Pages: Application to Phishing Detection, ACM Transactions on Internet and Technology, 10(2), 2010