Recognition of phishing attacks utilizing anomalies in ... · Sunil Chaudhary 2nd December 2012,...

Recognition of phishing attacks utilizing anomalies in phishing websites

Sunil Chaudhary

University of Tampere Department of Computer Sciences Computer Science/Software Development M.Sc. thesis Supervisor: Eleni Berki November 2012

i

University of Tampere Department of Computer Sciences Computer Science /Software Development Sunil Chaudhary: Recognition of phishing attacks utilizing anomalies in phishing websites M.Sc. Thesis, 78 pages, 15 index and appendix pages November 2012

The fight against phishing has resulted in several anticipating phishing prevention

techniques. However, they are only partially able to address the phishing problem.

There are still a large number of Internet users who are tricked to disclose their personal

information to fake websites every day. This might be because existing phishing

prevention techniques are either not foolproof or they are unable to deal with the

emerging changes in phishing.

The main purpose of this thesis is to identify anomalies that can be found in the

Uniform Resource Locators (URLs) and source codes of phishing websites and

determine an efficient way to employ those anomalies for phishing detection. In order to

do that, I performed the meta-analysis of several existing phishing prevention

techniques, specifically heuristic methods. Then, I selected forty-one anomalies, which

can be found in the URLs and sources codes of phishing websites and are also

mentioned or utilized by the past studies. This is followed by the verification of those

anomalies using an experiment conducted on twenty online phishing websites. The

study revealed that some anomalies, which were once significant for phishing detection,

are no longer included in present day phishing websites, and several anomalies are also

widely present in legitimate websites. Such ambiguous anomalies need further analysis

to determine their significance in phishing detection. Moreover, it was also found that

several heuristic methods use an insufficient set of anomalies which introduces

inaccuracy in their results. Finally, in order to design an efficient heuristic method

employing anomalies that can be found in URLs and source codes of phishing websites,

it is suggested to give due priority to the anomalies that are: difficult for phishers to

bypass, only found in phishing websites, seriously harmful, independent of other

anomalies, and do not consume a lot of time for evaluation.

Key words and terms: phishing, phishing prevention, URL, DOM objects, whitelist, blacklist, heuristics, meta-analysis, software quality.

ii

Acknowledgement I would like to express my sincere thanks and deep appreciation to my professor and

supervisor Eleni Berki for her guidance and valuable comments. I am equally thankful

to Marko Helenius (Tampere University of Technology) for the constructive feedback. I

would also like to thank Linfeng Li for sharing his experiences on phishing research and

suggesting various useful materials that I used for my thesis.

My sincere thanks also go to my English teachers, Robert Hollingsworth and

Julie Rajala who helped me to get familiar with the rules of academic writing. I would

also like to thank to my professors Jyrki Nummenmaa and Zheying Zhang as well as all

the attendee of the seminar course entitled “Master’s Thesis Seminar in Sofware

Development “for their suggestion and feedback. Last but not least, I am thankful to my

professor Mikko Ruohonen who provided me summer traineeship and ample freedom to

complete a large part of my thesis during the traineeship period.

Sunil Chaudhary

2nd December 2012, Tampere

iii

Contents 1.Introduction ................................................................................................................... 1

1.1.The phishing epidemic ........................................................................................ 1

1.2.Research questions ............................................................................................. 5

1.3.Anomalies in phishing websites are suitable for phishing detection ................... 6 1.4.Thesis contribution .............................................................................................. 7

1.5.Thesis outline............... ....................................................................................... 8

2.Review of phishing prevention methods ....................................................................... 8

2.1.Meaning of phishing prevention methods ........................................................... 8

2.2.Important factors for effective phishing prevention methods .............................. 9

2.2.1. Phishers’ behavior and phishing techniques ....................................... 10

2.2.2.Internet users behavior and decision making process .......................... 12

2.3.Objectives of existing phishing prevention methods ......................................... 14

2.3.1.Reasons behind internet users’ tendency to fall for phishing ............... 15

2.3.2.Design techniques to educate and aware about phishing ..................... 16

2.3.3.Design effective UI and warning to alert about phishing ................... 18

2.3.4.Development of countermeasure to automatically detect phishing ...... 20

2.3.5.Evaluate the effectiveness of existing phishing prevention methods ... 22

2.3.6.The need to invent proactive strategies for phishing prevention .......... 24

2.4.Classification of phishing prevention techniques .............................................. 28

2.5.Phishing prevention applications ....................................................................... 30 3.Analysis of strength and limitations of technical phishing prevention methods ......... 34

3.1.List based methods ............................................................................................ 34

3.1.1.Whitelist method .................................................................................. 34

3.1.2.Blacklist method ................................................................................... 36

3.2.Heuristic methods .............................................................................................. 40

3.2.1.Use of visual similarity measures in phishing detection ...................... 40

3.2.2.Use of search engine in phishing detection .......................................... 46

3.2.3.Use of anomalies in phishing websites for phishing detection ............ 50

4.Investigating anomalies in phishing websites .............................................................. 55

4.1.Anomalies found in the URLs of phishing websites ......................................... 56

4.2.Anomalies found in the source codes of phishing websites .............................. 62

4.3.Verification of anomalies using online phishing websites ................................ 66

4.4. Discussion on findings ..................................................................................... 70

5.Conclusions ................................................................................................................. 75 6.Limitations and future development work ................................................................... 78 References ...................................................................................................................... 79

Appendix ........................................................................................................................ 86

1

1. Introduction

1.1. The phishing epidemic

Online services are an integral part of modern society. They make information readily

accessible from any place through the Internet. This feature is equally utilized by both

service providers and users. Service providers are able to penetrate and cover large

markets easily at a low operational cost whilst users are able to choose from a wide

range of services and are able to use them regardless of time and location.

Unfortunately, these services too have not spared the attentions of cybercriminal. One of

the major drawbacks of using such services is the risk of phishing.

Phishing is a fraudulent activity carried out using an electronic communication to

acquire personal information for malicious purposes. This information can include bank

or financial institution authentication credentials, social security numbers, credit card

details, and online shopping account information with which phishers usually defraud

their victims. Phishers employ a number of techniques, such as social engineering

scheme and technical subterfuge [APGW, 2012] in order to allure potential victims and

make them divulge their account details and other susceptible information.

(i) Social engineering scheme. In general, phishers use emails masquerading as

being from a legitimate and trustworthy source, such as a bank, or an auction

site, or an online commerce site [APGW, 2012] and redirect victims to an

authentic looking counterfeit website to deceive recipients into disclosing

sensitive information. Many other mediums, such as snail mail, phone call, and

instant messenger are also used to reach the potential victims and lure them to

disclose their confidential information. However, fake emails and phony

websites are easy and economically viable means to target a large number of

potential victims at a time which also might be a reason they are widely used to

conduct phishing.

The fake emails and phony websites used by phishers have evolved to

become technically deceptive and hard for casual detection methods to detect.

Phishing emails often create a sense of urgency to motivate Internet users to take

prompt action, such as asking potential victims to update, or validate, or confirm

their account information for different reasons, for example, to receive an award,

or to help the bank in their procedures, otherwise their account will be

2

suspended, or to stop the account from misuse. Similarly, phishers are also

found misusing the situations and current happenings, for example, phishing

attacks, which emerged after the Haiti earthquake purported to be from relief

organizations or the victims themselves asking for donations, and FIFA World

Cup-themed phishing attempts. Phishing websites often use original website

layout, logo, trademark, and even a similar domain name to make them look

similar to the genuine websites. Furthermore, mirroring original websites to

generate fake websites makes it harder to differentiate them even for people with

adequate knowledge about phishing. It has also been reported that some

phishing websites claim to sell products, such as software, games, and sex pills

at high discount, and then steal the bank information when the Internet user

enters it into their websites to buy the products.

(ii) Technical subterfuge. Phishers plant crimeware onto Personal Computers (PCs)

of potential victims to steal their credentials directly [APGW, 2012]. Many hackers

have been involved with phishing and use advanced hacking techniques. Some of

the mechanisms used in technical subterfuge are:

Session hijacking is used, often by corrupting the local navigation

infrastructure to misdirect potential victims to a fake website or an authentic

website through proxies controlled by the phishers. Techniques, such as

pharming, cross-site scripting attack, cross-site request forgery, domain

name typos, and man-in-the-middle attacks are implemented to carry out

session hijacking [Milletary, 2006].

An uncontrolled flood of spam emails are sent with malware in the

attachment or with a link on clicking which surreptitiously installs

specialized malware in the Internet users’ computers. Such malware is

designed to monitor and intercept the victims’ keystrokes and mouse clicks.

Sometimes malware is designed even to capture the screenshot of webpage

visited by the victims and ultimately post captured information to the

phishers. More advanced malware designed to capture network packets and

protocol information, and password harvester that looks for username and

password information in the victims’ computer are also found to be

employed by phishers.

3

There has been a rapid increase in phishing attacks from the first half of 2008 to

the first half of 2012, shown in Figure 1.

Figure 1: Phishing websites detected from 2008 to 2012

Many factors are responsible for the growth of phishing. One of the major factors is the

unawareness of Internet users that their personal information is actively targeted by

criminals and as a consequence they neglect to take precautionary measures while

performing online transactions. Likewise, many online service users lack organizational

policies and procedures for contacting customers [Dhamija et al., 2006], although

presently many big organizations that are phishing prone do seem to have acted to

improve the situation. Moreover, phishing is a very lucrative cybercrime with a high

benefit return against little risk. The exponential growth in the use of the Internet and

online services has resulted in a rapid increase in potential victims encouraging many

new criminals and inspiring them to use different new sophisticated techniques to

deceive Internet users more effectively. Additionally, the fact that the technical

resources required for phishing are easily accessible. It has enabled even a criminal with

a little technical knowledge to conduct phishing successfully. Many do-it yourself

phishing kits are available online which can be downloaded for free. These kits also

contain software for spamming that enable phishers to easily reach large numbers of

potential victims. There are various websites available online that offer the guidance for

designing and conducting phishing.

4

Phishing is a leading cause of identity theft on the Internet and causes billions of

dollars of damage worldwide every year. It has an adverse impact on the economy

through direct and indirect losses experienced by businesses and customers.

The direct loss is the financial damage incurred of the amount that phishers

withdraw from their victims’ accounts.

The indirect losses are an adverse impact on customers’ confidence towards

online commerce and services, the diminished reputation of victimized

organizations, and the resources spent to combat phishing.

Moreover, the convenience of e-commerce seems to be embraced by both

cybercriminals and users on an equal basis. Financial services are the most targeted

industries by phishing, shown in Figure 2.

Figure 2: Targeted industry sectors by phishing [APWG, 2012]

With the prevalence of phishing attacks and the increasing vulnerability of users’

confidential and personal information, it is increasingly important to provide Internet

users with an effective and reliable phishing prevention method.

There is no silver bullet to eliminate the problem of phishing. It depends partially

on well designed technology and equally on the browsing habits of Internet users. Well

designed technology includes techniques efficiently able to tackle successful phishing

techniques and a usable design that take into consideration what humans can and cannot

do well [Dhamija et al., 2006]. Li et al. [2007] emphasize improving the quality of

system design and the need for well-defined security requirements to prevent system

users from phishing. The browsing habit means that Internet users are familiar with

phishing and are able to detect them. It includes the trust towards anti-phishing software

5

which Internet users have installed in their system and their reaction to the warning

from the anti-phishing system installed. However, an empirical study has shown that

many of the Internet users neglect warnings from the anti-phishing system [Dhamija et

al., 2006]. Many Internet users do not understand phishing attacks or do not understand

the sophistication of phishing [Wu et al., 2006b].

There are several promising solutions provided by security experts and researchers

against phishing. These systems build an awareness of potential phishing attempts, and

develop and promote suitable technology solutions that help to protect Internet users

against phishing. They implement prevention, detection, and response measures. They

are available in a variety of forms: integrated with popular anti-virus systems, e.g., anti-

phishing tool in Norton antivirus software, as an embedded feature of renowned web

browsers, e.g., Google Safe Browsing toolbar [Google Safe Browsing] used in Mozilla

Firefox browser, and as separate tools and add-ons that can be used in server and client

machines, e.g., eBay toolbar [eBay Toolbar’s Account Guard]. They employ different

techniques, such as blacklist, e.g., Netcraft Anti-phishing toolbar [Netcraft] , whitelist,

e.g., SmartScreen Filter [MSDN IEBlog], content based detection, e.g., CANTINA

[Zhang et al., 2007a], analysis of source web page source code or URL, e.g.,

CANTINA+ [Xiang et al., 2011] , comparing visual similarity of the whole webpage or

layout or logo, e.g., online tool called “SiteWatcher Anti-phishing Tech” [Liu et al.

,2006], analysis of data submitted by users online, e.g., SpoofGuard[Chou et al. ,2004] ,

and use of a reputable search engine, e.g., CANTINA [Zhang et al., 2007a]. There has

been a good progress in identifying countermeasures; however, there has also been an

increase in attack diversity and technical sophistication to circumvent both detection

and users’ suspicions too. This means as countermeasures are implemented to thwart

one method of stealing information, criminals search for new vulnerabilities to be

exploited. This also means they always have additional opportunities available to them.

1.2. Research questions

The most common and straight forward technique to commit phishing attacks is to

deploy a webpage that mimics the look and feel of a target organization’s website.

There are several heuristic methods which employ anomalies in the URLs and source

codes in order to identify phishing websites. Many anti-phishing tools in use, such as

SpoofGuard [Chou et al., 2004], Netcraft Anti-phishing toolbar [Netcraft], CallingID

toolbar [CallingID], eBay toolbar [eBay Toolbar’s Account Guard], and SmartScreen

6

Filter [MSDN IEBlog] also implement heuristic methods for phishing detection.

However, there are several anti-phishing tools, such as Cloudmark Anti-fraud toolbar

[Cloudmark] and EarthLink toolbar [EarthLink] that still rely on manual verifications

and blacklists for phishing detection [Zhang et al., 2007b]. Ironically, even anti-

phishing tools, such as eBay Toolbar and SmartScreen Filter that use heuristic methods

do not use them in the first place [MSDN IEBlog, eBay Toolbar’s Account Guard],

since heuristic methods introduce higher inaccuracy in the results compare to list based

methods. Therefore, it requires further research that can improve the heuristic methods’

results. In this thesis, I have worked on answering following two questions:

(i) What are the most common anomalies found in URLs and source code of

phishing websites?

(ii) How could these anomalies be deployed in order to recognize phishing

attempts?

I believe in order to enhance the accuracy of results in heuristic methods, the two crucial

factors are: selection of suitable anomalies and designing suitable method to employ

them.

Some of the related studies that use anomalies in the source codes and URLs of

phishing websites for phishing detection include Pan and Ding [2006], who looked for

anomalies in webpage and cookies for phishing detection; Gasteller-Prevost et al.

[2011], who evaluated URL and webpage source code; Garera et al. [2007], who

analyzed the features of URL for discrepancies; and Alkhozae and Batarfi, [2011] who

looked for abnormalities in webpage with respect to the W3C standard. However,

Garera et al. [2007] excluded the source code of the website despite the fact that

important anomalies can also be found in the source code of phishing websites, whilst

all the other studies seems to neglect some vital factors during selection, calibration,

and deployment of the anomalies which is testified by high inaccuracy in their results.

In addition, the studies were performed some years ago, but the trend in phishing is very

dynamic. There is a high chance that anomalies that were important during their studies

may no longer be valid. Many other related researches are analyzed in chapter three.

1.3. Anomalies in phishing websites are suitable for phishing detection

Although phishing sites are cheap and easy to build [Pan and Ding, 2006], these cheaply

made websites are often poorly designed and coded, and do not properly meet

recognized standards, for example, the recommendations from the World Wide Web

7

Consortium (W3C) [Alkhozae and Batarfi, 2011] and the Google guidelines [Garera et

al., 2007]. Their quality score in the Google crawl database was found to be either very

low or they had no score [Gastellier-Prevost et al., 2011]. Moreover, phishing websites

have a very short lifetime and on average a phishing website domain remains online for

3 days, 31 minutes and 8 seconds [McGrath and Gupta, 2008]. For this short duration

naturally phishers do not prefer to concentrate on website design and quality

improvement, but rather to work on more beneficial activities, such as pushing more

emails and websites to potential victims, infecting users’ PCs with malign software to

use them as proxies, and designing distributed architecture that includes registering

many domains from various registrars in order to direct traffic to one of their domains

when any of their domains were removed. In addition, phishing websites often imitate

some genuine websites and they claim false identities which cannot be possible unless

some anomalies are introduced. Therefore, these anomalies can be utilized to detect

phishing. The other benefits of using such anomalies found in the URLs and DOM

objects of websites for a phishing detection method are:

(i) It is not dependent on any specific phishing strategy and is equally valid for

all kinds of phishing websites.

(ii) It does not depend on any external factors ,such as databases, and

(iii) It does not require any changes in user browsing habits.

1.4. Thesis contribution

In order to determine anomalies that are found in the URLs and source codes of

phishing websites, I performed meta-analysis of several past studies related to phishing

prevention, specifically heuristic methods. Then, I selected forty-one anomalies that can

be found in the URLs and source code of phishing websites. After that, I performed an

experiment conducted on twenty online phishing websites to verify those anomalies and

determine their significances in phishing detection. Finally, I suggest the ways by

which anomalies in the URLs and source code of phishing websites can be effectively

utilized during phishing detection. In general, the thesis makes the following

contributions:

(i) A systematic classification of phishing prevention techniques and

applications.

(ii) The meta-analysis of phishing prevention methods.

8

(iii) A set of forty-one anomalies that can be found in the URLs and source code

of phishing websites.

(iv) Results from an experiment conducted on twenty online phishing websites in

order to verify the significances of anomalies in phishing detection.

(v) Necessary guidelines to help in deployment of anomalies for phishing

prevention methods.

1.5. Thesis outline

The thesis proceeds as follows: chapter second reviews phishing prevention techniques

and also includes a systematic classification of phishing prevention techniques and

applications. Chapter third includes the meta-analysis of list based methods and

heuristic methods along with various related studies on them with their main contents,

specialities, and limitations. Chapter four lists out anomalies found in the URLs and

source codes of phishing websites which can be employed for phishing detection, an

explanation of the experiment setup conducted on twenty online phishing websites, the

results obtained from the experiment, and a discussion on the findings. Chapter five

presents conclusions and the last chapter, i.e., chapter six includes the limitations of this

research and some future research and development work.

2. Review of phishing prevention methods

2.1. Meaning of phishing prevention methods

Phishing utilizes the union of technology and social engineering. Social engineering is

about the exploitation of human vulnerabilities [Odaro and Sanders, 2011]. There are

various limitations which arise from human behaviour and decision making process

(e.g., greed and fear affect decision), and social norms (e.g., ethical, legality) which,

unfortunately, so far do not have an exact technical solution valid for all scenario. In

order to overcome those limitations, it requires Internet users’ intelligences to correctly

make the security critical decisions. However, phishers use social engineering and

technology in a strategic manner to distract their potential victims [Jakobsson, 2005].

Therefore, phishing prevention techniques target both components (i.e., technology and

social engineering) related to phishing. Precisely, a phishing prevention technique is any

technical or non-technical solution designed to either stop sensitive information from

leaking to counterfeit website or make leaked data useless [Cao et al., 2008].

9

In order to address the problem of phishing, the American Bankers Association

[2005] recommends developing a comprehensive set of procedures that perform:

(i) Detection. Detection means to keep a vigilant eye on phishing and discover

when any new phishing activity occurs before it can victimize Internet users. It

also includes a solution that extracts information about the phishing website.

(ii) Prevention. Prevention means to help in reducing the frequency of phishing

attempts that Internet users receive or educate Internet users so that they are less

likely to respond to phishing attempts.

(iii) Response. Response means to focus on the precaution and action which

have to be taken after the detection of phishing. It is also related to information

flow about the culprit website and process of removing the phishing websites.

Even though it is recommended for banking sector, it is valid for curbing all other

kinds of phishing as well. The three procedures are shown in Figure 3.

Figure 3: Phishing prevention procedures [American Bankers Association, 2005]

2.2. Important factors for effective phishing prevention methods In order to prevent phishing attacks, it is vital to comprehend phishers’ behaviour and

phishing techniques along with Internet users’ behaviour and their decision making

process. An analysis of phishing behaviours and techniques provide idea and knowledge

about technical and social engineering techniques applied for phishing. Likewise,

Internet users’ behaviour and their decision making process put light on aspects that

Internet users are good at doing and their vulnerabilities. A detected phishing attempt

does not make sense when Internet users cannot either notice or ignore the warnings

from a phishing prevention system. Therefore, Internet users’ response limitations

should be respected. These should further be facilitated with suitable usability.

10

2.2.1. Phishers’ behavior and phishing techniques

Computer security attacks are of three kinds:

(i) Physical attacks. It targets physical infrastructure and network to cause physical

outages, such as break the power or data transmission cable.

(ii) Syntactic attacks. It targets vulnerabilities and loopholes in software, such as

problems in cryptographic algorithms and protocols.

(iii) Semantic attacks. It targets people behaviour and the way they interact with

computer and web, such as the use of social engineering to manipulate Internet

users and steal their information.

This means that phishing includes both syntactic and semantic attacks [Downs et al.,

2006]. This also implies that a phishing prevention system should prevent Internet users

from both syntactic and semantic attacks.

According to Singh [2007], the schemes used by phishers can roughly be classified

into following four kinds:

(i) Dragnet method. It uses spammed emails, bearing the falsified corporate

identification websites or pop-up windows.

(ii) Rod-and-Reel method. It targets specific prospective victims with whom initial

contact is already made, and sends false information to prompt their disclosure

of personal or financial data.

(iii) Lobsterpot method. It consists of creating a forgery website that imitates a

legitimate website so that the victims mistake the spoofed website as a

legitimate one and provide the information of personal data.

(iv) Gillnet phishing. It uses malicious code which infects user’s system with a

Trojan horse or changes the settings of user’s system. Consequently, the Internet

user is directed to a phishing website when tries to visit a legitimate website or

record the keystrokes of user’s personal information and transmit those data to

phishers.

In all these techniques, the phishing schemes seem to typically rely on three basic

elements:

Phishing solicitations often use familiar corporate trademarks and trade names, as

well as recognized security agency names and logos. This can be seen from

Figure 4; it is a phishing website for “Paypal” that also uses “Verisign” logo.

11

Figure 4: A phishing website for Paypal

The solicitations routinely contain warnings or information about award,

lottery or other similar messages intended to cause the recipients immediate

concern or worry about access to an existing financial account. An example

of phishing email informing about a grant can be seen in Figure 5.

Figure 5: A phishing email

The solicitations rely on two facts pertaining to authentication of the e-

mails:

1. Online consumers often lack the tools and technical knowledge to

authenticate messages, especially from financial institutions and e-

commerce companies; and

12

2. Most of the available tools and techniques are inadequate for robust

authentication or can easily be spoofed [Wu et al., 2006b].

In fact, they are the elements against which the existing anti-phishing techniques work

and also the future researches on phishing prevention techniques will work. There are

several heuristic methods that use logo comparison, look for the misuse of security

agency logo, and other properties to detect phishing. Heuristic methods are discussed in

later chapters. Then, there are various spam and phishing emails’ filters in use to protect

against phishing attacks. Some of the e-commerce organizations have their own toolbars

designed for their customers, e.g., eBay’s toolbars that can alert their clients about

phishing targeted to eBay [eBay Toolbar’s Account Guard].

2.2.2. Internet users behavior and decision making process

Human behaviour makes decision making process a very complex procedure. The

outcome depends on probability. It is affected by various factors, such as beliefs,

preferences, past experiences, subjected situations, current states, and others. Further

studies can:

Improve the understanding of factors that make Internet users to fall for phishing,

and

Guide security experts to design countermeasure which can effectively protect

Internet users from phishing.

There has been little work done related to Internet users behaviour and decision making

process in the context of phishing. There is, however, work related to human behaviour

and decision making process in other research contexts. Only a few security scientists

have contributed to human behaviour and decision making process with respect to

phishing. Dhamija et al. [2006] experiment on why people fall for phishing is an

example of such work. This study focused on finding limitations of existing phishing

prevention techniques. Their study revealed that Internet users have their own

preferences of characteristics for identifying phishing, and their decision making

process is affected by various factors, such as their past experiences with phishing,

subjected situation (i.e., a person desperately looking to buy FIFA world cup ticket

reaction towards FIFA world cup-themed phishing will be different than a person who

has not thought about watching FIFA world cup). For instance, in this experiment

participants were asked to identify phishing which affected their decision, participants

were found to be misguided by attractive and luring sentences of email or website.

13

Moreover, subjected situation was a key factor; in the experiment there was no penalty

for wrong decision which affected participants’ decisions.

A classical example about the impact of belief in decision making during phishing

detection is mentioned in the experimental case study performed on a bank’s employee

by Aburrous et al. [2010]. They found that some Internet users strongly believe that they

are capable of detecting all kinds of phishing attacks and avoid using anti-phishing tools

which, unintentionally, expose them to the higher risk of phishing attacks. One of the

in-depth studies about Internet users’ behaviour while interacting with phishing was

done by Dong et al. [2008]. Their research focused on Internet users’ behaviour during

interaction with phishing websites and their decision making process. They also

designed a model called “user-phishing interaction model” after a cognitive

walkthrough on four hundreds phishing websites; identifying users’ activities,

information used, and assumptions/executions that Internet users make during their

interaction with phishing webpage. A diagrammatic representation of the information

Internet users may use when encountering phishing attacks is shown in Figure 6.

Figure 6: The overview of User-Phishing Interaction [Dong et al., 2008]

External information. This is the information that users perceive from user

interface (includes phishing emails/communication), as well as other sources

(such as expert advice).

Knowledge and context. This is the information that user perceive from his

environment, social networks, past experience, things happening around him

etc.

14

Expectation and previous perception. After each action, Internet users have some

expectations. This is the information retrieved from this expectation and also

understanding of the system.

In their Decision Making Model, Dong et al. [2008] mentioned the following two kinds

of decision that users make when interacting with phishing activities and reflect in their

content.

Decide on a series of action to take. This is taken consciously. This affects the

decision whether to proceed or not.

Decide whether to proceed or not. This is, usually, taken subconsciously.

Both decision making processes are further divided into the following three steps:

Construction of the perception of the situation

Generation of possible actions to respond

Generation of assessment criteria and choosing an action.

A diagrammatic representation of their Decision Making Model is shown in Figure 7.

Figure 7: Decision Making Model [Dong et al., 2008]

2.3. Objectives of existing phishing prevention methods

There are several phishing prevention methods resulted from different studies

conducted on protection against phishing. These phishing prevention methods are

primarily motivated to look for:

15

(i) Reasons behind Internet users’ tendency to fall for phishing

(ii) Design techniques to educate and aware about phishing

(iii) Design effective User Interface (UI) and warning to alert about phishing

(iv) Development of countermeasure to automatically detect phishing

(v) Evaluation of the effectiveness of existing phishing prevention methods, and

(vi) Invent proactive strategies for phishing prevention.

Below there are references and examples from all these research studies.

2.3.1. Reasons behind internet users’ tendency to fall for phishing

It is not uncommon for novice Internet users to be victimized by phishing; but

shockingly, it is found that even those with adequate knowledge about phishing are

tricked by phishers [Odaro and Sanders, 2011]. In a study conducted by Aburrous et al.

[2010] on bank’s employee found that even employee from Information Technology

(IT) department who are chiefly responsible to always remain alert about phishing got

tricked. Likewise, in a study by Dhamija et al. [2006], ninety percent of the participants

got tricked by good phishing websites. There are a number of such studies that have

examined the reasons behind Internet users’ tendency to fall for phishing.

Friedman et al. [2002] empirical study on users’ conceptions of web security

revealed that many Internet users are unable to differentiate between secure and insecure

website connection. The meaning of security varies from one Internet user to other and

many look to components in UI that can be easily copied from original website as cues

for secure connection. Likewise, the study by Dhamija et al. [2006] found that many

Internet users are unable to differentiate between legitimate and spoofed websites. Many

Internet users use the content of the website as cues for authenticity. There are a number

of Internet users who use padlock icon, animated graphics, pictures, and design touches,

such as logo, favicons etc. to differentiate between genuine and fake websites.

Most surprisingly, many Internet users do not hesitate revealing their personal

information to spoofed website despite warning from the phishing prevention tools

installed in their system. Dhamija et al. [2006] also blamed the ineffectiveness of

existing solutions designed for phishing prevention to be a reason behind Internet users

falling for phishing. These solutions are more technical and usually neglect some crucial

non-technical aspects in their design.

Similarly, Downs et al. [2006] study on Internet users’ mental model when

reading email and browsing web, and their vulnerability to manipulation revealed that

16

merely having knowledge and experience about phishing is an ineffectual strategy for

phishing prevention especially, in the case of new phishing methods. One of the reasons

mentioned is that ineffectiveness could be because of current awareness techniques that

do not effectively mention about possible vulnerabilities or strategies to identify

phishing emails. Another reason could be due to the fact, one sometimes going too

rigid with certain knowledge can lead to suspect real email and web-based actions

[Odaro and Sanders, 2011] that are unlikely to work for many who conduct business via

web.

Wu et al.’s [2006b] study also found that many Internet users use website

appearances and content to differentiate between fraud and legitimate websites.

Moreover, security is rarely the primary goal of Internet users. They also indicate that

sloppy practices of web aid in confusing Internet users and impose them to risk. For

example, a web form is used to submit both sensitive and insensitive information, some

legitimate websites use Internet Protocol (IP) address URLs, some legitimate websites

have login page without Secure Socket Layer (SSL) or use SSL for very short time

which is unnoticeable for Internet users. Moreover, Ma [2006] and Wu et al. [2006b]

mention that lack of alternative is a factor behind Internet user falling for phishing.

Almost all phishing prevention approaches detect probable phishing, but they rarely

provide alternative to proceed and enforce Internet users to take risk. There is some role

of human behaviour as well to make Internet users fall in phishing trap.

2.3.2. Design techniques to educate and aware about phishing

Phishing is largely dependent on human factor, so educating Internet users and bringing

awareness about phishing is one of the potential countermeasures. All phishing attempts

are not complex to differentiate. The majority of phishing attacks contain visible

distinguishing factors which can facilitate Internet users in identifying them, however,

the majority of Internet users are found either not aware or not clear about them. Their

inability to distinguish legitimate websites from phishing websites is exploited by

phishing attacks. Surveys and studies undertaken by Friedman et al. [2002], Dhamija et

al. [2006], Karakasiliotis et al. [2007], Jagatic et al. [2007], Herzberg and Jbara[2004],

and Odaro and Sanders [2011] have revealed that Internet users lack proper knowledge

about phishing. Their skill to identify phishing attacks is not adequate enough and they

usually misclassify phishing websites as legitimate websites and vice versa.

Undoubtedly all phishing attacks cannot be detected manually. Yet, performing manual

17

detection by Internet users can make a big change in reducing the number of people

falling for phishing. Wu et al. [2006b] found significant improvement in ability to

detect phishing attacks in Internet users before and after reading a tutorial by email

about phishing. Various kind of materials are available to educate and aware Internet

users about phishing and techniques to detect them manually.

Many online training materials are published by various government and non-

government organizations, business, security organizations, universities etc. Most of

the organizations that work on the prevention of phishing (e.g., APWG, antivirus

companies, universities working on phishing) or are targeted by phishing attacks (e.g.,

bank, e-business companies, finance companies) have included information about

phishing and instructions to be performed when encounter such scenario in their official

websites. An example of such information included in the website of Nordea Bank,

Finland is shown in Figure 8.

Figure 8: ‘About phishing’ page in Nordea Bank, Finland website

Many other online materials are also available. “Anti-Phish Phil”, an interactive game

and “PhishGuru”, an interactive training system are designed by Cylab Usable Privacy

and Security (CUPS) Laboratory at Carnegie Mellon University that is used to educate

Internet users about phishing websites. Sheng et al. [2007] experiment on the role of

game to educate Internet users about phishing showed that game is more effective than

other means, such as reading text or reading online tutorial material. A screenshot of

“Anti-Phish Phil” game is shown in Figure 9.

18

Figure 9: A screenshot of the education game called “Anti-Phish Phil”

Similarly, “Phish or Not Phish”, an online quiz developed by VeriSign is available for

free. It displays two similar looking websites snapshots and asks users to distinguish the

snapshot from a phishing website. After each answer, it displays the reasons that make

one of the snapshots from a phishing website. A screenshot of “Phish or Not Phish” is

shown in Figure 10.

Figure 10: A screenshot of the online quiz called “Phish or Not Phish”

2.3.3. Design effective UI and warning to alert about phishing

Dhamija et al. [2006] have mentioned that phishing cannot be solely solved by a

traditional cryptographic-based security framework; rather it equally needs inclusion of

usability and user experience. Several studies have indicated that bad or ineffective user

interface is some of the prominent factors behind weak performance of anti-phishing

software. Wu et al. [2006a] pointed out location of warning indicators found at

peripheral area in many phishing prevention solutions as one of the example of a very

poor design. Further, they mention that such warning indicators send very weak signal

19

in comparison to much larger centrally located displayed spoofed web pages. Zhang et

al. [2007b] study also revealed poor usability performance of existing phishing

prevention tools. Some of the examples of poor design in phishing prevention tools are:

Use of red and green colour indicator , which is a poor choice for red/green colour

blindness unless there is some other noticeable cues included along with it,

Use of pop-up dialog boxes to warn when popular browsers (e.g., Internet

explorer (IE), Google Chrome, Mozilla Firefox) have option to block such boxes

and beside that most of Internet users dismiss such boxes without reading. An

option to disable pop-up dialog in IE 9 is shown in Figure 11

Some examples of the anti-phishing toolbars, which use poor ways to notify

phishing attacks, are: EBay’s Account Guard and SpoofGuard. EBay’s Account Guard

shows green icon to indicate the webpage belongs to eBay or PayPal, Grey icon for

unidentified websites, and red icon to indicate potential phishing website [eBay

Toolbar’s Account Guard]. SpoofGuard displays traffic light colours (Red: above their

threshold value, Yellow: probably hostile, and Green: for low scores and is probably

safe) to indicate a website chance of being a phish [Chou et al, 2004].

Figure 11: Highlight of pop-up blocker in Internet Explorer 9

Currently, significant improvement can be seen in the usability of some phishing

prevention tools. Popular browsers are using active warning that is displayed on the full

page. Such warning cannot be unnoticed by Internet users. Internet Explorer uses both

active and passive warning; when it gets confirm that the website is a phishing website,

20

it uses active warning whilst for suspected webpage passive warning is used. An active

warning displayed by Google Chrome browser is shown in Figure 12.

Figure 12: Active warning message displayed in Google Chrome

Similarly, many other researchers have designed user friendly interfaces. Dynamic

Security Skins [Dhamija and Tygar, 2005] used a random photographic image in the

background of password window as cues to differentiate between legitimate and fake

website. Each Internet user is assigned a unique image and is recommended to enter

password only after his personal image is loaded. SpoofStick displays the website’s real

domain and expose the websites that obscure their domain name [SpoofStick]. TrustBar

makes SSL more visible by displaying the logos of the website and its certificate

authority [Herzeberg and Gbara, 2004]. Netcraft toolbar enforces display of browser

navigational controls (toolbar and address bar) in all windows, to defend against pop-up

windows which attempt to hide the navigational controls. In addition, it clearly displays

sites’ hosting location, including country that helps in evaluating fraudulent URLs

[Netcraft].

2.3.4. Development of countermeasures to automatically detect phishing

Human ability to detect phishing is limited and varies among Internet users. Moreover,

manual method to identify phishing can be deluded. Therefore, there are several

software tools developed in order to identify phishing websites. These software tools

can be phishing emails filter, such as Phishing Identification by Leading on Features of

Email Received (PILFER) [Fette et al., 2006], which uses a machine learning based

approach to examine a set of ten features in suspected email. PILFER also uses Support

Vector Machine (SVM) classifier for reference implementation. Another approach for

spam filtering is greylisting which blocks spam at the mail server level based on the

behaviour of sending server, rather than the content of the message. The mail server

21

that employs greylisting deliberately dismiss mails from unknown or suspect sources

with temporary error until configured period of time. It relies on the fact that many

spam sources, i.e., Simple Mail Transfer Protocol (SMTP) used by spammer, do not

maintain queues for retrying message transmission. When a sender has proven itself

able to properly retry delivery, the sender is added into the whitelist so that no more the

mail from the sender is impeded. However, the problem with phishing emails filter is

that it fails to stop phishing attacks that use other mediums, such as IRC, Messenger,

and advertisement [IBM Internet Security Systems, 2007]. Moreover, such phishing

email filters are unable to stop all malicious emails.

There are several software tools mostly in the form of browser toolbars in order to

detect phishing attacks that use other mediums including emails. Some of anti-phishing

tools are: phishing prevention tools integrated in popular anti-virus software, such as

Norton antivirus and AVG antivirus; inbuilt in popular browsers, such as Internet

Explorer, Mozilla Firefox, and Google Chrome; as a independent applications or web

browser add-ons, such as FraudEliminator [Fraud Eliminator], Netcraft toolbar

[Netcraft], eBay toolbar [eBay Toolbar’s Account Guard], EarthLink toolbar

[EarthLink], Geo Trust Trustwatcher toolbar [Geo Trust], SpoofGuard[Chou et al,

2004], CallingID Toolbar [CallingID], Cloudmark Anti-Fraud toolbar [Cloudmark],

Google Safe Browsing [Google Safe Browsing], SpoofStick [SpoofStick], and TrustBar

[Herzberg and Gbara, 2004]. These tools employ either heuristic methods or list based

methods or both of them for phishing detection. Heuristic methods check characteristics

of website and decide whether it is phishing or not whilst list based methods maintain a

list of either genuine website (whitelist) or phishing website (blacklist) and verify if the

website is in the list to decide phishing or not phishing. Each technique has its own

pros and cons. This thesis is also about heuristic methods for phishing detection. So, in

the later chapters, details of heuristic methods and list based methods used for

automatic detection of phishing are covered.

Then, there is DNSSEC Validator [DNSSEC Validator], an add-on made for

Mozilla Firefox browser that detects DNS spoofing. The DNS Validator compares only

the DNS records of the domain name used in page address and the IP addresses from

where the Firefox download the page in order to detect DNS spoofing. A screenshot of

DNS Validator is shown in Figure 13.

22

Figure 13: A screenshot of browser add-on “DNSSEC Validator” [DNSSEC Validator]

2.3.5. Evaluate the effectiveness of existing phishing prevention methods

Despite wide media coverage of phishing and numerous phishing prevention

techniques, phishing remains effective. This brings forth a serious concern on the

efficiency of methods used for phishing prevention. Many studies are conducted in

order to examine the efficiency of the existing phishing prevention methods. These

studies expose the reliability of phishing prevention methods and at the same time point

out their deficiencies which can be helpful to improve existing phishing prevention

methods as well as forth coming methods.

Wu et al. [2006b] study on the effectiveness of security toolbars revealed that

existing security toolbars are big failure in mitigating phishing attacks. They pointed out

several factors, such as very small alert display in comparison to content display located

at the periphery that gets unnoticed, security not as primary goal of Internet users, and

distrust towards such toolbars due to their false positive , can be responsible for

ineffectiveness of phishing prevention methods.

In another similar study by Zhang et al. [2007b] to observe the tool performance,

testing methodology, and user interface of eleven selected phishing prevention tools

(i.e., CallingID toolbar, Cloudmark Anti-Fraud toolbar, EarthLink toolbar, eBay toolbar,

Firefox 2, GeoTrust TrustWatch toolbar, Microsoft Phishng Filter in Windows Internet

Explorer 7, Netcraft anti-phishing toolbar, Netscape browser, and SpoofGurad) revealed

that these tools are under performing and all of them are incapable to protect Internet

users from the phishing attacks using sophisticated techniques. Their performance vary

with the source of phishing URLs used by them. Further, many of the tools even failed

for very simple exploit as well. They suggest that no single phishing prevention

methods can ensure high performance; multiple methods supporting each other used

together in an anti-phishing tool can provide better results.

23

Some studies only evaluated the effectiveness of in-built anti-phishing toolbars of

web browsers. For instance, Ludl et al. [2007] analyzed the effectiveness of blacklists

maintained by Microsoft and Google. The blacklist maintained by Microsoft is used in

Internet Explorer whilst the blacklist maintained by Google is used by Google Chrome

and Mozilla Firefox. Google Chrome, Mozilla Firefox, and Internet Explorer are the

most widely used web browsers, their inbuilt anti-phishing toolbars are also the most

widely used. The study focused on three crucial factors: coverage and quality of

blacklist, and list update time. It indicated that blacklist based phishing prevention is

satisfactorily effective and especially from Google; however, blacklist based phishing

prevention’s inability to detect new phishing attacks can be handled in large extent

using heuristic techniques in the way IE browser use heuristic technique to complement

list based technique.

Likewise, Bian et al. [2009] evaluated the effectiveness of three external online

resources (Google PageRank system, Yahoo! Inlink data, and Yahoo! Directory

service). Their finding suggested that such online resources can be used to increase

efficiency of detection when used in conjunction with existing countermeasures.

Similarly, Egelman et al. [2008] studied the effectiveness of Internet browsers

warning and found that most of the Internet users heed to active warning (79% in their

experiment) whilst passive warning was no different than not displaying any warning.

They further found that the active warning in Mozilla Firefox was more helpful than IE

active warning. Sheng et al. [2009] also performed an empirical analysis to observe the

effectiveness of phishing blacklists and found that phishing blacklists are poor choice to

fight against zero hour phishing websites. Li and Helenius [2007] performed heuristic

usability evaluation on five selected anti-phishing client-side applications (i.e., Google

toolbar, Netcraft toolbar, SpoofGuard, Phishing Filter in IE, and anti-phishing IEPlug).

They suggested the following three points for an effective usability design of anti-

phishing client-side applications:

(i) Toolbar’s status should be visible to Internet users and anti-phishing client-

side application’s should have intuitive interface.

(ii) Warning should help Internet users to take the correct decision. The warning

for suspicious webpage should not be as strong as the warning for detected

webpage.

24

(iii) Anti-phishing client-side application should be aided with a suitable help

system.

2.3.6. The need to invent proactive strategies for phishing prevention

Most of the investigations on phishing are motivated towards finding a new reactive

technique. A reactive technique is often effective against the types of phishing which

exist when the technique was designed but it abruptly failed to detect a phishing attack

that employs a new technique. Current trends in research are chiefly targeted towards

defending attacks from phishers or taking down phishing websites, when scammers are

continuously making new attacks. In fact, no adequate effort seems to be applied in

order to reach the root of the problem. There is a need of more research that can:

Strengthen the weak points in legitimate systems and make them tedious to

misuse

Develop strategies to retaliate and circumvent the phishers, and

Track the phishers to bring them under law enforcement

Law enforcement could be difficult in those countries that do not have provision for

such case; however, a study by APWG [2012] showed that the countries that host most

of the phishing websites are developed countries, USA topping the list with an average

of about fifty percent of the phishing websites hosted from there. Other countries

hosting most of the phishing websites for the first quarter of 2012 are shown in Figure

14.

Figure 14: Top countries hosting phishing websites [APWG, 2012]

Most of the countries in the list have stern law for cyber crime, so tracking such wicked

people to punish them by the law can discourage many phishing aspirants or at least to

those who are non techie and still conduct phishing. Similarly, making phishing activity

25

sophisticated to conduct can highly affect the naïve players in phishing. There are some

proactive strategies that are directed towards reducing phishing.

One of the ways is to use web crawlers alike to that is used by search engine to

search phishing websites, and pass this information to appropriate Internet Service

Provider (ISP) to bring down the websites. However, there are some limitations in this

technique. Many countries do not have legal provision to remove such websites.

Moreover, such detection can consume time which can be enough for scammers to

fulfill their illegal desires.

Another similar technique is to flood the phishers’ database with false information

also called poisoning, but it is not Denial of Service (DOS) attack. This can make it

difficult for phishers to differentiate between valid and false data and sometimes even

can make the database useless. This technique, too, has limitations. It requires tracking

the spoof websites without any false negative. The time taken to track the fraud website

can be sufficient for phishers to victimize many Internet users. Moreover, any false

negative result can cause serious consequences and lawsuit.

Another proactive technique can be to keep watch on corporation’s logo

download. Many phishers use an authentic logo in their websites to give a more real

look to their fake websites. However, this technique, too, has some limitations. Firstly,

the corporation’s logo is also used by respective corporation’s partners and some other

legitimate websites; phishers can easily download from them. Figure 15 shows a page

from a website that has logos of various banks and hundreds of such websites are

available online.

Figure 15: Logos of various banks used in a personal blog website

Secondly, making a copy of legitimate website logo is not difficult for many good

designers. After all, how many Internet users can correctly differentiate between a

legitimate logo and its copy is still a question to be answered.

26

One of the prominent works related to proactive strategies to track phishers is

from McRae and Vaughn [2007] using web bugs and honeytokens. In their experiment,

they used uniquely named HyperText Markup Language (HTML) image tags of one

pixel by one pixel for each phishing e-mail as honeytokens or web bugs. The links of

HTML and image links were filled to all the values of the variables with a text data type

in phishing website forms and submitted. When phishers viewed the results from

HTML enabled environment that does not filter or block third party images from being

loaded, this image get retrieved from the server by the attacker. This is used to gather

information about individuals or groups who viewed the data collected by phishing

schemes. However, this technique, too, can be bypassed by using various approaches,

for instance,

View the results of phishing form in text-only viewers.

Disable the HTML code and prevent any referral from being made in the web

server log.

Disable loading of third party images in whatever browser used.

Use a web proxy (usually some hacked system) to view the results.

Another proactive approach is from Hacker Factor Solutions [2005] who

proposed to use page encoding in order to encapsulate each web page to stop phishing

websites generated using mirroring techniques. Availability of various mirroring

techniques (Web browser “Save as” option is the simplest mirroring technique; tools,

such as “wget”, WebWhacker, Templeton, telnet, netcat) have drastically reduced the

time and effort of phishers in making fake websites. In fact, such techniques are acting

as catalyst to vigorous growth of phishing and encouraging many novice cybercriminals

to perform phishing. However, the problem with this approach is that it uses Javascript

code to decode the page content, whilst all popular web browsers have options to

disable Javascript and in current time only few websites require Javascript enabled.

Figure 16 shows the options to disable Javascript in Google Chrome browser.

27

Figure 16: Options to enable and disable Javascript in Google Chrome

Moreover, there are add-ons like NoScript for Mozilla Firefox browser that can be used

to allow execution of Javascript, Java, Flash, and other plugins only by the selected

websites. Figure 17 shows the options to disable execution of script in NoScript.

Figure 17: Options to enable and disable script execution in NoScript

There are many other issues with this approach, some of them are:

Search engine will be unable to index the page since it is encoded.

It cannot provide protection against phishing malware.

It requires routine (may be weekly) change in encoding algorithm.

It needs specialized skill and more time to develop such websites.

28

Likewise, another technique is from Li et al. [2007] that suggest misuse-oriented

prevention, i.e., protect form phishing attacks with the misuse case method from a

system design perspective. Security requirements are often not stated during

requirement elicitation and analysis, leaving vulnerabilities in future Information

systems which later are compromised by scammers. Such vulnerabilities can be fixed

using a misuse case approach at requirement gathering (a designer is asked to abuse

each use case, and then its countermeasure is identified and employed. It continues in

iterative way unless it does not get full proof.). The summery of the methodology of

misuse cases are:

a. Design the use cases of the system

b. Personate a misuse, who intends to compromise the system;

c. Design the misuse for a specific use case;

d. Find a countermeasure for a misuse case;

e. Judge whether the countermeasure is vulnerable; if yes, go to step c, otherwise go

to the next step;

f. Find whether there is possible vulnerability or misuse; if yes, go to step c,

otherwise security requirement elicitation ends.

Even though the technique could be beneficial for cases in which websites are hacked

and compromised to conduct phishing, but its ability to prevent the majority of phishing

in which phishers develop an independent websites or ask information through email

cannot be seen. Moreover, no matter how full proof system you design, the hackers may

find some ways to intrude. This can also be seen from the news of attack on the

Pentagon (the headquarters of the United States Department of Defense) computer

system and 24, 000 files stolen [NYDailyNews.com, July 14 2011], and the news that a

hacker succeeded to hack the computer systems owned by Oracle, NASA (National

Aeronautics and Space Administration), the U.S. Army, and the U.S. Department of

Defense [IDG News Service, May 10 2012].

2.4. Classification of phishing prevention techniques

There are several promising techniques that significantly prevent phishing attacks.

These techniques have to deal with both technical and non-technical factors. Therefore,

in the first level, phishing prevention techniques can be classified into technical

methods and non-technical methods. The technical methods can be further categorized

29

into list based methods and heuristics methods [Dunlop et al., 2010]. A classification

hierarchy of phishing prevention techniques is shown in Figure 18.

Figure 18: Classification of phishing prevention techniques

Technical methods. Technical methods deal with technical vulnerabilities in

Information systems; tools for phishing detection, prevention, and response;

designing game, online tutorial, quiz for Internet awareness etc. Some of the

examples are: Anti-virus integrated with phishing prevention; in-built system in

web browsers; software tools, such as FraudEliminator, Netcraft toolbar, eBay

toolbar, EarthLink toolbar, Geo Trust Trustwatcher toolbar, SpoofGuard,

CallingID toolbar, Cloudmark Anti-Fraud Toolbar Google Safe Browsing,

SpoofStick, TrustBar , Anti-Phishi, DOMAntiphish, PwdHash etc.

o List based methods .List based methods classify websites into either

phishing or trusted one and maintain into database lookup in the form of

either blacklist or whitelist. These lists can be of IP addresses or domain

name or URLs. Blacklist is a list of IP addresses or domain names or

URLs collection of phishing websites whilst whitelist is a list of IP

addresses or domain name or URLs collection of legitimate websites.

List based methods are discussed in detail in section 3.1.

o Heuristics methods .Heuristics methods check for one or more

characteristics of websites and decide whether it is phishing or legitimate

website. It utilizes the properties like HTML and script code of website,

URL, UI design, page content for phishing websites identification.

Heuristics methods are discussed in more details in section 3.4.

30

Non-technical methods .Non-technical methods deal with the factors which are

related to studying Internet users’ behavior, social engineering principles and

techniques used by phishers, legality of using any techniques, training Internet

users about phishing, information and guidelines for safe browsing, and cyber

laws to punish phishing culprit.

Since, the purpose of this thesis is to concentrate on technical methods, i.e., list based

methods and heuristic methods specifically used in browser based applications for

phishing prevention, here non-technical methods are not further discussed.

2.5. Phishing prevention applications

Both list based methods and heuristic methods are implemented in server-side

applications and client-side applications (i.e., browser based applications, since client

side applications are widely used as web browser toolbars) used for phishing

prevention. According to the implementation architecture of client-side applications,

they are further categorized into two types: client-server structured applications and

independent applications [Li et al., 2007]. A classification hierarchy of phishing

prevention applications are shown in Figure 19.

Figure 19: Classification of phishing prevention applications

(i) Server side applications. Server side applications are employed in the servers

(e.g., organizational server, email server, ISP server) for phishing identification

and remedy. Bayesian Filters are installed in the server to detect phishing emails.

Although, such filters are an effective technique for phishing prevention, it

should be noted that such filters cannot be hundred percent accurate and above

all email is not a sole channel (other popular channels are message boards, web

31

banner advertising, instant chats, such as Internet Relay Chat (IRC) and instant

messenger) of phishing attacks. Many other applications that use IP addresses

and URL blacklist, heuristics and fingerprinting (compares known samples of

phishing message against incoming emails) are deployed in ISP’s servers for

phishing prevention.

(ii) Client side applications or browser based applications. Web browser is the

most common method used by Internet users to get access of web contents.

There are other methods too, but they are usually tricky and complex, which

makes them unsuitable for general Internet users. Furthermore, it is the foremost

layer with which Internet user interacts, and tracking user’s activity at this level

is potentially more effective. Its strategic positions make it suitable to warn

Internet users directly and effectively [Sheng et al., 2009]. Even a study by

Egelman et al. [2008] found that phishing warning in Mozilla Firefox 2 was very

effective, and was able to stop all participants in their study from entering

sensitive information into fraudulent websites. In addition, web browser market

is dominated by selected number of browsers, i.e., Google Chrome, Internet

Explorer, Mozilla Firefox, Safari, and Opera. All together, it is easy to handle

phishing at the browser level.

This also does not mean use of web browser is free of limitations. Most of

browser based techniques act when webpage is loaded, which is risky from

malware and other malicious code prospect that are used for phishing [Garera et

al., 2007; Ma et al., 2009]. Other factor that has always been challenging for the

researcher and security expert in browser based techniques is the mode to

display the warning messages. Passive warning used to notify about phishing,

such as change in colour, pop-up with textual information displayed at the

corner or periphery of browser without interrupting browse activity is either

unnoticed or neglected by Internet user [Wu et al., 2006]. Current trend is to use

active warning, which enforces Internet users to notice and take action by

interrupting the browsing activity. However, it can be debatable how acceptable

such interruptive warnings are, more specifically in case of false negative. This

might be a reason that IE uses active warning when it is confirmed that the

website is a phishing website otherwise, it uses passive warning for doubtful

websites. Thus, such warning should be precise and accurate. Any wrong

32

warning or alert can raise the question on its reliability which ultimately will

reduce Internet user trust towards it.

Despite some limitations in use of browsers based techniques for phishing

prevention, they are widely used. Nowadays, most of the phishing prevention

applications are found to be concentrated on the most vulnerable client side [Li

et al., 2007] and for them browser based applications highly suit. Such

applications are either inbuilt or they are independent browser toolbar that can

be embedded into the web browser. The current version of all popular browsers

(Google Chrome, IE, and Mozilla Firefox) comes with inbuilt phishing

prevention system and some other features (e.g., block pop-up windows, enable

and disable Javascript or Active script in IE, warn when sites try to install add-

ons in Mozilla Firefox) that contribute in fight against phishing attacks. Some

examples of independent browser toolbars are: Netcraft Anti-phishing toolbar,

eBay’s Account Guard, SpoofGuard, Microsoft Anti-phishingtoolbar for IE etc.

Client-server structured applications. Client-server structured applications

routinely request for update and maintenance from the server. Such kinds of

toolbars are usually made by commercial organizations, such as Google,

Microsoft, and Netcraft. Mozilla Firefox uses Google Safe Browsing and

updates its blacklist for the first time when the feature is enabled and after

that it updates in every thirty minutes. It communicates with the Google

server during two occasions: during the regular update of blacklist and when

the reported phishing website is encountered so that before blocking the

website it doubles checks to confirm the website is not removed since the

last update. Similarly, Google Chrome contacts the Google servers within

the five minutes of start-up, and approximately every half an hour thereafter

to download updated lists of suspected phishing websites. Likewise in IE

from version 8, it uses “SmartScreen filter” for phishing detection that does

both local verification and online lookup for phishing website identification.

SmartScreen filter uses both list based method and heuristic method for

phishing website identification. In the beginning, i.e., local verification, it

looks for the website’s URL in the whitelist (generated by Microsoft) stored

on users’ computer. In case, the website is not found in the list, it uses

heuristics method for probable deception detection. When the heuristic

33

method indicates the website is suspicious, it sends the website addresses to

the Microsoft online service in order to compare with its blacklist, i.e.,

online lookup. Figure 20 shows the option to enable “SmartScreen Filter” in

IE 9.

Figure 20: SmartScreen Filter in IE9

Similarly, Netcraft Toolbar too communicates with the Netcraft web

server’s database to obtain the blacklist of phishing websites [Netcraft]. In

addition, the toolbar displays also other information related to the website

like date it was first surveyed, country where it is hosted, popularity

amongst toolbar users, and other information that can be seen from Figure

21.

Figure 21: Netcraft Toolbar [Netcraft]

Independent applications. Independent applications use the data stored in

local systems to identify a deceptive website. The working mechanism of

such toolbars is as follows: After the webpage is downloaded into the

34

local computer, it compares the characteristics of websites with the data

stored locally. When any anomalies are detected, it warns the Internet

users. An example of such toolbars is SpoofGuard, a plug-in for IE that

accesses the IE history file along with three additional files stored in the

user profile directory for phishing detection. The three additional files

are comprised of: read only file of host names of email sites, such as

Hotmail, Yahoo!Mail, and Gmail, used in the referring page check; file

of hashed password history (domain name , username, and password)

and the file of hashed image history[SpoofGuard]

3. Analysis of strength and limitations of technical phishing

prevention methods

The two technical methods (i.e., list based methods and heuristic methods) for phishing

prevention are further decomposed into their constituents depending on strategies used

for phishing detection. Further details about them with their pros and cons, and several

studies related to them are discussed in the following sections:

3.1. List based methods

The list based methods are reactive techniques for phishing prevention. It maintains a

database lookup of either trusted websites (whitelist) or malicious websites (blacklist).

Such list can be maintained either locally or hosted at the central server.

3.1.1. Whitelist method

Whitelist is the list of trusted websites that an Internet user visits in regular basis. When

the whitelist is exclusive, it allows access to only those only those websites which are

considered trusted and thus is highly effective against zero hour phishing. It also does

not produce any false positive results unless there is any wrong entry in the whitelist.

However, it is very difficult to determine beforehand all the websites which users may

want to browse and accordingly update the list on time. Any failure in updating the

whitelist causes high false negative and severe usability penalty, which also might be a

reason behind the low popularity of whitelist. SmartScreen Filter [MSDN IEBlog] is a

feature in IE8 and IE9 browsers that uses whitelist for phishing prevention; however, it

further uses heuristic method and blacklist method in order to confirm the phishing

webpage. Anti-Phishing IEPlug [Li and Helenius, 2007] is another toolbar made for

35

Internet Explorer that uses whitelist method. It uses a whitelist of domain names

maintained by the Internet user or computer administrator. It checks whether the

webpage that the Internet user wants to visit contains password input field or not. When

password input field is detected, it checks whether the domain contains any domain

names in the whitelist. It warns the Internet user when an address to be visited contains

a keyword that is saved in the whitelist , but the actual domain is different.

There are very few studies that have focused on improvement of whitelist. One of

such study is from Cao et al. [2008]. They designed an approach called “Automated

Individual White-List (AIWL)” that stores all familiar websites with Login User

Interface (LUI). AIWL uses the Naïve Bayesian classifier in order to identify websites

with login page. Each time an Internet user submits confidential information to any

website that is not in the list; the user gets an alert message. A new website is added to

the list when the user continues to submit the confidential information to the website

several times despite the warning. Although, this approach includes a mechanism for

the auto-update of whitelist that differentiate it from a general whitelist method, it

possesses several limitations, such as:

The initial list used by this method is not automated that means it will either

initially have zero entry or it has to rely on some other mechanism for the initial

list.

The update mechanism used by this method is highly dependable on Internet

users’ ability to distinguish legitimate websites, when studies have shown that

Internet users are not good at identifying phishing websites [Friedman et al.

,2002; Dhamija et al., 2006; Karakasiliotis et al., 2007; Jagatic et al. ,2007;

Herzberg and Jbara,2004; Odaro and Sanders ,2011].

The reliability of method that alerts the user even for legitimate website and many

times for the same legitimate website is in itself questionable.

In conclusion, whitelist method can be an effective technique when used to

complement other technique, such as blacklist method and heuristic method. It can be

used for the first level verification, so that those legitimate websites which Internet

users visit very often do not have to go through time-consuming verification process

and most importantly they do not get misclassified.

36

3.1.2. Blacklist method

Blacklist is the list of IP addresses or Domain Names (DNs) or URLs of treacherous

websites, although, IP addresses and DNs used by the scammer can be blocked.

However, many times phishers use hacked DNs and servers [MarkMonitor Inc., 2008].

So, blocking the whole DNs or IP addresses can unintentionally block many legitimate

websites which share the same IP addresses and DNs. Therefore, blacklisting URLs is,

comparatively more appropriate for blacklist [Sheng et al., 2009]. It is a widely used

technique for phishing prevention. Even the popular web browsers (i.e., Google

Chrome, Mozilla Firefox, and IE) use blacklist for phishing detection. It detects

malicious websites that are included in the blacklist, so it has a very low false positive

and is favoured over heuristic methods. The low false positive rate and the simplicity in

design and implementation especially with browser can be the reasons behind the

popularity of blacklist method. The low false positive also reduces the liability risk of

incorrectly labelling a legitimate website as a phish.

Despite these all benefits and the wide popularity of blacklist, it possesses

following three main challenges.

(i) Zero hour phishing. It takes time to include a new phishing website in to the

blacklist. Thus, it is ineffective against zero hour phishing, leaving the Internet

users vulnerable to phishing unless it is not discovered. An empirical analysis by

Sheng et al. [2009] on the tools that use blacklist revealed that most of such

tools are able to catch only less than 20% of phish at zero hour. Moreover,

majority of phishing websites are short lived and the most of damages are done

during this short time span. Thus, delay in list update reduces the effectiveness

of the blacklist.

(ii) Update mechanisms. Everyday there are hundreds of new phishing websites

added to Internet. Most of the blacklists, for instance, PhishTank relies on

manual verification of websites due to its high accuracy; despite the fact that

manual verification is time inefficient process. There are some blacklists, such

as the Google blacklist that uses automatic verification employing heuristics via

machine learning techniques which is a quick process but introduces

comparatively more inaccuracy in the list. The compilation and maintenance of

blacklist in itself is a multiple step process, and the two steps are:

37

Data (phishing URLs) gathering. It needs the gathering of data (phishing

URLs) from various sources, such as: spam traps, detected by filters,

users reported (APWG List, Phishtank list), compiled by other parties,

such as takedown vendors or financial institutions.

Verification of websites. After the data gathering, it needs verification of

the websites to identify phishing websites. This verification often relies

on human reviewers for reliability. Sometime verification from multiple

reviewers is needed for more accurate result. Phishtank’s statistics

showed that manual review process of URLs takes considerable amount

of time, ranging from a median of over ten hours in March, 2009 to a

median of over fifty hours in June, 2009 for single URL [Whittaker et

al., 2010]. Although, PhishTank was able to significantly improve this

figure; it dropped the median time to identify a phish to 12 hours in Jan,

2010 and to 2.4 hours in Jan, 2011, its verification mechanism still

leaves several suspected URLs unidentified [Liu et al., 2011]. The

verification mechanism prescribed by PhishTank requires 4 votes to

confirm a website as a phish, and those URLs that receive less than 4

votes also called “wasted votes” are declared unidentified URLs.

Moreover, it should be noted that the median time 2.4 hours is after the

suspected website is submitted, which means there is a chance of delay

before the submission of the website to Phishtank; when most of the

victims fall for phishing scam within eight hours from the start of attack.

[Kumaraguru et al., 2009] Above that, phishing websites grow endlessly

making it difficult to always keep the list up to date. Even human

verification is prone to human error. Moore and Clayton [2008] found

power-law issue in the participants of PhishTank (i.e., participants who

periodically participate are more prone to making error in labelling), and

at the same time taking out human effort entirely out of the loop is too

risky [Edwards et al., 2007].

(iii) Matching mechanisms. The third difficulty of blacklist method is the ways

of matching URLs that Internet user enters with those from the list. An exact

matching of URLs can be easily evaded by automatically generated URLs from

phishers [Prakash et al., 2010], for example, Rock-Phish gang uses phishing

38

toolkits to generate a large number of slightly varied URLs for a single phishing

website [MarkMonitor Inc., 2008]. A way to tackle such problem is to include

an ability to detect any changes in the URLs, but it introduces more inaccuracy

in the blacklist.

Therefore, it is clear that the efficiency of a blacklist basically depends on the

following factors:

list accuracy,

list update mechanism, and

URLs matching mechanism

There are several researches that have worked on those factors in order to increase the

efficiency of blacklist technique. One of such study is by Liu et al. [2011] to improve

the list update mechanism maintaining its accuracy .They suggest improving the

wisdom of crowds to maintain extremely low false positive rates and also reducing the

time to verify attacks. They designed an approach called “Aquarium” which is a

crowdsourcing technique that clusters similar phishes together and asks the manually

trained participants to vote for the cluster rather than individual phish. The mechanism

uses websites’ URLs submitted to PhishTank yet to be verified. The URLs are passed

through the whitelist technique to filter some of the legitimate pages, and reduce the

effort by reviewers. After that, the remaining URLs are clustered using Density-Based

Spatial Clustering of Applications with Noise (DBSCAN) and Shingling algorithm,

commonly used in the search engines for duplicate page detection. Finally, the clustered

URLs are submitted as tasks to Amazon’s Mechanical Turk (MTurk) system as Human

Intelligence Tasks (HITs) for verification by the participants. The weighing model of

votes from participants is based on their history of votes. Alike Phishtank, this

approach too requires the minimum of four votes to classify a cluster of URLs as phish.

Although, this approach improves the efficiency of reviewer in quantity, their efficiency

in quality is still questionable. Moreover, limitations, such as waste votes, power-law

issues of participation, limitation from MTurk (e.g., there is chance that each reviewer

can have different browsing experiences or get distracted by MTrek’s physical

environment) [Kittur et al., 2008], and inability to correct when participants make

incorrect classification still persist.

Similarly, to make the classification mechanism swift and timely, Whittaker et al.

[2010] designed a scalable machine learning classifier that automatically classifies

39

phishing pages and is used to maintain Google’s blacklist. This classifier examines the

features which the human reviewers look for in suspected websites to identify phishes

e.g., page’s URL, page HTML content collected by crawler, and hosting information. It

also uses a logistic regression classifier to make the final decision. The classifier

classifies the websites submitted by Internet users and also those collected from the

Gmail’s spam filters. Moreover, the blacklist maintained by Google is found to be more

effective than its contemporaries in phishing prevention [Ludl et al., 2007]. The

problem in Whittaker et al. [2010] approach is that its efficiency is dependent on the

efficiency of Gmail’s spam filter, when there are various other ways (e.g., Internet relay

Chat, i.e., IRC chat, web banner advertising, and instant messenger, other email services

like Hotmail, Yahoo!Mail, RediffMail, and so on) that scammer use to reach their

potential victims [IBM Internet Security Systems, 2007] and on the activeness of

Internet users to report suspected website, when several studies have proved that the

Internet users are not good at identifying phishing websites [Friedman et al., 2002 ;

Dhamija et al., 2006; Downs et al., 2006; Wu et al., 2006b].

Likewise, an approach called “PhishNet” by Prakash et al. [2010] attempts to

tackle the URL matching mechanism problem. PhishNet uses two components and they

are:

(i) A URL prediction component. It works offline and systematically generates new

URLs that are the modified form of the URLs in existing blacklist employing

various heuristics, such as: changing the Top Level Domains (TLD), IP address

equivalence, i.e., grouping together URLs having the same IP addresses,

directory structure similarity, i.e., grouping together URLs with similar directory

structure, using query string substitution, and brand name equivalence, i.e. ,

replacing one brand name with another.

(ii) An approximate URL matching component. It performs an approximate

matching of the URLs entered by Internet users with the URLs in blacklist.

In fact it utilizes the finding that malicious URLs even after mutation remain usually

syntactically close to each other or semantically same, i.e. ,uses the same IP address.

The verification of generated URLs to find whether they are indeed malicious or not is

done with the help of Domain Name System (DNS) queries and content matching

techniques in an automated fashion thus ensuring minimal human effort. The matching

is performed using a novel data structure that performs approximate matches with

40

incoming URLs based on regular expression and hash maps to catch syntactic and

semantic variations. Even though, this is a novel technique in generating various

modified form of URLs, however, it seems to utilize very few heuristic features to

check whether a newly generated URL belongs to phish or not. This means a phishing

website may get misclassified, especially when it looks for ninety percent similarity to

parent URL webpage in order to declare as phish page.

In conclusion, an effective blacklist must be comprehensive, error free, and

timely. An incomprehensive blacklist fails to protect a portion of its users. Similarly,

blacklist with wrong entry results unwanted warning which gradually trained Internet

users to disobey the warning [Whittaker et al., 2010]. Moreover, untimely update can

significantly degrade the quality of list. Therefore, an effective blacklist can be achieved

only, when it can use an error free automatic classifier with broad sources to receive

suspected websites for verifications and possesses URLs matching mechanism that can

detect all derivative URLs of phishing URLs. The study by Sheng et al. [2009] found

that tools that use heuristic method to complement blacklist performs better than those

using only blacklist, especially against zero hour phishing. Table 1 shows the summery

of list based methods with their main characteristics, pros, and cons.

3.2. Heuristic methods

Heuristic methods examine one or more characteristics of websites in order to detect

phishing websites. These characteristics are anomalies in the components of phishing

websites. In fact, even the automatic verification of phishing websites used to maintain

blacklists employs heuristic methods. Some of the heuristic methods are next analyzed.

3.2.1. The use of visual similarity measures for phishing detection

Phishing websites often imitate the look and feel of official websites with the same

layouts, styles, key regions, rendering, blocks, and most of the contents. They use

various non-text elements, such as images and flash objects to display contents. Such

mimic of an authentic website with only minimal required changes are often difficult for

Internet users to distinguish. Moreover, the use of non-text elements to display web

contents makes it even harder for general content based anti-phishing techniques. There

are some techniques, for instance, the technique proposed by Pang and Ding [2006] that

uses Optical Character Recognition (OCR) to analyze the contents in image, but it still

fails to analyze websites’ elements, such as flash objects and advertisement banners.

41

However, such cases can be efficiently handled by the use of phishing prevention

techniques that employ visual similarity measures to differentiate between bogus and

original websites. All visual similarity measures use database to store genuine websites’

data. When any suspicious websites are met, their data are compared to the data of

genuine websites stored in the database to detect differences. The genuine websites’

data are stored in one of the following forms:

(i) DOM elements of genuine websites. In this case, DOM elements of genuine

websites are compared with that of suspicious websites

(ii) Captured images of genuine websites. In this case, features in the images of

genuine websites are compared with that of suspicious websites using the

various techniques of Image Recognition (IR).

There are several studies that used DOM elements comparison for the visual

similarity measure. One of the approaches is from Wenyin et al. [2005], which consists

of four modules:

(i) Suspicious URL detection module. It is the source for suspicious URLs which

are obtained from transformation of the true URLs and various suspicious URLs

detected in emails.

(ii) Suspicious webpage processing module. It validates whether any real webpage

exists for the URL supplied by the “Suspicious URL detection module” and

generates a representation of the found webpage, i.e., blocks and features of

suspicious webpage.

(iii) True webpage processing module. It obtains a representation of the true

webpage, i.e., blocks, features, and weight of the true webpage.

(iv) Visual similarity assessment module. It compares the true webpage and each

suspicious webpage and finally calculates their visual similarity based on their

intermediate representations.

The approach by Wenyin et al. [2005] uses three similarity metrics, i.e., block level

similarity, layout similarity, and overall similarity defined on webpage segmentation to

calculate visual similarity between two websites. In the block level similarity, the

similarity of features that represent text blocks and image blocks are measured.

Similarly, in layout similarity, the ratio of the weighted number of matched blocks in

the suspected website to the total number of blocks in the true webpage is calculated.

The overall style similarity focuses on the visual style of webpage, which can be

42

represented by several format definitions, e.g., font family, background colour, text

alignment, and line spacing. The final verdict is made on the basis of similarity weight

of the suspected webpage which needs to exceed the similarity threshold in order to be

declared a phishing website.

Another similar approach is from Liu et al. [2006] called “SiteWatcher” that uses

visual similarity comparison and comprise of two sequential processes , the first process

runs at email server and the second process performs the visual similarity comparison.

It needs the registration of true URLs and their associated keywords to the system. The

process in the email server monitors and analyzes both incoming and outgoing emails to

find messages that contain keywords associated with the genuine website. All

embedded URLs from the messages that contain keywords are sent for visual similarity

assessment. After that, the second process performs visual similarity assessment at

block level, layout, and style. The visual similarity assessment includes the extraction of

visual features and the finding matches of suspected website against original website.

This matching is performed at blocks level, each visually and semantically, then on

position constraints among blocks. It calculates layout similarity (i.e., the weighted

number of blocks by the total blocks in the true page), calculates overall similarity on

the basis of distribution of features values, and the correlation coefficient of two pages’

histogram as the overall style similarity. It issues phishing reports to the respective

genuine website’s owner when the visual similarity reaches higher than corresponding

threshold values.

The problem with both of the above mentioned approaches is that they use feed of

legitimate website and cannot detect phishing websites that target other than the

websites in the database. The approach by Liu et al. [2006] even needs unique keywords

that can represent the legitimate website, which is an additional burden on Internet

users. Moreover, such approaches that completely rely on code can be easily deceived

by the use of following tricks: rewrite HTML codes that give the same design but use

different DOM objects than the legitimate page as shown in Figure 22, use images that

provide the same look as spoofed website, and use code obfuscation techniques to alter

the codes. In addition, such approaches can result false negative when the same theme is

used to generate different websites.

43

Figure 22: HTML codes and screenshots of the sign-in page in eBay.com [Lam et al.,

2009]

An approach that provides solution for the case mentioned in Figure 22, i.e., the

same design but uses different DOM objects than the legitimate website, is proposed by

Lam et al. [2009]. It uses visual similarity-based phishing detection effective even for

polymorphic phishing web pages. Polymorphic web pages are visually identical to

authentic web pages but uses different source code components than the authentic

webpage. This approach by Lam et al. [2009] performs page layout analysis and layout

block matching to calculate the degree of similarity using image processing techniques.

The authentic webpage is stored in a database. When a suspected webpage is found,

both authentic and suspect web pages are treated as images and Otsu’s thersholding

method is applied to transform images into black and white images. The degree of

similarity is ranked using classifier trained to handle such case. However, this approach,

too, cannot detect phishing websites that use code obfuscation techniques to alter the

source code. Moreover, using two processes for phishing website validation cannot

come for free; it is usually accompanied by degrade in time performance. In addition, it

still cannot detect websites that are not in the database.

The problems in visual similarity measure techniques that occurred due to

dependency on source code can be overcome by analyzing the features in captured

images of legitimate and suspicious websites. An approach by Fu et al. [2006] that

extracts the URLs from emails containing keywords associated with the protected

websites. This approach uses Earth Mover’s Distance (EMD) to calculate the visual

similarity of web pages. It first extracts the URLs from emails and then converts the

web pages associated to those URLs into normalized images. Next, it obtains the

44

images’ signatures which comprise colour and coordinates features. Finally, visual

similarity is computed using the linear programming algorithm of EMD. The final

classification is made on the basis of similarity value of the suspected webpage. When

similarity value of suspected webpage exceeds the threshold value of protected

webpage, it is classified as a phishing website. However, the problem in this approach is

that it uses colour histogram which is unsuitable for web pages, since websites usually

contain very few colours [Liu et al., 2006]. Moreover, making even a minor change in

dynamic components, which are often unnoticed by Internet users, can significantly vary

colour histogram. In addition, use of colour histogram has high chance of false negative

results for websites that are designed using popular theme.

Another approach that uses images, but analyze many other features of images is

from Cordero and Blain [2006]. Their approach uses differences in image rendering of

web pages for phishing websites identification. It captures the Tagged Interchange File

Format (TIFF) image of entire rendered web page which is turned into more

manageable feature vectors by calculating a joint histogram with two features resulting

in 256 features per image. It uses Cocoa/Safari Engine for website rendering and GNU

Octave and Image Magick for data pre-processing. Although, this approach compares

far more features than the approach by Fu et al. [2006], it also posses various

limitations. It uses the image rendering and layout of webpage for phishing website

detection despite the fact that both of them are affected with change in window size.

Even changes in font type and font size make changes in the appearance of webpage. In

addition, a website uses several dynamic components, such as advertisement banners,

flash objects that are cumbersome to compare using this approach, since with each

scene the image changes.

Likewise, an approach that uses image which also claims to handle use of

dynamic objects in webpage is by Chen et al. [2009]. It considers phishing page

detection as an image matching process. It takes the suspicious webpage snapshot and

uses Contrast Context Histogram (CCH) to extract discriminative keypoints from

suspected webpage which are matched with that of the authentic webpage often targeted

by phishers. Such authentic web pages data are stored in the database from reliable

source. Computer vision and image processing are used to compare the similarity. The

degree of similarity is calculated using k-means algorithm and when it exceeds certain

threshold, suspected webpage is considered to be phishing website. Even though, this

45

approach is effective against dynamic objects, such as advertisement banner, flash

objects, and video; however, it does not mention about degrade in time performance

that can occur due to processing of dynamic objects.

Different from all above mentioned approaches, Wang et al. [2011] proposed an

approach called “Verilogo”, which does not analyze the image of the whole webpage;

rather it analyzes only the logo used in the webpage. The main assumption of Verilogo

is that, logo is an easy means of recognition and it is deeply associated with given

organizations so it is often included in phishing websites to exhibit false originality. It

stores heavily phished logos and their related information in the database. It matches the

logo used by suspected webpage from the logos stored in the database using computer-

vision algorithm, then validate whether the suspected webpage has authorized hosting

IP address to use that logo or not. It warns the Internet users when they enter keyboard

input into the webpage that is not authorized to use the logo. Even though, comparing

logo is lighter than comparing the whole webpage, it protects only the websites whose

logos information is stored in the database. Moreover, it needs the list of all

organizations that are allowed to use a particular logo, which is another unconventional

situation.

In all of the above mentioned techniques that use visual similarity measures for

phishing detection, the common limitation is that all of them needed to know the

legitimate websites beforehand which is impractical. In order to remove this limitation,

Medvet et al. [2008] proposed an approach that uses three features to determine

webpage similarity:

Text pieces which also includes style-related features

Image embedded in the webpage, and

The overall visual appearance of the webpage as seen by the Internet user (after

the web browser has rendered the webpage).

This approach does not need initial list of legitimate web pages; instead it remembers

the pair of information (e.g., username, password) and the webpage in which Internet

user enters them. When Internet user enters the same credentials into any new webpage,

it performs the similarity comparison. The procedure is to retrieve the suspicious

webpage, transform the webpage into a signature, and compare the signature with the

stored signature of the legitimate webpage. In case of similarity, it raises an alert.

However, this approach neglects the fact that there are several Internet users who use

46

the same credentials for different websites. Moreover, some banks and organizations

(e.g., Nordea Bank) use one-time password and such case cannot be protected by this

approach.

To sum up, visual similarity measure is suitable for server (e.g., ISP server) based

phishing prevention techniques so that server admin can maintain the list of phishing

prone websites. However, it still can be a question whether that is possible.

3.2.2. Use of search engine in phishing detection

There are several search engines (e.g., Google, Bing, Yahoo!, Baidu) that maintain

crawl database and perform page ranking to display search results. PageRank algorithm

that was formulated by Google founder Larry Page and Sergey Brown uses factors, such

as number of inbound links, number of outbound links, and other damping factors.

Moreover, there is a set of recommended guidelines from Google web master to prevent

removable of websites from Google search engine index [Google Webmaster

Guidelines]. These all suggest that web pages must follow Google web master

guidelines and it must have maximum inbound links in order to gain high page rank. In

the contrary, phishing web pages usually have very short life span and they are even

found to disobey the recommended guidelines [Garera et al., 2007]. Therefore, phishing

websites are either absent in the search results or possess a very low page rank. In

addition, the count of search results for phishing websites are usually very few that

mostly consist of other phishing websites and websites that maintain malicious websites

list, such as PhishTank. These features of search engine are applied by many researchers

for phishing detection. The two vital components of this approach are: extraction of

search keywords and selection of search engine. Some of the proposed approaches that

use search engines for phishing detection are mentioned next.

An approach that uses search engine for phishing detection is by Ma [2006]. His

approach uses the Google search engine results for phishing detection. His work is a

plug-in for Mozilla Firefox web browser that extracts unique keywords from the

website to be analyzed and uses the keywords as query word for Google search engine.

Then the URL of suspected site is compared with the URLs of the top search results. In

case of mismatch, it interrupts the Internet user and it suggests one of the top ranked

search results. However, the problem with this approach is that it does not mention

about the keywords extraction method and the number of search results to be compared.

47

Another similar approach that is clear on both of the problems mentioned in Ma

[2006] approach is by Zhang et al. [2007b]. They proposed an approach called “A

Content-Based Approach to Detecting Phishing Website” or simply CANTINA that

examines the content of a webpage to identify phishing. It implements Term

Frequency–Inverse Document Frequency (TF-IDF) algorithm used in Information

Retrieval (IR) and Robust Hyperlink algorithm. TF-IDF algorithm is used to determine

the importance of a word in a document and Robust Hyperlink algorithm is used to

determine broken hyperlinks. The two ideas behind this approach are:

Phishers usually copy legitimate websites to generate phishing web pages. In that

case, Robust Hyperlink algorithm can be used to find the original log-in page.

Phishing websites often contain the original brand name which is common in

legitimate webpage, but it is relatively rare in web. Again in this case, Robust

Hyperlink algorithm can be applied to determine the actual owner of the

webpage.

The general working mechanism is as follows: first it calculates the score of each

term on the webpage using TF-IDF and then generates lexical signatures of the top five

terms which in concatenation with the domain name (even when the signatures already

contain domain name) is fed to the search engine (in this case Google). Finally, it

classifies the suspected webpage as a phishing webpage if its domain name does not lie

in the top thirty results of search engine. Even in the case when the search result count

is zero, the suspected webpage is classified as a phishing webpage. The limitations of

this approach are:

It works only with the webpage that has content in English language

It takes time because it involves querying Google

It can be bypassed using techniques, such as: use image content instead of textual

content, use unrelated text in invisible form (i.e., use font colour that is used as

webpage background colour), change enough words in the webpage, and use

webpage already high ranked in search engine result.

This approach uses linear classifier, which has its own limitations [Xiang et al.,

2011].

Likewise, Xiang and Hong [2009] proposed an approach that uses search engine

technique in association with other techniques for phishing detection. Their approach

uses IR methods to recognize the identity of the claimed webpage and captured phishing

48

webpage by examining the discrepancies between the claimed identity and its original

identity. It uses Named Entity Reorganization (NER) algorithms to reduce false

positives. The identity oriented component is aided by a keywords-retrieval component

that employs search engines to detect potential phishing webpage via searching

keywords of significant importance with respect to IR. It includes whitelist methods

and login-form detector to filter good web pages and control false positive results. Even

though this approach has better handling for false positive, it still contains the

limitations mentioned for CANTINA.

Similarly, an approach is proposed by Huh and Kim [2011] that is lighter than all

the above mentioned approaches that use search engine for phishing detection. It uses

the full URL string without parameter of suspected webpage as the query for search

engine exempting it from the tedious process of keyword extraction. The total number

of search results and the ranking of suspected webpage are used to determine whether it

is legitimate or fake. It uses the fact that legitimate web pages get a large number of

search results and usually ranked the first in search results whilst phishing web pages

get only a few numbers of results and they usually have a low rank or no rank. The

validation of this approach was performed using three different reputable search

engines: Google, Yahoo!, and Bing. However, the problem with this approach is that it

fails to detect the phishing web pages which use compromised popular websites.

To sum up, using search engine is an effective approach for phishing detection.

The results are more accurate due to the high efficiency of webmaster of search engines.

Moreover, approach by Ma [2006] provides an alternative option to the Internet users to

proceed browsing. One of the reasons that enforce Internet users to risk clicking a

suspected website despite the warning from security system could be the lack of

alternative. Most of the phishing prevention systems just warn Internet users and rarely

provide any substitute. Then, the approach by Huh and Kim [2011] which uses the

whole URL for search improves the quality of search keyword. Apart from them, this

approach is independent of other resources, such as database, and is equally effective

for zero hour phishing.

However, use of search engines for phishing detection, too, has several limitations

and some of them are mentioned next.

It is the webmaster of search engine who determines whether a website should be

indexed or not. This decision is taken on the basis of fact, how much the website

49

adhere to the recommended guidelines from webmaster for design content,

technical, and the quality of website. These guidelines help to make a website

search engine friendly [Google Webmaster Tools, Bing Webmaster Tools].

Search engine spider crawls the website on the basis of several factors, for

instance, Google looks to the factors, such as Pagerank, links to a page, and

crawling constraint like the number of parameters in a URL [Google Webmaster

Tools]. Moreover, Google PageRank is updated approximately in every three

month [Huh and Kim, 2011], and the case is similar with other search engines.

However, the concern is how many new legitimate websites do follow the

Webmaster guidelines. There are many legitimate websites designed by novice

designers who are unacquainted to the Webmaster guidelines. The situation

might improve when Content Management System (CMS) tools, such as

Joomla!, Druple, and Wordpress, is used for the design activities of webpage.

However, there are still many fresh legitimate websites which rank very low in

search result or they are not even in the rank of search engine results. Such

websites are misclassified by the phishing prevention approaches that use search

engines. Some of the legitimate websites, whose rank might improve, yet suffer

misclassification for three months in case of Google.

Such phishing prevention approaches can be easily bypassed by abusing a

legitimate website that already has a top ranking in search engine results or

registering a legitimate website to conduct phishing, even though such processes

are comparatively expensive.

Phishers can manipulate the ranking algorithms to get good ranking for their

websites in search engine results. Fourthly, search results vary with the kind of

search engine used. Figure 23 shows snapshots of search results after a

legitimate URL is entered as a query to two popular search engines, i.e., Google

and Bing.

50

Figure 23: Same URL searched using Google.com and Bing.com

Thus, it is suitable to use popularity of websites to support other heuristic properties for

phishing detection.

3.2.3. Use of anomalies in phishing websites for phishing detection

Phishing websites mimic the look and feel of genuine websites at interface level, but

they are different at code level. In fact, they also contain many anomalies in their web

objects, HyperText Transfer Protocol (HTTP) transactions, and claimed identities [Pan

and Ding, 2006]. These anomalies can exist in their URLs, DOM objects, or webpage

contents. There are several studies that have utilized the varied sets of these anomalies

for phishing detection. Some of the prominent studies are mentioned next.

An approach by Chou et al. [2004] is a browser plug-in called “SpoofGuard” which

is designed for the client side defence against phishing. SpoofGuard examines

properties, such as domain name, URL, link, and image to identify probable spoof

attacks. Further, it also looks to the browse history in order to verify whether the given

domain was visited before or not. It also checks whether the webpage is opened by

clicking any link from email messages. Most importantly, it stores the hash values of

post data, i.e., username and password, and the domain name where the credentials are

used. When Internet users enter any credentials, it compares the post data with the

stored credentials and their respective domain names. It warns the user, when

credentials match but their domain names differ. The two major problems in this

approach are:

51

It neglects the facts that many Internet users use the same credentials for different

domain names which can produce false negative results, and

It does not protect the websites which use one-time password, i.e., password is

valid for only one login session. It will store several credential for a single

domain name; precisely an entry for every login.

Likewise, Pan and Ding [2006] proposed an approach which detects phishing from

anomalies in the DOM objects of phishing websites. It employs two major components:

(i) Identity Extractor uses IR algorithm and χ2 test to extract web identity, and

(ii) SVM as Page Classifier takes input of web identity and a set of structural

features (i.e., web objects or properties relevant to web identity) to determine

whether a webpage is phishing page or not.

They also suggest using Optical Character Recognition (OCR) to extract contents from

phishing websites that use images in the place of textual contents. The main limitation

of this approach is that it uses an assumption “the distribution of identity-related words

usually deviates from that of other words” which is not completely true and this can be

observed from the high false positive results produced by the approach [Xiang et al.,

2011].

Similarly, an approach by Alkhozae and Batarfi [2011] looks to the violation of

W3C recommendations in webpage source codes to identify phishing websites. The

general mechanism is to assign an appropriate weight for each characteristic (W3C

violation) and an initial weight to the suspected website. An occurrence of each

characteristic in the suspected website reduces the corresponding characteristic’s weight

from the initial weight. The final decision is taken on the basis of remaining initial

weight after the examination. The smaller the weight, the higher is the probability of

being a phishing website. The main problem with this approach is that it depends on the

violation of W3C recommendations when it is unclear how many web developers really

know and follow W3C recommendations. Then, there are other web standards followed

by the development web industry, such as Internet Standards (STD) documents [IETF].

Moreover, there is a chance of bypassing this approach by the use of phishing website

that follows the most of W3C recommendations.

Problem with the above discussed approaches by Chou et al. [2004], Pan and

Ding [2006], and Alkhozae and Batarfi [2011] is that they load websites in order to

identify whether phishing websites, which ultimately expose Internet users to phishing

52

conducted using malicious codes. Therefore, to overcome this danger, Garera et al.

[2007] proposed a phishing prevention approach that uses only anomalies in the URLs

of phishing websites to detect them. This approach uses various distinguishing features

of phishing URLs and a logistic regression classifier (trained with data from Google)

which also includes obfuscation style heuristics and general heuristics based on the

Google’s Index Infrastructure. The main problem of depending solely on URLs for

phishing detection is that such approach can be easily deceived using either registered

domains or some compromised legitimate websites to conduct phishing.

Another similar approach that uses only URLs analysis is by Ma et al. [2009].

Their approach uses statistical method from machine learning to identify phishing

websites. It examines the lexical features (i.e., textual feature of URLs) and host based

features (i.e., IP addresses properties, WHOIS properties, domain name properties, and

geographical properties) of URLs in order to know the reputation of websites. The

problems with this approach are:

It can misclassify legitimate websites that use URLs containing benign tokens

stated in the approach.

It can misclassify legitimate websites that use free hosting services.

It cannot detect phishing websites that use compromised legitimate websites.

It can misclassify legitimate websites that use redirection of services.

It can misclassify legitimate websites hosted in reputable geographical regions,

such as USA, despite the fact the more than fifty percent of phishing websites

are hosted in USA [APWG, 2012].

It can misclassify websites that possess international TLDs but are hosted in USA.

Even though URLs analysis protects Internet users from malicious software, it

lacks the accuracy that could have gained when using DOM objects and webpage

contents analysis. A more robust approach called “CANTINA+” is designed by Xiang et

al. [2011] that uses the resources including URLs, HTML DOMs, third party services,

and search engine to detect phishing websites. It uses five features from CANTINA

(discussed in section 3.2.2), and additional ten new discriminative features for phishing

websites identification. It employs two filters, they are

(i) Hash Based filter. It uses SHA1 hash algorithm .It is used for duplicate page

detection.

53

(ii) Login form detection. It looks for three main characteristics of Login form, i.e.,

FORM tags, INPUT tags, and Login keywords (search for 42 different login

keywords).

Finally, it employs the machine learning detection model based on discriminative

features extensively trained as classifier. Even though CANTINA+ is more robust than

CANTINA, it still has some limitations which are:

It us unable to detect Cross-site scripting attacks,

It cannot detect phishing that is conducted using compromised legitimate

websites.

It cannot detect phishing websites that use images instead of textual content.

Above mentioned approaches detect phishing, but they do not report what kind of

attack is it. Choi et al. [2011] proposed a machine learning approach to detect malicious

URLs of all kinds including phishing, spamming, and malware infection. Along with

detection, it also identifies the attack type. It uses various discriminative features (e.g.,

lexical, link popularity, webpage content, DNS fluxiness, and network) for detection.

The methodologies used are SVM for detection of malicious URLs and RAkEL and

ML-kNN for identifying attack types of malicious URLs. The main problem with

machine learning approach is that its effectiveness is dependent on the type of data used

for training. Moreover, phishing schemes are dynamic and such classifier has to be

updated timely.

To sum up, anomalies in the URLs and source codes of phishing websites can be a

promising way to differentiate between phishing and legitimate websites. An approach

designed by Gastellier-Prevost et al. [2011] called “Phishark” in order to study the

effectiveness of URLs and page contents analysis for phishing detection, too, showed

that anomalies can be an effective means to distinguish between legitimate and phishing

websites. The major challenge in using anomalies for phishing prevention is the

legitimate websites that are developed by novice web developers or precisely, the web

developers who are unacknowledged about Internet security and various web

development standards. Such web developers unintentionally practice several anomalies

in their work and their websites usually get misclassified.

Table 1 is the summery of technical phishing prevention methods with their main

characteristics, pros, and cons.

54

Methods Characteristics Pros Cons

Whitelist

method

It uses a list of trusted

websites and checks

whether a given website is

present in the list or not.

(i)It is effective against

zero hour phishing.

(ii)It produces almost

no false positive

results.

iii) It is simple in

design.

(i)It has difficult

update mechanism.

Blacklist

method

It uses a list of treacherous

websites and checks

whether a given website is

present in the list of not.

(i)It has low false

positive results.

(ii)It is simple in

design.

(i)It is ineffective

against zero hour

phishing.

(ii)It has difficult

update mechanism.

(iii)It has difficult

URLs’ matching

mechanism.

Visual

similarity

measures

It stores the information of

the DOM elements or

captured images of the

legitimate websites and

compares the information

from its database with that

of the suspicious websites.

It is effective against

phishing attacks

targeting websites

whose information is

stored in its database.

(i)It needs to store

data about the

legitimate websites

which has to be

protected from

phishing.

(ii) It cannot detect

phishing attacks

which target the

websites not in its

database.

55

Use of search

engine

It extracts search keywords

from the given websites

and searches the keywords

using a search engine.

Then, it compares whether

the given URL is in the top

search results.

(i)It is simple in design.

(ii)It is very much

suitable for anti-

phishing tools that can

suggest alternative links

to Internet user.

(i)It can misclassify

many legitimate

websites.

(ii)Its accuracy

depends on selected

search engine.

Use of

anomalies in

phishing

websites

It looks for the

characteristics in DOM

objects or URLs of the

websites.

(i) It is not dependent

on any specific

phishing strategy and is

equally valid for all

kinds of phishing

websites.

(ii) It does not depend

on any external factors,

such as databases

(iii) It does not require

any changes in user

browsing habits.

(i)It is complex in

design.

(ii)Its accuracy varies

with the list of used

phishing-

characteristics.

Table 1: Summery of technical phishing prevention methods

4. Investigating anomalies in phishing websites

One of the main objectives of this thesis is to identify the important anomalies found in

the URLs and source codes of phishing websites. Therefore, I compiled as many

distinctive anomalies as possible. In order to gather anomalies, I realize there are two

possible ways. One way is to analyze phishing websites and the corresponding

legitimate websites together to discover their differences, but this is time consuming

process. Therefore, I selected the second way and chose past studies as the sources to

get the anomalies; since those anomalies are already confirmed that they can occur in

phishing websites. I collected several past studies, for example, studies by Chou et al.

56

[2004], Fette et al. [2006], Pan and Ding [2006], Garera et al. [2007], McGrath and

Gupta [2008] , Ma et al. [2009], Bian et al. [2009], Alkhozae and Batarfi [2011] , Xiang

et al. [2011], Choi et al. [2011], and Gastellier-Prevost et al. [2011] and picked all non-

redundant anomalies. The anomalies that I have listed are mentioned next.

4.1. Anomalies found in the URLs of phishing websites

Use IP address in URLs. Some of the phishing websites use IP address in their

URLs either to replace the host name or as a substring of the URL in order to

confuse Internet users. APWG [2012] reported that 1.19%, 1.4%, and 2.09% of

the phishing websites had used URLs containing IP address during the first

quarter of 2012. An example of such URL is:

http://184.173.179.200/~agarwal/rbc/

However, some genuine web applications usually used in intranet also can

contain IP address in URL.

URLs contain brand, or domain, or host name. In this form of phishing websites’

URLs, the target’s company brand or domain or host name is included in the

path segment of URLs. McGrath and Gupta [2008] found that 50%-75% of

phishing websites’ URLs contain the targeted brand or domain or host name.

According to the report of APWG for the first quarter of 2012, it was found that

49.53%, 45.39%, and 55.42% of the phishing websites used URLs containing

targeted company’s brand, or domain, or host name in their URLs. An example

of such URL is: http://fatloss4babyboomers.com/paypal.html

However, brand or domain or host name is also used by the most of the genuine

websites in their URLs.

URLs use http in place of https, i.e., abnormal SSL certificate. Most of the

phishing websites use unsecured connection to transfer sensitive information.

Valid Secure Socket Layer (SSL) certificate is issued by authorized

organizations. The authorized organizations verify the websites before issuing

SSL certificate which means acquiring such certificate by a phishing website

makes it susceptible to detection techniques and some time even dangerous for

the respective phisher to get trace. In addition, Internet users are not good at

differentiating between secure and unsecure connections [Gastellier-Provost et

al., 2011]. Some phishing websites were reported to use either invalid or

57

inconsistent to claimed identity SSL certificate, but currently it is rarely in

practices since all the recent versions of popular web browsers, such as Google

Chrome, Mozilla Firefox, and IE have detection systems for them. An example

of phishing website that uses http is: http://coachbronek.com/muz4/index.php.

However, there are some authentic websites, such as Facebook, Viadeo which

use SSL for very short time to validate the users’ credentials [Gastellier-Provost

et al., 2011].

URLs contain misspelled or derived domain name. There are various tricks used

by phishers to derive domain name that looks similar to genuine domain name

but disobey the URL naming conventions. Many times such derived domain

name is registered domain name. Some of the techniques used to generate

derive domain name for phishing websites are:

o Replace the characters of real domain name with similar looking

elements (can be Hexadecimal, Integer). An example of such URL is:

http://paypa1.com, where character ‘l’ is replaced by number one.

o Introduce a hyphen (-) in domain name. An example of such URL is:

http://www.adm-ahtuba.astranet.ru/semite.html

o Shift the characters of domain name. An example of such URL is:

http://www.paypla.com, where position of characters ‘a’ and ‘l’ are

interchanged.

However, several genuine websites have URLs that contain meaningless word

and this can complicate the detection of phishing websites’ URLs.

URLs using long host name. Phishing websites’ URLs are usually longer than

normal URLs. McGrath and Gupta [2008] found that the URLs’ lengths peak

at 22 characters for legitimate websites in the DMOZ whilst they are 67

characters for the URLs in PhishTank and 107 for the URLs in MarkMonitor.

They further found that only few URLs in DMOZ were found to be longer than

75 characters and the longest URLs found in PhishTank and MarkMonior had

length more than 150 characters. In addition, they found that phishing domains

(without TLD) have shorter length than legitimate domains. Domains’ length

(without TLD) peaks at 10 characters for the URLs in DMOZ when it peaks at 7

characters for the URLs in PhishTank and MarkMonitor. An example of such

URL is:

58

http://fodamat.com/templates/fodamat/webscr/PayPal.com/webscr.php?cmd=_l

ogin-run&dispatch=5885d80a13c0db1f998ca054efbdf2c29878a435fe324eec25

11727fbf3e9efe4eb694d5cae9e96bf5176d35f4070ec44eb694d5cae9e96bf5176d

35f4070ec4

Use short URLs. Some phishing websites use URLs shortening services, such as

TinyURL [McGrath and Gupta 2008, Gastellier-Prevost et al., 2011] to shorten

their URLs which ultimately redirect to long URLs. An example of such URL is:

http://prophor.com.ar/prophor/wells/alerts.php that redirected to URL

http://specialneedssvg.org/wp/wp-

admin/import/wellsfargo/wellsfargo/wellsfargo2011/index.php

Use “//” character in URLs’ path. When URLs’ path contains “//” character, it is

suspicious and there is greater chance that it will redirect [Gastellier-Prevost et

al., 2011]. An example of such URL is:

http://bganketa.com/libraries/eBaiISAPI.dll.htm?https://signin.ebay.co.uk/ws/eB

ayISAPI.dll?SignIn

However, there are some genuine websites that satisfy the condition. An

example is the login page URL for Gmail:

https://accounts.google.com/ServiceLogin?service=mail&passive=true&rm=fals

e&continue=https://mail.google.com/mail/&ss=1&scc=1&ltmpl=default&ltmplc

ache=2

URLs use unknown or unrelated domain name. Sometime phishers use a domain

name that is either completely unknown or unrelated. An example of such URL

targeted to Paypal is: http://www.traitembal.com/backoffice/images-

backoffice/dossier/

However, it is legal to have unique domain name.

URLs use multiple Top Level Domains (TLD) within domain name. Some

phishing websites’ URLs use multiple TLDs within domain name. Such URLs

can be detected from the number of dots (.) used in URLs. It is found that

genuine URLs contain on average less than five dots (.) [Zhang et al 2007a]. An

example of phishing URL with more than five dots is:

http://paypal.com.bin.webscr.skin.a5s4d6a5sdas56d6554y65564y65564y4a56s4

d56as4d65sad4.shoppingcarblumenau.com.br/

59

However, there are some legitimate websites that contain more than five dots.

An example of such URL is:

https://login.live.com/login.srf?wa=wsignin1.0&rpsnv=11&ct=1351023508&rv

er=6.1.6206.0&wp=MBI&wreply=http:%2F%2Fmail.live.com%2Fdefault.aspx

&lc=1033&id=64855&mkt=en-us&cbcxt=mai&snsc=1 the URL of login page

of “Hotmail.com”

Use encoded URLs. Use of obfuscated text, i.e., ASCII or Hex or Oct, equivalent

of readable text for URL is another technique exercised to hide the identity of

phish. Some time encoded IP address is used in the URL. Such text is less likely

to be readable and can easily deceive Internet users. An example of such URL is:

http://www.absolutewealthsystem.com/www.paypal.it_service-

security_confermation/it/Processing1.php?cmd=_Processing&dispatch=5885d8

0a13c0db1fb6947b0aeae66fdbfb2119927117e3a6f876e0fd34af4365dcbd1864c8

b4dcf443a6f60fef107b96dcbd1864c8b4dcf443a6f60fef107b96

Uses special character ‘@’ in URLs. Special character ‘@’ is used in the URL to

redirect the user to a website different from that appears within the address bar.

A ‘@’ symbol in URL disregard string on the left side of the symbol and the

actual URL is the string on the right side of the symbol [Zhang et al., 2007a]. An

example of such URL is:

http://www.amazon.com:[email protected]

URLs use different port number. Some phishing websites use port other than port

80 [Gastellier-Prevost et al., 2011]. It is found that 1.19%, 0.68%, and 0.26% of

the phishing websites did not use port 80 in January, February, and March of

2012 respectively [APWG 2012].

URLs with abnormal DNS record. Legitimate websites usually have record in

DNS record; however, phishing websites usually do not have record. In case if

they have, most of the information remains empty. Figure 24 shows the DNS

lookup result using My-Addr.com tool for the phishing URL:

http://188.138.124.133/www.paypal.com/session_id/8754445562322241489889

6521458754/index.htm#

60

Figure 24: DNS record for a phish URL tested using My-Addr.com tool

However, incomplete DNS record can also be for legitimate websites whilst a

complete DNS can be for fake websites.

Life of Domain. In general, the life of phishing sites is not long. Even when they

have registered domain, it is usually a recently registered one. Phishing websites

become active immediately after registration [McGrath and Gupta, 2008; Zhang

et al., 2007a]. However, everyday many recently registered legitimate websites

are added to Internet.

Number of sensitive words in URLs. Several suggestive word-tokens are used in

phishing websites’ URLs [Garera et al., 2007]. The eight word-tokens used by

Garera et al. [2007] in their classifier are: webscr, secure, banking, ebayisapi,

account, confirm, login, and signin. An example of such URL is:

http://paypal.com.cgi.bin.webscr.cmd.login.submit.dispatch.8f9j89u54iu5l5469t

6d6sd4.boquetequalityproperties.net/pay/

Use of free web hosting. Free web hosting services are widely misused by

phishers to host their phishing websites [McGrath and Gupta, 2008]. Most of the

phishing websites use domain that is specifically registered for hosting phishing

sites or they use web hosting services which are available for free [Prakash et

al.,2010]. An example of such URL is:

http://arnodits.net/ysCntrlde/webscr_prim.php?YXJub2RpdHMubmV0NTAxN

mNmYTVjMzY4NQ==MTM0MzY3MjIyOQ

However, many other legitimate websites also use free web hosting services.

URLs popularity. Page rank depicts the relative importance of a website within a

set of websites. A higher page rank indicates that the website is more important

and mostly a legitimate website can achieve it [Garera et al., 2007; Choi et al.,

2011]. Techniques by Ma [2006], Zhang et al. [2007a], Xiang and Hong [2009],

and Huh and Kim [2011] use search engine ranking for phishing websites

detection. A screenshot of the results returned by Google for a phishing URL is

shown in Figure 25.

61

Figure 25: Google search results for a phish URL

However, phishing websites can use compromised URLs which are already

popular whilst newly designed websites can have very low popularity.

Moreover, the ranking varies with the type of search engine used, shown in

Figure 23.

No credible in-neighbor search results [Bian et al., 2009]. Legitimate websites’

domain usually has inlinks from various credible websites while phishing

websites mostly do not have inlinks from legitimate websites. In fact, most of

the time phishing websites even do not have inlinks at all. This does not mean

all legitimate websites will have inlinks. Several legitimate websites may not

have inlinks at all as well. Some of methods that can be used to get the inlink

are: “link:[no space]DomainToSearch” in Google, “link:[

space]DomainToSearch” in Yahoo! and Bing ,Bing webmaster tool , and

Google webmaster tool.

URLs absence in relevant web category [Bian et al., 2009]. When the keywords of

a legitimate website are entered to Yahoo! Directory, it lists out the websites that

are relevant to provided keywords which also include the legitimate website.

However, a phishing website either does not get any results or it is absent in the

results. This again does not guarantee that all legitimate websites will have non-

zero result counts. There are several legitimate websites that were found to have

zero result counts.

Number of “Bag of words” in URLs. Frequency of strings delimited by

‘/’,’?’,’.’,’=’,’-’,’_’ can be used for phishing detection[Ma et al., 2009]. In

62

general, phishing websites possess higher frequency of these symbols in their

URLs than normal websites URLs. An example of such URL is:

http://artesax.com/~citcompa/paypal_priv8_us_2012/index.htm?cmd=_login-

run&dispatch=063c19f9f888ffe32e5abeba112f5b33063c19f9f888ffe32e5abeba1

12f5b33

Domain name character composition. McGrath and Gupta [2008] found that

domain names from DMOZ resembles to relative letter frequencies of characters

in English language whilst domain names from PhishTank and MarkMonitor

have less pronounced peak at each of vowels. Likewise, they also found that

relative popularity of letters of the English language differs in legitimate and

phishing domain names. Letters ‘a’, ‘c’, and ‘e’ have significantly different

probability of appearing in English language documents or DMOZ domain

names; but they have very similar probabilities of occurrence in phishing

domain names.

URLs hosted by geographical location. The majority of phishing websites are

hosted in USA [APWG, 2012]. This might be because USA hosts the highest

number of other websites as well.

TLD triplets used in URLs. It is found that triplets correspond to TLD that are

very often used by spammers are .us, .cn, and .com [Gastellier-Provost et al.,

2011]. However, they are also widely used TLD for genuine websites.

4.2. Anomalies found in the source codes of phishing websites

Abnormal anchor URLs. Genuine websites link use an anchor to provide

navigational guidance. The URLs used in the anchor are usually from their own

domain and sometime to different domain. However, in phishing sites such

anchor URLs are mostly from different domain. It has been also found that

sometimes the anchor in phishing websites does not link to any page, for

example, AURL can be “file:///E/” or “#”.

Abnormal Server Form Handler (SFH). Security is one of the prime concerns for

organizations that do online transactions. Such organizations require credentials

for login which are generally username and password. Thus, their websites

include SFH. Legitimate websites always take actions upon the submission of

form; however, phishing websites can either contain “about:blank” or “#”.

63

Moreover, legal site’s SFHs are handled by the server of the same domain. So

whenever the form is handled by any foreign domain server, it makes the

websites suspicious.

Abnormal request URLs. RURLs are the links of external objects (images,

external scripts, CSS) also called resources. W3C recommends websites to use

resources from page’s own domain and is widely followed by genuine websites.

However, spoof websites often use these resources from the victim websites to

make phishing websites look and feel similar to legitimate websites. It means

the request URLs used by crook websites are often from different domain. Some

of the genuine websites too use resources from domain other than their own

domain; however, they use for very few resources whilst in phishing websites

they use different domain for RURLs for most of their resources.

Abnormal Cookie. Cookie is used to identify users and their previous activity in

the websites. This is an important part of portals and online shopping websites.

This is always bound to websites’ server domain. However, in phishing

websites, it either points to its own domain which is inconsistent to the claimed

identity or points to genuine websites’ domain which differs from the phishing

domain.

Mismatch hyperlink. Mismatch hyperlink is used to mislead Internet users.

Although the links that appear to Internet users are of the original websites, but

when the links are clicked, they direct to the phishing websites. For instance,

<a href=”http://www.profusenet.net/checksession.php”>

https://secure.regionset.com/Ebamking/logon/</a>

Use of illegal pop-up windows. A phisher uses pop-up and asks Internet users to

fill their information. It could be a borderless window above the real websites

that looks very much a part of genuine websites. There can be two ways to

create pop-up windows: using HTML which is in practice, for instance,

< div onClick=”window.open(‘mona.html’)”>

Other way is using Javascript, which is illegal:

onClick=”javascript:popup(‘mona.html’)”

All popular web browsers have features to block pop-up windows [Alkhozae

and Batarfi, 2011].

64

Harmful forms. Phishing websites usually use a form asking to fill other details

along with username and password [Ludl et al., 2007]. Number of input fields,

text fields, password fields, hidden fields, and other fields, such as radio buttons,

and check box can be used for phishing detection [Ludl et al., 2007]. Precisely,

<input> tags that accept text accompanied by word, such as “credit card” can

indicate phishing [Zhang et al., 2007a]. This form usually contains submit

button. Figure 26 shows a form in a phishing website.

Figure 26: A form phishing website

Use of onMouseOver to hide the link. Some phishing websites include

onMouseOver function to hide their abnormal link. An example of code snippet

that performs onMouseOver is below:

<a href="http://www.abc.com"onMouseOver ="window.status='Click here to go

to ABC'; return true">ABC</a>

onMouseOver="window.status='Click here to go to ABC'; return true"

Number of Script tag. In general, phishing websites are found to use more number

of Javascript tags and plain text pages than legitimate websites [Ludl et al.,

2007]. Thus, too many uses of Javascript tags in a website make it suspicious.

Presence of Javascript functions. There are some native Javascript functions, such

as escape (), eval(), link(), unescape(), exec(), link(), and search() , which occur

predominately in phishing websites containing cross-site scripting and web

based malware [Choi et al., 2011]. Availability of these functions in higher

count in a website makes it suspicious.

IFrame redirection. IFrame is used to embed another webpage within the current

webpage. It creates a frame or window on a webpage so that another page can be

65

inside this frame. A borderless IFrame which can be hard for Internet users to

detect manually is found to be used by some phishing websites.

Mismatch in form fields and domain name. Phishing websites use their own

domain name but put text of legitimate websites in the <title> tag, which make it

a complete mismatch [Gastellier-Provost et al., 2011]. This can be applied for

phishing detection.

Disabled right click. Some of the phishing websites disabled the right mouse

click. A simple Javascript function can be used to disable it. A code snippet that

can disable right click is given below:

function disableclick(e){

if(event.button==1) {

return false; }}

Use authentic logo. Almost all of the phishing websites use logo of the legitimate

websites to imitate the appearance [Zhang et al., 2007a]. This verification needs

record of all the logos of legitimate websites that are highly targeted by phishers,

which means dependency.

Integrate security logo. Most of the phishing websites use security logo, such as

VeriSign [Gastellier-Prevost et al., 2011] to provide the look of genuineness. It

needs prior knowledge about all existing security logos. Figure 27 shows a

phishing website that uses “VeriSign” logo.

Figure 27: Phishing website with a company’s logo and VeriSign’s logo

Keyword/Description. These objects and properties provide information about the

websites, such as copyright, ownership, and content of the website. Although

website’s mirroring is quite simple process, even all popular browser’s (e.g.,

66

“Save as” option is one of the simplest methods for website mirroring, yet this

information can be helpful for phishing detection. In fact, there are already some

phishing prevention techniques which use them for phishing detection, such as

Bayesian Filter.

Sloppiness or lack of familiarity with English. Some phishing websites bear silly

spelling mistakes, grammatical errors, and inconsistencies in the web contents.

Sometime it is done deliberately in order to bypass anti-phishing tools that use

content based filtering technique, i.e., Bayesian Filter. Moreover, designing a

tool to check language mistakes is in itself another challenge. Moreover, there

are many phishing websites that are in other languages than English.

Email function. Some of the phishing websites include a function that sends email

to the phishers. When a victim enters the information, it sends an email with all

the information to the phisher. An example of Javascript code that sends email

is:

function sendMail() {

var link = "mailto:[email protected]"

"[email protected]"

"&subject=" + escape("This is my subject")

"&body=" + escape(document.getElementById('myText').value) ;

window.location.href = link ;}

This code can be in some other programming language that cannot be shown in

client side.

4.3. Verification of the anomalies using online phishing websites

An experiment was conducted to verify the anomalies listed in afore mentioned

subchapters (i.e., 4.1. and 4.2.). I used twenty online phishing websites already

validated as phishing websites by PhishTank, for the experiment. I selected serially the

top twenty phishing websites that were verified as phishing on 9th of August 2012. The

list of phishing websites’ URLs used for the experiment is included in the Appendix. I

verified most of the anomalies, but few of the anomalies were not verified due to

technical complications. This includes anomalies that are related with the grammatical

mistakes in the web contents. I used mainly the login page of phishing websites for the

experiment, since it is the entry point and phishing has to be detected at this point. I

used the most of the tools and environments that already exist for the experiment. The

67

benefit of using existing tools is that these tools are online, stable, and their results can

be trusted. The tools and environments used are:

Google search engine was used to obtain the popularity of phishing URLs. The

complete URL of each phishing website was used as a search keyword.

Google, Yahoo!, and Bing search engines were used for finding the credible in-

neighbor search of phishing websites’ URLs.

Yahoo! Directory was used to obtain relevant web category. Spoofed organization

name was used as a keyword.

DNS and WHOIS tool in My-Addr.com was used to get the DNS record of

phishing website’s URL.

Check/Search Port tool in My-Addr.com was employed to get the port used by

phishing websites.

Notepad++ was used as a source code viewer and also its ‘find’ feature was used

to search DOM objects’ tags.

Utility applications designed in C Sharp programming language (.Net platform)

were used for extraction of properties in URLs and DOM objects.

In order to verify the anomalies, I chose a phishing website at a time and looked for

all the anomalies in the website. I always started with the anomalies which require the

website to be online, e.g., URLs hosted geographical location, URLs popularity, no

credible in-neighbour search results, URLs with abnormal DNS record, URLs use

different port number, use of free web hosting, and life of domain. One of the major

challenges was that phishing websites do not remain online for a long time. Therefore, I

have to make sure I get the required information before somebody takes the website

down. Then, I download the phishing webpage for source code analysis and after that I

analyzed its URL for anomalies. I analyzed the source code of the phishing website in

the last.

During analysis, firstly, I checked whether the anomalies are present in the selected

phishing website or not. Then, I obtained the count of occurrences for those anomalies

whose count is necessary to differentiate between a legitimate website and a phishing

website, such as number of “Bag of words” in URLs, number of script tags, URLs use

multiple Top Level Domains (TLD) within domain name, and number of script tags. I

also calculated the mean and median values of the count of occurrences. Mean value is

calculated when the data set (i.e., a set of values formed from count of occurrences of an

68

anomaly in each phishing websites) is evenly distributed, otherwise, median value is

calculated. The results from the experiment are listed in Table 2 and Table 3. Table 2

contains anomalies type and the number of phishing websites containing anomalies in

their URLs.

Properties Results (Occurrence/Total)

Use IP address in URLs 2/20

URLs contain brand or domain or host

name

12/20

URLs use http in place of https ,i.e.,

abnormal SSL certificate

20/20

URLs contain misspelled or derived

domain name

0/20

URLs use large host name 9 /20 URLs length equal or greater than 75

characters ; Mean =96.9

Use short URLs 2/20

Use “//” characters in URLs path 1/20

URLs use unknown or unrelated domain

name

8/20

URLs use multiple Top Level Domains

(TLD) within domain name

20/20; Mean=3

Use encoded URLs 4/20

Uses special character ‘@’ in URLs 0/20

URLs use different port number 0/20

URLs with abnormal DNS record Complete=11; Incomplete=8; Not Found=1

Number of sensitive words in URLs 9/20,

Number of “Bag of words” in URLs 20/20; Mean=9

URLs popularity 18/ 20; Median Results Count =3

No credible in-neighbour search results 20/20

URLs absence in relevant web category 20/20

Life of domain Unknown, cannot obtain the life of domain

Use of free web hosting Unknown, cannot obtain information about the

web hosting servers

69

Domain name character composition Unable to classify

URLs hosted geographical location 10/20 –United State; 3/20- Spain; 1/20 each

for- France, Italy, Switzerland, Hong Kong,

Vietnam, Turkey; 1/20- Unknown

TLD triplets used in URL 11/20 use .com

Table 2: Number of phishing websites containing anomalies in their URLs

Similarly, Table 3 contains anomalies type and the number of phishing websites

containing anomalies in their source codes.

Properties Results (Occurrence/Total)

Abnormal anchor URLs 18/20

Abnormal Server Form Handler (SFH) 20/20

Abnormal request URLs 18/20

Abnormal cookie 3/20

Mismatch hyperlink 0/20

Use of illegal pop-up windows 0/20

Harmful forms 20/20

Use of onMouseOver to hide the link 0/20

Number of script tags 20/20; Mean=28

Presence of Javascript functions 10/20

IFrame redirection 0/20

Email functions 0/20

Mismatch in form fields and domain

name

19/20

Disable right click 0/20

Use authentic logo 20/20

Integrate security logo 11/20

Keyword/Description Unknown, phishes used various languages.

Sloppiness or lack of familiarity with

English

Unknown, phishes used various languages.

Table 3: Number of phishing websites containing anomalies in their source codes

70

4.4. Discussion on findings

The anomalies present in source codes are clearer than those found in URLs. Most of

the anomalies in source code can be analyzed locally which means they do not need

Internet connection and they are almost independent of the Internet speed once the web

pages get loaded. Likewise, the majorities of anomalies in source codes are only textual

matching except few anomalies which need images matching and English grammar

rule. One of the major problems in analyzing anomalies in source codes is that they

need to load web pages which expose Internet users to vulnerabilities from malicious

codes, keyloggers, and botnets. Although, the risk from malicious code, keyloggers, and

botnotes can be reduced using a sandbox browser to load the webpage for analysis; it

cannot guarantee a complete protection from malwares and malicious codes [Sabanal

and Yason, 2012].

Similarly, the analysis of anomalies in URLs does not need to load the web pages

which mean Internet users can be safe from phishing conducted using malicious

software. However, some of the anomalies found in URLs need Internet connection and

are time consuming processes.

The experiment revealed that all anomalies are not equally important. Some

important results from the experiment are:

A promising set of anomalies which had high frequency and they were strong

indicator of phishing are listed in Table 4.

Anomaly types

Abnormal Server Form Handler (SFH)

Harmful forms

URLs uses http in place of https or abnormal SSL certificate

URLs contain brand or domain or host name

Abnormal anchor URLs

Abnormal request URLs

Mismatch in form fields and domain name

Table 4: Promising anomalies

Some anomalies are highly occurring and also are important for phishing

detection; however, they need prior information about the owner of the

legitimate websites and the security logo owner. List of such anomalies is in

Table 5.

71

Anomaly types

Authentic logo used

Security logo integrated

Table 5: Anomalies dependent on external factors

It was also found that some of the anomalies, which are easy to avoid, are either

rarely present (Table 6) or are absent (Table 7) in phishing websites.

Anomaly types

Use IP address in URLs

Use encoded URLs

Use ‘//’ characters in URLs path

Abnormal Cookie

Use short URLs

Table 6: Important anomalies that are less occurring

Anomaly types

Uses special character ‘@’ in URLs

Mismatch hyperlink

Use of illegal pop-up windows

Use of onMouseOver to hide the link

IFrame redirection

Email functions

Disable right click

URLs contain misspelled or derived domain name

URLs use unknown or unrelated domain name

Table 7: Important anomalies absence in phishing websites

Some of the anomalies can have higher time overhead, which can make

them unsuitable during certain circumstances, for example, in the case when

Internet speeds is slow. The list of anomalies is in Table 8.

Anomaly types

URLs with abnormal DNS record

No credible in-neighbor search results

URLs absence in relevant web category

Life of domain

72

Use of free web hosting

URLs hosted geographical location

URLs Popularity

URLs use different port number

Table 8: Anomalies with higher time overhead

There are some anomalies which are not clear in the sense that the same

anomalies also exist in legitimate websites. Therefore, such anomalies need

further analysis to clarify exactly when their presence can declare a website as a

phishing website. The list of such anomalies is in Table 9.

Anomaly types

URLs use multiple TLD within domain name

TLD triplets used in URL

Number of sensitive words in URLs

Number of Script tag

Number of ‘Bag of words’ in URLs

URLs use large host name

Presence of Javascript functions

Table 9: Vague anomalies (need further analysis)

Although, Zhang et al [2007a] stated that a genuine website contains less

than five dots (‘.’) in URL, i.e., anomaly “URL uses multiple TLD within domain

name”, but only three phishing websites are found during the experiment that

satisfy the condition whilst there are legitimate websites, which have login page

with more than five dots ,e.g. ,

https://login.live.com/login.srf?wa=wsignin1.0&rpsnv=11&ct=1350861003&rver

=6.1.6620.0&wp=MBI&wreply=http:%2F%2Fmail.live.com%2Fdefault.aspx&lc

=1033&id=64648&mkt=en-us&cbcxt=mai&snsc=1 , is the login page URL for

“Hotmail.com” that has seven dots.

Similarly, McGrath and Gupta mentioned that a long genuine URL can be

of length maximum seventy-five characters and in general of twenty-two

characters. But some of the phishing websites used for the experiment have URL

length less than twenty-two characters whilst there are genuine websites whose

login page URLs have length more than seventy-five characters, e.g.,

73

https://accounts.google.com/ServiceLogin?service=mail&passive=true&rm=false

&conticon=https://mail.google.com/mail/&ss=1&scc=1&ltmpl=default&ltmplcac

he=2, is the login page URL of “Gmail.com”.

Similarly, TLD stated by the anomaly called “TLD triplets used in URL” are

the most common TLDs and millions of legitimate websites use them.

Likewise, for anomalies “Number of sensitive words in URL”, “Number of

Script tag”, “Number of ‘Bag of words’ in URL”, even though several websites

contain them but what number should indicate phishing is unclear and

interestingly they are also very common in legitimate websites.

Some anomalies are associated with English language, when several phishing

websites are found to be non-English. List of such anomalies is in Table 10.

Anomaly types

Sloppiness or lacks of familiarity with English

Domain name character composition

Keyword/Description

Table 10: Anomalies dependent on English language

Anomalies in phishing websites can be an effective way to detect phishing, but

there is a need for proper methods for selection, calibration, and deployment of those

anomalies. There is a need to look for anomaly or a group of anomalies that are hard for

phishers to manipulate and are unexpected in legitimate websites during examination of

suspected websites. Some important points that can be utilized during the deployment

of anomalies for heuristic methods are:

(i) Priority should be given to the anomalies which phishers cannot easily

avoid.

Elimination of these anomalies takes time, effort, and money of phishers.

Further, it makes easier to detect such phishing websites and sometimes

it makes risky for phishers that they might be traced. An example is

URLs uses http in place of https or abnormal SSL certificate.

Anomalies, which are crucial for usability and social engineering. The

removable of such anomalies can easily be noticed by Internet users and

phishers are forced to include them. An example is authentic logo used

in phishing websites.

74

Anomalies that are vital part of phishing and phishers usually do not

have good alternative for them. An example is the use of abnormal

Server Form Handler (SFH)”.

(ii) Priority should be given depending on the harmfulness of anomalies.

Higher is the harmfulness of the anomalies when they are included in the

websites, more important the anomalies are. An example is the use of

abnormal Server Form Handler (SFH)”.

(iii) Priority should be given to anomalies on the basis of time taken for analysis

versus the importance of anomalies

It is important to realize the time required analyzing an anomaly and the

impact it makes in phishing detection procedures. There should not be a

time overhead. An example is checking URLs popularity that can have

time overhead when Internet is slow.

(iv) Priority should be given to independent anomalies.

Priority should be given to independent anomalies over dependent

anomalies. Some anomalies need other anomalies to make sense in

phishing detection. Examples of such anomalies are: "Harmful form" and

"URLs uses http in place of https, i.e., abnormal SSL certificate".

(v) There is a possibility that an anomaly will occur in legitimate websites other

than domain owner.

Priority should be given to anomalies that have a high possibility to

occur in legitimate websites and are against recognized standard or

practices than anomalies that can occur in legitimate website and are not

objected by recognized standard. An example of an anomaly which is

against the recognized standard is “Use of illegal pop-up windows”.

Similarly, an anomaly which is not against the recognized standard is

“Presence of Javascript functions”.

It is recommended to employ anomalies that are strong indicators of phishing in

heuristic methods; however, the irony is that most of the phishers try to get rid of those

anomalies. Therefore, heuristic methods also have to rely on those anomalies that are

not strong indicators and can be easily found in legitimate websites. In addition, many

web developers either lack information on standards, such as W3C, ISO, Ecma

International, and Google Guidelines relate to the best practices in web development or

75

they deliberately do not follow these standards. Such developers unintentionally include

several anomalies in their websites which are also the characteristics of phishing

websites because of which their websites get misclassified.

One of the prime reasons for such misclassification is that current heuristic

methods that look for the anomalies in URLs and source codes of suspected website

usually look for each anomaly separately and assign a particular score to each of them.

The problem with this approach is that they penalize all websites on equal basis when

any anomaly is present. Due to it, several unimportant anomalies which also occur in

legitimate websites and improperly designed website accumulate enough score to

declare a legitimate website as a phishing website. Moreover, this is not the way human

decision making process works. The human decision making process looks to other

circumstances before making the final verdict and they are justifiable. Such decision

making should be applied for phishing detection too. A technique alike to Ludl et al.

[2007] who employed J48 algorithm to extract decision tree to classify phishing and

legitimate website can be more effective for such case. It can provide intuitive insight

into which features are important in classifying a data set.

5. Conclusions Phishing is almost a decade and half old concept emerged in mid 90s. It is also one of

the highly publicised cyber crimes since it is related to money and adversely impacts

business and general public interest. Moreover, the majority of phishing uses technically

simple method, i.e., create authentic looking forge websites and reach potential victims

through spam. Indeed, there is some phishing which employ complex techniques, such

as cross-site request forgery, cross site scripting, dynamic pharming, botnets, malicious

code, and key logger software. However, there is no countermeasure that can

outperform and protect from every kind of phishing. There are a number of studies

which have worked on technical and non-technical aspects with the objective to

determine remedies for phishing. They claim to be more effective than their

contemporaries, but, the misery is, most of them do perform well for the certain kind of

phishing and usually fail to counterattack various tricky phishing strategies. This might

be because; phishing does not just exploit technical vulnerabilities but it equally

exploits human vulnerabilities. There can be exact solutions for technical

vulnerabilities; but the exploitation of human behaviour and decision making does not

76

have any precise remedy. Additionally, methods adopted by the phishers are constantly

changing. When security experts succeed to design a countermeasure for one, phishers

discover new routes to make successful attacks. One of the common mistakes that the

most of phishing prevention techniques make in general is; they depict users’ purpose

for web browsing and security significance as two different components. They inform

that something is wrong and prohibit proceeding; however, they do not provide suitable

alternatives [Ma, 2006; Wu et al., 2006b]. They usually neglect the fact that security is

not the prime concern of Internet users; and this enforces Internet users to take risks

despite warning. Further, designing phishing prevention techniques are compounded by

several issues. Most of the phishing prevention techniques fail to overcome one or many

of these issues. Some of these issues are:

Accuracy in results. The results from any phishing prevention systems should be

accurate, i.e., no false positive and no false negative results. Any errors in results

diminish the credibility of phishing prevention systems and ultimately

discourage Internet users from using them or encourage Internet users to take

risk and fall for phishing. At the same time produces a challenge for phishing

prevention systems when a website is doubtful but cannot confirm whether it is

a phishing website or not.

Effective warning. It is very important to have effective method to warn Internet

users and stop them from revealing their credentials to phishing websites. It is

one of the major challenges for anti-phishing tools. Several past studies have

proved that passive alert signals or messages are either unnoticed or ignored by

Internet users [Dhamija et al., 2006; Wu et al., 2006a; Zhang et al., 2007b]. For

active warning, i.e., refusing to connect, it should be absolutely certain else it is

unacceptable. Moreover, in the case of passive warning, the frequency of alert

message should be so that it does not miss any phish and at the same time it

should be comfortable to Internet users. Bombarding with alert messages can

force Internet users to switch off anti-phishing tools. It was also found that too

frequent alert message desensitized Internet users and they are more likely to

reveal their personal details to phishing [ITNOW, 2012].

Execution time matters. Time is an important factor in all kind of software. It

makes more sense to client side phishing prevention toolbars. Client side

phishing prevention toolbars perform the verification of webpage before loading

77

it. Therefore, a slow system can highly demotivate Internet users from using it.

However, this constraint enforces to detect those anomalies that are quick to

analyse even though they might not be practically very effective to detect

phishing.

Address security and Internet users’ intentions together. Security and Internet

users’ intention cannot be dealt separately. The majority of phishing prevention

tools make mistakes by separating them. They attempt to solve the security

problem and disregard the Internet users’ specific intention. They inform that

there is something wrong, but never tells the specific ways to continue. It is

recommended integrating the security concerns into the critical path of task of

Internet users [Wu et al., 2006b] and provides them with suitable alternatives

when phishing is detected. However, it needs an extra process to determine

alternatives which affects execution time.

Scale problem. Phishing is very dynamic and phishers constantly look for ways to

bypass phishing prevention techniques. It also means that the higher the

popularity of phishing prevention technique is, phishers will apply more effort to

evade it. Therefore, phishing prevention should also have to constantly update

covering emerging trends in phishing.

Usability and Internet users’ behaviour under controlled conditions. Almost all

the studies of usability and Internet users’ behaviour are performed under

controlled condition due to ethical and legal issues. Such studies are unable to

see all factors that can influence result. However, such studies cannot be

allowed to conduct in uncontrolled condition due to privacy, ethics, and legality

issue.

Therefore, there is a need for more studies and research to develop robust technical

approaches. It equally needs some flexibility from social and legal division to freely

conduct such studies.

The current trends in phishing prevention are mostly reactive techniques. Therefore,

there is a need for proactive strategies for phishing prevention. Web development

industries need technology and practices which can make it difficult for phishers to

conduct phishing. One of the major factors that are encouraging scammers to conduct

phishing is the low cost and high benefit from phishing. When their benefits get

reduced, less and less number of people will be interested in conducting phishing.

78

Awareness about security and standards in web developer is another necessary factor.

For instance, web developers should properly fill in all the different fields of source

codes with some information related to their domain name by clearly identifying every

HTML tag [Gastellier-Prevost et al., 2011]. In addition, a web developer should not use

features that are disallowed by the recognized standards, such as recommendation from

W3C and standards published by ISO. They should develop code in the way it

facilitates phishing prevention methods. Similarly, companies should follow standards

and guidelines to improve distinguishing their websites from phony websites. There is a

need of work in development of technology that can trace phishers and help law

authority to punish them. This does not mean phishing can be eliminated; however, it

can significantly be reduced.

Last but not least, non-technical methods can be a vital player in the war against

phishing. However, many of the organizations prone to phishing still do not provide

information or counselling to their new customers relating dangers from phishing unless

they are victimized. This might be because to conduct counselling it needs resources

and also there is a chance that their customers wrongly understand as the weakness of

organizations. Many organizations do include static information about phishing in their

websites which is dull for many customers and they hardly read it. Therefore, there is a

need for improvement in presentation of such information. For instance, techniques,

such as puzzle and game can be motivating and an effective way to teach customers

about phishing.

6. Limitations and future development work

In this thesis, the experiment is conducted only on phishing websites, so I believe the

results could be more accurate if the same study was conducted on legitimate websites

as well. More importantly, the results obtained are solely on the basis of meta-analysis

of past studies followed by an experiment on phishing websites. In order to observe the

clear picture of results, it is necessary to apply them in real time anti-phishing software.

Therefore, designing such software is the main future development work from this

thesis.

79

References

[APGW, 2012] Phishing activity trends report: 1st half 2012. Report January-March 2012. Available as: http://www.antiphishing.org/reports/apwg_trends_report_q1_2012.pdf (retrieved on 5th May 2012)

[American Bankers Associaion, 2005] ABA works on fraud: phishing prevention and resolution. Available as: http://www.angelinabank.com/phishing063005.pdf (retrieved on 15th October 2012)

[Bing Webmaster Tools] How to submit a sitemap. Available as: http://onlinehelp.microsoft.com/en-US/bing/hh204487.aspx (retrieved on 7th July 2012)

[CallingID] CallingID toolbar. Available as: http://www.callingid.com/Default.aspx (retrieved on 17th November 2012)

[Cloudmark] Cloudmark Anti-Fraud toolbar. Available as: http://www.cloudmark.com/en/products/cloudmark-desktopone/index (retrieved on 17th November 2012)

[DNSSEC Validator] DNSSEC Validator 1.1.5. Available as: https://addons.mozilla.org/en-us/firefox/addon/dnssec-validator/ (retrieved on 18th November 2012)

[IDG News Service, May 10 2012] NASA and pentagon hacker TinKode receives two years suspended jail sentence. Available as: http://news.idg.no/cw/art.cfm?id=F21FFE88-01F3-6A5A-F13AD8F4C45D72FC (retrieved on 16th November 2012)

[EarthLink] EarthLink toolbar. Available as: http://www.earthlink.net/software/domore.faces?tab=toolbar (retrieved on 17th November 2012)

[eBay Toolbar’s Account Guard] Using eBay toolbar’s account guard. Available as: http://pages.ebay.com.au/help/account/toolbar-account-guard.html (retrieved on 28th July 2012)

[Fraud Eliminator] Fraud Eliminator toolbar. Available as: http://www.topsecretsoftware.com/fraud-eliminator.html (retrieved on 17th November 2012)

[Geo Trust] Geo Trust Trustwatcher toolbar. Available as: http://dnstree.com/com/trustwatch/ (retrieved on 17th November 2012)

[Google Safe Browsing] Google Safe Browsing API. Available as: https://developers.google.com/safe-browsing/ (retrieved on 17th November 2012)

[Google Support] Phishing and malware detection. Available as: https://support.google.com/chrome/bin/answer.py?hl=en&answer=99020&p=cpn_safe_browsing (retrieved on 31st July 2012)

80

[Google Webmaster Guidelines] Best practices to help google find, crawl, and index your site. Available as: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=35769 (retrieved on 7th July 2012)

[Google Webmaster Tools] How often does Google crawl the web? Available as: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=34439 (retrieved on 7th July 2012)

[Hacker Factor Solutions, 2005] Anti-Phishing: page encoding. Available as: http://www.hackerfactor.com/papers/ap-page_encoding.pdf (retrieved on 2nd March 2012)

[IBM Internet Security Systems, 2007] The phishing guide: understanding & preventing phishing attacks. Available as: http://www-935.ibm.com/services/us/iss/pdf/phishing-guide-wp.pdf (retrieved on 2nd March 2012)

[ITNOW, 2012] Overload information, ITNOW- The Chartered Institute for IT, autumn 2012.

[MarkMonitor Inc., 2008] Whitepaper- Rock phishing: the thread and recommended countermeasures. Available as: https://www.markmonitor.com/download/wp/wp-rock-phish.pdf (retrieved on 2nd March 2012)

[MSDN IEBlog] IE8 security part III: SmartScreen filter. Available as: http://blogs.msdn.com/b/ie/archive/2008/07/02/ie8-security-part-iii-smartscreen-filter.aspx (retrieved on 22nd July 2012)

[Netcraft] Why use the Netcraft toolbar? Available as: http://toolbar.netcraft.com/ (retrieved on 23rd July 2012)

[NYDailyNews.com, July 14 2011] Pentagon hacked, 24,000 files stolen by ‘foreign intruders’ in cyber attack. Available as: http://articles.nydailynews.com/2011-07-14/news/29792364_1_cyber-attack-terrorist-group-pentagon-computer-system (retrieved on 28th July 2012)

[PhishTank] Online valid phishes. Available as: http://www.phishtank.com/phish_search.php?valid=y&active=All&Search=Search (retrieved on 9th of August 2012)

[SpoofStick] SpoofStick 1.02. Available as: https://whatapp.org/spoofstick/ (retrieved on 28th July 2012)

[SpoofGuard] SpoofGuard. Available as: http://crypto.stanford.edu/SpoofGuard/ (retrieved on 9th October 2012) [Aburrous et al., 2010] Maher Aburrous, M.A. Hossain, Keshav Dahal, and Fadi

Thabtah, Experimental case studies for investigation e-banking phishing techniques and attacks strategies. Springer Science+ Business Media, LLC 2010.

81

[Alkhozae and Batarfi, 2011] Mona Ghotaish Alkhozae and Omar Abdullah Batarfi, Phishing websites detection based on phishing characteristics in the webpage source code. IJICT, Volume 1 No.6, October 2011, ISSN-2223-4985.

[Bian et al., 2009] Kaigui Bian, Jung-Min” Jerry” Park, Michael S. Hsiao, France Belanger, and Janine Hiller, Evaluation of online resources in assisting phishing detection. In: Proc. of 2009 Ninth Annual International Symposium on Applications and the Internet, Page 30-36.

[Cao et al., 2008] Ye Cao, Weili Han, and Yueran Le, Anti-phishing based on automated individual white list. ACM 978-1-60558-294-8/08/10.

[Chen et al., 2009] Kaun-ta Chen, Chun-Rong Huang, Chu-Song Chen, and Jau-Yuan Chen, Fighting phishing with discriminative keypoint features. IEEE Internet Computing, 1089-7801/09.

[Choi et al., 2011] Hyunsang Choi, Bin B. Zhu, and Heejo Lee, Detecting malicious web links and indentifying their attack types. In: Proc. of 2nd USENIX Conference on Web Application Development 2011.

[Chou et al., 2004] Neil Chou, Robert Ledesma, Yuka Teraguchi, and John C. Mitchell, Client-side defence against web-based identity theft. In: Proc. of 11th Annual Network and Distributed System Security Symposium, 2004.

[Cordero and Blain, 2006] Arel Cordero and Tamara Blain, Catching phish: Detecting phishing attacks from rendered website images. University of California, Berkeley, CA, 94720, 12th December, 2012. Also available as: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.92.9084&rep=rep1&type=pdf (retrieved on 27th July 2012).

[Dhamija et al., 2006] Rachna Dhamija, J.D.Tygar, and Marti Hearst, Why phishing works. ACM 1-59593-178-3/06/0004.

[Dhamja and Tygar, 2005] Rachna Dhamija and J.D. Tygar, The battle against phishing: Dynamic security skins. In: Proc. Symposium On Usable Privacy and Security (SOUPS) 2005, July 6-8, 2005, Pittsburgh, PA, USA.

[Dong et al., 2008] Xun Dong, John A. Clark, and Jeremy Jacob, Modeling user-interaction. IEEE 2008, 1-4244-1543-8/08.

[Downs et al., 2006] Julie S. Downs, Mandy B. Holbrook, and Lorrie Faith Cranor, Decision strategies and susceptibility to phishing. In: Proc. of Symposium On Usable Privacy and Security (SOUPS), July 12-14, 2006, Pittsburgh, PA, USA.

[Dunlop et al., 2010] Matthew Dunlop, Stephen Groat, and David Shelly, GoldPhish: using images for content-based phishing analysis. In: Proc. of Fifth International Conference on Internet Monitoring and Protection, 2010, ICIMP, pp.123-128.

[Edwards et al., 2007] W. Keith Edwards, Erika Shehan Poole, and Jennifer Stoll, Security automation considered harmful? ACM 978-1-60558-080-7/07/09.

82

[Egelman et al., 2008] Serger Egleman, Lorrie Faith Cranor, and Jason Hang, You’ve been warned: An empirical study of the effectiveness of web browser phishing warning. In: Proc. of CHI 2008, April5-10, 2008, Florence, Italy. ACM 1-59593-178-3/07/0004.

[Fette et al., 2006] Ian Fette, Norman Sadeh, and Anthony Tomasic, Learning to detect phishing emails. Carnegie Mellon University, School of Computer Scienec, Technical Report CMU-CyLab-06-012. Available as: http://www.cs.cmu.edu/~tomasic/doc/2007/FetteSadehTomasicWWW2007.pdf (retrieved on 2nd May 2012).

[Florêncio and Herley, 2006] Dinei Florêncio and Cormac Herley, Analysis and improvement of anti-phishing schemes. Security and Privacy in Dynamic Environments IFIP International Federation for Information Processing Volume 201, 2006, pp 148-157.

[Friedman et al., 2002] Batya Friedman, Helen Nissenbaum, David Hurley, Daniel C. Howe, and Edward Felten, Users’ conceptions of risks and harms on the web: A comparative study. ACM 1-58113-454-1/02/0004.

[Fu et al., 2006] Anthony Y. Fu, Liu Wenyin, and Xiaotie Deng, Detecting phishing web pages with visual similarity assessment based on Earth Mover’s Distance (EMD). In: IEEE Transactions on Dependable and Secure Computing, Vol. 3, No. 4, October-December 2006.

[Garera et al., 2007] Sujata Garera, Niels Provos, Monica Chew, and Aviel D. Rubin, A framework for detection and measurement of phishing attacks. ACM 978-1-59593-886-2/07/0011.

[Gastellier-Prevost et al., 2011] Sophie Gastellier-Prevost, Gustavo Gonzalez Granadillo, and Maryline Laurent, Decisive heuristics to differentiate legitimate from phishing sites. In: Proc. of Network and Information System Security (SAR-SSI), 2011 Conference. ACM 978-1-4577-0735-3.

[Herzberg and Gbara, 2004] Amir Herzberg and Ahmad Gbara, TrustBar: protecting (even naïve ) web users from spoofing and phishing attacks. Bar Ilan University, Dept. of Computer Science. Available as: http://u.cs.biu.ac.il/~herzbea/Papers/ecommerce/spoofing.htm (retrieved on 23rd July 2012).

[Huh and Kim, 2011] Jun Ho Huh and Hyoungshick Kim, Phishing detection with popular search engines: Simple and effective. In: Proc. of Springer-Verlag Berlin Heidelberg 2011, FPS 2011, LNCS 6888, pp.194-207, 2011.

[Jagatic et al., 2007] Tom Jagatic, Nathaniel Johnson, Markus Jakobsson, and Filippo Menczer, Social phishing. ACM, Volume 50 Issue 10, October 2007, Pages 94-100.

83

[Jakobsson, 2005] Markus Jakobsson, Modeling and preventing phishing attacks. In: Proc. the 9th International Conference on Financial Cryptography and Data Security, Pages 89-89.

[Karakasiliotis et al., 2007] Athanasios Karakasiliotis,Steven Furnell, and Maria Papadaki, An assessment of end-user vulnerability of phishing attacks. Journal of Information Warfare, 6 (1), 2007, pp. 17-28.

[Kittur et al., 2008] Aniket Kittur, Ed H. Chi, and Bongwon Suh, Crowdsourcing user studies with Mechanical Turk. In: Proc. CHI 2008, April 5–10, 2008, Florence, Italy. ACM 978-1-60558-011-1/08/04

[Kumaraguru et al., 2009] Ponnurangam Kumaraguru, Justin Cranshaw, Alessandro Acquisti, Lorrie Cranor, Jason Hong, Mary Ann Blair, and Theodore Pham. School of phish: A real-world evaluation of anti-phishing training. In: Proc. of 5th Symposium on Usable Privacy and Security (SOUPS ’09).

[Lam et al., 2009] Ieng-Fat Lam, Wei-Cheng Xiao, Szu-Chi Wang and Kaun-Ta Chen, Counteracting phishing page polymorphism: An image layout analysis approach. In: Proc. of ISA 2009.

[Li et al., 2007] Linfeng Li, Marko Helenius, and Eleni Berki, Phishing-resistant systems: security handling with misuse cases design. In: Proc. of SQM07, 389-404, 2007.

[Li and Helenius, 2007] Linfeng Li and Marko Helenius, Usability evaluation of anti-phishing toolbars. Journal in Computer Virology, volume 3, 163-184, DOI 10.1007/s11416-007-0050-4.

[Liu et al., 2006] Wenyin Liu, Xiaotie Deng, Guanglin Huang and Anthony Y.Fu, An anti-phishing strategy based on visual similarity assessment. In: Proc. of IEEE Internet Computing, ACM 1089-7891/06.

[Liu et al., 2011] Gang Liu, Guang Xiang,Bryan A. Pendleton, Jason I. Hong, and Wenyin Liu, Smartening the crowds: computational techniques for improving human verification to fight phishing scams. In: Proc. Symposium On Usable and Secuirty (SOUPS) 2011, July 20-22, 2011, Pittsburgh, PA, USA.

[Ludl et al., 2007] Christian Ludl, Sean Mcallister, Engin Kirda, and Christopher Kruegel, On the effectiveness of techniques to detect phishing sites. In: Proc. of DIMVA’07 Proceedings of the 4th International Conference on Detection of Intrusions and Malware, and Vulnerability. Springer-Verlag Berlin, Heidelberg 2007, ISBN: 978-3-540-73613-4 doi.

[Ma, 2006] Robert Ma, Phishing attack detection by using a reputable search engine. University of Toronto, Dept. of Electrical and Computer Engineering. Available as: http://www.eecg.toronto.edu/~lie/Courses/ECE1776-2006/Projects/Phishing2a-proposal.pdf (retrieved on 7th July 2012).

84

[Ma et al., 2009] Justin Ma, Lawrence K. Saul, Stefan Savage, and Geoffrey M. Voelker, Beyond blacklists: Learning to detect malicious web sites from suspicious urls. In: Proc. of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1245-1254, June-July 2009.

[Martino and Perramon, 2010] Antonio San Martino and Xavier Perramon, Phishing secrets: history, effects, and countermeasures. International Journal of Network Security, Vol.11, No.3, PP.163-171, November 2010.

[McGrath and Gupta, 2008] D. Kevin McGrath and Minaxi Gupta, Behind phishing: An examination of phisher modi operandi. In: Proc. of 1st Usenix Workshop on Large-Scale Exploits and Emergent Threats. San Francisco, California, USA: USENIX Association Berkeley, CA, USA, 2008, p. Article No.4.

[McRae and Vaughan, 2007] Craig M, McRae and Rayford B. Vaughn, Phighting the phisher: Using web bugs and honeytokens to investigate the source of phishing attacks. In: Proc. of 40th Annual Hawaii International Conference on System Sciences (HICSS ‘07) 0-7695-2755-8/07.

[Medvet et al., 2008] Eric Medvet, Engin Kirda, and Christopher Kruegel, Visual similarity-based phishing detection. ACM ISBN #978-1-60558-241-2.

[Milletary, 2006] Jason Milletary, Technical trends in phishing attacks. United States Computer Emergency Readiness Team (US-CERT), 2006. Available as http://www.us-cert.gov/reading_room/phishing_trends0511.pdf (retrieved on 2nd May 2012).

[Moore and Clayton, 2008] Tyler Moore and Richard Clayton, Evaluating the wisdom crowds in assessing phishing websites. In: Proc. of Financial Cryptography and Data Security (FC) 2008, LNCS 5143, pp. 16-30.

[Pan and Ding, 2006] Ying Pan and Xuhua Ding, Anomaly based web phishing page detection. In: Proc. of 22nd Annual Computer Security Applications Conference (ACSAC’06), Computer Society, 2006.

[Prakash et al, 2010] Pawan Prakash, Manish kumar, Rao Kompella and Minaxi Gupta, PhishNet: Predictive blacklisting to detect phishing attacks. In: Proc. of IEEE INFOCOM on Computer Communication 2010.

[Odaro and Sanders, 2011] Ugiomo S. Odaro and Benjamin G. Sanders, Social engineering: phishing for a solution. In: Proc. of IT Security for the Next Generation-European Cup 2011, Kaspersky Lab.

[Rasmussen and Aaron, 2011] Rod Rasmussen and Greg Aaron, Global phishing survey: trends and domain name use in 1H2011. APWG Report January-June 2011 .Available as: http://www.antiphishing.org/reports/APWG_GlobalPhishingSurvey_1H2011.pdf (retrieved on 3rd May 2012).

85

[Sabanal and Yason, 2012] Paul Sabanal and Mark Vincent Yason, Digging deep into the flash sandboxes. ibm security systems. Available as: http://media.blackhat.com/bh-us-12/Briefings/Sabanal/BH_US_12_Sabanal_Digging_Deep_WP.pdf (retrieved on 17th November 2012)

[Sheng et al., 2007] Steve Sheng, Bryant Magnien, Ponnurangam Kumaraguru, Alessandro Acquisti, Lorrie Faith Cranor, Jason Hong, and Elizabeth Nunge, Anti-Phishing Phil: The design and evaluation of a game that teachers people not to fall for phish. In: Proc. of Symposium on Usable and Security (SOUPS) 2007, July 18-20, 2007, Pittsburgh, PA, USA.

[Singh, 2007] N.P. Singh, Online frauds in banks with phishing. Journal of Internet Banking and Commerce, August 2007, vol.12, no.2.

[Wang et al., 2011] Ge Wang, He Liu, Sebastian Becerra, Kai Wang, Serge Belongie, Hovav Shacham, and Stefan Savage, Verilogo: Proactive phishing detection via logo recognition. University of California, San Diego, Dept. of Computer Science and Engineering. Technical Report CS211-0969, US San Diego, August 2011. Available as: http://cseweb.ucsd.edu/~hovav/dist/verilogo.pdf (checked on August 2nd, 2012).

[Wenyin et al., 2005] Liu Wenyin, Guanglin Huang, Lui Xiaoyue, Zhang Min, and Xiaotie Deng, Detection of phishing webpages based on visual similarity. ACM 1-59593-051-5/05/0005.

[Whittaker et al., 2010] Colin Whittaker, Brian Ryner, and Marria Nazif, Large-scale automatic classification of phishing pages. Google Inc., Research at Google: Research Areas & Publications. Available as: http://research.google.com/pubs/pub35580.html. (retrieved on 26th July, 2012).

[Wu et al., 2006a] Min Wu, Robert C. Miller, Greg Little, Web Wallet: Preventing phishing attacks by revealing user intentions. In: Proc. of The Second Symposium on Usable Privacy and Security (SOUPS 2006). pp. 102-113 2006.

[Wu et al., 2006b] Min Wu, Robert C. Miller, and Simson L. Garfinkel, Do security toolbars actually prevent phishing attacks? ACM 1-59593-178-3/06/0004.

[Xiang and Hong, 2009] Guang Xiang and Jason I. Hong, A hybrid phish detection approach by identify discovery and keywords retrieval. ACM 978-1-60558-487-4/09/04.

[Xiang et al., 2011] Guang Xiang, Jason Hong, Carolyn P. Rose, and Lorrie Cranor, CANTINA+: A feature-rich machine learning framework for detecting phishing websites. ACM Transactions on Information and System Security (TISSEC) Volume 14 Issue 2, September 2011, Article No. 21.

86

[Zhang et al., 2007a] Yue Zhang, Jason Hong and Lorrie Cranor, CANTINA: A Content-Based Approach to Detecting Phishing Web Sites. ACM 978-1-59593-654-7/07/0005.

[Zhang et al., 2007b] Yue Zhang, Serge Egelman, Lorrie Cranor, and Jason Hong, Phinding phish: Evaluating anti-phishing tools. In: Proc. of the 14th Annual Network and Distributed System Security Symposium (NDSS 2007).

Appendix Important terminology and definitions

The Anti-Phishing Working Group (APWG)

An international consortium formed to fight against phishing and on-line fraud.

Active warning

Warning that forces Internet users to notice it by interrupting their activity.

Code obfuscation

An act of converting code into the form that is difficult to understand and it is

mainly performed to protect code from reverse engineering.

Crimeware

Software designed for conducting cybercrime.

Cross-site request forgery

A malicious exploitation of a website in which the legitimate user is forced to

execute unauthorized commands.

Cross-site scripting

An attack in which malicious code is injected into the client side of legitimate

webpage.

DNS spoofing

An attack because of which a DNS server returns wrong IP addresses and diverts

traffic to another computer.

Domain name typos

An act of generating a list of misspelled and mistyped of entered domain name.

Denial of Service (DOS)

An attack on a network by flooding it with useless traffic.

DOM (Document Object Model) objects

Document Object Model is a platform- and language-neutral interface that will

allow programs and scripts to dynamically access and update the content,

structure and style of documents. [W3C]

87

DMOZ A web directory. False negative A phishing website is misclassified as a legitimate website. False positive A legitimate website gets misclassified as a phishing website. Heuristic methods

A technique in which various characteristics of the websites are checked to differentiate whether it is a phishing website or not.

Malware Malicious software used to disrupt computer operation and also used to conduct phishing.

Malicious code Any code or script in software system that is intended to cause undesired effect, security breach, or damage to the system. [Wikipedia]

Man in the middle attacks An intrusion into an existing connection to intercept the exchanged data and inject false information.

MarkMonitor A company that develops Internet brand protection software and services. Mirroring of website Act of creating an exact copy of another website. Passive warning Warning that just displays the message without interrupting Internet user activity. Password harvester

Malicious software that looks for username and password information in the victims’ computer.

Pharming An attack intended to redirect a website’s traffic to a bogus website. PhishTank

An anti-phishing website. Sandbox

A security mechanism for programs from untrusted sources. Session hijacking

An exploitation of computer session in order to get an unauthorised access to information or services in a computer. [Wikipedia]

Secure Socket Layer A cryptographic protocol used for secure communication over the Internet. Spam

88

Unsolicited bulk messages, usually, used for advertisement. Trojan horse A kind of malware. List of URLs for the valid phishing websites used for the experiment (Source: PhishTank)

S.N URLs Brands 1. http://agenciasck.goldenbiz.com.br/ SCK

Imperial 2. http://credit10.webobo.biz/download.php?id_menu=3441921/ Haboo

3. http://deutchland-konto.ntdll.net/img/glyph/webscr.php?cmd=_login-run&dispatch=5885d80a13c0db1f1ff80d546411d7f84f1036d8f209d3d19ebb6f4eeec8bd0eaf4a55ab8d6b037be0813c1fa7ae828caf4a55ab8d6b037be0813c1fa7ae828c

Paypal

4. http://lehoapaper.com/Paypal_Virefication/1596578fae650778e27f8ffbd70c4502/

Paypal

5. http://masterstudio.es/wp-includes/js/crop/ Paypal 6. http://ilhanpolat.com/account/id/78550375/paypal/pp/update/webscr/

6998GSQ64976W84f356Gi6Bn432/profile/webscr/pp/us/www.Paypal.com/webscr.php?cmd=_login-run&dispatch=5885d80a13c0db1f 1ff80d54 6411d7f8a8350c132bc41e0934cfc023d4e8f9e5fb78214886 cead8bcd4c1677f8e7572cfb78214886cead8bcd4c1677f8e7572c

Paypal

7. http://188.138.124.133/www.paypal.com/session_id/87544455623222414898896521454598/index.htm#

Paypal

8. http://pornographicrecordings.com/img/icons/tabs/webscr.php?cmd=_login-run&dispatch=5885d80a13c0db1f1ff80d546411d7f84f 1036d8f209d3d19ebb6f4eeec8bd0eb8fde1c0e2ec85dcf4341e5b995664adb8fde1c0e2ec85dcf4341e5b995664ad

Paypal

9. http://sreeramsolutions.com/ayyalu/images/login.php CAPITEC Bank

10. http://sreeramsolutions.com/ayyalu/images/capitec.htm CAPITEC Bank

11. http://prophor.com.ar/prophor/wells/alerts.php http://specialneedssvg.org/wp/wp-admin/import/wellsfargo/ wellsfargo/wellsfargo2011/index.php

WELLS FARGO

12. http://rrnow.findhere.org/ Time Warner Cabel

13. http://paypal.com.login.secure.md5.id.0645654032132165461321. Paypal

89

fabianpulido.com/b22668f2a2c3063efb7749ac67fef65a/ 14. http://net77-43-56-76.mclink.it/.ss/

http://78.188.234.21/.ss3/?https://bankingportal.kreissparkasse- heinsberg.de/portal/portal/StartenIPSTANDARD

Sparkasse

15. http://godknwswhy.x90x.net/ Yahoo!Mail

16. http://zulumarket.com/negocio/index.html CHASE

17. http://abnerindonesia.com/billingcenter/aol/XKklowI9292O02/ DBMECX8QgQ1BHaQQv4pYZFzemQbF/verify/Accounts/Secure_Area/aol/update.php

AOL Mail

18. http://abnerindonesia.com/billingcenter/aol/XKklowI9292O02/ DBMECX8 QgQ1BHaQQv4pYZFzemQbF/verify/Accounts /Secure_Area/aol/

AOL Mail

19. http://alex.24openstore.de/PayPal/webscr.php?cmd=_login-run&dispatch=5885d80a13c0db1f1ff80d546411d7f8a8350c132b c41e0934cfc023d4e8f9e5eb7cfbb17ec87b191acc343bb447f8e9eb7cfbb17ec87b191acc343bb447f8e9

Paypal

20. http://us.battlle.net.htm.isnyeo.info/battle_net_account.html?ref=https%3A%2F%2Fus.battle.net%2Faccount%2Fmanagement%2Findex.xml&app=bam&t=1

BATTLENET

Recognition of phishing attacks utilizing anomalies in ... · Sunil Chaudhary 2nd December 2012,...

Documents

Transcript of Recognition of phishing attacks utilizing anomalies in ... · Sunil Chaudhary 2nd December 2012,...