CHAPTER 7 SERVER SIDE PHISHING FILTERING...

145

CHAPTER 7

SERVER SIDE PHISHING FILTERING TECHNIQUES

7.1 INTRODUCTION

The client side phishing techniques were discussed in the previous

chapters. In this chapter, we will look into the server side techniques. Usually,

the servers are more favoured targets for obvious reasons. Since phishing

attacks are getting trickier, the client information submitted to the server

cannot be trusted all the time. In addition to stealing the user’s password,

phishers also steal more sensitive information of the user by imitating

successful authentication. They even check the validity of the password by

forwarding it to the legitimate server or sometimes hijack user’s login session

by using the man in the middle attack. A phishing attack is usually carried out

by an email or an instantaneous message, in an effort to attract recipients to a

fake website, to reveal their personal credentials.

A number of countermeasures have been proposed and developed

for protecting the website against phishing attacks. Server-side protection uses

SSL certi cates, user selected logo and other security gauges to help users

verify the legitimacy of websites.When phishing is carried out via email, the

illegal user sends out a large number of messages that appear to come from a

genuine source, such as a confidence business or financial institution. The

emails include urgent requests for personal information to be submitted.

Typically, the phisher reveals that there is some serious need to update an

account instantly. A link is provided in the email message to an authorized

146

appearing website, where the information is truly entered by users; the

personal information provided to this site, however, the genuine information

goes directly to the illegal business committing the phishing attack, and not to

the imitated but legitimate business.

The term 'server side' here does not only mean the servers in

financial institutions. Instead, it also includes servers running through whole

network. Though many methods exist for protecting the information for the

users, most of the times, many individuals and organizations lose their

information. The literature clearly reveals that the existing server side

phishing techniques have not identified the phishers clearly. Thus, there exists

a large gap between the server side phishing techniques and the expected

user’s secrecy. Therefore, there is a need for better phishing prevention

techniques, considering the importance of the user’s personal information.

There is a need to tailor the secrecy of the information, and to satisfy

everyone who uses the website.

The preventing phishing technique is a promising way to resolve

this problem by advanced methods. In this work, considering that the personal

information of users is more secure, users should be aware when the step by

step process is not applicable, and when they need to stop the process. They

should not give their personal information. To meet all these necessities, there

are four server side techniques introduced. They are the one-time password

mechanism, watermarking mechanism, preventing phishing through session

hijacking and e-mail phishing. In the one-time password mechanism, the

password will be available after the first step of authentication. In the

watermaking mechanism, there should be the specified watermaking present

during authentication. The one-time URL method is applied in session

hijacking and e-mail filtering methods are implemented.

147

7.2 ONE-TIME PASSWORD MECHANISM

A password that is valid only for one session or transaction is

known as a One-Time Password (OTP), and it helps in avoiding the risks of

traditional or regular passwords. In this system, a password will be available

only after generating the secret code. Users can be authenticated with the

encrypted security code delivered via a reliable communication protocol on

demand. The user database at the server side matches a user’s name with its

corresponding identity on another communication path. When a user wants to

access the website, the server sends an encrypted security code to the user

through the communication protocol. On receipt of the encrypted security

code the user has to decrypt that code, and enter the login. The security code

is encrypted with the private key and decrypted with the public key. The

decryption process is done by the user.

The admin process consisting of registration, involves the

following steps. The user must choose one login name, fill in all the required

information fields, and provide at least one type of personal contact

information (E-mail address or Mobile number). The website should list all

the services that it uses, to deliver the security code so that the user can

choose the preferred service. The use of a security question is not mandatory.

It depends on the web site provider’s policy or the user’s wish. The proposed

system is shown in Figure 7.1. However, such questions make the

authentication process more secure. The steps are as follows:

(i) The validation page is sent to the customer. The page contains

the name of the login used by the web site.

(ii) If the customer’s login name is new to the web site, the

customer is asked for permission to add the login name to the

websites’ contact list.

148

(iii) After the login has been approved by both the web site and the

customer, the website sends an account validation message to

the user via the designated communication channel.

Figure 7.1 One-time password system for preventing a phishing attack

Next, the user starts the actual login process, by browsing the login

page which contains an input field for the customer’s login name and the

CAPTCHA test. If the user’s login name is not recognized by the website, it

must be displayed in a page. If the user’s account name is valid, the website

checks the customer’s registered account, and sends an acknowledgement to

that account. If the acknowledgement message is valid, the customer enters

the assigned security code on the input page. On receipt of the security code,

the website has to make sure that the customer submits the security code from

exactly the same IP address as the customer requests to login.

User

Registration

Decryptio

n

Login

Generate

security code

DB

Client Server

Encrypted code

User name and captcha

code

Acknowledgement

Security Code

149

7.2.1 Implementation Process

This solution can only be implemented on the web server’s side. If

our system is to offer a practical opposition against phishing attacks, it must

impose minimal overhead, since a solution that significantly slows to the web

browsing experience will be unlikely to be adapted. Figure 7.2 contains an

input field for the user’s login name and the CAPTCHA test. If the user’s

login name is not valid, it will show an error message. If the user’s name is

valid, the website checks the user’s registered account and sends an

acknowledgement to that user.

Figure 7.2 First login page

Next, the customer enters the assigned security code on the input

page as shown in Figure 7.3. On receipt of the security code, the website has

to check whether the user submits a valid security code. If it is not valid, it

will display the error message and the user can enter the wrong security code

only n times.

150

Figure 7.3 Identification of the user name

7.2.2 Security analysis of one-time password mechanism

Other than phishing, this system avoids some of the attacks. The

following are the attacks that trouble the websites:

(i) Denial of service attack

A denial of service could be launched against any part of the

Internet connectivity and network infrastructure. In the proposed solution, the

website authenticates the customer, by asking him/her to input the security

code already assigned by the website. The customer authenticates the website

by first checking the sender of the acknowledgement message.

(ii) IP Spoofing

In IP spoofing, the target computer will have attacks that resemble

those generated from its own address, by faking the source IP address,

causing the Operating system like Windows to crash or lock up. This

proposed solution restricts the locations that are able to launch the IP-

151

Spoofing attacks. If the attacker uses the same IP address as the user in the

same local network concurrently, the user can detect it. The lifetime of the

security code is only a few seconds. So, it is not possible for the attacker to

login the protected website via the same IP address.

(iii) Server spoofing

In Windows 95 stations, the LANMAN authentication can be

requested from the client by running the C2MYAZZ utility, which the

attacker uses to his benefit, by acting like the server during the user login

sessions. If the attacker is successful in tricking the client, then he will be able

to read user login details from the network packets. The proposed solution

does not require a preset password to login, thereby avoiding password theft.

(iv) Man in the middle attack

An attacker may watch a session open on a network. Once

authentication is over, he might attack the client system to disable it, and use

IP spoofing to declare to be the client who was just authenticated and take the

session. In this proposed solution, suppose the attacker discovers both the

customer’s web account name and the security code for the current session.

Since the life span of the security code is very short, it would be of little use

to the attacker.

This approach would be deployed for websites requiring a high

level of security, and it would ultimately help in retaining the customer’s

confidence in using web-based commerce. When comparing the

consequences of phishing, the increase in time in milliseconds is negligible.

This approach is to prevent phishing sites, which are more powerful than the

earlier techniques. But this fundamental checking does not prevent complete

phishing sites. So, the next technique is implemented with watermarking.

152

7.3 WATERMARKING MECHANISM

The purpose of a watermark is to recognize the work and avoids its

unauthorized use. Visible watermark is a common way of recognizing images

and protecting them from unauthorized use online. The watermark message is

intended to be distinctive for every user, and carries a shared secret between

the company and the user in order to stop attacks like phishing. The proposed

system is shown in Figure 7.4.

Figure 7.4 System for preventing phishing attack using watermarking

Here, the client will enter the URL to view the required webpage of

a particular web server. To increase the credibility of the web site of that

particular webpage, the client machine’s current date and time will be

displayed at the client browser. Usually, when the phishing attack occurs, the

User

User

Registration

Login

Application Decrypt

the code

Auto generation

of Secret code

Encrypt

the code

DB

Show the code to user

Show as a

watermarking

Retrieve the

secret code

based on user ID

Client Server

Increase Credibility

153

page may redirect during the money transaction. When the clients need to

enter their personal details, such as online banking password or ATM pin

number, they need to login. After logging in, the client may not know whether

he is in a correct page or not. Here is where the water marking technique

plays a major role, to give the highest credibility of that particular webpage.

After logging in, and before giving the personal details, the user can check the

credibility level of the webpage. This credibility provision is only possible

through the water marking mechanism, as shown in Figure 7.5.

Before logging into a commercial web site, the user can see his/her

machine’s date and time in the logo, which is initiated from the server. After

logging in, if the user places the cursor over the logo, the secret code will be

displayed. This secret code is user dependent and it will be stored in the

server database. If the user places the cursor at the top of the web page, the

user’s name will be displayed.

But the attacker may hack the server database to get the respective

secret code of the user, and may show the watermark in the fake website as a

legitimate website. To avoid this kind of problem, the server will encrypt the

secret code, using the symmetric key encryption algorithm before storing it in

the database.

This algorithm will convert the secret code into an encrypted

format, which cannot be understood by humans. When the user logs in to this

website, the secret code of that particular user will be fetched from the

database, and decrypted, into a human readable format, using the symmetric

key decryption algorithm.

154

Figure 7.5 Watermarking system

This encrypted water marking mechanism is more secure than the

previous ones, since the date and time are initiated from the server and the

secret code is displayed only after decryption. Even if the attacker hacks the

server database he/she cannot understand the secret code. This is the simplest,

easiest, and at the same time, the most efficient watermarking technique, to

prevent phishing attacks.

Here, the main advantage is that the secret code will be decrypted

on the client side and the server will send the encrypted secret code only to

the client side. Using this, the man in the middle attacks will be prevented.

Some of the existing watermarking techniques are a little more

costly than this method, since they need some additional software, such as

image magic. This approach is platform independent, and there is no load on

155

the client side, such as the usage of an additional tool. But this fundamental

checking does not prevent all phishing attacks. So, the next technique is

implemented with session hijacking.

7.4 PREVENTING PHISHING THROUGH SESSION

HIJACKING

The method, by which the attacker gains access to the user's session

by obtaining his session ID, and acting like the authorized user, is called

Session hijacking. This gives free access to the hacker to do anything in the

network which the legitimate user is allowed to do.

Normally the cookie or URL stores the session ID inside and the

authentication procedures are carried out in the initial setup time, which is

taken as an advantage for the hijacking by intruding into that session in real

time. Session hijacking can be a possible reason behind the unexpected

response of the website or no response to the user inputs.

A web based detection and prevention mechanism called Session

Hijacking Attack Prevention System (SHAPS), is used to prevent session

hijacking. It fetches all the requests and responses, and validates them for

session hijacking. It prevents session fixation by validating the hostname, IP

address, and session ID, and mismatched sessions are invalidated.

The Universally Unique Identifier (UUID) and Time Stamp (TS)

are used to generate the URL for providing short time living services, and also

as a gateway for every request. The Secure Hash Algorithm (SHA-1) is used

in a non-static web session to prevent session hijacking, by dynamically

creating session identifiers, which are further used to weed out the phishing

website.

156

The Session Hijacking Attack Prevention System has two parts.

The first part of the system has the normal client server communication, and

the second part contains phases of the SHAPS. Each phase contains web

services for preventing a session hijacking attack. The repository stores the

necessary information about the client’s request in order to retrieve the

information later, to validate the client’s request.

The architecture of the system is described in terms of the

components and their interactions. A component is a part of the system which

performs a well defined interface. Components interact with other

components through their interfaces. In order to prevent the session hijacking

attack in web application, the client’s request in web application should be

captured and checked against the attacker session ID, through different types

of web services.

SHAPS has three parts, namely, the session fixation preventer web

service, one time URL web service and the non static web session creator web

service. Each part has a separate web service to detect and prevent the session

hijacking attack in web applications.

7.4.1 Session Fixation Preventer Web service

A session ID is issued by the web server to the user session. This

approach however ignores one very important issue. There is a possibility for

the attacker issuing a session ID to the client’s session, thereby forcing the

client to use a chosen session. This class of attack is called session fixation. It

is one of the session hijacking attacks because the user's session ID has been

fixed by the attacker in advance, instead of its being generated randomly at

login time. Figure 7.6 describes the session fixation attack preventer web

service.

157

Figure 7.6 Session Fixation Attack Preventer Web Service

In a normal client and server environment, the client sends a

request through the browser and the server responses to browser containing

web application. The HTTP stateless protocol maintains a session for reliable

communication between the client and server for each single user

communicating through the session. Each session has a unique session

identifier, for the identification key of the user. There is an opportunity for the

attacker issuing a session ID to the client’s browser, thereby forcing the user

to use a selected session. This class of attack is called session fixation. It is

one of the session hijacking attacks, because the user's session ID has been

fixed in advance instead of its being generated randomly. If the impersonated

client clicks the link as a request, an attacker session will be fixed to the client

session, in the weakly developed web application. The attacker already knows

Fetching data from

request

Validating fetched

request

Regenerated session ID

Original

session data

Web ServerClient Database

Web

Service

Req

Res

158

the fixed session ID; hence, at the same time the attacker can enter the client’s

account without providing the user name and password, and gain access till

the client signs out the session. For this problem, Figure 7.6 provides an

essential solution for the weakly developed web application. When the

client’s request is sent by the attacker, it will be fixed with the session ID to

the client session.

After successful login, a web server generated session ID is

displayed within the alert message box. It will be used to maintain a different

user session accessing the same server. If the attacker fixed session is blocked

by the web service, the attacked web page shows an alert message, or when

the session ID is mismatched, or if the session ID is not generated, then the

alert message will be displayed on the screen.

Using the session fixation preventer web service to a weakly

developed web application, each of the multiple requests will be fetched by

the session fixation preventer web service, and stored as essential information

in the repository. The essential information is the session ID, IP address, Date,

time, and host name of the request. Using all this information called by the

validation services, it can check the requested URL for the same session ID,

different IP address, and same host name request for a short period of time.

Then, the validation service finds it out as an attacker session, and the service

allows the request to the web server after logging in. The attacker fixed

session ID has been regenerated and stored in the repository. Now, the

attacker cannot attack the client session by fixing the attacker session ID to

the client session. Figure 7.7 shows the XML structure of the fixation

preventer service.

159

Figure 7.7 XML Structure of the fixation preventer service

7.4.2 One time URL Web Service

The main goal of the one time URL is to provide security to short

life time services, like transaction services, account activation services and

password reset services. The one time URL is valid for only one time access.

It will be generated by the web server, and our web service using UUID and

time stamp, as shown in Figure 7.8.

<?XML version=”1.0” encoding=”UTF-8”>

<service name=”FixationService” scope=”application”

class=”fixation.SessionFixationServiceLifeCycle”>

<description>

FixationService

</description>

<messageReceivers>

<messageReceiver mep=http://www.w3.org/2004/08/wsdl/in-out

Class=”org.apache.axis2.rpc.receivers.RPCMessageReceiver”/>

</messageReceiver>

<parameter name= “ServiceClass”>

Fixation.sessionFixation

160

Figure 7.8 One-time URL web service

The one time URL is generated in a web application for accessing

the sensitive information, or any kind of services like money transfer, account

activation or secret details reset. These types of services are very confidential,

and will be protected against outside users and attackers. The client server

communication environment notes the number of frequent requests and

responses sent by the client and server in a session. The SOAP object is used

to make a HTTP request; the header field of the SOAP object is optional here.

The Universally Unique Identifier and Time Stamp are filled into the SOAP

header field, for providing security to the client request. Sometimes the same

UUID is never assigned to different clients, through the same TS is. However,

the lifetime differs, based on the request, as shown in Figure 7.9.

One time URL Creation

(UUID+TS)

Original

Request

Information

Web ServerClient Databasee

Web

Service

Req

Res

Requested for service

Validation (Handler)One time

URL

161

Figure 7.9 XML structure of a transaction service

The generation of the URL is manually possible, but using the

manually created URL, the attacker cannot access anything in the system. The

onetime URL web service containing the handler, is like a security gateway

where every client request will be passed through the handler. The handler

evaluates each single user request against the one time URL. If the incoming

URL has been used, the handler blocks the request to access the particular

service, and invalidates the transaction session, and advises it to start again.

The correct one time URL has a short life time. Within the life time, the client

has to do everything. If the client is idle or accesses the service slowly, the

URL time will expire. These two important security mechanisms are included

in the one-time URL for providing security to the short life time services in a

weakly developed web application.

<?xml version=”1.0” encoding =”UTF-8”?>

<serviceGroup>

<service name=”TRANSACTION SERVICE”>

<parameter name=”ServiceClass” locked=”false”>com.example.

Transaction Services</parameters>

<operation name=”aspectWithdraw”>

<actionMapping>

<messageReceiverclass=”org.apache.axis2.receivers.RawXMLINOutMess

ageReceiver”/>

</operation>

</service>

</serviceGroup>

162

7.4.3 Non Static Web Session Creator Web Services

Famous websites like Hotmail, Gmail, Yahoo etc, are vulnerable

and the preferred targets for session hijacking. The hackers constantly try to

capture the cookie/session ID to access the system using the victim’s identity.

This vulnerability is basically due to the usage of a static session ID. This

module prevents hijacking by providing a prevention model; using a non-

static session ID, thereby making the captured session IDs useless to the

attacker as shown in Figure 7.10.

Figure 7.10 Non static web session creator web service module

This system provides non static session IDs instead of static session

IDs. The idea is that each HTTP request must use a different session ID to

provide protection from session hijacking attacks. A method is designed,

method which generates the dynamic session ID using the Secure Hash

Cookies storage and

mapping

Original

cookies

Web ServerClient Database

Web

Service

Req

Res

Cookies of User request

Non-Static session

creation

Dynamic

ID

163

Algorithm (SHA-1) which is 160-bit value. This algorithm gets the variable

length of input and provides a fixed length of output, called the message

digest.

The non-static web session creator creates a dynamic ID for each

request based on SHA-1 algorithm. The process of the authentication

phase, the client request for a bank web page, after which the server

sends a login page to the client receives login page such as

https://www.chennaibank.com/login.jsp. The user enters their username and

password and the server validates the username and password, it will generate

a static session ID. Then it updates the repository located in the server’s

memory. The static ID is an identifier for the user session. At the same time,

the web service generates a secret key; no one knows it except the server and

the client. The static session ID, the secret key, client time and user name will

be stored. In each subsequent request, the web service generates a dynamic ID

for the requested user to access the particular web site.

The system designed a method which generates the dynamic

session ID from a secret key, a static session ID, and the client’s session time,

by using the equations (7.1) and (7.2).

Calculating the Dynamic Session ID can be done by creating

equations

A is computed from the equation (3.1):

B is computed from the equation (3.2):

Dynamic ID = Hash (Secret Key + Client Time) + Static ID + Client Time

A B

164

The dynamic ID can be generated by using

Verifying a dynamic ID can be done in the six steps given in Figure

7.11.

Figure 7.11 Algorithm for verifying dynamic ID

The system designed a method which generates the dynamic

session ID from the secret key, static session ID, and the client’s session time.

1. Read the static ID from the B of the dynamic ID, and then use it for

looking up the secret key and username from the repository.

2. Read the client time from the B of the dynamic ID, and then append

the secret key (from 1) with the client time.

3. Calculate the Hash (secret key + client time).

4. Read A of the dynamic ID and compare it with the results of 3.

5. If the comparison in 4 is a match, this session belongs to the correct

requested user. If it is not a match, this session is incorrect.

6. If the session in 5 is correct, verify the client time. If the client time is

different from the server’s time (server time), then the web service is

alerted and the session is terminated.

A = Hash (Secret Key + Client Time)

B = (Static ID + Client Time)

(7.1)

(7.2)

Dynamic ID = 7.1 + 7.2

165

7.4.4 Performance Comparison

Table 7.1 shows the result of the evaluated SHAPS, and different

types of vulnerable inputs. The performance of the developed system is

analysed, based on the response time, with the prevention services as well as

without them. The response time for the session fixation preventer, one time

URL, and non static web session creator has been taken, evaluated

independently with ten different samples of vulnerable links and script.

Table 7.1 Comparative assessment based on Time of the session hijacking

No. of

Test

Session Fixation Preventer One time URLNon static web session

creator

Response

time

Without

prevention

services

(milli

seconds)

Response

time

with

prevention

service

(milli

seconds)

Response

time

Without

prevention

services

(milli

seconds))

Response

time

with

prevention

service

(milli

seconds)

Response

time

Without

prevention

services

(milli

seconds)

Response

time

with

prevention

service

(milli

seconds)

1 25 43 23 90 134 457

2 21 78 36 110 35 80

3 27 43 75 99 38 60

4 13 18 52 105 41 82

5 26 16 56 80 78 84

6 22 27 43 111 35 55

7 27 31 45 85 30 48

8 19 23 16 65 16 59

9 26 30 53 99 31 200

10 28 35 37 74 53 135

Avg

Response

Time

23.4 34.4 43.6 91.8 49.1 126.0

166

Even though the response time is high in this filtering system, the

system provides more security than the existing techniques. With this session

hijacking, the phishing can be avoided efficiently. The next technique for the

server side phishing is E-mail phishing.

7.5 E-MAIL PHISHING

Phishing email attempts to fraudulently acquire personal

information, such as your account password or credit card information. Here,

the email may look like a legitimate source, but actually it is not. Many e-mail

tools as well as most of the browser tools apply lists to classify “good”

(whitelists) and “bad” (blacklists) sources/senders. Typically, the blacklists

block the IP address of the e-mail (SMTP) server, the sender domain, or even

the whole e-mail address domain of a sender. Blocking the IP address or

domain can cause problems when the sender uses an SMTP server of any

provider, and blocking the whole sender’s email address domain can be

inefficient, because the source address could be forged (Cleber et al 2011). In

this approach, a robust three-stage classification model which can be

implemented is used. Web servers automatically detect phishing messages

and discover the impersonated entity in those messages. The approach

combines soft computing and a Three-Stage classification model for filtering

phishing emails, and the name of this approach is called Mail Sieve. Here, the

existing classifier algorithms are rescheduled as a multi-tier classification

process to classify the phishing email and to find out the optimum scheduling.

Moreover, in this work, supervised classification techniques are used, which

is a major stream of data mining to assess the severity of the phishing attacks.

Supervised learning algorithms namely, Decision Tree induction, Multilayer

Perceptron , Naïve Bayes , Bayesian Network and Radial Basis Function

classifications are used for learning.

167

7.5.1 Clustering the Values

In this approach, the email messages are initially parsed, using the

Chilkat Reader. The parsed mails are mentioned as texts, which are now

validated for legitimacy. Initially, the texts are checked for missing values by

the soft computing technique. The missing values are replaced by standard

terms, using the K-Cluster algorithm. The K-means is one of the simplest

learning algorithms that solve the well-known clustering problem. The given

data set is classified using a certain number of clusters (k). For every cluster,

the k centers should be defined, and can be placed in a way to replace all

other related and missing values. The process will associate each point of the

given data set to its nearest center. A loop has to be generated until the target

value replaces the missing values. This loop will enable the k centers to

change their location step by step, until no more changes are possible. The

objective function called squared error function is minimized through this

algorithm, and is given by (www.deib.polim.it);

2

1 1

( )icc

i ji j

J V x v

where,

‘||xi - vj||’ is the Euclidean distance between xi and vj.

‘ci’ is the number of data points in the ith

cluster.

‘c’ is the number of cluster centers.

The algorithm is given in Figure 7.12.

168

Figure 7.12 Algorithm for K-Cluster

Initially, the values are grouped as K-clusters. For example, the

end-user, user, client and similar words related to the customer are grouped as

a cluster, named as a customer cluster. The banking terms like bank, banking,

and finance are grouped as another cluster, called bank. For each and every

cluster, a center point or value is fixed, termed as the centroid or target value.

Then, the values in the cluster are replaced by the target value until there is no

more replacement. The work flow is shown in Figure 7.13. The terms are

grouped as clusters, and the centroid value is fixed for each and every cluster;

then the phished values are replaced by the target value.

After replacing the missing values, the e-mail is checked in the

three stage classification model. In stage-one, the mails are checked for the

legitimacy in the subject, in stage-two, they are checked for the content, and

in stage-three, they are checked for the sender’s IP address.

1. The objects that are being clustered represent the space where k

points are placed. These k points represent the initial group

centroids.

2. The group, which has the closest centroid, will be assigned with

each object.

3. After assigning all the objects, the positions of the k centroids are

recalculated.

4. Steps 2 and 3 have to be repeated until the centroids stop moving.

Due to this the separation of the objects into groups will happen;

thereby the calculation of the metric to be minimized is possible.

169

NO

YES

Figure 7.13 Workflow of the K-Cluster algorithm

7.5.2 Three Stage Classification Model

The stage-one classifier validates the texts in the mail subject. It

selects the texts, checks and verifies with the predefined keywords, mentioned

in the FilterKeywords.xml file. It is either marked as legitimate or spam mail,

based on the keyword match. Then, the mails are moved to the spam or junk

folder, if illegitimate. If it is found to be good, it is then passed to the

stage-two classifier. The mails are checked for their legitimacy in content.

The content is checked for phishing keywords as well as the embedded image

in it. The outputs may be either good mail or spam mail. If invalid, it is

moved to the spam or junk folder. If legitimate, the outputs are fed as input to

Start

Number of

cluster K

Find Centroid

Replace missing values by

centroid

Are all

missing

values

replaced?

End

170

the stage-three classifier. This algorithm will classify the message with a label

of either good or spam after validating the IP address. The IP address received

was checked in the black list of real time site Spamhaus.org. If the received

mail is marked as spam, it is moved to the spam or junk folder. Else, the

output message of the algorithm will directly be sent to the inbox, as the mail

is legitimate.

As many mails can be detected for phishing, as possible. The user

accounts can be configured for any of the mail servers like Gmail and Yahoo.

For example, Gmail is to be configured as imap.gmail.com. User accounts

which are to be detected for phishing can be many for the mail server

configured. The accounts for which the mails are to be detected are

configured in the credentials.xml file. The user id and password are encoded

and then updated in the credentials.xml file, separated by a semicolon. Also,

the folder where the illegitimate mails are to be moved should be mentioned

for each and every user account. The folder name can be like Spam, Junk, or

any user convenient name. The user credentials can be encoded for security

reasons, using the encoder/decoder.exe file. The credentials.xml file has to be

configured, as shown in Figure 7.14.

Figure 7.14 User-ID and Password configuration

<?xml version="1.0" encoding="UTF-8"?>

<CredentialList>

<Credentials>cG9ubWFuaW1hbGxpa2FAZ21haWwuY29t;R29vZ2xlQDEyM

w==;Junk

</Credentials>

<Credentials>cG9ubWFuaWFubmFtYWxhaUBnbWFpLmNvbQ==;MXFheiFB

UVo===;Spam</Credentials>

<Credentials>vdGVzdEBnbWFpbC5jb20===;M2VrZGZtIUAj;Junk

</Credentials>

<Credentials>c2h5bmlkYWxpbmFAZ21haWwuY29t;cEAkMTIzDQpwQkMTI

zNDU2;Spam

</Credentials>

</CredentialList>

171

A logger file is also maintained. The logger is nothing but a console

application. The console is used to display all the details of the mails checked.

Error messages like “unable to connect to Gmail host”, “invalid userid or

password” are shown in the console window. If there is no new mail, the

message “Email box is empty or no new mails” is displayed. If the mail is

invalid, the message “Mail with such subject is illegitimate and thus moved to

spam” is fired in the log. And all the details for illegitimacy are grouped in the

console. The details include mail subject/content, legitimacy check, and spam

info. The illegitimate mails are highlighted in red color. The legitimate mails

are shown in default color as shown in Figure 7.15.

Figure 7.15 Received E-mail details

The results are classified with the supervised learning algorithms to

make sure that our result is correct, and its accuracy is found.

7.5.3 Using Supervised Learning Algorithm

In the three stage classifiers, there are fifteen features that are

considered for checking the subject header and the content of the subject. The

fifteen features are listed below,

172

(i) Popup

Phishing attacks can be found in emails if the attacker inserts ant

forms or links to the compromised websites. Hence, the attacker may include

scripts to create a popup and then load a form in that popup, to trick the user

into entering sensitive data. Hence, finding the presence of a popup suggests

the possibility of the mail being an attempt to phish sensitive data.

(ii) Text “Verify Account”

If an email is found to have the text “Verify Account”, “Verify

Email”, ”Bank”, “Debit”, “fwd”, “reply”, “Click”, “Here ”, “login”, “update”

or any of its variants, then it is worth checking the email for further symptoms

of phishing. While the presence of these texts does not necessarily indicate

the presence of a phishing attempt, it is an easy way to lure people to click

into malicious links.

(iii) Javascript

Javascript is normally used to validate forms in websites. Its

presence in an email indicates that it is likely to be a malicious email, because

javascript can be used to change the text of a document. It can be used to trick

users in various ways.

(iv) onClick attribute:

The onClick attribute in an HTML element can be used to make a

HTML element clickable, and redirect a user to another URL which is

normally not possible.

173

(v) Change of window status

The status of the browser page can be changed by using the

window.object.status function in javascript. This can be used to provide the

user with false information like load contents from other websites, while

showing the legitimate website’s address in the status bar.

(vi) IP address in URLs

Some phishing attacks are hosted on PCs infected with

Virus/Malware. The only way to link to them is by using their IP address.

Legitimate email seldom uses links with an IP address. A link is an email

whose host is an IP-address (E.g http:// 101. 56.3.48/ login. facebook.

com/login).

(vii) ReplyTo modification

The attacker may modify the ‘replyto’ field in the email, with the

email address of the legitimate company, so that the user can reply back to the

legitimate company, and thus not become suspicious about the sender’s

identity. Hence, checking if the sender address and the ‘reply to’ address are

different, is important. If they are from different domains, it will help in

identifying phishing attempts.

(viii) Number of unique domains in URLs

The legitimate emails contain links in only one or two domains. If

the number is high, the email is probably an attempt to phish user data from

the receiver.

174

(ix) Number of words in Subject

Most legitimate E-mails have less than five to ten words in their

subjects. Hence, the presence of a large number of words in the subject

indicates the possibility of the Email being an attempt to phish sensitive data

from the user.

(x) Richness of the Vocabulary

Phishing emails normally contain the same words in a different

form. This reduces the richness of the content. This can be calculated by the

Type token ratio as shown in equation (7.3).

100Tokens

Types (7.3)

Types : number of words

Tokens: Number of different word forms and characters

(xi) Number of Periods in URL

Legitimate URLs also can contain a number of dots, and this does

not make it a phishing URL. This feature is simply the maximum number of

periods (‘.’) contained in any of the links present in the email, and is a

continuous feature.

(xii) Link in Image

By linking an image with a URL, many of the deceptions seen in

phishing attacks are possible. For a phisher to launch an attack with a plain

links is difficult, because the user is less likely to click a link with a plain text.

But the attacker can lure the user with attractive images to click it, and thus

the attacker can redirect the user to a phishing site.

175

(xiii) Number of hyperlinks

It denotes the total number of hyperlinks which are available in the

content.

(xiv) Cascading Style Sheet (CSS)

It denotes the CSS applied in the content of the message.

(xv) Number of words in subject with at least fifteen characters

These are the features X= <F1, F2, F3, F4, F5, F6, F7, F8, F9, F10,

F11, F12, F13, F14, F15> which can be used to differentiate between phishing

and legitimate web pages.

7.5.4 Evaluation

The experimental results of the three-stage classification model are

based on two data sets. One data set is based on the subject header, and the

other on the content of the e-mail. The test data set consists of 1535 e-mails.

The classification algorithms, Decision Tree induction, Multilayer Perceptron,

Naïve Bayes, Bayesian Network , and Radial Basis Function are implemented

and trained, using WEKA. The Weka 3.4.4, Open Source, Portable, and the

GUI based workbench, are a collection of data pre processing tools and state-

of-the-art machine learning algorithms. The 10-fold cross validation evaluates

the robustness of the classifiers. The prediction of phishing website is

measured as the ratio between total the test cases and the correctly classified

instances in the test dataset and is used as the primary performance measure

in prediction accuracy. Prediction accuracy and the training time are the two

criteria used to evaluate the performances of the trained models. The model’s

prediction accuracy is compared, and Table 7.2 gives the results of the five

classifiers.

176

Table 7.2 Performance analysis based on the subject for the five

classifiers

Criteria for Evaluation

Supervised Learning algorithm

DT MLP NB BN RBF

Kappa Statistic 0.7927 0.7868 0.7102 0.7097 0.7603

Mean Absolute Error 0.1484 0.1265 0.1519 0.1607 0.1818

Relative Absolute Error (%) 29.6802 25.2909 30.3735 32.1486 36.3645

Root Relative Squared

Error(%)29.6802 25.2909 30.3735 32.1486 36.3645

Correctly classified

instances (%)89.6322 89.3419 85.4948 85.4706 88.0111

Incorrectly classified

instances (%)10.3678 10.6581 14.5020 14.5294 11.9889

Precision (%) 89.7 89.3 87.9 87.8 88.6

Recall (%) 89.6 89.3 85.5 85.5 88

F Measure (%) 89.6 89.3 85.3 85.2 88

ROC Area (%) 95.3 95 90.1 91.9 94.5

Time taken to build model

(Sec)0.13 0.21 0.17 0.24 1.16

The comparison graphs between the classifiers for precision, recall,

F Measure and ROC for the subject in e-mails are shown in Figure 7.16. It

clearly shows that the Decision Tree is the best algorithm for this set of

features. Moreover, it is found that the time taken to build the model and the

precision accuracy is high, in the case of decision tree induction, when

compared to the other four algorithms, as shown in Figure 7.17.

177

Figure 7.16 Comparison Graphs between classifiers for subject

Figure 7.17 Building time between the classifiers for subject

The 10-fold cross validation results of the five classifiers for the

content of the e-mails are summarized in Table 7.3.

80

82

84

86

88

90

92

94

96

98

DT MLP NB BN RBF

% o

f v

alu

e

Classifiers

Precision

Recall

F Measure

ROC

0

0.2

0.4

0.6

0.8

1

1.2

1.4

DT MLP NB BN RBF

Tim

e t

ak

en

to

bu

ild

(S

ecs

)

Classifiers

Time taken

178

Table 7.3 Performance analysis based on content for the five classifiers



DT MLP NB BN RBF

Kappa Statistic 0.9664 0.9618 0.9485 0.9281 0.9035


Relative Absolute Error 5.0333 3.85 5.5438 7.5845 115.3835


Error (%)25.4202 27.6015 30.7141 36.6595 39.2758


instances (%)98.3184 98.0886 97.4232 96.407 95.173


instances (%)1.6816 1.9114 2.5768 3.593 4.827

Precision (%) 98.3 98.1 97.4 96.4 95.2

Recall (%) 98.3 98.1 98.3 97.3 95.2

F Measure (%) 98.3 98.1 97.5 96.4 95.2

ROC Area (%) 98.4 98.2 98.4 98.6 97.6

Time taken to build

model (Sec)0.13 0.23 0.16 0.18 0.71


F Measure and ROC for content in e-mails are shown in Figure 7.18. The

figure also clearly shows that the Decision Tree is the best algorithm for this

set of features. Moreover, it is found that the time taken to build the model

and the precision accuracy is high, in the case of decision tree induction,

when compared to the other four algorithms, as shown in Figure 7.19.

179

Figure 7.18 Comparison Graphs between classifiers for the content

Figure 7.19 Building time between the classifiers for the content

The 10-fold cross validation results of the five classifiers for the

subject and content together of the e-mails are summarized in Table 7.4.

93

94

95

96

97

98

99

DT MLP NB BN RBF

% o

f accura

cy

Classifiers

Precision

Recall

F Measure

ROC

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

DT MLP NB BN RBF

Tim

e t

ak

en

to

bu

ild

(se

c)

Classifiers

Time taken

180

Table 7.4 Performance analysis based on the subject and content for the

five classifiers



DT MLP NB BN RBF

Kappa Statistic 0.9472 0.9472 0.9296 0.9693 0.9112


Relative Absolute Error 9.6891 9.1745 9.7541 3.6701 16.2106


Error31.3747 31.2648 36.1425 23.5682 40.053


instances97.3627 96.1472 96.4796 98.4636 95.5601


instances2.6373 3.8528 3.5204 1.5364 4.4399

Precision (%) 99.4 97.4 96.5 98.5 95.6

Recall (%) 99.3 94.4 95.5 96.5 94.6

F Measure (%) 99.2 97.4 96.5 98.5 95.6

ROC Area (%) 99.4 97.8 97.7 99.4 97.2

Time taken to build

model (Sec)0.03 0.18 0.04 0.11 0.55


F Measure and ROC for the combination of the subject and content in e-mails

are shown in Figure 7.20. This clearly shows that the Decision Tree is the best

algorithm for this set of features. The comparison graph for the time taken to

build the model for the subject, content, and both combined, is shown in

Figure 7.21.

181

Figure 7.20 Comparison graph between the classifiers for the subject

and content

Figure 7.21 Building time between the classifiers for the subject and

content

91

92

93

94

95

96

97

98

99

100

DT MLP NB BN RBF

% o

f v

alu

e

Classifiers

Precision

Recall

F Measure

ROC

0

0.2

0.4

0.6

0.8

1

1.2

1.4

DT MLP NB BN RBF

Tim

e t

ak

en

to

bu

ild

(se

c)

Classifiers

Time taken for subject

Time taken for subject

Time taken for content

Time taken for content

Time taken for subject &

content

182

7.5.5 Performance Comparison with the Existing Techniques

The proposed e-mail phishing (Mail Sieve) is compared with the

existing tools Mail Washer and G-Lock spam Combat. Based on their

performance, it is analysed that both the tools read all the mails which are

already read, as shown in Table 7.5.

Table 7.5 Comparative analysis with the existing tools

S. No. Features MailwasherG-lock

spam

Mail

sieve

1 Process read mails Yes Yes No

2 Process unread mails Yes Yes Yes

3Read mails by message

subjectYes Yes Yes

4 Mails marked as spam by user Yes Yes No

5 Mails marked as spam default No No Yes

6 Mails moved to spam Yes No Yes

7 Mails moved to thrash No Yes No

8 Log reader Yes Yes Yes

9 Time consumption High Medium Low

10 User friendliness Good Better Better

The end user generally does not worry about the read mails. She/he

wants to know the quality of the unread mails only. The system reads the

unread mails and directly moves to spam or junk which do not exist in both

the tools. Both the tools ask the user to mark as good or bad. Based on the

user’s decision, it updates its filters.

183

The Mail Sieve is compared with the existing techniques like multi-

tier phishing detection (Rafiqul Islam and Jemal Abawajy 2013), Soft

computation based imputation (Kancherla et al 2012) and Phishing detection

using CRF and LDA (Venkatesh and Harry 2013) in terms of accuracy as

shown in table 7.6.

Table 7.6 Comparison with the existing approaches based on accuracy

S.No Different existing approaches Accuracy (%)

1 Multi-tier 97

2 Soft computation 82.46

3 CRF & LDA 98.8

4 Mail sieve 99.4

From Table 7.6, it is clearly understood that the Mail sieve

approach (proposed system) shows more accuracy than the existing

techniques. The comparison graph is shown in Figure 7.22.

Figure 7.22 Comparison graph with the existing approaches based on

accuracy

97

82.46

98.8 99.4

0

10

20

30

40

50

60

70

80

90

100

Multi-tier Soft

computation

CRF & LDA Mail sieve

% o

f v

alu

e

Different approaches

Accuracy

Accuracy

184

7.6 SUMMARY

Secure bank transaction is achieved in the server side phishing

filtering techniques, by implementing the effective one-time

password, and watermarking mechanisms.

The one-time URL mechanism used in session hijacking

provides more secure communication for the individual and

the organization.

The Session Fixation preventer system achieved high accuracy

for preventing the session fixation by the attacker.

The Three-Stage e-mail filtering technique achieved 99.4%

accuracy for filtering the phishing e-mails.

In order to produce efficient results over e-mail, and offer a

better e-mail process, features are selected from different e-

mails, and the classification algorithms are used to achieve

low false negative and false positive rates.

CHAPTER 7 SERVER SIDE PHISHING FILTERING...

Documents

Transcript of CHAPTER 7 SERVER SIDE PHISHING FILTERING...