CHAPTER 7 SERVER SIDE PHISHING FILTERING...
Transcript of CHAPTER 7 SERVER SIDE PHISHING FILTERING...
145
CHAPTER 7
SERVER SIDE PHISHING FILTERING TECHNIQUES
7.1 INTRODUCTION
The client side phishing techniques were discussed in the previous
chapters. In this chapter, we will look into the server side techniques. Usually,
the servers are more favoured targets for obvious reasons. Since phishing
attacks are getting trickier, the client information submitted to the server
cannot be trusted all the time. In addition to stealing the user’s password,
phishers also steal more sensitive information of the user by imitating
successful authentication. They even check the validity of the password by
forwarding it to the legitimate server or sometimes hijack user’s login session
by using the man in the middle attack. A phishing attack is usually carried out
by an email or an instantaneous message, in an effort to attract recipients to a
fake website, to reveal their personal credentials.
A number of countermeasures have been proposed and developed
for protecting the website against phishing attacks. Server-side protection uses
SSL certi cates, user selected logo and other security gauges to help users
verify the legitimacy of websites.When phishing is carried out via email, the
illegal user sends out a large number of messages that appear to come from a
genuine source, such as a confidence business or financial institution. The
emails include urgent requests for personal information to be submitted.
Typically, the phisher reveals that there is some serious need to update an
account instantly. A link is provided in the email message to an authorized
146
appearing website, where the information is truly entered by users; the
personal information provided to this site, however, the genuine information
goes directly to the illegal business committing the phishing attack, and not to
the imitated but legitimate business.
The term 'server side' here does not only mean the servers in
financial institutions. Instead, it also includes servers running through whole
network. Though many methods exist for protecting the information for the
users, most of the times, many individuals and organizations lose their
information. The literature clearly reveals that the existing server side
phishing techniques have not identified the phishers clearly. Thus, there exists
a large gap between the server side phishing techniques and the expected
user’s secrecy. Therefore, there is a need for better phishing prevention
techniques, considering the importance of the user’s personal information.
There is a need to tailor the secrecy of the information, and to satisfy
everyone who uses the website.
The preventing phishing technique is a promising way to resolve
this problem by advanced methods. In this work, considering that the personal
information of users is more secure, users should be aware when the step by
step process is not applicable, and when they need to stop the process. They
should not give their personal information. To meet all these necessities, there
are four server side techniques introduced. They are the one-time password
mechanism, watermarking mechanism, preventing phishing through session
hijacking and e-mail phishing. In the one-time password mechanism, the
password will be available after the first step of authentication. In the
watermaking mechanism, there should be the specified watermaking present
during authentication. The one-time URL method is applied in session
hijacking and e-mail filtering methods are implemented.
147
7.2 ONE-TIME PASSWORD MECHANISM
A password that is valid only for one session or transaction is
known as a One-Time Password (OTP), and it helps in avoiding the risks of
traditional or regular passwords. In this system, a password will be available
only after generating the secret code. Users can be authenticated with the
encrypted security code delivered via a reliable communication protocol on
demand. The user database at the server side matches a user’s name with its
corresponding identity on another communication path. When a user wants to
access the website, the server sends an encrypted security code to the user
through the communication protocol. On receipt of the encrypted security
code the user has to decrypt that code, and enter the login. The security code
is encrypted with the private key and decrypted with the public key. The
decryption process is done by the user.
The admin process consisting of registration, involves the
following steps. The user must choose one login name, fill in all the required
information fields, and provide at least one type of personal contact
information (E-mail address or Mobile number). The website should list all
the services that it uses, to deliver the security code so that the user can
choose the preferred service. The use of a security question is not mandatory.
It depends on the web site provider’s policy or the user’s wish. The proposed
system is shown in Figure 7.1. However, such questions make the
authentication process more secure. The steps are as follows:
(i) The validation page is sent to the customer. The page contains
the name of the login used by the web site.
(ii) If the customer’s login name is new to the web site, the
customer is asked for permission to add the login name to the
websites’ contact list.
148
(iii) After the login has been approved by both the web site and the
customer, the website sends an account validation message to
the user via the designated communication channel.
Figure 7.1 One-time password system for preventing a phishing attack
Next, the user starts the actual login process, by browsing the login
page which contains an input field for the customer’s login name and the
CAPTCHA test. If the user’s login name is not recognized by the website, it
must be displayed in a page. If the user’s account name is valid, the website
checks the customer’s registered account, and sends an acknowledgement to
that account. If the acknowledgement message is valid, the customer enters
the assigned security code on the input page. On receipt of the security code,
the website has to make sure that the customer submits the security code from
exactly the same IP address as the customer requests to login.
User
Registration
Decryptio
n
Login
Generate
security code
DB
Client Server
Encrypted code
User name and captcha
code
Acknowledgement
Security Code
149
7.2.1 Implementation Process
This solution can only be implemented on the web server’s side. If
our system is to offer a practical opposition against phishing attacks, it must
impose minimal overhead, since a solution that significantly slows to the web
browsing experience will be unlikely to be adapted. Figure 7.2 contains an
input field for the user’s login name and the CAPTCHA test. If the user’s
login name is not valid, it will show an error message. If the user’s name is
valid, the website checks the user’s registered account and sends an
acknowledgement to that user.
Figure 7.2 First login page
Next, the customer enters the assigned security code on the input
page as shown in Figure 7.3. On receipt of the security code, the website has
to check whether the user submits a valid security code. If it is not valid, it
will display the error message and the user can enter the wrong security code
only n times.
150
Figure 7.3 Identification of the user name
7.2.2 Security analysis of one-time password mechanism
Other than phishing, this system avoids some of the attacks. The
following are the attacks that trouble the websites:
(i) Denial of service attack
A denial of service could be launched against any part of the
Internet connectivity and network infrastructure. In the proposed solution, the
website authenticates the customer, by asking him/her to input the security
code already assigned by the website. The customer authenticates the website
by first checking the sender of the acknowledgement message.
(ii) IP Spoofing
In IP spoofing, the target computer will have attacks that resemble
those generated from its own address, by faking the source IP address,
causing the Operating system like Windows to crash or lock up. This
proposed solution restricts the locations that are able to launch the IP-
151
Spoofing attacks. If the attacker uses the same IP address as the user in the
same local network concurrently, the user can detect it. The lifetime of the
security code is only a few seconds. So, it is not possible for the attacker to
login the protected website via the same IP address.
(iii) Server spoofing
In Windows 95 stations, the LANMAN authentication can be
requested from the client by running the C2MYAZZ utility, which the
attacker uses to his benefit, by acting like the server during the user login
sessions. If the attacker is successful in tricking the client, then he will be able
to read user login details from the network packets. The proposed solution
does not require a preset password to login, thereby avoiding password theft.
(iv) Man in the middle attack
An attacker may watch a session open on a network. Once
authentication is over, he might attack the client system to disable it, and use
IP spoofing to declare to be the client who was just authenticated and take the
session. In this proposed solution, suppose the attacker discovers both the
customer’s web account name and the security code for the current session.
Since the life span of the security code is very short, it would be of little use
to the attacker.
This approach would be deployed for websites requiring a high
level of security, and it would ultimately help in retaining the customer’s
confidence in using web-based commerce. When comparing the
consequences of phishing, the increase in time in milliseconds is negligible.
This approach is to prevent phishing sites, which are more powerful than the
earlier techniques. But this fundamental checking does not prevent complete
phishing sites. So, the next technique is implemented with watermarking.
152
7.3 WATERMARKING MECHANISM
The purpose of a watermark is to recognize the work and avoids its
unauthorized use. Visible watermark is a common way of recognizing images
and protecting them from unauthorized use online. The watermark message is
intended to be distinctive for every user, and carries a shared secret between
the company and the user in order to stop attacks like phishing. The proposed
system is shown in Figure 7.4.
Figure 7.4 System for preventing phishing attack using watermarking
Here, the client will enter the URL to view the required webpage of
a particular web server. To increase the credibility of the web site of that
particular webpage, the client machine’s current date and time will be
displayed at the client browser. Usually, when the phishing attack occurs, the
User
User
Registration
Login
Application Decrypt
the code
Auto generation
of Secret code
Encrypt
the code
DB
Show the code to user
Show as a
watermarking
Retrieve the
secret code
based on user ID
Client Server
Increase Credibility
153
page may redirect during the money transaction. When the clients need to
enter their personal details, such as online banking password or ATM pin
number, they need to login. After logging in, the client may not know whether
he is in a correct page or not. Here is where the water marking technique
plays a major role, to give the highest credibility of that particular webpage.
After logging in, and before giving the personal details, the user can check the
credibility level of the webpage. This credibility provision is only possible
through the water marking mechanism, as shown in Figure 7.5.
Before logging into a commercial web site, the user can see his/her
machine’s date and time in the logo, which is initiated from the server. After
logging in, if the user places the cursor over the logo, the secret code will be
displayed. This secret code is user dependent and it will be stored in the
server database. If the user places the cursor at the top of the web page, the
user’s name will be displayed.
But the attacker may hack the server database to get the respective
secret code of the user, and may show the watermark in the fake website as a
legitimate website. To avoid this kind of problem, the server will encrypt the
secret code, using the symmetric key encryption algorithm before storing it in
the database.
This algorithm will convert the secret code into an encrypted
format, which cannot be understood by humans. When the user logs in to this
website, the secret code of that particular user will be fetched from the
database, and decrypted, into a human readable format, using the symmetric
key decryption algorithm.
154
Figure 7.5 Watermarking system
This encrypted water marking mechanism is more secure than the
previous ones, since the date and time are initiated from the server and the
secret code is displayed only after decryption. Even if the attacker hacks the
server database he/she cannot understand the secret code. This is the simplest,
easiest, and at the same time, the most efficient watermarking technique, to
prevent phishing attacks.
Here, the main advantage is that the secret code will be decrypted
on the client side and the server will send the encrypted secret code only to
the client side. Using this, the man in the middle attacks will be prevented.
Some of the existing watermarking techniques are a little more
costly than this method, since they need some additional software, such as
image magic. This approach is platform independent, and there is no load on
155
the client side, such as the usage of an additional tool. But this fundamental
checking does not prevent all phishing attacks. So, the next technique is
implemented with session hijacking.
7.4 PREVENTING PHISHING THROUGH SESSION
HIJACKING
The method, by which the attacker gains access to the user's session
by obtaining his session ID, and acting like the authorized user, is called
Session hijacking. This gives free access to the hacker to do anything in the
network which the legitimate user is allowed to do.
Normally the cookie or URL stores the session ID inside and the
authentication procedures are carried out in the initial setup time, which is
taken as an advantage for the hijacking by intruding into that session in real
time. Session hijacking can be a possible reason behind the unexpected
response of the website or no response to the user inputs.
A web based detection and prevention mechanism called Session
Hijacking Attack Prevention System (SHAPS), is used to prevent session
hijacking. It fetches all the requests and responses, and validates them for
session hijacking. It prevents session fixation by validating the hostname, IP
address, and session ID, and mismatched sessions are invalidated.
The Universally Unique Identifier (UUID) and Time Stamp (TS)
are used to generate the URL for providing short time living services, and also
as a gateway for every request. The Secure Hash Algorithm (SHA-1) is used
in a non-static web session to prevent session hijacking, by dynamically
creating session identifiers, which are further used to weed out the phishing
website.
156
The Session Hijacking Attack Prevention System has two parts.
The first part of the system has the normal client server communication, and
the second part contains phases of the SHAPS. Each phase contains web
services for preventing a session hijacking attack. The repository stores the
necessary information about the client’s request in order to retrieve the
information later, to validate the client’s request.
The architecture of the system is described in terms of the
components and their interactions. A component is a part of the system which
performs a well defined interface. Components interact with other
components through their interfaces. In order to prevent the session hijacking
attack in web application, the client’s request in web application should be
captured and checked against the attacker session ID, through different types
of web services.
SHAPS has three parts, namely, the session fixation preventer web
service, one time URL web service and the non static web session creator web
service. Each part has a separate web service to detect and prevent the session
hijacking attack in web applications.
7.4.1 Session Fixation Preventer Web service
A session ID is issued by the web server to the user session. This
approach however ignores one very important issue. There is a possibility for
the attacker issuing a session ID to the client’s session, thereby forcing the
client to use a chosen session. This class of attack is called session fixation. It
is one of the session hijacking attacks because the user's session ID has been
fixed by the attacker in advance, instead of its being generated randomly at
login time. Figure 7.6 describes the session fixation attack preventer web
service.
157
Figure 7.6 Session Fixation Attack Preventer Web Service
In a normal client and server environment, the client sends a
request through the browser and the server responses to browser containing
web application. The HTTP stateless protocol maintains a session for reliable
communication between the client and server for each single user
communicating through the session. Each session has a unique session
identifier, for the identification key of the user. There is an opportunity for the
attacker issuing a session ID to the client’s browser, thereby forcing the user
to use a selected session. This class of attack is called session fixation. It is
one of the session hijacking attacks, because the user's session ID has been
fixed in advance instead of its being generated randomly. If the impersonated
client clicks the link as a request, an attacker session will be fixed to the client
session, in the weakly developed web application. The attacker already knows
Fetching data from
request
Validating fetched
request
Regenerated session ID
Original
session data
Web ServerClient Database
Web
Service
Req
Res
158
the fixed session ID; hence, at the same time the attacker can enter the client’s
account without providing the user name and password, and gain access till
the client signs out the session. For this problem, Figure 7.6 provides an
essential solution for the weakly developed web application. When the
client’s request is sent by the attacker, it will be fixed with the session ID to
the client session.
After successful login, a web server generated session ID is
displayed within the alert message box. It will be used to maintain a different
user session accessing the same server. If the attacker fixed session is blocked
by the web service, the attacked web page shows an alert message, or when
the session ID is mismatched, or if the session ID is not generated, then the
alert message will be displayed on the screen.
Using the session fixation preventer web service to a weakly
developed web application, each of the multiple requests will be fetched by
the session fixation preventer web service, and stored as essential information
in the repository. The essential information is the session ID, IP address, Date,
time, and host name of the request. Using all this information called by the
validation services, it can check the requested URL for the same session ID,
different IP address, and same host name request for a short period of time.
Then, the validation service finds it out as an attacker session, and the service
allows the request to the web server after logging in. The attacker fixed
session ID has been regenerated and stored in the repository. Now, the
attacker cannot attack the client session by fixing the attacker session ID to
the client session. Figure 7.7 shows the XML structure of the fixation
preventer service.
159
Figure 7.7 XML Structure of the fixation preventer service
7.4.2 One time URL Web Service
The main goal of the one time URL is to provide security to short
life time services, like transaction services, account activation services and
password reset services. The one time URL is valid for only one time access.
It will be generated by the web server, and our web service using UUID and
time stamp, as shown in Figure 7.8.
<?XML version=”1.0” encoding=”UTF-8”>
<service name=”FixationService” scope=”application”
class=”fixation.SessionFixationServiceLifeCycle”>
<description>
FixationService
</description>
<messageReceivers>
<messageReceiver mep=http://www.w3.org/2004/08/wsdl/in-out
Class=”org.apache.axis2.rpc.receivers.RPCMessageReceiver”/>
</messageReceiver>
<parameter name= “ServiceClass”>
Fixation.sessionFixation
160
Figure 7.8 One-time URL web service
The one time URL is generated in a web application for accessing
the sensitive information, or any kind of services like money transfer, account
activation or secret details reset. These types of services are very confidential,
and will be protected against outside users and attackers. The client server
communication environment notes the number of frequent requests and
responses sent by the client and server in a session. The SOAP object is used
to make a HTTP request; the header field of the SOAP object is optional here.
The Universally Unique Identifier and Time Stamp are filled into the SOAP
header field, for providing security to the client request. Sometimes the same
UUID is never assigned to different clients, through the same TS is. However,
the lifetime differs, based on the request, as shown in Figure 7.9.
One time URL Creation
(UUID+TS)
Original
Request
Information
Web ServerClient Databasee
Web
Service
Req
Res
Requested for service
Validation (Handler)One time
URL
161
Figure 7.9 XML structure of a transaction service
The generation of the URL is manually possible, but using the
manually created URL, the attacker cannot access anything in the system. The
onetime URL web service containing the handler, is like a security gateway
where every client request will be passed through the handler. The handler
evaluates each single user request against the one time URL. If the incoming
URL has been used, the handler blocks the request to access the particular
service, and invalidates the transaction session, and advises it to start again.
The correct one time URL has a short life time. Within the life time, the client
has to do everything. If the client is idle or accesses the service slowly, the
URL time will expire. These two important security mechanisms are included
in the one-time URL for providing security to the short life time services in a
weakly developed web application.
<?xml version=”1.0” encoding =”UTF-8”?>
<serviceGroup>
<service name=”TRANSACTION SERVICE”>
<parameter name=”ServiceClass” locked=”false”>com.example.
Transaction Services</parameters>
<operation name=”aspectWithdraw”>
<actionMapping>
<messageReceiverclass=”org.apache.axis2.receivers.RawXMLINOutMess
ageReceiver”/>
</operation>
</service>
</serviceGroup>
162
7.4.3 Non Static Web Session Creator Web Services
Famous websites like Hotmail, Gmail, Yahoo etc, are vulnerable
and the preferred targets for session hijacking. The hackers constantly try to
capture the cookie/session ID to access the system using the victim’s identity.
This vulnerability is basically due to the usage of a static session ID. This
module prevents hijacking by providing a prevention model; using a non-
static session ID, thereby making the captured session IDs useless to the
attacker as shown in Figure 7.10.
Figure 7.10 Non static web session creator web service module
This system provides non static session IDs instead of static session
IDs. The idea is that each HTTP request must use a different session ID to
provide protection from session hijacking attacks. A method is designed,
method which generates the dynamic session ID using the Secure Hash
Cookies storage and
mapping
Original
cookies
Web ServerClient Database
Web
Service
Req
Res
Cookies of User request
Non-Static session
creation
Dynamic
ID
163
Algorithm (SHA-1) which is 160-bit value. This algorithm gets the variable
length of input and provides a fixed length of output, called the message
digest.
The non-static web session creator creates a dynamic ID for each
request based on SHA-1 algorithm. The process of the authentication
phase, the client request for a bank web page, after which the server
sends a login page to the client receives login page such as
https://www.chennaibank.com/login.jsp. The user enters their username and
password and the server validates the username and password, it will generate
a static session ID. Then it updates the repository located in the server’s
memory. The static ID is an identifier for the user session. At the same time,
the web service generates a secret key; no one knows it except the server and
the client. The static session ID, the secret key, client time and user name will
be stored. In each subsequent request, the web service generates a dynamic ID
for the requested user to access the particular web site.
The system designed a method which generates the dynamic
session ID from a secret key, a static session ID, and the client’s session time,
by using the equations (7.1) and (7.2).
Calculating the Dynamic Session ID can be done by creating
equations
A is computed from the equation (3.1):
B is computed from the equation (3.2):
Dynamic ID = Hash (Secret Key + Client Time) + Static ID + Client Time
A B
164
The dynamic ID can be generated by using
Verifying a dynamic ID can be done in the six steps given in Figure
7.11.
Figure 7.11 Algorithm for verifying dynamic ID
The system designed a method which generates the dynamic
session ID from the secret key, static session ID, and the client’s session time.
1. Read the static ID from the B of the dynamic ID, and then use it for
looking up the secret key and username from the repository.
2. Read the client time from the B of the dynamic ID, and then append
the secret key (from 1) with the client time.
3. Calculate the Hash (secret key + client time).
4. Read A of the dynamic ID and compare it with the results of 3.
5. If the comparison in 4 is a match, this session belongs to the correct
requested user. If it is not a match, this session is incorrect.
6. If the session in 5 is correct, verify the client time. If the client time is
different from the server’s time (server time), then the web service is
alerted and the session is terminated.
A = Hash (Secret Key + Client Time)
B = (Static ID + Client Time)
(7.1)
(7.2)
Dynamic ID = 7.1 + 7.2
165
7.4.4 Performance Comparison
Table 7.1 shows the result of the evaluated SHAPS, and different
types of vulnerable inputs. The performance of the developed system is
analysed, based on the response time, with the prevention services as well as
without them. The response time for the session fixation preventer, one time
URL, and non static web session creator has been taken, evaluated
independently with ten different samples of vulnerable links and script.
Table 7.1 Comparative assessment based on Time of the session hijacking
No. of
Test
Session Fixation Preventer One time URLNon static web session
creator
Response
time
Without
prevention
services
(milli
seconds)
Response
time
with
prevention
service
(milli
seconds)
Response
time
Without
prevention
services
(milli
seconds))
Response
time
with
prevention
service
(milli
seconds)
Response
time
Without
prevention
services
(milli
seconds)
Response
time
with
prevention
service
(milli
seconds)
1 25 43 23 90 134 457
2 21 78 36 110 35 80
3 27 43 75 99 38 60
4 13 18 52 105 41 82
5 26 16 56 80 78 84
6 22 27 43 111 35 55
7 27 31 45 85 30 48
8 19 23 16 65 16 59
9 26 30 53 99 31 200
10 28 35 37 74 53 135
Avg
Response
Time
23.4 34.4 43.6 91.8 49.1 126.0
166
Even though the response time is high in this filtering system, the
system provides more security than the existing techniques. With this session
hijacking, the phishing can be avoided efficiently. The next technique for the
server side phishing is E-mail phishing.
7.5 E-MAIL PHISHING
Phishing email attempts to fraudulently acquire personal
information, such as your account password or credit card information. Here,
the email may look like a legitimate source, but actually it is not. Many e-mail
tools as well as most of the browser tools apply lists to classify “good”
(whitelists) and “bad” (blacklists) sources/senders. Typically, the blacklists
block the IP address of the e-mail (SMTP) server, the sender domain, or even
the whole e-mail address domain of a sender. Blocking the IP address or
domain can cause problems when the sender uses an SMTP server of any
provider, and blocking the whole sender’s email address domain can be
inefficient, because the source address could be forged (Cleber et al 2011). In
this approach, a robust three-stage classification model which can be
implemented is used. Web servers automatically detect phishing messages
and discover the impersonated entity in those messages. The approach
combines soft computing and a Three-Stage classification model for filtering
phishing emails, and the name of this approach is called Mail Sieve. Here, the
existing classifier algorithms are rescheduled as a multi-tier classification
process to classify the phishing email and to find out the optimum scheduling.
Moreover, in this work, supervised classification techniques are used, which
is a major stream of data mining to assess the severity of the phishing attacks.
Supervised learning algorithms namely, Decision Tree induction, Multilayer
Perceptron , Naïve Bayes , Bayesian Network and Radial Basis Function
classifications are used for learning.
167
7.5.1 Clustering the Values
In this approach, the email messages are initially parsed, using the
Chilkat Reader. The parsed mails are mentioned as texts, which are now
validated for legitimacy. Initially, the texts are checked for missing values by
the soft computing technique. The missing values are replaced by standard
terms, using the K-Cluster algorithm. The K-means is one of the simplest
learning algorithms that solve the well-known clustering problem. The given
data set is classified using a certain number of clusters (k). For every cluster,
the k centers should be defined, and can be placed in a way to replace all
other related and missing values. The process will associate each point of the
given data set to its nearest center. A loop has to be generated until the target
value replaces the missing values. This loop will enable the k centers to
change their location step by step, until no more changes are possible. The
objective function called squared error function is minimized through this
algorithm, and is given by (www.deib.polim.it);
2
1 1
( )icc
i ji j
J V x v
where,
‘||xi - vj||’ is the Euclidean distance between xi and vj.
‘ci’ is the number of data points in the ith
cluster.
‘c’ is the number of cluster centers.
The algorithm is given in Figure 7.12.
168
Figure 7.12 Algorithm for K-Cluster
Initially, the values are grouped as K-clusters. For example, the
end-user, user, client and similar words related to the customer are grouped as
a cluster, named as a customer cluster. The banking terms like bank, banking,
and finance are grouped as another cluster, called bank. For each and every
cluster, a center point or value is fixed, termed as the centroid or target value.
Then, the values in the cluster are replaced by the target value until there is no
more replacement. The work flow is shown in Figure 7.13. The terms are
grouped as clusters, and the centroid value is fixed for each and every cluster;
then the phished values are replaced by the target value.
After replacing the missing values, the e-mail is checked in the
three stage classification model. In stage-one, the mails are checked for the
legitimacy in the subject, in stage-two, they are checked for the content, and
in stage-three, they are checked for the sender’s IP address.
1. The objects that are being clustered represent the space where k
points are placed. These k points represent the initial group
centroids.
2. The group, which has the closest centroid, will be assigned with
each object.
3. After assigning all the objects, the positions of the k centroids are
recalculated.
4. Steps 2 and 3 have to be repeated until the centroids stop moving.
Due to this the separation of the objects into groups will happen;
thereby the calculation of the metric to be minimized is possible.
169
NO
YES
Figure 7.13 Workflow of the K-Cluster algorithm
7.5.2 Three Stage Classification Model
The stage-one classifier validates the texts in the mail subject. It
selects the texts, checks and verifies with the predefined keywords, mentioned
in the FilterKeywords.xml file. It is either marked as legitimate or spam mail,
based on the keyword match. Then, the mails are moved to the spam or junk
folder, if illegitimate. If it is found to be good, it is then passed to the
stage-two classifier. The mails are checked for their legitimacy in content.
The content is checked for phishing keywords as well as the embedded image
in it. The outputs may be either good mail or spam mail. If invalid, it is
moved to the spam or junk folder. If legitimate, the outputs are fed as input to
Start
Number of
cluster K
Find Centroid
Replace missing values by
centroid
Are all
missing
values
replaced?
End
170
the stage-three classifier. This algorithm will classify the message with a label
of either good or spam after validating the IP address. The IP address received
was checked in the black list of real time site Spamhaus.org. If the received
mail is marked as spam, it is moved to the spam or junk folder. Else, the
output message of the algorithm will directly be sent to the inbox, as the mail
is legitimate.
As many mails can be detected for phishing, as possible. The user
accounts can be configured for any of the mail servers like Gmail and Yahoo.
For example, Gmail is to be configured as imap.gmail.com. User accounts
which are to be detected for phishing can be many for the mail server
configured. The accounts for which the mails are to be detected are
configured in the credentials.xml file. The user id and password are encoded
and then updated in the credentials.xml file, separated by a semicolon. Also,
the folder where the illegitimate mails are to be moved should be mentioned
for each and every user account. The folder name can be like Spam, Junk, or
any user convenient name. The user credentials can be encoded for security
reasons, using the encoder/decoder.exe file. The credentials.xml file has to be
configured, as shown in Figure 7.14.
Figure 7.14 User-ID and Password configuration
<?xml version="1.0" encoding="UTF-8"?>
<CredentialList>
<Credentials>cG9ubWFuaW1hbGxpa2FAZ21haWwuY29t;R29vZ2xlQDEyM
w==;Junk
</Credentials>
<Credentials>cG9ubWFuaWFubmFtYWxhaUBnbWFpLmNvbQ==;MXFheiFB
UVo===;Spam</Credentials>
<Credentials>vdGVzdEBnbWFpbC5jb20===;M2VrZGZtIUAj;Junk
</Credentials>
<Credentials>c2h5bmlkYWxpbmFAZ21haWwuY29t;cEAkMTIzDQpwQkMTI
zNDU2;Spam
</Credentials>
</CredentialList>
171
A logger file is also maintained. The logger is nothing but a console
application. The console is used to display all the details of the mails checked.
Error messages like “unable to connect to Gmail host”, “invalid userid or
password” are shown in the console window. If there is no new mail, the
message “Email box is empty or no new mails” is displayed. If the mail is
invalid, the message “Mail with such subject is illegitimate and thus moved to
spam” is fired in the log. And all the details for illegitimacy are grouped in the
console. The details include mail subject/content, legitimacy check, and spam
info. The illegitimate mails are highlighted in red color. The legitimate mails
are shown in default color as shown in Figure 7.15.
Figure 7.15 Received E-mail details
The results are classified with the supervised learning algorithms to
make sure that our result is correct, and its accuracy is found.
7.5.3 Using Supervised Learning Algorithm
In the three stage classifiers, there are fifteen features that are
considered for checking the subject header and the content of the subject. The
fifteen features are listed below,
172
(i) Popup
Phishing attacks can be found in emails if the attacker inserts ant
forms or links to the compromised websites. Hence, the attacker may include
scripts to create a popup and then load a form in that popup, to trick the user
into entering sensitive data. Hence, finding the presence of a popup suggests
the possibility of the mail being an attempt to phish sensitive data.
(ii) Text “Verify Account”
If an email is found to have the text “Verify Account”, “Verify
Email”, ”Bank”, “Debit”, “fwd”, “reply”, “Click”, “Here ”, “login”, “update”
or any of its variants, then it is worth checking the email for further symptoms
of phishing. While the presence of these texts does not necessarily indicate
the presence of a phishing attempt, it is an easy way to lure people to click
into malicious links.
(iii) Javascript
Javascript is normally used to validate forms in websites. Its
presence in an email indicates that it is likely to be a malicious email, because
javascript can be used to change the text of a document. It can be used to trick
users in various ways.
(iv) onClick attribute:
The onClick attribute in an HTML element can be used to make a
HTML element clickable, and redirect a user to another URL which is
normally not possible.
173
(v) Change of window status
The status of the browser page can be changed by using the
window.object.status function in javascript. This can be used to provide the
user with false information like load contents from other websites, while
showing the legitimate website’s address in the status bar.
(vi) IP address in URLs
Some phishing attacks are hosted on PCs infected with
Virus/Malware. The only way to link to them is by using their IP address.
Legitimate email seldom uses links with an IP address. A link is an email
whose host is an IP-address (E.g http:// 101. 56.3.48/ login. facebook.
com/login).
(vii) ReplyTo modification
The attacker may modify the ‘replyto’ field in the email, with the
email address of the legitimate company, so that the user can reply back to the
legitimate company, and thus not become suspicious about the sender’s
identity. Hence, checking if the sender address and the ‘reply to’ address are
different, is important. If they are from different domains, it will help in
identifying phishing attempts.
(viii) Number of unique domains in URLs
The legitimate emails contain links in only one or two domains. If
the number is high, the email is probably an attempt to phish user data from
the receiver.
174
(ix) Number of words in Subject
Most legitimate E-mails have less than five to ten words in their
subjects. Hence, the presence of a large number of words in the subject
indicates the possibility of the Email being an attempt to phish sensitive data
from the user.
(x) Richness of the Vocabulary
Phishing emails normally contain the same words in a different
form. This reduces the richness of the content. This can be calculated by the
Type token ratio as shown in equation (7.3).
100Tokens
Types (7.3)
Types : number of words
Tokens: Number of different word forms and characters
(xi) Number of Periods in URL
Legitimate URLs also can contain a number of dots, and this does
not make it a phishing URL. This feature is simply the maximum number of
periods (‘.’) contained in any of the links present in the email, and is a
continuous feature.
(xii) Link in Image
By linking an image with a URL, many of the deceptions seen in
phishing attacks are possible. For a phisher to launch an attack with a plain
links is difficult, because the user is less likely to click a link with a plain text.
But the attacker can lure the user with attractive images to click it, and thus
the attacker can redirect the user to a phishing site.
175
(xiii) Number of hyperlinks
It denotes the total number of hyperlinks which are available in the
content.
(xiv) Cascading Style Sheet (CSS)
It denotes the CSS applied in the content of the message.
(xv) Number of words in subject with at least fifteen characters
These are the features X= <F1, F2, F3, F4, F5, F6, F7, F8, F9, F10,
F11, F12, F13, F14, F15> which can be used to differentiate between phishing
and legitimate web pages.
7.5.4 Evaluation
The experimental results of the three-stage classification model are
based on two data sets. One data set is based on the subject header, and the
other on the content of the e-mail. The test data set consists of 1535 e-mails.
The classification algorithms, Decision Tree induction, Multilayer Perceptron,
Naïve Bayes, Bayesian Network , and Radial Basis Function are implemented
and trained, using WEKA. The Weka 3.4.4, Open Source, Portable, and the
GUI based workbench, are a collection of data pre processing tools and state-
of-the-art machine learning algorithms. The 10-fold cross validation evaluates
the robustness of the classifiers. The prediction of phishing website is
measured as the ratio between total the test cases and the correctly classified
instances in the test dataset and is used as the primary performance measure
in prediction accuracy. Prediction accuracy and the training time are the two
criteria used to evaluate the performances of the trained models. The model’s
prediction accuracy is compared, and Table 7.2 gives the results of the five
classifiers.
176
Table 7.2 Performance analysis based on the subject for the five
classifiers
Criteria for Evaluation
Supervised Learning algorithm
DT MLP NB BN RBF
Kappa Statistic 0.7927 0.7868 0.7102 0.7097 0.7603
Mean Absolute Error 0.1484 0.1265 0.1519 0.1607 0.1818
Relative Absolute Error (%) 29.6802 25.2909 30.3735 32.1486 36.3645
Root Relative Squared
Error(%)29.6802 25.2909 30.3735 32.1486 36.3645
Correctly classified
instances (%)89.6322 89.3419 85.4948 85.4706 88.0111
Incorrectly classified
instances (%)10.3678 10.6581 14.5020 14.5294 11.9889
Precision (%) 89.7 89.3 87.9 87.8 88.6
Recall (%) 89.6 89.3 85.5 85.5 88
F Measure (%) 89.6 89.3 85.3 85.2 88
ROC Area (%) 95.3 95 90.1 91.9 94.5
Time taken to build model
(Sec)0.13 0.21 0.17 0.24 1.16
The comparison graphs between the classifiers for precision, recall,
F Measure and ROC for the subject in e-mails are shown in Figure 7.16. It
clearly shows that the Decision Tree is the best algorithm for this set of
features. Moreover, it is found that the time taken to build the model and the
precision accuracy is high, in the case of decision tree induction, when
compared to the other four algorithms, as shown in Figure 7.17.
177
Figure 7.16 Comparison Graphs between classifiers for subject
Figure 7.17 Building time between the classifiers for subject
The 10-fold cross validation results of the five classifiers for the
content of the e-mails are summarized in Table 7.3.
80
82
84
86
88
90
92
94
96
98
DT MLP NB BN RBF
% o
f v
alu
e
Classifiers
Precision
Recall
F Measure
ROC
0
0.2
0.4
0.6
0.8
1
1.2
1.4
DT MLP NB BN RBF
Tim
e t
ak
en
to
bu
ild
(S
ecs
)
Classifiers
Time taken
178
Table 7.3 Performance analysis based on content for the five classifiers
Criteria for Evaluation
Supervised Learning algorithm
DT MLP NB BN RBF
Kappa Statistic 0.9664 0.9618 0.9485 0.9281 0.9035
Mean Absolute Error 0.0252 0.0192 0.0277 0.0379 0.0769
Relative Absolute Error 5.0333 3.85 5.5438 7.5845 115.3835
Root Relative Squared
Error (%)25.4202 27.6015 30.7141 36.6595 39.2758
Correctly classified
instances (%)98.3184 98.0886 97.4232 96.407 95.173
Incorrectly classified
instances (%)1.6816 1.9114 2.5768 3.593 4.827
Precision (%) 98.3 98.1 97.4 96.4 95.2
Recall (%) 98.3 98.1 98.3 97.3 95.2
F Measure (%) 98.3 98.1 97.5 96.4 95.2
ROC Area (%) 98.4 98.2 98.4 98.6 97.6
Time taken to build
model (Sec)0.13 0.23 0.16 0.18 0.71
The comparison graphs between the classifiers for precision, recall,
F Measure and ROC for content in e-mails are shown in Figure 7.18. The
figure also clearly shows that the Decision Tree is the best algorithm for this
set of features. Moreover, it is found that the time taken to build the model
and the precision accuracy is high, in the case of decision tree induction,
when compared to the other four algorithms, as shown in Figure 7.19.
179
Figure 7.18 Comparison Graphs between classifiers for the content
Figure 7.19 Building time between the classifiers for the content
The 10-fold cross validation results of the five classifiers for the
subject and content together of the e-mails are summarized in Table 7.4.
93
94
95
96
97
98
99
DT MLP NB BN RBF
% o
f accura
cy
Classifiers
Precision
Recall
F Measure
ROC
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
DT MLP NB BN RBF
Tim
e t
ak
en
to
bu
ild
(se
c)
Classifiers
Time taken
180
Table 7.4 Performance analysis based on the subject and content for the
five classifiers
Criteria for Evaluation
Supervised Learning algorithm
DT MLP NB BN RBF
Kappa Statistic 0.9472 0.9472 0.9296 0.9693 0.9112
Mean Absolute Error 0.0484 0.0459 0.0488 0.0184 0.0811
Relative Absolute Error 9.6891 9.1745 9.7541 3.6701 16.2106
Root Relative Squared
Error31.3747 31.2648 36.1425 23.5682 40.053
Correctly classified
instances97.3627 96.1472 96.4796 98.4636 95.5601
Incorrectly classified
instances2.6373 3.8528 3.5204 1.5364 4.4399
Precision (%) 99.4 97.4 96.5 98.5 95.6
Recall (%) 99.3 94.4 95.5 96.5 94.6
F Measure (%) 99.2 97.4 96.5 98.5 95.6
ROC Area (%) 99.4 97.8 97.7 99.4 97.2
Time taken to build
model (Sec)0.03 0.18 0.04 0.11 0.55
The comparison graphs between the classifiers for precision, recall,
F Measure and ROC for the combination of the subject and content in e-mails
are shown in Figure 7.20. This clearly shows that the Decision Tree is the best
algorithm for this set of features. The comparison graph for the time taken to
build the model for the subject, content, and both combined, is shown in
Figure 7.21.
181
Figure 7.20 Comparison graph between the classifiers for the subject
and content
Figure 7.21 Building time between the classifiers for the subject and
content
91
92
93
94
95
96
97
98
99
100
DT MLP NB BN RBF
% o
f v
alu
e
Classifiers
Precision
Recall
F Measure
ROC
0
0.2
0.4
0.6
0.8
1
1.2
1.4
DT MLP NB BN RBF
Tim
e t
ak
en
to
bu
ild
(se
c)
Classifiers
Time taken for subject
Time taken for subject
Time taken for content
Time taken for content
Time taken for subject &
content
182
7.5.5 Performance Comparison with the Existing Techniques
The proposed e-mail phishing (Mail Sieve) is compared with the
existing tools Mail Washer and G-Lock spam Combat. Based on their
performance, it is analysed that both the tools read all the mails which are
already read, as shown in Table 7.5.
Table 7.5 Comparative analysis with the existing tools
S. No. Features MailwasherG-lock
spam
sieve
1 Process read mails Yes Yes No
2 Process unread mails Yes Yes Yes
3Read mails by message
subjectYes Yes Yes
4 Mails marked as spam by user Yes Yes No
5 Mails marked as spam default No No Yes
6 Mails moved to spam Yes No Yes
7 Mails moved to thrash No Yes No
8 Log reader Yes Yes Yes
9 Time consumption High Medium Low
10 User friendliness Good Better Better
The end user generally does not worry about the read mails. She/he
wants to know the quality of the unread mails only. The system reads the
unread mails and directly moves to spam or junk which do not exist in both
the tools. Both the tools ask the user to mark as good or bad. Based on the
user’s decision, it updates its filters.
183
The Mail Sieve is compared with the existing techniques like multi-
tier phishing detection (Rafiqul Islam and Jemal Abawajy 2013), Soft
computation based imputation (Kancherla et al 2012) and Phishing detection
using CRF and LDA (Venkatesh and Harry 2013) in terms of accuracy as
shown in table 7.6.
Table 7.6 Comparison with the existing approaches based on accuracy
S.No Different existing approaches Accuracy (%)
1 Multi-tier 97
2 Soft computation 82.46
3 CRF & LDA 98.8
4 Mail sieve 99.4
From Table 7.6, it is clearly understood that the Mail sieve
approach (proposed system) shows more accuracy than the existing
techniques. The comparison graph is shown in Figure 7.22.
Figure 7.22 Comparison graph with the existing approaches based on
accuracy
97
82.46
98.8 99.4
0
10
20
30
40
50
60
70
80
90
100
Multi-tier Soft
computation
CRF & LDA Mail sieve
% o
f v
alu
e
Different approaches
Accuracy
Accuracy
184
7.6 SUMMARY
Secure bank transaction is achieved in the server side phishing
filtering techniques, by implementing the effective one-time
password, and watermarking mechanisms.
The one-time URL mechanism used in session hijacking
provides more secure communication for the individual and
the organization.
The Session Fixation preventer system achieved high accuracy
for preventing the session fixation by the attacker.
The Three-Stage e-mail filtering technique achieved 99.4%
accuracy for filtering the phishing e-mails.
In order to produce efficient results over e-mail, and offer a
better e-mail process, features are selected from different e-
mails, and the classification algorithms are used to achieve
low false negative and false positive rates.