Invitation or Bait? Detecting Malicious URLs in Facebook...

Invitation or Bait? Detecting Malicious URLs in Facebook Events

Sonu Gupta Department of Computer Science

Jaypee Institute of Information Technology Noida, India

[email protected]

Shelly Sachdeva Department of Computer Science

National Institute of Technology Delhi New Delhi, India

shellysachdeva @nitdelhi.ac.in

Abstract—With 2.2 billion monthly active users, Facebook is the most popular Online Social Network. Given its huge popularity and diverse features such as pages, events, groups etc., it is potentially the most attractive platform for cybercriminals to launch various attacks. In this paper, we study the role of Facebook Events in disseminating malicious URLs in the network. Here, we focus our analysis on Facebook Events which are created by Facebook Pages. The existing services like Web of Trust (WOT) and other blacklists follow crowdsourcing models. Thus, malicious URLs can only be detected once they are widespread on the network and has done significant damage. Therefore, we train a supervised machine learning model on our labeled dataset to create an efficient classifier for automatic detection of malicious Facebook events, independent of blacklists and third-party reputation services. Our model is able to classify malicious events with 75% accuracy using Support Vector Machine. To the best of our knowledge, this is the first paper to study the presence of malicious URLs on Facebook Events.

Keywords— Online Social Media, Malicious URLs, Facebook Events

I. INTRODUCTION From a mere chatroom service in the early 1990s to live

video streaming in 2018, Online Social Networks (OSNs) have evolved rapidly and are still evolving. There are more than 150 OSNs with Facebook, Twitter, and Instagram being one of the most popular ones. They are the best medium to quickly spread any information to a large audience. With 2.2 billion monthly active users, Facebook is the most popular OSN.1 Given its huge popularity and diverse features such as pages, events, groups etc., it is potentially the most attractive platform for cyber-criminals to launch various attacks. Once it has been claimed that Facebook spammers make over $200 million just by posting links.2 A study also shows the presence of malicious pages on Facebook [12]. These URLs can cause financial loss and thus degrades the user experience.

The presence of malicious URLs on the OSNs has been studied by several researchers. There can be many ways to spread malicious URLs on the network. On Twitter, it can be appended to the tweet. On Instagram, it can be part of a caption. Similarly, on Facebook too it can be disseminated

1 https://en.wikipedia.org/wiki/Facebook

2 http://www.theguardian.com/technology/ 2013/aug/28/facebook-spam-202-million-italian-research

easily via posts. Unlike Twitter which is a micro-blogging network, Facebook offers a vast variety of applications such as Facebook Groups, Events etc. to disseminate and share the information. With everything going online, Facebook provides a way to invite people to an offline and online event using online event invitation. Such events can be created by both Facebook users and Facebook Pages. Depending on the event owner’s preference, an event can be public or private. When a Facebook event is created by a page, it is by default public whereas when a user profile is used to create an event, the owner can decide to keep it public or private. The owner of the event invites other Facebook users. The owner of the event may permit or restrict other guests to invite more users. Recipients can give different types of responses. A response can be ‘Interested’, ‘Going’, ‘Maybe’ or ‘Declined’. A Facebook event typically consists of an event photo, event name, host name, venue, start date and time, description and link to event’s ticket website if any. Often the description part of an event consists of multiple URLs. We recently discovered that Facebook Events are also exploited to spread malicious URLs. According to a report3, 490 million people use Facebook events each month. Over 38 million public events were created in 2016 alone and nearly 35 million people view a public event each day.

Whenever a news-making incident takes place, user-activity increases on OSNs, including Facebook. Cybercriminals often take advantage of such situations to spread malicious URLs in the network. However, existing work to identify malicious URLs on other OSNs like YouTube, Twitter etc. cannot be immediately used to Facebook Events as these techniques rely majorly on features that are not publicly accessible for Facebook using the Graph API. These features include profile and network data, the total number of posts published, friends list, etc.

In this paper, we identify Facebook Events which spread malicious URLs in the network. We focus our analysis on Facebook Events which are created by various Facebook Pages. All the events were made public by the event creators. Figure 1 shows the architecture of the system to detect malicious URLs in Facebook Events. We collect data using Facebook’s Graph API4 and store in a NoSQL database. In this

3 https://events.fb.com/

4 https://developers.facebook.com/docs/graph-api/

work, we use MongoDB.5 Using supervised machine learning algorithms, we train a model to classify malicious events and legitimate events. To the best of our knowledge, this is the first work to study the presence of malicious URLs on Facebook Events.

Fig. 1. The architecture of the system to detect malicious URLs in Facebook Events

The main contributions of this work are:

• Highlighted that the Facebook Events too can be exploited to spread malicious URLs

• Proposed a set of 36 features to detect malicious URLs present in Facebook Events

• Developed a supervised machine learning model to classify a Facebook Event as malicious or legitimate.

The rest of the paper is organized as follows: Section II describes the work related to this paper. Section III explains the methodology that we used for data collection and data annotation. Section IV describes various features used to train the classifier. It also presents experimental results. Section V explains the limitations of this work. In the last section, VI, we conclude with future work.

II. RELATED WORK In this section, we discuss the previous work done for

identification of malicious content on OSNs. First, we describe the existing work on Facebook and in the later part of this section we discuss the past work for others OSNs.

A. Identification of malicious content on Facebook Gao et al. characterized and detected spam campaigns initiated on Facebook accounts using a dataset comprising of almost 187 million posts [7]. They found that more than 70% of all malicious wall posts advertised phishing websites. They also found that more than 97% accounts were compromised ones and were not fake accounts created only for the purpose of spamming. This approach relied heavily on URL blacklists to detect malware, phishing, and spam. In the follow-up work,

5 https://www.mongodb.com/

Gao et al. solved the challenge of reconstructing campaigns in real-time by adopting incremental clustering and parallelization [8]. They identified six features to distinguish spam campaigns from legitimate message clusters. They also developed and evaluated an accurate and efficient system that could be easily deployed at the OSN server side to provide online spam filtering in real-time. Ahmed and Abulaish presented a Markov Clustering based approach for the detection of spam profiles on Facebook [4]. Authors crawled the content posted by 320 users, out of which, 165 were manually labeled as spammers, and extracted 3 features from all of these profiles, which were used as input to the Markov Clustering model.

Stringhini et al. [6] leveraged a honeypot model to crawl data of the spammers present on Facebook. In [10], Rahman et al. proposed a social malware detection technique which leveraged the social context of posts. With this approach, they were able to attain a maximum true positive rate of 97%, using an SVM model which was trained on only 6 features.

Dewan and Kumaraguru characterized a dataset of 4.4 million public posts generated on Facebook and identified 11,217 malicious posts containing URLs [11]. They observed the greater participation of Facebook pages in generating malicious content as compared to legitimate content. They deployed a REST API and a browser plug-in to identify malicious Facebook posts in real-time. In another study, Dewan et al. identified and characterized Facebook pages which were engaged in spreading URLs pointing to various malicious domains [12]. They automated the process of detecting malicious Facebook pages by training multiple supervised learning algorithms on the dataset. An accuracy of 84.13% was achieved using an artificial neural network which was trained on a fixed sized bag-of-words.

B. Identification of malicious content on other OSNs There are several studies that use machine learning based models to identify malign data on other OSNs such as Twitter and YouTube [2], [3], [5]. The effectiveness of these models is due to the features like the number of followers-followee, date of joining the network, etc. [5], which are not publicly available on Facebook.

There is a tool which is based on the browser to detect phishing URLs [14], but this tool requires the user to click on suspicious URL. In [1], Aggarwal et al. built an effective mechanism to detect phishing links on twitter. They used a set of 42 features to train the classifier. They developed a RESTful API and a chrome extension which makes a call to this API to detect phishing URLs in real-time. They managed to get an accuracy of 92.52% with the Random Forest based machine learning model. Thomas et al. developed a real-time system that first crawls and then assesses the URLs submitted to a web service. It classifies the URLs as spam or legitimate in real-time. [9]. This system depends on the features of the landing page which might not be available in some cases.

III. DATA COLLECTION AND LABELED DATASET In this section, we describe how we collected data from

Facebook for analysis and to build a true positive dataset of Facebook events containing malicious URLs. Data collection

involves two steps, (i) collecting data from Facebook, (ii) labeling the event as malicious or legitimate.

A. Data Collection We focus our analysis on Facebook events which were

created by various Facebook pages. For our study, we required Facebook Events containing ticketing URLs and/or URLs in the description of the event. So, we collected only such events. To collected data, we used various keywords to search for relevant Facebook events using Facebook’s Graph API Search endpoint. Figure 2 shows our data collection procedure. We used diverse keywords like Happiness, Life, TEDx, Britney Spears, Art of Living, Events, Nightlife, alcohol, sex, porn, rave party, hukka, news, politics, sports, Tsunami, terrorist, terrorism, flood, Engineers, love etc. We observed that there were several events related to each such keywords.

Fig. 2. Describes the data collection procedure used.

B. Labeling Events as Malicious or Legitimate Initially, we curated a list of unique URLs present in the

events. We observed that many of the URLs were created by using various URL shortener services like bitly, Google URL shortener etc. All the unique URLs we collected through graph API are visited ‘URLexpanders’ to capture the destination page in case of shortened URLs.6 To label events as malicious or legitimate, we use Web of Trust7 (WOT) API, a crowdsourcing-based website reputation service. The WOT computes reputation of websites using the ratings received from information from third-party sources and users. Reputations are measured for target URL in several domains such as child abuse. For each target and domain pair, the system computes a reputation estimate and the confidence in the estimated reputation. Together, these indicate the amount of trust in the target URL in the given domain. The reputation score ranges from 0-100. The higher the value of reputation score, the more is the community trust for the website. A reputation score less than 60 is unsatisfactory. For each reputation score, a confidence score is associated which is of the range 0-100. Again, the higher the value of confidence score, the more is the community trust for the website. For

6 http://urlexpander.net/

7 https://www.mywot.com/

instance, the WOT add-on requires a confidence score greater than or equal to 10 before it presents a warning about a website. In addition to reputation and confidence score, the rating system also computes categories for websites based on votes from users and third parties. Category data aims to justify the reason behind a poor reputation score. There are three categories; (1) Negative, (2) Questionable and (3) Positive. In our dataset, URL domains with a low reputation score (<60) and high confidence score (>=10) were marked as malicious. We also label a URL as malicious if it fell under the Negative or Questionable category. In many past research, WOT has been used to label OSN data and proved to be effective [11], [12]. If a URL is found to be malicious, we label the source event to be malicious else we label the event to be legitimate. Table I shows the description of our dataset. We collected 223 events using keywords mentioned in the section III(A). After eliminating all the events which do not have URLs we are left with 62 events. Finally, we have 17 events with malicious URLs in them.

TABLE I. DESCRIPTION OF THE DATASET

No. of Events 223

No. of Events containing URL 62

No. of Events containing ticket URLs 31

No. of Events having websites 43

No. of Events with malicious URLs 17

IV. AUTOMATIC DETECTION OF MALICIOUS EVENTS In [13], Steve et al. has shown that URL blacklists and

reputation services are ineffective initially, and often take time to update. Therefore, there is a need for an automated system to detect malicious Facebook Events. To accomplish this need, we trained supervised learning algorithms on our labeled dataset to create an efficient model for automatic detection of malicious Facebook Events, independent of blacklists and third-party reputation services. We curated a feature set using some event specific features and some features from previous work [5]. To perform all the experiments, we used scikit-learn, a machine learning library for the python programming language.8

A. Feature Selection In this section, we describe the features we used to train

our machine learning models. Table II shows features used for machine learning experiments. The feature set is divided into five categories as follows.

1) Event-Based Features (F1): There are some basic

attributes which are associated with all the Facebook Events. Some of these attributes are interested count, maybe count, attending count, declined count, no reply count. We observed that a malicious Facebook Event has larger no reply count in comparison with legitimate Event. In contrast, attending

8 http://scikit-learn.org

count, declined count and interesting count are low. One of the interesting observation is that there are several Facebook Events such that the difference between the start time of the event and the event creation time is several years. Figure 3 is an example of one such event. Here, the event is scheduled for 10 years from now.

Fig. 3. An example of an event which is scheduled for 10 years from now.

2) Event Description Based Features (F2): We found that more than 90% of the events we collected have description in them. A description can be a short one or can be a very elaborate one with multiple URLs present in it. We found that some of the URLs are shortened using bitly and other such services. Using shortened URL in Facebook Event description when there is no limit of characters is a clear sign of deception.

3) Event’s Post Based Features (F3): Every event has a

discussion forum where guests and the owner can post. We found that on a Malicious Facebook Events there are less no. of posts. And if there are any post, they are mostly advertisements or links to some other webpages. We also observed that user engagement to these posts is also very less.

4) Link-Based Features (F4): There are multiple ways in

which links can be posted with a Facebook Event. The two main ways are through event tickets and in event description. Apart from this, links are also posted in discussion forums of the events. Cybercriminals exploit these places to inject malicious URLs in the events.

5) Event Host Based Features (F5): An event host can be

a page or a user profile. The credibility of a Facebook Event depends on the owner of the event. There can be multiple

owners of an event. We observed that Malicious Facebook Events are majorly created by Facebook pages. Also, the fan count of these pages is quite less. For our dataset, such pages have on an average 1,043 fans. We also found 6 user profiles which were involved in creating Malicious Facebook Events.

TABLE II. FEATURES USED FOR MACHINE LEARNING EXPERIMENTS

Event (12)

has ticketing URL, interested count, maybe count, attending count, declined count, no reply count, is canceled, can guests invite, event name length, no. of words in the name, start time, location

Event Description

(11)

emoticons (smile or frown), no. of words, average word length, no. of sentences, average sentence length, no. of English dictionary words , no. of characters, no. of URLs, no. of URLs per word, no. of uppercase characters, no. of words / no. of unique words

Event Posts (4) no. of URLs, share count, like count, comment count

Link (6) has HTTP / HTTPS, parameter count, parameter length, no. of subdomains, path length, hyphen count

Event Host (3) has a website, fan count, is verified?

B. Classification Algorithms In this section, we explain the supervised machine learning

algorithms we used to train our classifier to detect malicious events. We performed experiments with K-Nearest Neighbors, Naïve Bayes, Support Vector Machines, and Decision Tree. All the training models were evaluated using 10-fold cross-validation.

1) K-Nearest Neighbour (k-NN): It is a type of instance-

based learning. It does not learn or construct a generalized internal model. However, it directly pools instances of the data it sees during the training. A simple technique is used for classification. It is calculated from a predominant vote of the nearest neighbors of each data point. Here, simply a query point is allocated the class which has the maximum examples within the nearest neighbors of that data point.

2) Naïve Bayes Algorithms: They are a group of

supervised machine learning algorithms that are based on Bayes theorem with the simple notion of independence among each set of features. Consider a class y and a dependent feature vector x1 till xn, the Bayes theorem specifies the below relationship:

! " #1, . . , #' = ! " !(#1, . . , #'|")!(#1, . . , #')

Considering that all the set of features are independent such that:

! "# $, "1, . . "# − 1, "# + 1, . . , "* = !("#|$)

for all i, this equation can be written as:

!(#|%1, . . , %)) = !(#) !(%-|#)./01

!(%1, . . , %))

As !(#|%1, . . , %)) is fixed for a particular input data point, we use the underneath classification rule:

!(#|%1, . . , %)) ∝ !(#) !(%-|#).

/01

⇓

! = #$%max!) ! )(+,|!)

/

,=1

3) Support Vector Machines (SVMs): SVMs are also

known as support vector networks (SVNs). They are the supervised machine learning models which examine the data used for classification. Consider a set of training data points, each one is associated to one of the two classes, in binary classification problems. An SVM algorithm builds a model that allocates new data points to one of the two classes or groups, modeling it as a binary linear classifier. In other words, an SVM model is a representation of the training data as points in space, mapped so that the data points of the different classes or groups are separated by a distinct margin that is sufficiently wide enough. Then, the new data points are allocated into that same space and predicted to belong to a class on the basis of which side of the margin they lie.

4) Decision Tree: It is a type of machine learning

algorithm in which a decision tree is constructed from class-labeled training data points. A decision tree can be understood as a flow-chart-type arrangement, where every non-leaf node indicates a test on an attribute, each branch indicates the result of a test, and each of the terminal node carries a label to which the data point belongs. The uppermost node in a tree is the root node. The idea is to generate a model that forecasts the value of a target class based on multiple inputs.

C. Results We trained four machine learning models using 17 unique

Malicious Events as the positive class and 45 unique legitimate events which contain one or more URLs as the negative class. We are aware that the classes are imbalanced. We used imbalance classes because in the real-world scenario the no. of positive class examples would also be less. We performed all the experiments using the scikit-learn library. The 10-fold cross-validation on our training examples resulted in an accuracy of 75% using the SVM classifier. Table III describes the results in detail.

TABLE III. RESULTS FOR SUPERVISED LEARNING EXPERIMENTS

Classifier Accuracy

K-Nearest Neighbour 69%

Naïve Bayes 31%

Support Vector Machine 75%

Decision Tree 62%

V. LIMITATIONS Due to Facebook’s API restrictions, it is difficult to collect data. Thus, we trained our model with the small dataset we were able to collect. Also, we are not aware that what fraction of public events is returned by Graph API. Also, to create the ground truth dataset of Malicious Facebook Events, we used WOT API. We understand that the WOT follows a crowd-sourcing model and thus might be biased. Nevertheless, WOT explicitly states that they track user’s behavior before determining how much it trusts the user to keep ratings more trustworthy.

VI. CONCLUSION OSNs witness an increase in the volume of content during

real-life incidents. This, in turn, provides cyber-criminals a booming environment to disseminate malicious content on OSNs. We studied Facebook Events and found a considerable number of malicious URLs which exists in the network. We noticed significant differences between malicious and legitimate events. We used them to train machine learning models for automatic detection of malicious events. The feature set we curated consists of features which can be easily calculated with slight preprocessing. The model we trained is able to classify malicious events with 75% accuracy. Our analysis can be easily used to study public Facebook Groups as well. Thus, further we intend to study the presence of malicious entities in the Facebook groups. Facebook supports multiple non-English languages too. As of now, our system works only on the events in English language. In future, we would like to address this problem. We observed that there are many fake Facebook events which do not have any malicious URL. Such events tend to create an illusion of an event which doesn’t exist. In future, we also intend to explore NLP techniques to combat this issue.

ACKNOWLEDGEMENT I would like to thank my advisor and other faculties at JIIT

for their support and feedback.

REFERENCES

[1] A. Aggarwal, A. Rajadesingan, and P. Kumaraguru. “Phishari: Automatic realtime phishing detection on Twitter.” In eCrime Researchers Summit (eCrime), 2012, pages 1–12. IEEE, 2012.

[2] A. H. Wang. “Don’t follow me: Spam detection in twitter.” In SECRYPT, pages 1–10. IEEE, 2010.

[3] C. Grier, K. Thomas, V. Paxson, and M. Zhang. “@ spam: the underground on 140 characters or less.” In CCS, pages 27–37. ACM, 2010.

[4] F. Ahmed and M. Abulaish. “An mcl-based approach for spam profile detection in online social networks.” In IEEE TrustCom, pages 602–608. IEEE, 2012.

[5] F. Benevenuto, G. Magno, T. Rodrigues, and V. Almeida. “Detecting spammers on twitter.” In CEAS, volume 6, page 12, 2010.

[6] G. Stringhini, C. Kruegel, and G. Vigna. “Detecting spammers on Social Networks.” In ACSAC, pages 1–9. ACM, 2010.

[7] H. Gao, J. Hu, C. Wilson, Z. Li, Y. Chen, and B. Y. Zhao. “Detecting and characterizing social spam campaigns.” In Internet Measurement Conference”, pages 35–47. ACM, 2010.

[8] H. Gao, Y. Chen, K. Lee, D. Palsetia, and A. N. Choudhary. “Towards online spam filtering in social networks.” In NDSS, 2012.

[9] K. Thomas, C. Grier, J. Ma, V. Paxson, and D. Song, “Design and evaluation of a real-time url spam filtering service,” in Security and Privacy (SP), 2011 IEEE Symposium on. IEEE, 2011, pp. 447–462.

[10] M. S. Rahman, T.-K. Huang, H. V. Madhyastha, and M. Faloutsos. “Efficient and scalable socware detection in Online Social Networks.” In USENIX Security Symposium, pages 663–678, 2012.

[11] P. Dewan and P. Kumaraguru, “Towards automatic real time identification of malicious posts on Facebook,” In IEEE Conference on Privacy, Security and Trust. IEEE, 2015.

[12] P. Dewan, S. Bagroy and P. Kumaraguru, “Hiding in plain sight: Characterizing and detecting malicious Facebook pages,” In Int. Conf. on Advances in Social Networks Analysis and Mining, IEEE, 2016.

[13] S. Sheng, B. Wardman, G. Warner, L. Cranor, J. Hong, and C. Zhang. “An empirical analysis of phishing blacklists.” In Sixth Conference on Email and Anti-Spam (CEAS), 2009.

[14] Y. Zhang, S. Egelman, L. Cranor, and J. Hong, “Phinding phish: Evaluating anti-phishing tools,” in In Proceedings of the 14th Annual Network and Distributed System Security Symposium (NDSS 2007, 2007.

Invitation or Bait? Detecting Malicious URLs in Facebook...

Documents

Transcript of Invitation or Bait? Detecting Malicious URLs in Facebook...