Static Analysis of Third-Party Web Tracking Mentor: Michelle Mazurek Xing Niu, Hao Zhou University...

Static Analysis of Third-Party Web TrackingMentor: Michelle MazurekXing Niu <[email protected]>, Hao Zhou<[email protected]>University of Maryland, College ParkApril 6, 2015

Data SetsIn this project, we used the third-party web tracking data provided by Prof. Michelle Mazurek. It contains web browsing history for 33 users. The histories are organized as directed graphs in which vertices are web domains, edges represent connections from first to third parties and indicate whether cookie tracking was observed. A first-party website is what a user browses directly, but a third-party site is what the first-party website tries to connect backstage while been loading.

We merged all 33 graphs (one graphs per user) into a single one, excluded reflexive edges and collapsed some selected third-level domains into second-level domains (e.g. d3fw5vlhllyvee.cloudfront.net to cloudfront.net). It resulted in a big directed graph with 4,732 vertices and 37,381 edges. Edges are weighted with how many users experienced the same connections.

In addition, we looked up the categories of all enrolled web domains by using an online service, Blue Coat [1], to complement the dataset.

AbstractWe used NodeXL to analyze a third-party web tracking dataset in this project. We found that certain category of websites are more likely to be third-party sites and use cookies to take over the privacy of the user. Moreover, websites “collude” with each other selectively. By analyzing the in-degree and out-degree of the vertices, we found potential personal data flow and interesting patterns: some mixed websites act more like first-parties or the other way around.

mailto:[email protected]

mailto:[email protected]

They track you and take over your privacy

Figure 1: The distribution of third-party sites over category.

Since the web domains are linked as a directed graph where the first-party points to the third-party, we can use in-degree to measure how many websites a third-party site are tracking. E.g.

As shown in Figure 1, we arrange the third-party sites (with in-degrees no less than 5) by categories in a treemap. Within each category, the vertices are laid out as a grid. The size of vertices are also coded by in-degrees.

We can easily notice that Web Ads/Analysis, Technology/Internet (e.g. Web APIs), Marketing and Content Servers (e.g. cloud storage servers and content distribution networks) are major third-party sites that are tracking us.

It is interesting that some third-party sites also link to other parallel third-party sites. One reason is that they have several subdomains and some act as first-parties while some act as trackers. Another reason is they are sharing users’ information with others. We will discuss this issue later.

Third-partyFirst-partyIn-degree=3

Many third-party sites track users using cookies, a piece of text stored locally that contains information related to browsing actions.

For example, when a user was browsing first-party-1.com, ad.third-party.com embedded a cookie in his/her computer. Then ad.third-party.com can push a targeted ad to this user when he/she is browsing first-party-2.com.

Figure 2 shows which third-party sites are tracking users with such cookies (by applying dynamic filters).

Not surprisingly, Web Ads and Marketing sites are big fans of cookies because that small pieces of information help them select more personalized ads and lead to more money.

Content Servers, on the other hand, seldom use cookies because in most cases they are just hard drivers from the perspective of first-party websites.

first-party-1.com first-party-2.com

cookie

ad.third-party.com ad.third-party.com

Figure 2: The distribution of third-party sites using cookies over category.

They track you and take over your privacy

websites “collude” with each other

Figure 3: The tracking connectivity between websites grouped by category.

We have mentioned that Web Ads/Analysis, APIs, Marketing and Content Servers are dominating the web-tracking. Massive connections to them cover up many interesting tracking chains.

Figure 3 reveals the network among top categories expect those big four. The size of disks measures the number of first-party websites in that category. The opacity of edges measures the number of connections.

By tracing the links, we can find some “arrow collectors” such as Social Networking and Audio/Video Clips. It is easy to imagine a website embedded with Twitter timelines or YouTube videos.

websites “collude” with each otherWe have mentioned that Web Ads/Analysis, APIs, Marketing and Content Servers are dominating the web-tracking. Massive connections to them cover up many interesting tracking chains.

Figure 3 reveals the network among top categories expect those big four. The size of disks measures the number of first-party websites in that category. The opacity of edges measures the number of connections.

By tracing the links, we can find some “arrow collectors” such as Social Networking and Audio/Video Clips. It is easy to imagine a website embedded with Twitter timelines or YouTube videos.

Some meaningful links are telling how do websites “collude” with each other. For example, Education links to Job Search/ Careers, Restaurant/Dining/Food and Health link to shopping, Travel and Daily Living link to each other, …

Privacy is valuable only if it is shared to the person who treasures it.

Figure 3: The tracking connectivity between websites grouped by category.

(personal) data flows through the network

Figure 4: The scatter plot of website domains between in-degree and out-degree.

Some website belong to pure first-party: they only manipulate personal data when users are browsing them. Some are pure third-party sites: they take part in personal data sharing but have never been shown in the address bar. The others, like mixtures, carry the personal data in and out.

Figure 4 is a scatter plot of all vertices with the X-axis indicating in-degree while the Y-axis indicating out-degree. Vertices are color-coded using PageRank. Their sizes are mapped by degrees (in-degree plus out-degree).

By definition, all pure first-party websites lie against the vertical axis and all pure third-party sites lay on the horizontal axis. We can also tell whether a mixed site inside the quadrant is closer to the first-party or the third-party.

Theoretically, once personal data flows to a mixed site, it is also ready to flow to another third-party site.

We provide another version of scatter plot that separate first and third-party sites more clearly in the appendix.

Some patterns are also notable.

Information websites such as news, sport and weather sites are closer to first-party, while Social networks and content sharing sites are closer to third-party.

information websites

social networks

Figure 5: The scatter plot of website domains between in-degree and out-degree. Some information sites and social network sites are labeled.


Some patterns are also notable.

Information websites such as news, sport and weather sites are closer to first-party, while Social networks and content sharing sites are closer to third-party.

Online shopping sites are closer to first-party. People may be confused that since online shopping sites are the primary ads sponsor, why are they closer to first-party? The fact is ads are usually displayed by advertisement networks such as Google AdSense, not the online shopping websites themselves.

Web giants such as Yahoo!, Google, Bing, etc., are always at the top right corner, reflecting their complex business.


online shopping

Web “giants”

Figure 6: The scatter plot of website domains between in-degree and out-degree. Some online shopping sites and Web “giants” are labeled.

Third-party websites tend to form AlliancesWe used Wakita-Tsurumi method to cluster the graph. Only websites with in-degree no less than 5 were kept since they definitely act as third-party sites . As shown in Figure 7, we found that the whole graph can be clustered into several groups. The interesting part is each of the web “giants” belongs to one group and they roped in nearly all third-parties. Each group is a potential alliance whose center is one of the web “giants”. Instead of categories, It provides another third-party grouping strategy that members within a group are sharing similar content preferences.

Figure 7: The clustering of third-party sites. The size of vertices represents the out-degree.

Critique of NodeXLNodeXL is an efficient graph visualization software. It is equipped with many automatic layouts such as Fruchterman-Reingold, Harel-Koren, Circle and so on. Organizing groups into a treemap is a terrific feature that makes the layout clean and pretty. We also benefit from the convenience of computing plenty of useful network metrics. These metrics can help us dig out interesting patterns from the graphs.

However, in many cases, NodeXL cannot accomplish the tasks as expected and breaks some rules of Prof. Shneiderman’s. For example:

Rule 6. Permit easy reversal of actions.

• NodeXL does not offer UNDO! Although this is due to the restrictions imposed by Excel [2], it is very inconvenient for users to operate data and try out features via NodeXL.

Rule 1. Strive for consistency.

• When we collapse groups, groups acts as vertices. However, many features provided for ordinary vertices are not available to groups, such as visibility, tooltip, vertex size, specifying the X and Y axes, in-degree and out-degree, …

• Filters, i.e. visibility in Autofill and Dynamic Filters support limited or even no logic operations (AND, OR, etc.). We can restrict only one attribute with Autofill and only perform AND with Dynamic Filters.

• The graph layouts shown in Excel, the images copied to clipboard and the images saved to files: all of them have differences with each other.

Other suggestions:

• All items in Autofill accept only one attribute. It is fine for features such as size and color, but is not a good idea for tooltip and visibility (as mentioned above).

• When lay out the graph as a circle, NodeXL always display an imperfect circle as shown in Figure 3. We suggest it can be adjusted.

• Filters can only filter continuous values. We suggest it also takes discrete values (e.g. category) into consideration.

• NodeXL is easy to stop responding. Since undoing operations is infeasible, it is difficult for users to debug.

ConclusionIn this project, we used NodeXL to analyze a third-party web tracking dataset which is a combination of web browsing history for 33 users. 4 key insights were listed in the report.

By laying out the distribution of third-party sites over category using a treemap, we found that Web Ads and APIs, Marketing and Content Servers are major third-party sites that are tracking us. Among them, Ads and Marketing sites are more likely taking over the privacy of users by using cookies.

Besides those big four categories, many websites are also enthusiastic about embedding Social Networking posts and Audio/Video Clips. Excluding popular third-party sites, we discovered that many websites only share personal data to related third-parties.

We further found that a website may be a first-party and third-party website simultaneously, which leads to the potential personal data flow. These mixed websites still have tendencies to first or third-party. Web “giants” such as Yahoo! and Google has comprehensive business.

Web “giants” also draw most third-party sites and form potential alliances respectively.

Finally, we pointed out that NodeXL is a very efficient graph visualization software which provides plenty of automatic layouts and network metrics. However, it breaks some rules of interface design. We proposed our suggestions to improve it.

References[1] http://sitereview.bluecoat.com/sitereview.jsp[2] https://nodexl.codeplex.com/discussions/211379

http://sitereview.bluecoat.com/sitereview.jsp

https://nodexl.codeplex.com/discussions/211379

Appendix: Separating First and Third-Party Sites

Figure 8: The scatter plot of website domains that separates first and third-party vertically.third-partyfirst-party

Figure 8 is another version of scatter plot that set first and third-party apart. The X coordinate is calculated by arctan(in-degree/out-degree) and it ranges from 0 to . O is mapped to pure first-party while is mapped to third-party.

The vertical axis makes it possible to rank popular sites with degree larger than or equal to 10 .

Static Analysis of Third-Party Web Tracking Mentor: Michelle Mazurek Xing Niu, Hao Zhou University...

Documents

Transcript of Static Analysis of Third-Party Web Tracking Mentor: Michelle Mazurek Xing Niu, Hao Zhou University...