Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab...
-
Upload
elaine-nash -
Category
Documents
-
view
220 -
download
2
Transcript of Dark Web Collection, Search, and Analysis Dr. Hsinchun Chen Director, Artificial Intelligence Lab...
Dark WebCollection, Search, and Analysis
Dr. Hsinchun Chen
Director, Artificial Intelligence Lab
University of Arizona
[email protected] http://ai.arizona.edu
Acknowledgements: NSF CRI; NSF EXP-LA; DTRA, DOD CTFP, NPS; (ARFL WMD, CIA, FBI)
Leaderless Jihad and the Internet
• “The process of radicalization in a hostile habitat but linked through the Internet leads to a disconnected global network, the Leaderless Jihad.”
• Before 2004, face-to-face interactions, 26-year old
• After 2004, interactions on the Internet: Madrid, Dutch Hifsatd, Cairo, Toronto… Irhabi007 and Muntada, 20-year old
Intelligence and Security Informatics (ISI): Development of advanced information technologies, systems, algorithms, and databases for national security related applications, through an integrated technological, organizational, and policy-based approach” (Chen et al., 2003a)
Data, text, and web mining From COPLINK to Dark Web
Intelligence and Security Informatics
Newsweek Magazine, March 3, 2003
A computerized way for police to coordinate crime databases
Washington Post, March 6, 2008
National dragnet is a click away! COPLINK in use in 1,600 police agencies
in US!
ABC News April 15, 2003
Google for Cops: Coplink software helps police search for cyber clues to bust criminals
The New York Times, November 2, 2002
COPLINK assisted in DC sniper investigation
COPLINK project in the press
Dark Web Overview
Dark Web: Terrorists’ and cyber criminals’ use of the Internet
Collection: Web sites, forums, blogs, YouTube, Second Life
Analysis and Visualization: Link and content analysis; Web metrics analysis; Authorship analysis; Sentiment analysis; Multimedia analysis
Our collection is about 2 TBs in size, with close to 500M pages/files/messages from more than 10,000 Dark Web sites.
Project Seeks to Track Terror Web Posts, 11/11/2007
Researchers say tool could trace online posts to terrorists, 11/11/2007
Mathematicians Work to Help Track Terrorist Activity, 9/14/2007
Team from the University of Arizona identifies and tracks terrorists on the Web, 9/10/2007
Dar Web project in the press
Dark Web Forum Crawler System
Middle Eastern Web Collection File Types
Dynamic files (e.g., PHP, ASP, JSP, etc.) are widely used in extremist Web sites, indicating a high level of technical sophistication.
Multimedia files (videos, images) are also heavily used in extremist Web sites.
Terrorist Collection # of Files Volume(Bytes)
Total 222,687 12,362,050,865
Indexable Files 179,223 4,854,971,043
HTML Files 44,334 1,137,725,685
Word Files 278 16,371,586
PDF Files 3,145 542,061,545
Dynamic Files 130,972 3,106,537,495
Text Files 390 45,982,886
Powerpoint Files 6 6,087,168
XML Files 98 204,678
Multimedia Files 35,164 5,915,442,276
Image Files 31,691 525,986,847
Audio Files 2,554 3,750,390,404
Video Files 919 1,230,046,468
Archive Files 1,281 483,138,149
Non-Standard Files 7,019 1,108,499,397
Number of Fi l es Di stri buti on (Arabi c)
80%
16%
0%
4%
I ndexabl eFi l esMul medi aFi l esArchi ve Fi l es
Non-StandardFi l es
Vol ume Di stri buti on (Arabi c)
39%
48%
4%9% I ndexabl e
Fi l esMul medi aFi l esArchi ve Fi l es
Non-StandardFi l es
CyberGate System: Analysis & Visualization
7. Results: Intensity RelationshipU.S. Forum Scores
0
100
200
300
400
0 100 200 300 400Hate Scores
Vio
len
ce
Sc
ore
s
Middle Eastern Forum Scores
0
100
200
300
400
0 50 100 150 200 250 300 350 400Hate Scores
Vio
len
ce
Sc
ore
s
Measuring Hate and Violence: US vs. Middle Eastern Groups
b1 R2
U.S. Middle Eastern
N 4676 3349
beta (slope) 0.079 0.682
t-Stat 21.354 48.265
P-Value 0.000 0.000
R-Square 0.076 0.486
Strong hate and violence
correlation, especially for
Middle-Eastern groups.
Number of Posts By Month: Al-Firdaws vs. Montada
Al-Firdaws consistently has between 2,500-3,000 posts per month since the second half of 2006.
Montada very active in 2002 and 2005.
Al-Firdaws Posts By Month
0
500
1000
1500
2000
2500
3000
3500
Jan
-05
Ma
r-0
5
Ma
y-0
5
Jul-
05
Se
p-0
5
No
v-0
5
Jan
-06
Ma
r-0
6
Ma
y-0
6
Jul-
06
Se
p-0
6
No
v-0
6
Jan
-07
Ma
r-0
7
Ma
y-0
7
Jul-
07
# p
os
ts
Montada Posts By Month
0
5000
10000
15000
20000
25000S
ep-0
0
Jan-
01
May
-01
Sep
-01
Jan-
02
May
-02
Sep
-02
Jan-
03
May
-03
Sep
-03
Jan-
04
May
-04
Sep
-04
Jan-
05
May
-05
Sep
-05
Jan-
06
May
-06
Sep
-06
Jan-
07
May
-07
# p
ost
s
Affect Intensities: Al-Firdaws vs. Montada
Al-Firdaws - Anger Montada - Anger
Al-Firdaws - Violence Montada - Violence
Al-Firdaws has considerably higher violence and also greater anger intensity.
Arabic Writeprint Feature Set
Lexical Syntactic StructuralContent Specific
Feature Set
Char-Based
Word-Based
Punctuation
Function Words
Word Structure
Word Roots
Technical Structure
Race/Nationality
Violence
Char-Level
Letter Frequency
Special Char.
Word-Level
Vocab. Richness
Word Length Dist.
(262) (15)(62)(79)
(418)
(48) (31) (12) (200) (48) (11) (4)
(4) (35) (9) (6) (8) (15)
(50)M
essage Level
Paragraph Level
Contact Information
Font Color
Font Size
Embedded Im
ages
(5) (6) (3) (29)
Hyperlinks
(14)
(8) (4) (7)
Elongation
(2)
Arabic Feature Extraction Component
Feature Set
Elongation FilterCount +1
Degree + 5
Incoming Message
Filtered Message
Root Dictionary
Root Clustering Algorithm
Similarity Scores (SC)
max(SC)+1
Generic Feature Extractor
All Remaining Features Values
1
3
2
4
Sliding Window + PCA : Turning Text into Dots
1,0,0,2,1,2
0,1,3,0,1,0
0.533 0.956 -0.541 0.445 0.034 0.089 0.653 0.456 0.975 -0.085 0.143 -0.381
Compute eigenvectors for 2 principal components of feature group
Transform into 2-dimensional space
x
Extract feature usage vectors
y
x = Zx
y = Zy
Repeat steps 2 and 3
1.
3.
2.
x
y
Message Text
Feature Usage Vector Z
Eigenvectors
Anonymous MessagesAuthor Writeprints
Author B
Author A 10 messages
10 messages
ClearGuidance.com (Toronto Plot): Participant Network Visualization
ClearGuidance Forum “Experts”
The series of overlapping circular patterns for bag-of-word features indicates that the author’s discussion revolves around a related set of topics.
Bag-of-words are predominantly related to religious topics, e.g., Adam, angels, etc.
Many large red blots indicative of the presence of features unique to this author, e.g., Adam, angels, etc.
This author was later arrested as a major culprit in the Toronto terror plot (“Soldier of God”). He uses many violent affect terms.
Radar chart showing violent affect feature usages.
Selected feature is use of term “jihad” which is the highest in the forum .
Selected feature (i.e., “jihad”) is shown in red.
This author constantly attempts to justify acts of violence and terrorism. “…there are so many paid sheikhs
stuck in this life….no point going to them for fatwas…personally speaking…cuz they don’t even agree with jihad in the first place”
Dark Web Forum Tools
Information contained within Dark Web forums represent a significant source of knowledge for security and intelligence organizations.
We have developed tools supporting the large-scale collection, search, and analysis of Dark Web forums, specifically addressing the needs of security analysts.
Collection
AZ Forum
Spider
Search
AZ Forum Portal
AZ Sentiment Analyzer
Analysis
AZ CyberGate
Text Analyzer
AZ Forum Spider
Automated collection of forum communications; weekly update
Proxy servers and parameters
Site map, URL ordering, and forum extraction
Incremental spider Collection
visualization
Collection – AZ Forum Spider
Forum List
SpideringStatus
CollectionStatistics
SpideringProfile
AZ Forum Portal
Current version: 13M messages (340K members) across 29 major Jihadi forums in English, Arabic, French, German and Russian
Forum analysis By forum, thread,
member, time period, or topic
Social network analysis and visualization
Google Translation
Dark Web Forum Portal
23
Forum Portal Data Set
23
Name Language Time Span Number of Members
Number of Threads
Number of Messages
Al-Boraq Arabic 01/08/2006 - 01/02/2010 3,503 52,322 223,648
Al-Fallujah Arabic 09/19/2006 - 01/02/2010 5,853 74,899 547,712
Al-Firdaws* Arabic 01/02/2005 - 12/06/2007 2,187 9,359 39,715
Midad al-Suyuf Arabic 03/18/2006 - 01/02/2010 1,597 11,232 38,382
Alokab Arabic 04/08/2005 - 12/31/2009 1,547 8,096 55,947
Al-Qimmah Arabic 11/23/2007 - 01/02/2010 287 12,097 23,709
Alsayra Arabic 04/05/2001 - 12/31/2009 66,705 147,598 1,227,207
Ansar Arabic 11/07/2008 - 01/02/2010 1,224 12,041 46,928
At-tahadi Arabic 04/14/2008 - 01/02/2010 313 2,599 5,406
Hanin Net Arabic 11/27/2006 - 01/12/2010 2,837 96,239 821,478
Hawaa World Arabic 01/01/2001 - 01/02/2010 113,579 40,501 2,251,553
Hadramout Arabic 11/25/2000 - 12/29/2009 29,491 151,694 1,552,227
Ma’arik Arabic 07/29/2007 - 01/03/2010 1,880 15,288 57,047
Al-Mujahidin Arabic 11/09/2007 - 01/02/2010 4,259 29,980 140,930
Montada Arabic 09/25/2000 - 12/29/2009 40,291 120,181 1,412,028
24
Data Set (Cont’d)
24
Name Language Time Span Number of Members
Number of Threads
Number of Messages
Ana al-Muslim Arabic 10/08/1985 - 11/26/2009 12,215 179,791 1,343,370
Shumukh Arabic 03/21/2007 - 01/02/2010 3,938 46,666 289,201
Ansar English 12/08/2008 - 01/02/2010 377 11,133 29,056
Gawaher English 10/24/2004 - 01/01/2010 6,790 210,656 569,709
Islamic Awakening English 04/28/2004 - 12/31/2009 2,361 25,112 116,009
Islamic Network* English 06/09/2004 - 05/07/2008 1,573 11,974 87,314
Islamic Web-Community
English 11/14/2000 - 12/31/2009 745 6,262 24,850
Turn To Islam English 06/02/2006 - 01/01/2010 9,926 38,702 308,970
Ummah English 04/01/2002 - 12/31/2009 14,349 71,218 1,192,583
Al Minha Dj French 06/01/2008 - 01/04/2010 313 2,007 6,421
Forums d’aslama French 10/06/2004 - 01/03/2010 2,665 20,468 131,559
Al-Mourabitoune French 05/05/2002 - 03/27/2009 3,198 7,905 72,140
Ansar German 02/27/2009 - 01/02/2010 62 726 1,645
KavkazChat Russian 03/21/2003 - 01/03/2010 5,634 6,144 558,042
Total 339,699 1,422,890 13,174,786
2525
Forum Statistics Summary (Cont’d)
26
Cross Forum Search
26
2727
Single Forum Search & Translation
Search: bomb, iraq
Translations of thread titles
SNA Replay Network
28
1. Bint ul Islam (290 postings)
2. Iloveislam (239 postings)
3. Abuhannah (173 postings)
AZ Sentiment Analyzer
Portal for the sentiment and affect analysis of forums, measuring member opinions and emotions
Characterizes the affects conveyed in forum text, and the underlying sentiment polarity
By forum, thread, member, or time period
Keyword search
Search – AZ Sentiment Analyzer
AZ CyberGate Text Analyzer
Comprehensive system for the analysis and visualization of forum communications
Shows all text features Utilizes Writeprint and
Ink Blot techniques in text analysis
Incorporates rich visualization based upon multi-dimensional scaling and parallel coordinates
Analysis – AZ CyberGate Text Analyzer
Conclusion
The web offers extremists a rich medium for recruiting, communication, and radicalization.
Information contained within Dark Web sites, forums, blogs, multimedia, etc. represent a significant source of knowledge for security and intelligence organizations.
A computational approach to Dark Web research spans collection, search, and analysis.
Dark Web research could potentially assist in terrorism research and intelligence analysis.
Dark Web Forum Portal available now!!!
Dark WebCollection, Search, and Analysis
For more information:
Dr. Hsinchun Chen, University of Arizona
http://ai.arizona.edu