Extracting information from darknet market advertisements ...
Transcript of Extracting information from darknet market advertisements ...
![Page 1: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/1.jpg)
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 786687
Extracting information from darknetmarket advertisements and forums
• Sven Schlarb [email protected], Mina Schütz [email protected], Clements Heistracher [email protected] Austrian Institute of Technology, Competence Unit Data Science & Artificial Inteligence
• Faisal Ghaffar [email protected] Innovation Exchange, Cognitive Computing Group
SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
![Page 2: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/2.jpg)
2SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
OverviewNamed entity recognition & Relationship extraction
![Page 3: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/3.jpg)
3SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Overview
![Page 4: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/4.jpg)
4SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Manual annotation
![Page 5: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/5.jpg)
5SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Model creation
![Page 6: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/6.jpg)
6SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Working directories
![Page 7: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/7.jpg)
7SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Named Entity Recognition (CKNER)
![Page 8: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/8.jpg)
8SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Relationship Extraction (RELEXT)
![Page 9: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/9.jpg)
9SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Working directories
![Page 10: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/10.jpg)
10SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Storage/Search API
![Page 11: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/11.jpg)
11SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Process Flow
![Page 12: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/12.jpg)
12SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Process Flow
Replay
Harvesting Scraping AnnotationModel
CreationInformation Extraction
Label suggestions
![Page 13: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/13.jpg)
13SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Annotation interface
(RecogitoJS integrated)
![Page 14: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/14.jpg)
14SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Annotation interface – Loading data
![Page 15: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/15.jpg)
15SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Load grams weapons/drugs dataset
![Page 16: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/16.jpg)
16SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Navigate through annotated postings
![Page 17: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/17.jpg)
17SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Choose vocabulary for annotation
![Page 18: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/18.jpg)
18SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Create regex pattern (e.g. Regex101.com)
![Page 19: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/19.jpg)
19SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Get named entity suggestions
![Page 20: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/20.jpg)
20SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Regular expression batch labelling
![Page 21: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/21.jpg)
21SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Multi-user
![Page 22: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/22.jpg)
22SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Multi-user/copkit1
![Page 23: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/23.jpg)
23SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Multi-user/admin
![Page 24: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/24.jpg)
24SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Definition-based batch labelling
• Useful for initial labelling (before userannotation starts)
• Definition file in JSON format required
• Regex terms can be used instead of fixedstrings to match tokens
• Filename convention: prelabel_definitions_*.json
![Page 25: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/25.jpg)
25SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Definition-based batch labelling
![Page 26: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/26.jpg)
26SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Training/test data
Data Set
Training Test
Two basic functions for handling train/test data are offered: 1) merge test into train, 2) split train into train and test
Training
Training Test
1.) Merge test set into training set→ Recommended before starting user annotation
2.) Split training set into training and test set→ Recommended before starting model creation
![Page 27: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/27.jpg)
27SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Merge test set into training set
![Page 28: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/28.jpg)
28SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Split training set into training and test set
![Page 29: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/29.jpg)
29SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Restore backups
![Page 30: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/30.jpg)
30SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Persist annotations
• Update training/test data files with user annotations (userannotations are removed afterwards)
![Page 31: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/31.jpg)
31SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Technical background:
Named entity recognition
![Page 32: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/32.jpg)
32SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Preparation: Text classification
Heistracher, C. & Schlarb, S.Machine Learning Techniques for the Classification of Product Descriptions from Darknet MarketplacesProceedings of the 11th International Conference on Applied Informatics, 2020
• Evaluate whether pre-trained embeddings improve the classification performance over simple vector space models, such as TF-IDF, when being transferred to the darknet domain.
• Best performance was achieved with Tensorflow Universal Sentence Encoder and a linear SVM
![Page 33: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/33.jpg)
33SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Named entity recognition with SpaCy
• SpaCy is an open-source library for NLP
• SpaCy’s standard models can be used as a basis to build domain-specific models
• For example, SpaCy’s “en_core_web_lg” is a convolutional neural network trained for multiple tasks on the OntoNotes 5 dataset.
• 18 entity types that are available in the pretrained model.
• Words are represented as GloVe vectors, that were trained on the common crawl dataset.
![Page 34: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/34.jpg)
34SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Spacy Named Entity Recognition - Evaluation
• Dataset usef for the evaluation was a subset of (Gwern2015) that contained strings from a list of weapon related terms.
• This subset was manually categorized and the categories were used as labels related to weapons.
• The subset of the data contained 1580 product descriptions including: 1582 Drugs, 769 Weapons, 319 ebooks, 96 3D-printing.
Precision 76.97
Recall 70.88
![Page 35: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/35.jpg)
35SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Technical background:
Relation extraction
![Page 36: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/36.jpg)
36SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Relation extraction
‐ Focus on Transformer models‐ Can be pre-trained on smaller datasets
‐ Model can be used for NLP downstream tasks
‐ Several papers already published with BERT, ALBERT, RoBERTA, GPT..
‐ Mostly trained and evaluated on SemEval 2010 Task 8 & TACRED datasets
‐ Various approaches
‐ Most commonly used: sentence-level
‐ Semantic Role Labeling
‐ Joint Approaches (NER + RE)
‐ Multi-Relation-Extraction
36
![Page 37: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/37.jpg)
37SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Relation extraction – approach
‐ Entities are already annotated‐ No ground truth for relations for our dataset annotated
1. Unsupervised relation extraction
2. Using an already labeled dataset with pre-defined labels such as SemEval 2010 Task 8
‐ Two models are fitting for our approach:‐ R-BERT
‐ BERT, RoBERTa and ALBERT are available
‐ Sentence-level
‐ OpenNRE‐ BERT
‐ Sentence-, bag-, document-level and few-shot
37
![Page 38: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/38.jpg)
38SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Relation extraction – R-BERT
38
![Page 39: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/39.jpg)
39SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Relation extraction – OPENnre
39
![Page 40: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/40.jpg)
40SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
SOURCES
• Xu Han, Tianyu Gao, Yuan Yao, Deming Ye, Zhiyuan Liu, and Maosong Sun. 2019. OpenNRE: An open and extensible toolkit for neural relation extraction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations, pages 169–174, Hong Kong, China, November. Association forComputational Linguisticts. https://github.com/thunlp/OpenNRE
• Shanchan Wu and Yifan He. 2019. Enriching pre-trained language model with entity information for relation classification. https://github.com/monologg/R-BERT
40
![Page 41: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/41.jpg)
41SECURWARE 2020 November 21, 2020 to November 25, 2020 - Valencia, Spain
Thank you !
![Page 42: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/42.jpg)
42COPKIT Webinar, 16 September 2020 at 11:00 AM CEST
IBM’s Moment Recognizer(MoRec)
Faisal Ghaffar (IBM), Brian Maguire (IBM)Wednesday 19th August 2020
![Page 43: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/43.jpg)
43COPKIT Webinar, 16 September 2020 at 11:00 AM CEST
2. Component Overview• AS an (LEA) end-user, I would like to extract knowledge from DNM forums to
- understand thread discussions- determine topics of discussion - determine trends in topics over time- identify experts on specific topic(s)
![Page 44: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/44.jpg)
44COPKIT Webinar, 16 September 2020 at 11:00 AM CEST
Events timeline
Given a thread, extract who said what over a daily timeline
MoRec
![Page 45: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/45.jpg)
45COPKIT Webinar, 16 September 2020 at 11:00 AM CEST
Complete Process
Extracted Text
Preprocessing
Word Segmentation
Post Segmentation
Thread Segmentation
Domain Dictionary
Topic Detection Summarization Q&A and NE
Events Generation
Domain Knowledge
![Page 46: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/46.jpg)
46COPKIT Webinar, 16 September 2020 at 11:00 AM CEST
Topic Detection
1. Given a darknet forum, determine topics of discussion from individual post messages.
MoRec
Community support
Business activities
Risk management
![Page 47: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/47.jpg)
47COPKIT Webinar, 16 September 2020 at 11:00 AM CEST
Text Summarization
Provide ‘extractive’ and ‘abstractive’ summary
MoRec
Summary
![Page 48: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/48.jpg)
48COPKIT Webinar, 16 September 2020 at 11:00 AM CEST
Question answering
Given a forum, retrieve answers to questions related to text
Example: question-answer from a context text. Image Source:https://rajpurkar.github.io/mlx/qa-and-squad/
![Page 49: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/49.jpg)
49COPKIT Webinar, 16 September 2020 at 11:00 AM CEST
Model and its Performance
RoBERTa- Based on BERT with following differeneces- Dynamic word-masking instead of static- Removed NSP training requirement- Trained on 8k sequences compared to 256 of BERT
Evaluation Metric• Label Ranking Average Precision• Final Model : {'LRAP': 0.967, 'eval loss': 0.185}
![Page 50: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/50.jpg)
50COPKIT Webinar, 16 September 2020 at 11:00 AM CEST
Component functions• Multi-Label Categorization based on trained LM
- Is forum post belong to CaaS and carding, hacking, trafficking,drugs• Trained Language models (LM) (RoBERTa ) on dark net data• Summarization updated based on LM• Topic detection based on embeddings• similarity search function• Timeline creation from forum threads• Question-answering (TBD)• Trade Network extraction and analysis (TBD)
![Page 51: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/51.jpg)
51COPKIT Webinar, 16 September 2020 at 11:00 AM CEST
Component Interface (Swagger)
![Page 52: Extracting information from darknet market advertisements ...](https://reader033.fdocuments.in/reader033/viewer/2022061101/629c2f27523c7669272ef0da/html5/thumbnails/52.jpg)
52COPKIT Webinar, 16 September 2020 at 11:00 AM CEST
Thank you !Time for Questions and
Discussion