Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously...

56
Breaking CAPTCHAs on the Dark Web Using neural networks to enable scraping RP #62, Kevin Csuka & Dirk Gaastra Supervisor: Yonne de Bruijn, Fox-IT 6 February, 2018 University of Amsterdam

Transcript of Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously...

Page 1: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Breaking CAPTCHAs on the Dark WebUsing neural networks to enable scraping

RP #62, Kevin Csuka & Dirk Gaastra

Supervisor: Yonne de Bruijn, Fox-IT6 February, 2018

University of Amsterdam

Page 2: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Introduction

Page 3: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Scraping the Dark Web

Useful for threat intelligence companies

... sometimes hard to get to.

Mainly the blockades, such as CAPTCHAs, is an issue for the scrapers.

1

Page 4: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Scraping the Dark Web

Useful for threat intelligence companies

... sometimes hard to get to.

Mainly the blockades, such as CAPTCHAs, is an issue for the scrapers.

1

Page 5: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Scraping the Dark Web

Useful for threat intelligence companies

... sometimes hard to get to.

Mainly the blockades, such as CAPTCHAs, is an issue for the scrapers.

1

Page 6: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

CAPTCHA

Figure 1: CAPTCHA example

• Completely Automated Public Turing test to tell Computer andHumans Apart

• Test to determine whether the user is human or not

2

Page 7: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

CAPTCHA

Figure 1: CAPTCHA example

• Completely Automated Public Turing test to tell Computer andHumans Apart

• Test to determine whether the user is human or not

2

Page 8: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Main question

How would a scraper be able to circumvent CAPTCHAs thatprevent it from properly scraping dark web websites?

Sub-questions:

1. Impact of solving CAPTCHAs2. Solve CAPTCHAs by using Optical Character Recognition (OCR)?3. Solving CAPTCHAs by using Machine Learning (ML)

3

Page 9: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Main question

How would a scraper be able to circumvent CAPTCHAs thatprevent it from properly scraping dark web websites?

Sub-questions:

1. Impact of solving CAPTCHAs

2. Solve CAPTCHAs by using Optical Character Recognition (OCR)?3. Solving CAPTCHAs by using Machine Learning (ML)

3

Page 10: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Main question

How would a scraper be able to circumvent CAPTCHAs thatprevent it from properly scraping dark web websites?

Sub-questions:

1. Impact of solving CAPTCHAs2. Solve CAPTCHAs by using Optical Character Recognition (OCR)?

3. Solving CAPTCHAs by using Machine Learning (ML)

3

Page 11: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Main question

How would a scraper be able to circumvent CAPTCHAs thatprevent it from properly scraping dark web websites?

Sub-questions:

1. Impact of solving CAPTCHAs2. Solve CAPTCHAs by using Optical Character Recognition (OCR)?3. Solving CAPTCHAs by using Machine Learning (ML)

3

Page 12: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Related Work

Page 13: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Related Work

1. Lawrence et al. created their own dark web scraping tool, D-miner;CAPTCHAs were solved by human labor [1]

2. Ryan Mitchell demonstrated how to solve CAPTCHAs using OpticalCharacter Recognition with Tesseract [2]

3. Torch has previously been used to train a neural network to solveCAPTCHAs by Arun Patala [3]

4

Page 14: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Related Work

1. Lawrence et al. created their own dark web scraping tool, D-miner;CAPTCHAs were solved by human labor [1]

2. Ryan Mitchell demonstrated how to solve CAPTCHAs using OpticalCharacter Recognition with Tesseract [2]

3. Torch has previously been used to train a neural network to solveCAPTCHAs by Arun Patala [3]

4

Page 15: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Related Work

1. Lawrence et al. created their own dark web scraping tool, D-miner;CAPTCHAs were solved by human labor [1]

2. Ryan Mitchell demonstrated how to solve CAPTCHAs using OpticalCharacter Recognition with Tesseract [2]

3. Torch has previously been used to train a neural network to solveCAPTCHAs by Arun Patala [3]

4

Page 16: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Methods

Page 17: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Methods

Two methods to solve the questions:

1. Categorizing dark web websites2. Breaking CAPTCHAs

5

Page 18: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

1. Categorizing websites

Analysis of 633 dark web websites

• Which ones are up?• Are there any duplicates?• Which ones block scraping?• What kind of blockade are they using?

6

Page 19: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

1. Categorizing websites

Analysis of 633 dark web websites

• Which ones are up?• Are there any duplicates?• Which ones block scraping?• What kind of blockade are they using?

6

Page 20: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

1. Categorizing websites

Analysis of 633 dark web websites

• Which ones are up?

• Are there any duplicates?• Which ones block scraping?• What kind of blockade are they using?

6

Page 21: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

1. Categorizing websites

Analysis of 633 dark web websites

• Which ones are up?• Are there any duplicates?

• Which ones block scraping?• What kind of blockade are they using?

6

Page 22: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

1. Categorizing websites

Analysis of 633 dark web websites

• Which ones are up?• Are there any duplicates?• Which ones block scraping?

• What kind of blockade are they using?

6

Page 23: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

1. Categorizing websites

Analysis of 633 dark web websites

• Which ones are up?• Are there any duplicates?• Which ones block scraping?• What kind of blockade are they using?

6

Page 24: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

2. Breaking CAPTCHAs

There are 3 common approaches to defeat CAPTCHAs:

1. Using a service which solves CAPTCHAs through human labor2. Exploiting bugs in the implementation that allow the attacker to

bypass the CAPTCHA3. Character recognition software to solve the CAPTCHA

7

Page 25: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

2. Breaking CAPTCHAs

There are 3 common approaches to defeat CAPTCHAs:

1. Using a service which solves CAPTCHAs through human labor2. Exploiting bugs in the implementation that allow the attacker to

bypass the CAPTCHA3. Character recognition software to solve the CAPTCHA

7

Page 26: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

2. Breaking CAPTCHAs

There are 3 common approaches to defeat CAPTCHAs:

1. Using a service which solves CAPTCHAs through human labor

2. Exploiting bugs in the implementation that allow the attacker tobypass the CAPTCHA

3. Character recognition software to solve the CAPTCHA

7

Page 27: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

2. Breaking CAPTCHAs

There are 3 common approaches to defeat CAPTCHAs:

1. Using a service which solves CAPTCHAs through human labor2. Exploiting bugs in the implementation that allow the attacker to

bypass the CAPTCHA

3. Character recognition software to solve the CAPTCHA

7

Page 28: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

2. Breaking CAPTCHAs

There are 3 common approaches to defeat CAPTCHAs:

1. Using a service which solves CAPTCHAs through human labor2. Exploiting bugs in the implementation that allow the attacker to

bypass the CAPTCHA3. Character recognition software to solve the CAPTCHA

7

Page 29: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

2. Breaking CAPTCHAs

There are 3 common approaches to defeat CAPTCHAs:

1. Using a service which solves CAPTCHAs through human labor2. Exploiting bugs in the implementation that allow the attacker to

bypass the CAPTCHA3. Character recognition software to solve the CAPTCHA

8

Page 30: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

2. Breaking CAPTCHAs - Dataset

Testing two common types of CAPTCHA:

Figure 2: CAPTCHAs set 1, generated using PHP

Figure 3: CAPTCHAs set 2, generated with Python

9

Page 31: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

2. Breaking CAPTCHAs

Figure 4: Training the neural network

10

Page 32: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

2. Breaking CAPTCHAs

Figure 5: Login web page with generated CAPTCHA

11

Page 33: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

2. Breaking CAPTCHAs

Figure 6: Workflow of solving CAPTCHA with TensorFlow via Scrapy12

Page 34: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Results

Page 35: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

1. Categorizing websites

Figure 7: Percentage of scraping blockade using CAPTCHAs(n = 465 )

13

Page 36: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

1. Categorizing websites

Figure 7: Percentage of scraping blockade using CAPTCHAs(n = 465 )

13

Page 37: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

1. Categorizing websites

Figure 8: Percentage of scraping blockades using CAPTCHAs(n = 465, n = 55)

14

Page 38: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

2. Breaking CAPTCHAs - TensorFlow vs. Tesseract

Figure 9: Success rate of Tesseract and TensorFlow (n = 1,000), higher isbetter

15

Page 39: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

2. Breaking CAPTCHAs - TensorFlow vs. Tesseract

Figure 9: Success rate of Tesseract and TensorFlow (n = 1,000), higher isbetter 15

Page 40: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

2. Breaking CAPTCHAs - TensorFlow vs. Tesseract

Levenshtein distance: minimal edit distance to get the correct result [5]

E.g. kitten to mitten = 1

Figure 10: Combined Levenshtein distance, lower is better

16

Page 41: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

2. Breaking CAPTCHAs - TensorFlow vs. Tesseract

Levenshtein distance: minimal edit distance to get the correct result [5]

E.g. kitten to mitten = 1

Figure 10: Combined Levenshtein distance, lower is better16

Page 42: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Conclusion

Page 43: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Conclusion

• Circumventing CAPTCHAs is necessary to scrape blocked parts ofwebsites

• Machine Learning is most effective• However, if immediacy takes precedent over success rate and

accuracy, then Tesseract (OCR) might be a better option

17

Page 44: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Conclusion

• Circumventing CAPTCHAs is necessary to scrape blocked parts ofwebsites

• Machine Learning is most effective

• However, if immediacy takes precedent over success rate andaccuracy, then Tesseract (OCR) might be a better option

17

Page 45: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Conclusion

• Circumventing CAPTCHAs is necessary to scrape blocked parts ofwebsites

• Machine Learning is most effective• However, if immediacy takes precedent over success rate and

accuracy, then Tesseract (OCR) might be a better option

17

Page 46: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Future Research

Page 47: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Future Research

A more granular analysis of dark web websites:

• What content?• Any content hidden, due to lack of privileges?

18

Page 48: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Future Research

A more granular analysis of dark web websites:

• What content?

• Any content hidden, due to lack of privileges?

18

Page 49: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Future Research

A more granular analysis of dark web websites:

• What content?• Any content hidden, due to lack of privileges?

18

Page 50: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Future Research

Increase readability for Tesseract by ”cleaning up” the image

Figure 11: Removing noise from CAPTCHA [6]

19

Page 51: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Future Research

Achieve a more efficient training model, by using character segmentation

Figure 12: CAPTCHA character segmentation [7]

20

Page 52: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Future Research

Try more CAPTCHAs:

• Increased difficulty• If software to generate the CAPTCHAs, including the answers, is not

available; send a training set to be solved by human labor. Thiscosts money, $ 1,39 per 1,000 images [8]

21

Page 53: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Future Research

Try more CAPTCHAs:

• Increased difficulty

• If software to generate the CAPTCHAs, including the answers, is notavailable; send a training set to be solved by human labor. Thiscosts money, $ 1,39 per 1,000 images [8]

21

Page 54: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Future Research

Try more CAPTCHAs:

• Increased difficulty• If software to generate the CAPTCHAs, including the answers, is not

available; send a training set to be solved by human labor. Thiscosts money, $ 1,39 per 1,000 images [8]

21

Page 55: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

Questions

?

22

Page 56: Breaking CAPTCHAs on the Dark WebCharacter Recognition with Tesseract [2] 3.Torch has previously been used to train a neural network to solve CAPTCHAs by Arun Patala [3] 4 Related

References

[1] Lawrence, H., Hughes, A., Tonic, R., & Zou, C. (2017, October).D-miner: A framework for mining, searching, visualizing, and alerting ondarknet events. In Communications and Network Security (CNS), 2017IEEE Conference on (pp. 1-9). IEEE.

[2] Mitchell, R. (2015). Web scraping with Python: collecting data fromthe modern web. ” O’Reilly Media, Inc.”.

[3] Arun Patala. https://deepmlblog.wordpress.com/2016/01/03/how-to-break-a-captcha-system/

[4]people.cs.pitt.edu

[5]extremetech.com

[6]ahm3dibrahim.wordpress.com

[7] medium.com

[8] http://www.deathbycaptcha.com/

23