Published May 21, 2024 | Version v1

PhishDecloaker Datasets

  • 1. ROR icon National University of Singapore
  • 2. ROR icon Shanghai Jiao Tong University

Description

This record contains datasets part of the paper: "PhishDecloaker: Detecting CAPTCHA-cloaked Phishing Websites via Hybrid
Vision-based Interactive Models", published at USENIX Security'24.

Phishing Kit Dataset

  • Section: 2
  • Description: For empirical study. 
  • Contents: 100 defanged PHP phishing kits representing the following list of brands
1. Microsoft
2. Banco de Oro
3. Microsoft OneDrive
4. Deutsche Kreditbank
5. Adobe Acrobat
6. N26
7. Absa Group
8. DHL
9. Microsoft
10. Correos
11. Kempinski Summerland Hotel & Resort Beirut
12. Vantage West Credit Union
13. NetFlix
14. Agencia Tributaria
15. Square
16. Chronopost
17. PayPal
18. American Express
19. Allegro
20. LinkedIn
21. virtru
22. Citibank
23. AOL
24. Credit Agricole
25. Mercado Pago
26. Université de Pau et des Pays de l'Adour (UPPA)
27. Fifth Third Banki
28. Columbia Bank
29. Alibaba Mail
30. Microsoft OneDrive
31. Intesa Sanpaolo
32. Santander
33. America First Credit Union
34. Barclays
35. Interac
36. USPS
37. Wells Fargo
38. Yahoo
39. XFINITY
40. Berliner Sparkasse
41. OneDrive
42. Standard Bank
43. Wells Fargo
44. aruba.it
45. Bancolombia
46. Caisse d’Epargne
47. DubaiPay
48. Chase Bank
49. M&T Bank
50. Postmaster
51. Volksbanken Raiffeisenbanken
52. Facebook
53. Huntington Bank
54. Commonwealth Bank of Australia
55. Orange
56. shopify
57. Google Drive
58. WalletConnect
59. Meritrust Credit Union
60. Credit Agricole
61. Desjardins
62. Postbank
63. Dropbox
64. DocuSign
65. dpdgroup
66. L'Assurance Maladie
67. Adobe Acrobat
68. Global Sources
69. Microsoft Excel
70. SFR
71. FedEx
72. Citibank
73. Royal Credit Union
74. GoDaddy
75. ADP
76. International Card Services
77. Israeli Post
78. UNI Financial Cooperation
79. TD Bank
80. ATB Mobile
81. HSBC
82. Bank of Montreal
83. RBC Royal Bank
84. IONOS
85. AlaskaUSA Federal Credit Union
86. French Government
87. UOL SAC
88. Banco Itaú Paraguay
89. Amazon
90. Apple
91. AT&T
92. Australian Government
93. Bank of America
94. BNP Paribas
95. eBay
96. ING Group
97. Instagram
98. MetaMask
99. SingTel
100. Société Générale

Landscape Dataset

  • Section: 4.3
  • Description: For training the rotation CAPTCHA solver model.
  • Contents: 7,268 natural and man-made landscape images (320×180).
  • Format: JPEG images.

CAPTCHA Detection Dataset

  • Section: 5.2.1
  • Description: For training the CAPTCHA detection model.
  • Contents: 19,680 webpage screenshots (1920×1080), 10,680 with annotated CAPTCHA bounding boxes, 9,000 without.
  • Format: PNG images with annotations in PASCAL VOC and COCO format.All bounding boxes are labeled as the "CAPTCHA" class (no CAPTCHA type categorization).

CAPTCHA Recognition Dataset

  • Section: 5.2.2
  • Description: For training the CAPTCHA recognition model
  • Contents: 6,612 CAPTCHA images distributed across 38 classes.
  • Format: PNG images with their corresponding class labels in CSV
CAPTCHA classes:
1. baidu_slide_rotate
2. dingxiang_audio
3. dingxiang_click_area
4. dingxiang_click_difference
5. dingxiang_click_font
6. dingxiang_click_icon
7. dingxiang_click_vr
8. dingxiang_click_word
9. dingxiang_drag
10. dingxiang_slide_puzzle
11. dingxiang_slide_puzzle2
12. dingxiang_slide_rotate
13. geetest_checkbox
14. geetest_click_icon
15. geetest_click_phrase
16. geetest_click_word
17. geetest_game_playing
18. geetest_game_playing2
19. geetest_select
20. geetest_slide_puzzle
21. hcaptcha
22. hcaptcha_checkbox
23. netease_click_icon
24. netease_click_phrase
25. netease_click_vr
26. netease_click_word
27. netease_drag
28. netease_slide
29. press_and_hold
30. recaptchav2
31. recaptchav2_checkbox
32. tencent_slide
33. text_1
34. text_2
35. text_3
36. text_4
37. text_5
38. text_6

CAPTCHA Open-set Dataset

  • Section: 5.2.2
  • Description: For testing the CAPTCHA detection and recognition pipeline.
  • Contents: 1,100 webpage screenshots (1920×1080), all of which have annotated CAPTCHA classes spanning 11 different categories.
  • Format: PNG CAPTCHA and screenshot images with their corresponding class labels in CSV
CAPTCHA classes:
1. arkose_select_2
2. capycaptcha_drag
3. dicecaptcha_qa
4. funcaptcha_select
5. funcaptcha_select_2
6. funcaptcha_select_3
7. funcaptcha_select_4
8. funcaptcha_select_5
9. funcaptcha_select_6
10. keycaptcha_drag
11. mtcaptcha_text

Ablation Dataset

  • Section: 5.4
  • Description: For training the CAPTCHA recognition model
  • Contents: 722 webpage screenshots (1920×1080), 622 with CAPTCHAs spanning 38 classes, 100 without.
  • Format: PNG images with their corresponding bounding box and class labels in CSV. Class IDs 0-37 can be directly mapped to class names in CAPTCHA recognition dataset. Class ID 38 are samples without CAPTCHAs.

Files

ablation-dataset.zip

Files (5.7 GB)

Name Size
md5:dea75d2b452736511e044e034df49967
342.7 MB Preview Download
md5:b40a4b9c2774a03ee55f0a0d6926a611
4.1 GB Preview Download
md5:9d73960a6f113b49edb5780d1f19fc3d
595.1 MB Preview Download
md5:c4c6042f01a1da2c41a77420d1976bdc
322.7 MB Preview Download
md5:2f17ea845a079e9a2a3164930601506e
79.3 MB Preview Download
md5:8395ddea3bf84469fedf785941660e3a
172.6 MB Preview Download

Additional details

Software

Programming language
Python , JavaScript , PHP
Development Status
Active