{ "access": { "embargo": { "active": false, "reason": null }, "files": "public", "record": "public", "status": "open" }, "created": "2022-06-30T10:27:41.247182+00:00", "custom_fields": {}, "deletion_status": { "is_deleted": false, "status": "P" }, "files": { "count": 1, "enabled": true, "entries": { "rba-dataset.zip": { "checksum": "md5:cc1b1078b3929650e6c08678caffcc57", "ext": "zip", "id": "d291c4c0-1c8a-47f0-b824-0615480c39a7", "key": "rba-dataset.zip", "metadata": null, "mimetype": "application/zip", "size": 1093700330 } }, "order": [], "total_bytes": 1093700330 }, "id": "6782156", "is_draft": false, "is_published": true, "links": { "access": "https://zenodo.org/api/records/6782156/access", "access_links": "https://zenodo.org/api/records/6782156/access/links", "access_request": "https://zenodo.org/api/records/6782156/access/request", "access_users": "https://zenodo.org/api/records/6782156/access/users", "archive": "https://zenodo.org/api/records/6782156/files-archive", "archive_media": "https://zenodo.org/api/records/6782156/media-files-archive", "communities": "https://zenodo.org/api/records/6782156/communities", "communities-suggestions": "https://zenodo.org/api/records/6782156/communities-suggestions", "doi": "https://doi.org/10.5281/zenodo.6782156", "draft": "https://zenodo.org/api/records/6782156/draft", "files": "https://zenodo.org/api/records/6782156/files", "latest": "https://zenodo.org/api/records/6782156/versions/latest", "latest_html": "https://zenodo.org/records/6782156/latest", "media_files": "https://zenodo.org/api/records/6782156/media-files", "parent": "https://zenodo.org/api/records/6782155", "parent_doi": "https://zenodo.org/doi/10.5281/zenodo.6782155", "parent_html": "https://zenodo.org/records/6782155", "requests": "https://zenodo.org/api/records/6782156/requests", "reserve_doi": "https://zenodo.org/api/records/6782156/draft/pids/doi", "self": "https://zenodo.org/api/records/6782156", "self_doi": "https://zenodo.org/doi/10.5281/zenodo.6782156", "self_html": "https://zenodo.org/records/6782156", "self_iiif_manifest": "https://zenodo.org/api/iiif/record:6782156/manifest", "self_iiif_sequence": "https://zenodo.org/api/iiif/record:6782156/sequence/default", "versions": "https://zenodo.org/api/records/6782156/versions" }, "media_files": { "count": 0, "enabled": false, "entries": {}, "order": [], "total_bytes": 0 }, "metadata": { "additional_descriptions": [ { "description": "Data set belonging to the following publication:\n\nStephan Wiefling, Paul Ren\u00e9 J\u00f8rgensen, Sigurd Thunem, and Luigi Lo Iacono: Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service. In: ACM Transactions on Privacy and Security (2022). doi: 10.1145/3546069", "type": { "id": "notes", "title": { "de": "Anmerkungen", "en": "Notes" } } } ], "creators": [ { "affiliations": [ { "name": "H-BRS University of Applied Sciences" } ], "person_or_org": { "family_name": "Wiefling", "given_name": "Stephan", "identifiers": [ { "identifier": "0000-0001-7917-6065", "scheme": "orcid" } ], "name": "Wiefling, Stephan", "type": "personal" } }, { "affiliations": [ { "name": "Telenor Digital" } ], "person_or_org": { "family_name": "J\u00f8rgensen", "given_name": "Paul Ren\u00e9", "identifiers": [ { "identifier": "0000-0003-3806-714X", "scheme": "orcid" } ], "name": "J\u00f8rgensen, Paul Ren\u00e9", "type": "personal" } }, { "affiliations": [ { "name": "Telenor Digital" } ], "person_or_org": { "family_name": "Thunem", "given_name": "Sigurd", "identifiers": [ { "identifier": "0000-0001-7569-8501", "scheme": "orcid" } ], "name": "Thunem, Sigurd", "type": "personal" } }, { "affiliations": [ { "name": "H-BRS University of Applied Sciences" } ], "person_or_org": { "family_name": "Lo Iacono", "given_name": "Luigi", "identifiers": [ { "identifier": "0000-0002-7863-0622", "scheme": "orcid" } ], "name": "Lo Iacono, Luigi", "type": "personal" } } ], "description": "
Login Data Set for Risk-Based Authentication
\n\n\n\n\nSynthesized login feature data of >33M login attempts and >3.3M users on a large-scale online service in Norway. Original data collected between February 2020 and February 2021.
\n
This data sets aims to foster research and development for Risk-Based Authentication (RBA) systems. The data was synthesized from the real-world login behavior of more than 3.3M users at a large-scale single sign-on (SSO) online service in Norway.
\n\nThe users used this SSO to access sensitive data provided by the online service, e.g., a cloud storage and billing information. We used this data set to study how the Freeman et al. (2016) RBA model behaves on a large-scale online service in the real world (see Publication). The synthesized data set can reproduce these results made on the original data set (see Study Reproduction). Beyond that, you can use this data set to evaluate and improve RBA algorithms under real-world conditions.
\n\nWARNING: The feature values are plausible, but still totally artificial. Therefore, you should NOT use this data set in productive systems, e.g., intrusion detection systems.
\n\nOverview
\n\nThe data set contains the following features related to each login attempt on the SSO:
\n\nFeature | \n\t\t\tData Type | \n\t\t\tDescription | \n\t\t\tRange or Example | \n\t\t
---|---|---|---|
IP Address | \n\t\t\tString | \n\t\t\tIP address belonging to the login attempt | \n\t\t\t0.0.0.0 - 255.255.255.255 | \n\t\t
Country | \n\t\t\tString | \n\t\t\tCountry derived from the IP address | \n\t\t\tUS | \n\t\t
Region | \n\t\t\tString | \n\t\t\tRegion derived from the IP address | \n\t\t\tNew York | \n\t\t
City | \n\t\t\tString | \n\t\t\tCity derived from the IP address | \n\t\t\tRochester | \n\t\t
ASN | \n\t\t\tInteger | \n\t\t\tAutonomous system number derived from the IP address | \n\t\t\t0 - 600000 | \n\t\t
User Agent String | \n\t\t\tString | \n\t\t\tUser agent string submitted by the client | \n\t\t\tMozilla/5.0 (Windows NT 10.0; Win64; ... | \n\t\t
OS Name and Version | \n\t\t\tString | \n\t\t\tOperating system name and version derived from the user agent string | \n\t\t\tWindows 10 | \n\t\t
Browser Name and Version | \n\t\t\tString | \n\t\t\tBrowser name and version derived from the user agent string | \n\t\t\tChrome 70.0.3538 | \n\t\t
Device Type | \n\t\t\tString | \n\t\t\tDevice type derived from the user agent string | \n\t\t\t(mobile , desktop , tablet , bot , unknown )1 | \n\t\t
User ID | \n\t\t\tInteger | \n\t\t\tIdenfication number related to the affected user account | \n\t\t\t[Random pseudonym] | \n\t\t
Login Timestamp | \n\t\t\tInteger | \n\t\t\tTimestamp related to the login attempt | \n\t\t\t[64 Bit timestamp] | \n\t\t
Round-Trip Time (RTT) [ms] | \n\t\t\tInteger | \n\t\t\tServer-side measured latency between client and server | \n\t\t\t1 - 8600000 | \n\t\t
Login Successful | \n\t\t\tBoolean | \n\t\t\tTrue : Login was successful, False : Login failed | \n\t\t\t(true , false ) | \n\t\t
Is Attack IP | \n\t\t\tBoolean | \n\t\t\tIP address was found in known attacker data set | \n\t\t\t(true , false ) | \n\t\t
Is Account Takeover | \n\t\t\tBoolean | \n\t\t\tLogin attempt was identified as account takeover by incident response team of the online service | \n\t\t\t(true , false ) | \n\t\t
Data Creation
\n\nAs the data set targets RBA systems, especially the Freeman et al. (2016) model, the statistical feature probabilities between all users, globally and locally, are identical for the categorical data. All the other data was randomly generated while maintaining logical relations and timely order between the features.
\n\nThe timestamps, however, are not identical and contain randomness. The feature values related to IP address and user agent string were randomly generated by publicly available data, so they were very likely not present in the real data set. The RTTs resemble real values but were randomly assigned among users per geolocation. Therefore, the RTT entries were probably in other positions in the original data set.
\n\nThe country was randomly assigned per unique feature value. Based on that, we randomly assigned an ASN related to the country, and generated the IP addresses for this ASN. The cities and regions were derived from the generated IP addresses for privacy reasons and do not reflect the real logical relations from the original data set.
\n\tThe device types are identical to the real data set. Based on that, we randomly assigned the OS, and based on the OS the browser information. From this information, we randomly generated the user agent string. Therefore, all the logical relations regarding the user agent are identical as in the real data set.
\n\tThe RTT was randomly drawn from the login success status and synthesized geolocation data. We did this to ensure that the RTTs are realistic ones.
\n\tRegarding the Data Values
\n\nDue to unresolvable conflicts during the data creation, we had to assign some unrealistic IP addresses and ASNs that are not present in the real world. Nevertheless, these do not have any effects on the risk scores generated by the Freeman et al. (2016) model.
\n\nYou can recognize them by the following values:
\n\nASNs with values >= 500.000
\n\tIP addresses in the range 10.0.0.0 - 10.255.255.255 (10.0.0.0/8 CIDR range)
\n\tStudy Reproduction
\n\nBased on our evaluation, this data set can reproduce our study results regarding the RBA behavior of an RBA model using the IP address (IP address, country, and ASN) and user agent string (Full string, OS name and version, browser name and version, device type) as features.
\n\nThe calculated RTT significances for countries and regions inside Norway are not identical using this data set, but have similar tendencies. The same is true for the Median RTTs per country. This is due to the fact that the available number of entries per country, region, and city changed with the data creation procedure. However, the RTTs still reflect the real-world distributions of different geolocations by city.
\n\nSee RESULTS.md for more details.
\n\nEthics
\n\nBy using the SSO service, the users agreed in the data collection and evaluation for research purposes. For study reproduction and fostering RBA research, we agreed with the data owner to create a synthesized data set that does not allow re-identification of customers.
\n\nThe synthesized data set does not contain any sensitive data values, as the IP addresses, browser identifiers, login timestamps, and RTTs were randomly generated and assigned.
\n\nPublication
\n\nYou can find more details on our conducted study in the following journal article:
\n\nPump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service (2022)
\nStephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono.
\nACM Transactions on Privacy and Security
Bibtex
\n\n@article{Wiefling_Pump_2022,\n author = {Wiefling, Stephan and Jørgensen, Paul René and Thunem, Sigurd and Lo Iacono, Luigi},\n title = {Pump {Up} {Password} {Security}! {Evaluating} and {Enhancing} {Risk}-{Based} {Authentication} on a {Real}-{World} {Large}-{Scale} {Online} {Service}},\n journal = {{ACM} {Transactions} on {Privacy} and {Security}},\n doi = {10.1145/3546069},\n publisher = {ACM},\n year = {2022}\n}\n\n
License
\n\nThis data set and the contents of this repository are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. See the LICENSE file for details. If the data set is used within a publication, the following journal article has to be cited as the source of the data set:
\n\nStephan Wiefling, Paul René Jørgensen, Sigurd Thunem, and Luigi Lo Iacono: Pump Up Password Security! Evaluating and Enhancing Risk-Based Authentication on a Real-World Large-Scale Online Service. In: ACM Transactions on Privacy and Security (2022). doi: 10.1145/3546069
\n\nFew (invalid) user agents strings from the original data set could not be parsed, so their device type is empty. Perhaps this parse error is useful information for your studies, so we kept these 1526 entries.\u21a9\ufe0e
\n\t