Exploratory Data Analysis on Username-Password Dataset

Passwords act as a first line of defense against any malicious or unauthorized access to one's personal information. With the increasing digitization, it has now become even more important to choose strong passwords. In this paper, the authors analyze a 100 million Email-Password Database to perform Exploratory Data Analysis. The analysis provides valuable insights on statistics about the most common passwords being used, character set of passwords, most common domains, average length, password strength, frequencies of letters, numbers, symbols (special characters), most common letter, most common number, most common symbol, the ratio of letters, numbers, symbols in passwords which highlights the general trend that users follow while creating passwords. Using the results of this paper, users can make intelligent decisions while creating passwords for themselves, i.e., not opting for the most common features that will help them create robust and less vulnerable passwords.


Introduction
Everything has a password [1] [2] connected to it in this digital world, from doing online transactions to updating profile pictures on social media platforms; everything requires passwords as proof of authentication. Having a secure and robust password has become a significant necessity nowadays. A single user can have multiple passwords for multiple accounts. With the increasing number of passwords, it becomes challenging to remember complex passwords because of which people tend to select common and weak passwords. Often people use the same passwords for multiple accounts. This makes their multiple accounts vulnerable in the case of a data breach [3].
Some websites have now started encouraging [4] users to choose a strong password consisting of different character sets like a combination of lowercase letters, uppercase letters, symbols (special characters), numbers and longer in length (typically, greater than or equal to 8). To increase security, websites have also started offering 2 Factor Authentication [5]. 2 Factor Authentication (2FA), as the name suggests, requires more than one form of verification. This means only a password is not enough to login, an OTP or a Auth Code is also required to login. It is always recommended to enable 2FA when possible.
There are many ways available by which hackers can break into an account, either by guessing or cracking passwords. Attacks like Dictionary Attack [6], Brute Force Attack [7][8] can help break into an account with a weak password. Dictionary Attack involves automating a script that can try all the words of a dictionary or a word list provided. Brute Force Attack involves running a script that tries all the possible combinations of the supplied character set. Many passwords cracking wordlists like rockyou[9], CrackStation[10], Weakpass[11], SkullSecurity [12] are also available on the web for hackers to launch an attack on accounts having weak passwords. Some passwords can also contain Unicode Characters. Unicode Characters [13] are special characters that are out of ASCII [14] range 128 and commonly found on a general keyboard layout. The results of this analysis can also help readers/users make intelligent decisions about their passwords and choose a strong password for their accounts.

Cleaning Data
The dataset had a lot of anomalies that did not align with the scope of the analysis, like some of the email-password combinations were blank, contained Unicode Chars, duplicate entries were also present. All these anomalies were selected and deleted from the dataset to meet the scope of the analysis. The cleaning was performed with the code written by the authors in python [16] and is available at [17].

Sorting Data
The dataset is very large and performing the analysis in one go is difficult. The dataset was broken into smaller sections to ease the process of analysis. The sorting was performed with the code written in python by the authors and is available at [17].

Performing Exploratory Data Analysis (EDA)
For performing EDA, some basic features were selected upon which the analysis was performed. The code for all the following subsections is available at [18], which is also written by the authors.

Password Strength
We defined three different categories of passwords: weak, moderate, strong. To define the parameters, we used the following approach.
Score was incremented if one of the following conditions were met:

Character set of passwords
All the passwords were run through a counter program which kept a count of the character set of all the passwords. The following are the seven-character set categories in which the passwords were categorized.

Lowercase Letter Frequency
A counter program iterated through all the passwords and calculated the frequency of occurrences of all the lowercase letters. The results were presented in a bar graph along with their frequencies.

Uppercase Letter Frequency
A similar counter program iterated through all the passwords and calculated the frequency of occurrences of all the lowercase letters. The results were presented in a bar graph along with their frequencies.

Numbers Frequency
All the passwords were passed through an iterator which kept count of the frequency of all the numbers occurring in the passwords. The results were presented in a bar graph along with their frequencies.

Symbols (Special Characters) Frequency
To calculate the frequency of symbols like @, !, #, etc., all the passwords were checked for the occurrence of symbols. A counter function kept the count of the occurrences of symbols with their frequencies. The results were presented in a graph.

Most Common Password
A Python dictionary was created to keep count of all the unique passwords along with their frequencies to find the most common password in the dataset. The top 10 most common passwords along with their frequencies were presented in a bar graph.

Most Common Domain
The emails present in the dataset had their domain name also present in the email id. All the unique domains along their frequencies were then plotted in a bar graph.

Most Common Unicode Characters
The dataset also contained a lot of Unicode Characters in the password. All the occurrences of Unicode Characters were plotted in a bar graph along with their frequencies.

The ratio of the number of Alphabetic Letters to the length of the password
It signifies how much percentage of the password is composed of Alphabetic Letters. The ratio was calculated on an individual password level and then averaged for the entire dataset.

The ratio of number of Numeric Digits to the length of the password
This ratio talks about how much percent of a password contains Numeric Digits. The ratio was locally calculated on a single password level and then adjusted for the complete dataset.

The ratio of number of Symbols to the length of the password
This ratio talks about how much percent of a password contains symbols in it. This ratio was also calculated on an individual password level and then adjusted for the complete dataset.

Additional Features of the Data set
The following five additional features about the dataset were also observed:  . Fig 1(a) indicates the frequency of weak, moderate and strong passwords present in the dataset . Fig 1(b) shows the information in the form of a Pie Chart for easy visualization. It has been observed that only 0.3% of the passwords are strong and more than 50% of the passwords are weak in strength.  Fig 2(a) shows the frequency of seven categories of character set. A bar graph is also plotted in Fig 2(b) which indicates that the categories 'small(lowercase) + big(uppercase) + numbers' , 'small(lowercase) + numbers' and 'small(lowercase) + big(uppercase) + numbers + without_symbols' are most common character sets with almost 72% each. Frequency   Fig 3(a) shows the frequency of all the 26 lowercase letters in the form of a bar graph. The top 10 most common lowercase letters found in the dataset are represented in Fig 3(b). It has been observed that the most common lowercase letter 'a' occurred 54517313 times and the least common lowercase letter 'q' occurred 4654476 times in the dataset.  Frequency  Fig 4(a) shows the frequency of all the 26 uppercase letters in the form of a bar graph. The top 10 most common uppercase letters found in the dataset are represented in Fig 4(b). It has been observed that the most common uppercase letter 'A' occurred 2491065 times and the least common lowercase letter 'X' occurred 501590 times in the dataset.

Numbers Frequency
Fig 5 shows the frequency of numeric digits occurring in the dataset. The most common numeric digit was found to be '1', occurring 61991073 times and the least common numeric digit was found to be '7', occurring 22170243 in the dataset. Frequencies of numbers '6', '4', '8', '5' were also found to be approximately equal.  Fig 6(a) shows twenty-five unique symbols present in the dataset with their frequency of occurrence. Fig 6(b) shows the top ten most common symbols along with their frequencies presented in the form of a bar graph. Symbol '.' (full stop) is the most common occurring symbol with a frequency of 1485430. Symbol '>' (greater than) is the least common occurring symbol with a frequency of 6500 only.

Most Common Domain
Fig 8 shows the top 10 most common domains found in the dataset. Domain 'yahoo.com' was found to be the most common domain in the dataset with a frequency of 18409390, i.e., 17% of the entire dataset. It was followed by 'hotmail.com', 'mail.ru', 'gmail.com' and 'aol.com' with the frequencies of 10895244, 8033848, 722,4411, 6650848 respectively.

Most Common Unicode Characters
Some passwords in the dataset also contain Unicode Characters (characters whose ASCII value is greater than 128). Fig 9 shows the top 10 most common Unicode Characters along with their frequencies plotted in a bar graph. The most common Unicode character found in the dataset was 'а' (ASCII value 1072) with a frequency of 1942387 i.e., 19% of all the Unicode characters present in the dataset.