Tuesday, August 29, 2017

320 Million Hashes Exposed

Earlier this month (August 2017) Troy Hunt founder of the website Have I been pwned? [0] released over 319 million plaintext passwords [1] compiled from various non-hashed data breaches, in the form of SHA-1 hashes. Making this data public might allow future passwords to be cross-checked in a secure manner in the hopes of preventing password re-use, especially of those from compromised breaches which were in unhashed plaintext.

Our group (in collaboration with @m33x and @tychotithonus) made an attempt to crack/recover as many of the hashes as possible, both for research purposes and of course to satisfy our curiosity while using this opportunity as a challenge. Although each of the pwned password packs released at the time (3 in total at this writing) were labeled as 40-character ASCII-HEX SHA-1 hashes, we worked under the assumption that “No hash list larger than a few hundred thousand entries, contains only one kind of hash!” - and these lists were no exception.

Nested Hashes
Although the majority of the passwords recovered were plaintext, as expected, we also noticed there were a number of plaintexts themselves being hashes or some form of non-plaintext. This suggested that we were dealing with more than just SHA-1.

Out of the roughly 320 million hashes, we were able to recover all but 116 of the SHA-1 hashes, a roughly 99.9999% success rate. In addition, we attempted to take it a step further and resolve as many “nested” hashes (hashes within hashes) as possible to their ultimate plaintext forms. Through the use of MDXfind [2] we were able to identify over 15 different algorithms in use across the pwned-passwords-1.0.txt and the successive update-1 and update-2 packages following that. We also added support for SHA1SHA512x01 to Hashcat [3].

Taking a deeper dive into the found “plaintexts,” we realized there were hashes-within-hashes, hashes of seemingly garbage data, what appears to be “seeded” hashes, and more. Here is a list of the hash types we found:

There are other hashes we have not completely resolved yet - some of which may be seeded hashes. For example, we see:


… and much more.

Personal Identifiable Information
We also saw unusual strings from incorrect import/export that was already present in the original leak. This links the hash to the owner of the password, which was clearly not intended by Troy. We found more than 2.5m email addresses and about 230k email:password combinations.
<firstname.lastname@tld><:.,;| /><password>
<truncated-firstname.lastname@tld><:.,;| /><password>
<@tld><:.,;| /><password>
<username><:.,;| /><password>
<firstname.lastname@tld><:.,;| /><some-hash>

Trash / Other Non-Passwords
Furthermore, there were obviously other strings that were not passwords, but rather fragments of files.  For example:

005a97e5323dac9a43c06bb5fe0a75973ee5e23f:<div><embed src="http://apps.rockyou.com/fxtext.swf?ID=31478642&nopanel=true&stage=true" quality="high" scale="noscale" width="405.37" height="116.475" wmode="transparent" name="rockyou" type="application/x-shockwave-flash" pluginspage="http://www.macrom

006bb7e8893618b02f979dd425e689b4ae64df10:honeyDo you realize who is in this image: http://thecoolpics.com/who.jpg . Just think for a moment and tell me o you realize who is in this image: http://thecoolpics.com/who.jpg . Just think for a moment and tell me soon ;))

Bad Line Parsing
We observed a number of passwords which appeared as they were truncated at length 40 but contained data following the linefeed terminator of the input lists.


We assumed this was either caused by a parsing error or some anomaly. To recover these strange processed plaintexts, some utilities were coded [4] to emulate the particular behavior of concatenating successive lines while restricting them to 40 characters.


Furthermore, to find the correct position where the initial parsing error occurred, we searched our dictionaries from the right to the left (see [4]) concatenating characters like this:


 An example of a bad/invalid email imported into the haveibeenpwned.com website

Hashcat’s Hexception
During hash processing, we also caught a glimpse into Troy’s methodology.  We believe that he processed some “cracked” passwords as well, suggested by the presence of $HEX[] plaintexts. This also revealed a bug in Hashcat’s $HEX[] encoding.

For example, consider the following hash:


Initially, when this was found with Hashcat, it appeared as:


The hash could not be verified as the solution since:


We discovered that Hashcat fails to correctly encode a literal string with $HEX[], if the literal string starts with $HEX[.  This means that if you take the output of Hashcat, say from hashcat.pot and try to re-crack it using the passwords in the hashcat.pot file - you will end up with “unsolvable” hashes.  As part of our work involves building dictionaries that we can reuse, we consider this a significant bug.

Some tools [5] were put together to properly re-encode the output from Hashcat, into the proper string:


This then works properly as a reusable password with Hashcat and MDXfind, as it decodes into the literal string:


This issue has been resolved in a beta version of Hashcat [6].

We also uncovered a second bug in Hashcat, which was later corrected in a beta version. When using certain rules, we found that the solutions that Hashcat was offering also did not hash back to the correct value.  We ended up with hundreds of  “solutions” that really were not solutions at all. This is one of the reasons that we always try to double-check our work, to ensure that we have accurate hashes and plaintexts.

As a final check, we took just the SHA1x01 passwords we found and re-ran them through both Hashcat (Beta v3.6.0-351-gec874c1) and MDXfind. The results were quite illuminating. The test system used was a 4 core Intel Core i7-6700K system, with 4x GTX1080 cards and 64GB of memory. Using Hashcat, we found that loading more than about 250,000,000 hashes at a time was not possible [7] and as a result, the list was broken up into chunks of 225m hashes.

Time to Complete
Hashes Found
55 minutes
MDXfind (all hashes)
9 minutes
MDXfind (225m chunks)
9 minutes

From our usage patterns, it is evident that both applications have their strengths and caveats. MDXfind shows its strength when the hashlist is too large to fit into GPU memory, when many algorithms need to be checked in parallel and when very long password strings need to be tested. Hashcat, on the other hand, shines when parallel compute is needed; such as running large rule sets and large keyspaces. Using the tools in tandem gives us the best of both worlds since we can feed the left list of each successive attack into either program to achieve optimal efficiency and coverage.

To further illustrate the problem with password reuse (and the importance of validation), the hashes were re-run using just the found password of Hashcat (Beta v3.6.0-351-gec874c1).  This resulted in 86,954 hashes not being recovered. These are primarily due to the $HEX encoding error that Hashcat makes.

Distributed Tasks
Once the hashlist was small enough where the size of the hashlist had negligible effects on search speed, distributed brute-force and mask attacks were conducted via Hashtopussy [8] a Hashcat wrapper.  Combining our hardware, we were able to achieve peak speeds of over 180GH/s on SHA-1, to put things into perspective that's roughly the speed of 25x GTX1080s. We were able to cover ?a length 1-8, ?l?d length 9-10 and ?b length 1-6 effortlessly.

Statistical Properties
In order to speed up the analysis of such a large volume of plaintexts, a custom tool was coded “Panal” (will be released at a later time) to quickly and accurately analyse our large dataset of over 320 million passwords. The longest password we found was 400 characters, while the shortest was only 3 characters long. About 0.06% of passwords were 50 characters or longer with 96.67% of passwords being 16 characters or less.  Roughly 87.3% of passwords fall into the character set of LowerNum 47.5%, LowerCase 24.75%, Num 8.15%, and MixedNum 6.89% respectively. In addition we saw UTF-8 encoded passwords along with passes containing control characters. See [9] for full Panal output.


Blocking common passwords during account creation has positive effects on the overall password security of a website [10]. While blacklisting 320m leaked passwords might sound like a good idea to further improve password security, it can have unforeseeable consequences on usability (i.e, the level of user frustration). Conventional blacklist approaches typically include the 10k most common passwords to limit online password guessing attack consequences. Until now, there has been no evidence to support which blacklist size provides an optimal balance. 

Post written in collaboration with @m33x and @tychotithonus

[0] 2017-08-03: Have I been pwned? by Troy Hunt
[1] 2017-08-03: Introducing 306 Million Freely Downloadable Pwned Passwords 
[2] 2017-08-03: MDXfind v1.93
[3] 2017-08-28: Hashcat sha1(sha512($pass)) patch
[4] 2017-08-27: Some tools we developed to deal with incorrectly parsed strings
[6] 2017-08-20: Hashcat Issue “hexify also all password of format $HEX[]”
[7] 2017-08-18: Hashcat Issue Potential Silent Cracking Failures at Certain Hash-Count
[8] 2017-08-03: Hashtopussy by s3inlc
[9] 2017-8-29: Panal (Password Analysis) 320m HIBP Passwords
[10] 2017-08-03: Password Creation in the Presence of Blacklists

Sunday, June 25, 2017

32hex is not MD5? What are Youku talking about?


32hex is not MD5? What are Youku talking about?

During April 2017, various online sources alleged that Youku, a Chinese video hosting service was hacked and that roughly 100 million user accounts were compromised. These sources stated that Youku usernames along with passwords hashed with MD5 and SHA1 algorithms were leaked. We decided to take a closer look in early June and will be presenting our findings in this post.

Of the 99,075,692 lines of data present in the leak provided to us, we were able to extract 99,028,838 usable hash strings. From the hash strings extracted from the original dump, we noticed there were hashes of varying lengths ranging from 30 to 32 ASCII-hex characters and thus suggesting to us they could be more MD5 like. After de-duplicating the hashes we were left with 57,205,528 hashes suggesting there was password re-use in this data.

A common practice, especially those seen in Chinese websites, is that the developers employ a form of ob-security in their password storage schemes. We suspect this is most likely done to deter the hashes being loaded into off-the-shelf password crackers. Another explanation would be that mistakes were made in processing the data.

As we started to work on this data set, it quickly became apparent that there were more than just MD5 hashes in this file.  We were able to identify both iterated MD5 hashes, as well as more complex sub-string iterated hashes.  Each of these also appeared as a chopped (last digits removed) value as well.  The majority of hashes were MD5($pass), but we found a sizeable number of MD5(MD5($pass)) and MD5(MD5(MD5($pass))).  The substring hashes were of the form MD5(substr(MD5($pass),8,16)).

The number of different MD5 variations used in hashing the passwords could be attributed to a number of factors which we won’t know but can only make assumptions. The simplest explanation is that the developers decided to change the hashing method through update iterations to their website. Some other explanations could be they merged with another service and also merged in those user accounts along with hashes, alternatively different accounts such as operators and users may have used different hashing schemes.

Dealing with the chopped hashes was not a problem for our tools. MDXfind natively supports partial matching of hashes, but we did modify hashcat to support these as well. See below for an example patch based on hashcat 3.6.0. A “clean” version including MD5sub8-24MD5 may be released at a later point. This required both small changes in the input parser, as well as the kernel code. We then ran the cracked passwords as a dictionary with MDXfind to mark the hashes correctly.


Of the 99 million hashes we parsed, we were able to recover 94.836 million - roughly 95.7% success rate. Interestingly, we noticed about 1.5 million MD5 like hashes which were in uppercase ASCII-hex form, as opposed to lowercase like the rest. We were not able to recover any of these hashes, and it is possible these are either salted or use a more exotic algorithm.

We found 48 million unique passwords, which solved the 94.8 million hashes.  The top-25 passwords for this list are typical for this type of web-site. It is interesting to note the fourth most common password used ‘xuanchuan’ is the romanized representation of 宣傳 translated to English means propaganda.

Perhaps the most interesting thing about this leak was the number of “created” or “generated” accounts we found.  Many, perhaps even the majority, of the accounts use what we consider to be generated email addresses and certainly machine-generated passwords.  While the exact number is difficult to calculate with certainty, we suspect tens of millions of these accounts are generated.

For example, there are 222 accounts we believe were created on October 10, 2011, at 14:25:03, all with 11 character random usernames @qq.com.  Why do we believe this?  Because they share exactly the same password: “2011-10-10 14:25:03”.   These accounts are part of a larger group of 606,733 accounts all created that day, presumably between 14:25 and 15:33.  There were an additional 22,741 accounts similar to these created, we believe, on October 14, 2011 - again with a similar style of @qq.com accounts (but using 9 character user names).  We do not believe that any of these qq.com accounts exist.

Another example is the uppercase ASCII-hex hashes. 1,563,853 (all but 1538) of these have email addresses like this: 037d6909-04a9-4b45-a309-157ef846c573@qzone.com. Having a UUID as the email address is strange enough but we looked into qzone.com. The records of DNSTrails show that an MX record for this domain only existed between October 2008 and August 2009. Also, the wayback machine of archive.org doesn’t have any recordings during that period. These facts lead us to believe that these are generated accounts.

One thing to take from this is that ob-security doesn't really help, in addition, it is interesting to see how there are so many different plays on MD5 used in this leak. It is always a good idea to not assume a single hash algorithm is being used, even if it comes from a single data set. Hopefully, we have provided an interesting read and we would love to find out why there are 1.5 M hashes which seem slightly different to the rest. If you know something, contact us.