Sunday, June 25, 2017

32hex is not MD5? What are Youku talking about?


32hex is not MD5? What are Youku talking about?

During April 2017, various online sources alleged that Youku, a Chinese video hosting service was hacked and that roughly 100 million user accounts were compromised. These sources stated that Youku usernames along with passwords hashed with MD5 and SHA1 algorithms were leaked. We decided to take a closer look in early June and will be presenting our findings in this post.

Of the 99,075,692 lines of data present in the leak provided to us, we were able to extract 99,028,838 usable hash strings. From the hash strings extracted from the original dump, we noticed there were hashes of varying lengths ranging from 30 to 32 ASCII-hex characters and thus suggesting to us they could be more MD5 like. After de-duplicating the hashes we were left with 57,205,528 hashes suggesting there was password re-use in this data.

A common practice, especially those seen in Chinese websites, is that the developers employ a form of ob-security in their password storage schemes. We suspect this is most likely done to deter the hashes being loaded into off-the-shelf password crackers. Another explanation would be that mistakes were made in processing the data.

As we started to work on this data set, it quickly became apparent that there were more than just MD5 hashes in this file.  We were able to identify both iterated MD5 hashes, as well as more complex sub-string iterated hashes.  Each of these also appeared as a chopped (last digits removed) value as well.  The majority of hashes were MD5($pass), but we found a sizeable number of MD5(MD5($pass)) and MD5(MD5(MD5($pass))).  The substring hashes were of the form MD5(substr(MD5($pass),8,16)).

The number of different MD5 variations used in hashing the passwords could be attributed to a number of factors which we won’t know but can only make assumptions. The simplest explanation is that the developers decided to change the hashing method through update iterations to their website. Some other explanations could be they merged with another service and also merged in those user accounts along with hashes, alternatively different accounts such as operators and users may have used different hashing schemes.

Dealing with the chopped hashes was not a problem for our tools. MDXfind natively supports partial matching of hashes, but we did modify hashcat to support these as well. See below for an example patch based on hashcat 3.6.0. A “clean” version including MD5sub8-24MD5 may be released at a later point. This required both small changes in the input parser, as well as the kernel code. We then ran the cracked passwords as a dictionary with MDXfind to mark the hashes correctly.


Of the 99 million hashes we parsed, we were able to recover 94.836 million - roughly 95.7% success rate. Interestingly, we noticed about 1.5 million MD5 like hashes which were in uppercase ASCII-hex form, as opposed to lowercase like the rest. We were not able to recover any of these hashes, and it is possible these are either salted or use a more exotic algorithm.

We found 48 million unique passwords, which solved the 94.8 million hashes.  The top-25 passwords for this list are typical for this type of web-site. It is interesting to note the fourth most common password used ‘xuanchuan’ is the romanized representation of 宣傳 translated to English means propaganda.

Perhaps the most interesting thing about this leak was the number of “created” or “generated” accounts we found.  Many, perhaps even the majority, of the accounts use what we consider to be generated email addresses and certainly machine-generated passwords.  While the exact number is difficult to calculate with certainty, we suspect tens of millions of these accounts are generated.

For example, there are 222 accounts we believe were created on October 10, 2011, at 14:25:03, all with 11 character random usernames  Why do we believe this?  Because they share exactly the same password: “2011-10-10 14:25:03”.   These accounts are part of a larger group of 606,733 accounts all created that day, presumably between 14:25 and 15:33.  There were an additional 22,741 accounts similar to these created, we believe, on October 14, 2011 - again with a similar style of accounts (but using 9 character user names).  We do not believe that any of these accounts exist.

Another example is the uppercase ASCII-hex hashes. 1,563,853 (all but 1538) of these have email addresses like this: Having a UUID as the email address is strange enough but we looked into The records of DNSTrails show that an MX record for this domain only existed between October 2008 and August 2009. Also, the wayback machine of doesn’t have any recordings during that period. These facts lead us to believe that these are generated accounts.

One thing to take from this is that ob-security doesn't really help, in addition, it is interesting to see how there are so many different plays on MD5 used in this leak. It is always a good idea to not assume a single hash algorithm is being used, even if it comes from a single data set. Hopefully, we have provided an interesting read and we would love to find out why there are 1.5 M hashes which seem slightly different to the rest. If you know something, contact us.