Recovering Mercurial code from Bitbucket

I received today the type of e-mail that we all know one day will arrive: an e-mail where someone is trying to locate a file that doesn't exist anymore.

The problem is very simple: friends of mine are trying to download code from https://bit.ly/2jIu1Vm to replicate results from an earlier paper, but the link redirects to https://bitbucket.org/villalbamartin/refexp-toolset/src/default/. You may recognize that URL: it belongs to Bitbucket, the company that infamously dropped their support for Mercurial a couple months ago despite being one of the largest Mercurial repositories on the internet.

This is the story of how I searched for that code, and even managed to recover some of it.

Offline backup

Unlike typical stories, several backup copies of this code existed. Like most stories, however, they all suffered terrible fates:

  • There was a migration of most of our code to Github, but this specific repo was missed because it belongs to our University group (everyone in that group had access to it) but it was not created under the group account.
  • Three physical copies of this code existed. One lived in a hard drive that died, one lived in a hard drive that may be lost, and the third one lives in my hard drive... but it may be missing a couple commits, because I was not part of that project at that time.

At this point my copy is the better one, and it doesn't seem to be that outdated. But could we do better?

Online repositories

My next step was figuring out whether a copy of this repo still exists on the internet - it is well known that everything online is being mirrored all the time, and it was only a question of figuring out who was more likely to have a copy.

My first stop was Archive Team, from the people behind the Internet Archive. This team famously downloaded 245K public repos from Bitbucket, and therefore they were my first choice when checking whether someone still had a copy of our code.

The experience yielded mixed results: accessing the repository with my browser is impossible because the page throws a large number of errors related to Content Security Policy, missing resources, and deprecated attributes. I imagine no one has looked at it in quite some time, as it is to be expected when dealing with historical content. On the command line, however, it mostly works: I can download the content of my repo with a single command:

hg clone --stream https://web.archive.org/web/2id_/https://bitbucket.org/villalbamartin/refexp-toolset

I say "mostly works" because my repo has a problem: it uses sub-repositories, which apparently Archive Team failed to archive. I can download the root directory of my code, but important subdirectories are missing.

My second stop was the Software Heritage archive, an initiative focused on collecting, preserving, and sharing software code in a universal software storage archive. They partnered up with the Mercurial hosting platform Octobus and produced a second mirror of Bitbucket projects, most of which can be nicely accessed via their properly-working web interface. For reasons I don't entirely get this web interface does not show my repo, but luckily for us the website also provides a second, more comprehensive list of archived repositories where I did find a copy.

As expected, this copy suffers from the same sub-repo problem as the other one. But if you are looking for any of the remaining 99% of code that doesn't use subrepos, you can probably stop reading here.

Deeper into the rabbit hole

At this point, we need to bring out the big guns. Seeing as the SH/Octobus repo is already providing me with the raw files they have, I don't think I can get more out of them than what I currently do. The Internet Archive, on the other hand, could still have something of use: if they crawled the entire interface with a web crawler, I may be able to recover my code from there.

The surprisingly-long process goes like this: first, you go to the Wayback Machine, give them the repository address, and find the date when the repository was crawled (you can see it in their calendar view). Then go to the Kicking the bucket project page, and search for a date that kind of matches that. In my case the repository was crawled on July 6, but the raw files I was looking for where stored in a file titled 20200707003620_2361f623. In order to identify this file I simply went through all files created on or after July 6, downloaded their index (in my case, the one named ITEM CDX INDEX) and used zgrep to check whether the string refexp-toolset (the key part of the repo's name) was contained in any of them. Once I identified the proper file, downloading the raw 28.9 Gb WEB ARCHIVE ZST file took about a day.

Once you downloaded this file, you need to decompress it. This file is compressed with ZST, meaning that you probably need to install the zstd tool or similar (this one worked in Devuan, so it's probably available in Ubuntu and Debian too). But we are not done! See, the ZST standard allows you to use an external dictionary without which you cannot open the WARC file (you get an Decoding error (36) : Dictionary mismatch error). The list of all dictionaries is available at the bottom of this list. How to identify the correct one? In my case, the file I want to decrypt is called bitbucket_20200707003620_2361f623.1592864269.megawarc.warc.zst, so the correct dictionary is the one called archiveteam_bitbucket_dictionary_1592864269.zstdict.zst. This file has a .zst extension, so don't forget to extract it too!

Once you have extracted the dictionaries, found the correct one, and extracted the contents of your warc.zst file (unzstd -D <dictionary> <file>) it is now time to access the file. The Webrecorder Player didn't work too well because the dump is too big, but the warctools package was helpful enough to realize... that the files I need are not in this dump either.

So that was a waste of time. On the plus side, if you ever need to extract files from the Internet Archive, you now know how.

Final thoughts

So far I seem to have exhausted all possibilities. I imagine that someone somewhere has a copy of Bitbucket's raw data, but I haven't managed to track it down yet. I have opened an issue regarding sub-repo cloning, but I don't expect it to be picked up anytime soon.

The main lesson to take away from here is: backups! I'm not saying you need 24/7 NAS mirroring, but you need something. If we had four copies and three of them failed, that should tell you all you need to know about the fragility of your data.

Second, my hat goes off both to the Internet Archive team and to the collaboration between the Software Heritage archive and Octobus. I personally like the later more because their interface is a lot nicer (and functional) than the Internet Archive, but I also appreciate the possibility of downloading everything and sorting it myself.

And finally, I want to suggest that you avoid Atlassian if you can. Atlassian has the type of business mentality that would warm Oracle's heart if they had one. Yes, I know they bought Trello and it's hard to find a better Kanban board, but remember that Atlassian is the company that, in no particular order,

  • regularly inflicts Jira on developers worldwide,
  • bought Bitbucket and then gutted it, and
  • sold everyone on the promise of local hosting and then discontinued it last week for everyone but their wealthiest clients, forcing everyone else to move to the cloud. Did you know that Australia has legally-mandated encryption backdoors? And do you want to take a guess on where Atlassian's headquarters are? Just saying.