I received today the type of e-mail that we all know one day will arrive: an e-mail where someone is trying to locate a file that doesn't exist anymore.
The problem is very simple: friends of mine are trying to download code from https://bit.ly/2jIu1Vm to replicate results from an earlier paper, but the link redirects to https://bitbucket.org/villalbamartin/refexp-toolset/src/default/. You may recognize that URL: it belongs to Bitbucket, the company that infamously dropped their support for Mercurial a couple months ago despite being one of the largest Mercurial repositories on the internet.
This is the story of how I searched for that code, and even managed to recover some of it.
Unlike typical stories, several backup copies of this code existed. Like most stories, however, they all suffered terrible fates:
- There was a migration of most of our code to Github, but this specific repo was missed because it belongs to our University group (everyone in that group had access to it) but it was not created under the group account.
- Three physical copies of this code existed. One lived in a hard drive that died, one lived in a hard drive that may be lost, and the third one lives in my hard drive... but it may be missing a couple commits, because I was not part of that project at that time.
At this point my copy is the better one, and it doesn't seem to be that outdated. But could we do better?
My next step was figuring out whether a copy of this repo still exists on the internet - it is well known that everything online is being mirrored all the time, and it was only a question of figuring out who was more likely to have a copy.
My first stop was Archive Team, from the people behind the Internet Archive. This team famously downloaded 245K public repos from Bitbucket, and therefore they were my first choice when checking whether someone still had a copy of our code.
The experience yielded mixed results: accessing the repository with my browser is impossible because the page throws a large number of errors related to Content Security Policy, missing resources, and deprecated attributes. I imagine no one has looked at it in quite some time, as it is to be expected when dealing with historical content. On the command line, however, it mostly works: I can download the content of my repo with a single command:
hg clone --stream https://web.archive.org/web/2id_/https://bitbucket.org/villalbamartin/refexp-toolset
I say "mostly works" because my repo has a problem: it uses sub-repositories, which apparently Archive Team failed to archive. I can download the root directory of my code, but important subdirectories are missing.
My second stop was the Software Heritage archive, an initiative focused on collecting, preserving, and sharing software code in a universal software storage archive. They partnered up with the Mercurial hosting platform Octobus and produced a second mirror of Bitbucket projects, most of which can be nicely accessed via their properly-working web interface. For reasons I don't entirely get this web interface does not show my repo, but luckily for us the website also provides a second, more comprehensive list of archived repositories where I did find a copy.
As expected, this copy suffers from the same sub-repo problem as the other one. But if you are looking for any of the remaining 99% of code that doesn't use subrepos, you can probably stop reading here.
Deeper into the rabbit hole
At this point, we need to bring out the big guns. Seeing as the SH/Octobus repo is already providing me with the raw files they have, I don't think I can get more out of them than what I currently do. The Internet Archive, on the other hand, could still have something of use: if they crawled the entire interface with a web crawler, I may be able to recover my code from there.
The surprisingly-long process goes like this:
first, you go to the Wayback Machine,
give them the repository address, and find the date when the repository was
crawled (you can see it in their calendar view). Then go to the
Kicking the bucket
project page, and search for a date that kind of matches that. In my case
the repository was crawled on July 6, but the raw files I was looking for where
stored in a file titled
20200707003620_2361f623. In order to identify this file I
simply went through all files created on or after July 6, downloaded their
index (in my case, the one named
ITEM CDX INDEX) and used
zgrep to check
whether the string
refexp-toolset (the key part of the repo's name) was
contained in any of them. Once I identified the proper file, downloading the
raw 28.9 Gb
WEB ARCHIVE ZST file took about a day.
Once you downloaded this file, you need to decompress it. This file is compressed
with ZST, meaning that you probably
need to install the
zstd tool or similar (this one worked in Devuan, so it's
probably available in Ubuntu and Debian too). But we are not done! See, the ZST
standard allows you to use an external dictionary
without which you cannot open the WARC file (you get an
Decoding error (36) : Dictionary
mismatch error). The list of all dictionaries is available at the bottom of
this list. How to identify
the correct one? In my case, the file I want to decrypt is called
bitbucket_20200707003620_2361f623.1592864269.megawarc.warc.zst, so the
correct dictionary is the one called
This file has a
.zst extension, so don't forget to extract it too!
Once you have extracted the dictionaries, found the correct one, and extracted
the contents of your
warc.zst file (
unzstd -D <dictionary> <file>) it is now
time to access the file. The Webrecorder
Player didn't work too well
because the dump is too big,
but the warctools package was
helpful enough to realize... that the files I need are not in this dump either.
So that was a waste of time. On the plus side, if you ever need to extract files from the Internet Archive, you now know how.
So far I seem to have exhausted all possibilities. I imagine that someone somewhere has a copy of Bitbucket's raw data, but I haven't managed to track it down yet. I have opened an issue regarding sub-repo cloning, but I don't expect it to be picked up anytime soon.
The main lesson to take away from here is: backups! I'm not saying you need 24/7 NAS mirroring, but you need something. If we had four copies and three of them failed, that should tell you all you need to know about the fragility of your data.
Second, my hat goes off both to the Internet Archive team and to the collaboration between the Software Heritage archive and Octobus. I personally like the later more because their interface is a lot nicer (and functional) than the Internet Archive, but I also appreciate the possibility of downloading everything and sorting it myself.
And finally, I want to suggest that you avoid Atlassian if you can. Atlassian has the type of business mentality that would warm Oracle's heart if they had one. Yes, I know they bought Trello and it's hard to find a better Kanban board, but remember that Atlassian is the company that, in no particular order,
- regularly inflicts Jira on developers worldwide,
- bought Bitbucket and then gutted it, and
- sold everyone on the promise of local hosting and then discontinued it last week for everyone but their wealthiest clients, forcing everyone else to move to the cloud. Did you know that Australia has legally-mandated encryption backdoors? And do you want to take a guess on where Atlassian's headquarters are? Just saying.