Recovering Mercurial code from Bitbucket

I received today the type of e-mail that we all know one day will arrive: an e-mail where someone is trying to locate a file that doesn't exist anymore.

The problem is very simple: friends of mine are trying to download code from https://bit.ly/2jIu1Vm to replicate results from an earlier paper, but the link redirects to https://bitbucket.org/villalbamartin/refexp-toolset/src/default/. You may recognize that URL: it belongs to Bitbucket, the company that infamously dropped their support for Mercurial a couple months ago despite being one of the largest Mercurial repositories on the internet.

This is the story of how I searched for that code, and even managed to recover some of it.

Offline backup

Unlike typical stories, several backup copies of this code existed. Like most stories, however, they all suffered terrible fates:

  • There was a migration of most of our code to Github, but this specific repo was missed because it belongs to our University group (everyone in that group had access to it) but it was not created under the group account.
  • Three physical copies of this code existed. One lived in a hard drive that died, one lived in a hard drive that may be lost, and the third one lives in my hard drive... but it may be missing a couple commits, because I was not part of that project at that time.

At this point my copy is the better one, and it doesn't seem to be that outdated. But could we do better?

Online repositories

My next step was figuring out whether a copy of this repo still exists on the internet - it is well known that everything online is being mirrored all the time, and it was only a question of figuring out who was more likely to have a copy.

My first stop was Archive Team, from the people behind the Internet Archive. This team famously downloaded 245K public repos from Bitbucket, and therefore they were my first choice when checking whether someone still had a copy of our code.

The experience yielded mixed results: accessing the repository with my browser is impossible because the page throws a large number of errors related to Content Security Policy, missing resources, and deprecated attributes. I imagine no one has looked at it in quite some time, as it is to be expected when dealing with historical content. On the command line, however, it mostly works: I can download the content of my repo with a single command:

hg clone --stream https://web.archive.org/web/2id_/https://bitbucket.org/villalbamartin/refexp-toolset

I say "mostly works" because my repo has a problem: it uses sub-repositories, which apparently Archive Team failed to archive. I can download the root directory of my code, but important subdirectories are missing.

My second stop was the Software Heritage archive, an initiative focused on collecting, preserving, and sharing software code in a universal software storage archive. They partnered up with the Mercurial hosting platform Octobus and produced a second mirror of Bitbucket projects, most of which can be nicely accessed via their properly-working web interface. For reasons I don't entirely get this web interface does not show my repo, but luckily for us the website also provides a second, more comprehensive list of archived repositories where I did find a copy.

As expected, this copy suffers from the same sub-repo problem as the other one. But if you are looking for any of the remaining 99% of code that doesn't use subrepos, you can probably stop reading here.

Deeper into the rabbit hole

At this point, we need to bring out the big guns. Seeing as the SH/Octobus repo is already providing me with the raw files they have, I don't think I can get more out of them than what I currently do. The Internet Archive, on the other hand, could still have something of use: if they crawled the entire interface with a web crawler, I may be able to recover my code from there.

The surprisingly-long process goes like this: first, you go to the Wayback Machine, give them the repository address, and find the date when the repository was crawled (you can see it in their calendar view). Then go to the Kicking the bucket project page, and search for a date that kind of matches that. In my case the repository was crawled on July 6, but the raw files I was looking for where stored in a file titled 20200707003620_2361f623. In order to identify this file I simply went through all files created on or after July 6, downloaded their index (in my case, the one named ITEM CDX INDEX) and used zgrep to check whether the string refexp-toolset (the key part of the repo's name) was contained in any of them. Once I identified the proper file, downloading the raw 28.9 Gb WEB ARCHIVE ZST file took about a day.

Once you downloaded this file, you need to decompress it. This file is compressed with ZST, meaning that you probably need to install the zstd tool or similar (this one worked in Devuan, so it's probably available in Ubuntu and Debian too). But we are not done! See, the ZST standard allows you to use an external dictionary without which you cannot open the WARC file (you get an Decoding error (36) : Dictionary mismatch error). The list of all dictionaries is available at the bottom of this list. How to identify the correct one? In my case, the file I want to decrypt is called bitbucket_20200707003620_2361f623.1592864269.megawarc.warc.zst, so the correct dictionary is the one called archiveteam_bitbucket_dictionary_1592864269.zstdict.zst. This file has a .zst extension, so don't forget to extract it too!

Once you have extracted the dictionaries, found the correct one, and extracted the contents of your warc.zst file (unzstd -D <dictionary> <file>) it is now time to access the file. The Webrecorder Player didn't work too well because the dump is too big, but the warctools package was helpful enough to realize... that the files I need are not in this dump either.

So that was a waste of time. On the plus side, if you ever need to extract files from the Internet Archive, you now know how.

Final thoughts

So far I seem to have exhausted all possibilities. I imagine that someone somewhere has a copy of Bitbucket's raw data, but I haven't managed to track it down yet. I have opened an issue regarding sub-repo cloning, but I don't expect it to be picked up anytime soon.

The main lesson to take away from here is: backups! I'm not saying you need 24/7 NAS mirroring, but you need something. If we had four copies and three of them failed, that should tell you all you need to know about the fragility of your data.

Second, my hat goes off both to the Internet Archive team and to the collaboration between the Software Heritage archive and Octobus. I personally like the later more because their interface is a lot nicer (and functional) than the Internet Archive, but I also appreciate the possibility of downloading everything and sorting it myself.

And finally, I want to suggest that you avoid Atlassian if you can. Atlassian has the type of business mentality that would warm Oracle's heart if they had one. Yes, I know they bought Trello and it's hard to find a better Kanban board, but remember that Atlassian is the company that, in no particular order,

  • regularly inflicts Jira on developers worldwide,
  • bought Bitbucket and then gutted it, and
  • sold everyone on the promise of local hosting and then discontinued it last week for everyone but their wealthiest clients, forcing everyone else to move to the cloud. Did you know that Australia has legally-mandated encryption backdoors? And do you want to take a guess on where Atlassian's headquarters are? Just saying.

Netflix and sound whitewashing

Note: I wrote this article in August, but I didn't realize it wasn't published until October. I kept the published date as it was, but if you didn't see it before well, that's why.

Are you familiar with a small streaming company called "Netflix"? If so, you might recognize their opening sound. And even if you don't, you might have seen one of their multiple recent press campaigns regarding this topic. From a recent episode of the Twenty Thousand Hertz podcast on all the sound choices that go into their logo to their announcement that Hans Zimmer has worked on making it longer for cinema productions.

What none of those articles are saying is that this sound is also the sound of Kevin Spacey hitting a desk at the end of Season 2 of House of Cards. Yes, that House of Cards, the critically-acclaimed series that made Netflix' stock jump a 70 percent even before it started and put Netflix on the map. If I were a Netflix executive back then, I would be proud of having the series as part of my corporate identity.

If I were an executive today, however, I would be terrified of people forever remembering that my company's official sound, the one that plays before every show, was first heard in a scene with an actor that has been very publicly accused of sexual assault in 2017. So I can understand why someone would feel that a change is needed, and I'm all for it. No one is blaming Netflix (as far as I know) for not running background checks on their actors.

Having said that, it seems that Netflix has gone all the way to completely erase that any of this ever happened, in what has to be the most pointless history rewrites in some time. In the above-mentioned podcast, a sound engineer talks about all the sounds that came together to compose the current Netflix sound, from a ring on a cabinet to the sound of an anvil, with no mention whatsoever of Kevin Spacey hitting any desks.

Suffice to say, I was confused by this omission, so I dug a bit more and found a Facebook post from August 2019 from the Twenty Thousand Hertz podcast official account, where they posted:

"I'm convinced the @netflix sonic logo was originally built from Frank Underwood banging on the desk at the end of House of Cards Season 2. BUT, I'm dying to know who enhanced it! I can't find anything online! (...)".

I can only conclude that the "it's a ring on a cabinet" story is technically true and a sound engineer has actually used it to enhance Kevin Spacey's desk banging sound, but they conveniently "forgot" to mention the relation between these two facts. One of the answers to this Quora question mentions that "The tapping on the table with his (Kevin Spacey) ring is associated with completing a mission or one of his plans being accomplished", which sheds even more light into why they were banging rings on furniture to begin with. And let's pray that the hand wearing the ring wasn't Kevin Spacey's...

None of this is mentioned in the podcast. As for the longer version composed by Hans Zimmer, it does not include the original soundbite at all. I believe that Netflix is going on a PR campaign to rewrite their history, has convinced the Twenty Thousand Hertz podcast people to just go with it, and have so far been very successful.

And yet, I have to ask... why? Was it so difficult to come out and say "we don't want to be associated with this sound anymore, and therefore we are releasing a new one"? I honestly don't care about Netflix nor House of Cards (which I have not seen), but I am kind of annoyed at such a transparent attempt to hide their history behind a PR campaign. Or even worse, that they seem to have gotten away with it.

A hot-dog is a sandwich, but you can't call it that

I remember seeing two main camps in the old debate on whether a hot-dog is a sandwich: those that argued "it's meat between two pieces of bread, therefore it's a sandwich" and those that counterargued "if I asked whether you wanted a sandwich and then gave you a hot-dog, you would be surprised". And both sides are right! That said, one of them is more right from a language-theoretical point of view, and that's the point of today's post.

(Spoiler: the "not a sandwich" camp is right. Sorry, pro-sandwichers!)

According to Herbert Clark a dialogue is a cooperative activity. It is a joint activity in which both speakers try to accomplish a common objective, ranging from something as formal as "make me understand where the train station is" to as vague an objective as "let's kill some time". And because it is cooperative, we do not expect the other person to be deliberately obtuse. If I ask "Can I get you something to drink?" and someone replies "I don't know, can you?", nobody would assume that this person has a genuine interest in my capacity for carrying drinks. Instead, we would immediately see this for what it is: that this person is not cooperating, that any reasonable person would have understood what I meant, and that this person stopped cooperating on purpose. Whether he did it to make a joke or because he's a jerk, that's a topic for a future discussion.

There is also a principle called the "maxim of quantity" (one of Grice's four maxims) according to which a person will always give as much information as possible, but not so much that it breaks the dialogue. If someone asks me where I come from, my answer can be as precise as a specific neighborhood or as vague as "somewhere near the border with Brazil". My answer will depend on how familiar I believe the other person to be with South-American geography, because I don't want to give them excessive information that they cannot handle. Again, I'm cooperating.

Which brings us to the hot-dog debate. From a taxonomic point of view, a hot-dog is a sandwich: it is composed of two pieces of processed meat between two pieces of bread, which is as clear as it gets. But this is only half the story.

The definition of sandwich came long after the sandwich itself. It is an artificial construct designed to model and understand a set of real-life language usages. Spoken language, on the other hand, is the real deal. Language rules and word definitions model how we speak, and not the other way around. All definitions are artificial and, therefore, may not always reflect the way we actually use those words.

Bringing this all together, I would only use the word "sandwich" to describe a hot-dog if I had a reason to believe the other person doesn't know what a hot-dog is. If you know what a hot-dog is and I know you know what a hot-dog is, using the word "sandwich" to speak about a hot-dog is neither maximally informative (I am giving you less information than I could) nor cooperative (I know which word would help us the most, but I'm not using it). "Sandwich" is a catch-all word that can only be used when no better word exists.

Even worse: you know we both know what a hot-dog is. By choosing "sandwich", I am actively leading you to believe that I want to offer something that can be described as a "sandwich" but not as a "hot-dog". Fans of malicious compliance will argue that this is not technically untrue, but you and I both know that there's little practical difference between "I told you a lie" and "I told you something that any sane person in the world would understand in a certain way while secretly using a different, opposite interpretation that I kept to myself".

So there you have it. A hot-dog is a sandwich if you stick to rigid categories created by researchers with a tenuous grasp of the real world at best (you know, people like me), but you are only allowed to use it if you talk to people who never heard about hot-dogs before. Using the word "sandwich" for a hot-dog in any other context is uncooperative, mildly dishonest, and kind of a jerk move. People do not use the word "sandwich" like that and, since spoken language is where "true" language usage lies, they are the ones who are using it right.

I'm running out of forums

As so many immigrants expats around the world, I like to follow the news regarding what's happening back in my home country. But getting complete, reliable information has been getting more and more difficult every year, and 2020 is the year in which I finally ran out of news sources.

Unlike my parent's generation, I don't consider newspapers a reasonable source of reliable information. The problem is that, following the example set by Fox News, the largest newspapers at home have substituted fact for opinions, extremely biased articles, and outrage has replaced objectivity as the main selling point. All of this seasoned with local celebrity gossip, of course.

If despite my best judgment I decide to check what's going on based on newspapers, I currently start with the biggest newspaper (which is very right-leaning) and then compensate with their main opposition (which is, as expected, very left-leaning). I then figure out which news are common to both, and decide which version is more likely to be true - one newspaper's fair trial is the other newspaper's witch hunt, and one newspaper's smart move is the other newspaper's national betrayal. Finally, I check which news have been mentioned in only one of them, and decide on a reasonable narrative for why only one of them is talking about it. Suffice to say, doing this in the morning on my phone takes a lot of effort.

In my case, this was one problem (probably the only one) solved by the news aggregator Reddit. The sub-reddit for my country used to be relatively good at highlighting the main issues in the public conscience, and reading the main posts used to be enough to get a good, updated picture. But not anymore: extremely lax moderation and general apathy have turned the forum into a meme-laden wasteland where all information has been replaced by bad jokes, political outrage, and the occasional "kill all poors" post. So I quit it for good, and haven't looked back.

Which leads me to today. I have not yet found any source of news that I can trust to inform me. Sure, I can easily get distracted, entertained, outraged, tricked, and lied to, but informed? Good luck with that. I have remained uninformed for a couple months now, and I don't see that changing anytime soon.

Being a programmer who has spent a bit of time toying around with NLP, I have been working on a solution - a small news aggregator where I can get a proper sense of what's going on. But it hasn't been easy: news agencies no longer offer RSS feeds, important data is locked behind proprietary formats made intentionally hard to read, and there is barely any corpora around in Spanish that I could use to train models.

I don't have a deeper point. I can't promise that my solution will work, because it's entirely possible that it never makes it out of vaporware. I don't have a website to recommend, because websites are getting bad faster than good ones are propping up. And I'm not going to talk about the cesspool that are unmoderated forums because we all know about that already.

I just wanted to say: I am very unhappy with this situation.

How to draw

I am okay at drawing. That means: I am probably better at drawing than a random person walking down the street but I'm far, far behind the type of artists that regularly post in Instagram. I am also mostly self-taught: I took some initial lessons via mail from the well-known (at the time) Modern Schools, gave up for a couple years, and picked it back up in my late teens when I needed something to do besides programming and not having friends. Some of my drawings have been published, and one in particular has been stolen countless time by people who thinks copying things from the internet without attribution is fine.

I was recently asked what I would recommend to someone who wants to learn how to draw. This question took me by surprise for two reasons: one, because I was never asked this before, and two, because my answer was surprisingly useless even to me:

Any tutorial you find online will give you the right steps. But you'll only understand them after you already know how to draw.

This is a pointless answer, which also happens to be 100% correct. This post is my attempt at giving a slightly clearer answer, explaining why anyone would think that my advice makes sense and hopefully give beginners some good points on where to start.

Note 1: this post contains links to drawings of naked people. If you are not comfortable with drawn nudity, you should probably not follow the links and definitely reconsider whether figure drawing is good for you.

The boring advice

All drawing is, at its core, more or less the same. Whether you are interested into realistic drawing, comic drawing, manga drawing (a term I hate), webcomics or editorial cartoons, the art of representing human figures in 2D is based on 90% the same rules. Sure, US comics have more muscles and japanese manga characters have no nose, but the fundamentals are the same. A typical drawing curriculum should include:

  • How to sketch a human figure. This guide is relatively good, while this one sucks for reasons I'll explain later on. If you've seen those wooden figures, they are useful for getting the hang of this step.
  • The proportions of the human body. More specifically, this guide on how many heads you need to draw a full body.
  • The proportions of the face. This is annoying enough that it often warants a section on its own.
  • How perspective works. One, two, and three points perspective are the typical ones.
  • How shadows work. Getting it perfect will take a long time, but "dark part is dark" will get you far with little effort.

Once you reach this point, you can either start learning about muscles and improve your anatomy (have you ever stopped to think about how weird knees look?), or become a caricaturist and call it a day.

The number of books and tutorials out there convering all these points is virtually infinite, and therefore any book you choose it's going to be probably fine. If you want some more specific advice, multiple generations of artists have learned with Andrew Loomis' books, which are freely available on the Internet Archive. You should start with Figure drawing for all it's worth, follow up with Drawing the head and hands, and fresh up your perspective with the first half of Successful drawing.

Practical advice I: Keep drawing

There are two extra pieces of advice worth discussing.

The standard advice says "keep drawing until you become good at it", which is technically true but only barely. The full, honest version should say:

Start with one drawing. It will suck, and that's fine. Once you're finished, look at it objectively and enumerate its defects. For your next drawing, focus on solving those defects. Repeat until you consider yourself good enough1.

In other terms: you can draw circles all day and all night for years, but that won't make you any better at drawing squares. If you want to get better at drawing, you first need to be aware of what's there to improve.

That doesn't mean that you can't be happy about something you just drew. Few things are as rewarding as putting your art supplies to the side, looking at your drawing, and admiring something knowing that you made it. All I'm saying is: you need to know what your blind spots are. If you are like me and your eyes are always sliiiiightly out of alignment, it is perfectly fine to still be happy about that portrait you just made. But if you are not honest and accept that yes, that one eye looks weird, then you will never learn how to fix it2.

Practical advice II: Copy other people

As the quote goes, "Good artists copy, great artists steal". Therefore, it is your duty as aspiring artist to copy as much as you can. Most self-taught artists I know started the same way, copying drawings over and over until they felt comfortable enough to start doing their own.

My suggestion: find an artist you like. Pick one of their drawings and copy it. Add the final work to your sketch folder. Repeat. This exercise serves several purposes:

  • It will improve your pencil grip, make your lines stronger, and improve your technique overall.
  • It allows you to focus on a sub-part of the problem (drawing a figure) without having to worry about the complicated stuff - you don't need to think about perspective, shadows or posture because the artist already did it for you.
  • It helps you to build your personal portfolio. It will help you visualize your progress, and gives you something to brag about whenever someone learns you are drawing and asks you to see something you've done. Plus, it's not like you wanted to throw those drawings away, right?
  • It will help you answer questions you didn't know you had. Do you want to know how to draw a feminine-looking nose? Copy one of Phil Noto's illustrations. Would you like to know how does a professional go from zero to done? You can watch professionals like Jim Lee do a couple pieces in real time online and even explain their process as they go. Are you wondering how much attention to pay to clothes and background? Once you notice that classical painters couldn't care less about whatever is below your shoulders, maybe you won't lose your sleep about it either.

Eventually, you'll start noticing that different artists have different skills to offer. Maybe that guy draws cool hands, that other artist draws clothes very well, and that third other one has very expressive faces. Copying their work helps you understand the tricks they are using, and adding them to your repertoire helps you develop your own style.

Rest of the owl

The final point is both super important and really difficult to explain to beginners.

Are you familiar with the how to draw an owl meme? This picture is very popular in amateur art circles because it goes straight to the core issue: that most tutorials will take your hand and guide you step-by-step, but then they will let go at a critical step and you'll fall down a metaphorical cliff.

The root of the problem, I think, is that one step where the book tells you to "do what feels natural" or to "just keep going". What these people forget, however, is that learning what feels natural takes a lot of practice!

This tutorial I mentioned above is as bad as it gets: the instructions tell you to "Draw some vertical and horizontal lines to plan your drawing", which is completely useless advice that only makes sense once you know which lines to draw and where. Whoever wrote that guide has forgotten what it was to be a beginner, and their advice is really not helping.

When that happens, you have two choices. You can look for a better tutorial, or you can keep going, and see how far you make it. There is no shame in trying and failing, and who knows? maybe you'll still make it. Truth be told, there is a point at which no tutorial can help you and all that's left for you to do is to just draw. But that only applies for specific, advanced tutorials. It is the sad truth that, as a beginner, you will often recognize bad tutorials only once you are stuck in them.

Nobody said the life of an artist was easy.

Closing remarks

This guide ended up being longer than I intended, and half as long as it should be. That's always going to be a problem: the average artist does not let structure get in the way of their vision, and any attempt at a "formal" answer will stop halfway (as I have complained before). That said, if you would still appreciate a more structured approach, I have heard good things about Betty Edward's book Drawing on the right side of the brain.

And finally: have fun. All of this advice is useful for when you want to get objectively better, but there's a lot to be said in favor of simply drawing because you enjoy it.

Happy drawing!


  1. Fair warning: in my experience, most artists never feel that they are "good enough". This is a well-known bug of art.

  2. I believe the process of "find defect, correct defect, repeat" is why most artists I know are never happy about their work. Seriously, go to an artist and tell them you like a particular drawing of them - there's a good chance that they'll give some excuse for why the drawing sucks.

