How I broke (and fixed) docs.atomist.com
or, a week in the life of a programmer
Lately I’m working on our documentation. We write it in markdown, turn it into a web site, and then serve it from s3. To turn it into a web site, we use mkdocs and the material theme for mkdocs. Mkdocs is written in Python. Then we test it with HtmlProofer, which is in Ruby. Okay.
A week ago, I set out to add an “Edit on GitHub” link to each of our pages.
That’s built-in functionality in mkdocs; define a
repo_url and an
mkdocs.yml and it should just work. It didn’t work right away (although now I wonder whether I just missed the little pencil symbol because I was looking for text). Before I dug into figuring out why, I upgraded mkdocs and material because we were two breaking versions behind; the latest is 3.0.4 and we were on 1.0.4. If I’m gonna study a tool, I want to use the latest docs.
The broken links
The upgrade was no trouble as far as producing the site. HtmlProofer, though, found a bunch of broken links. To troubleshoot this, I went through pretty much all the docs on mkdocs and on material. (They’re beautiful docs; there’s a reason we use these tools.) Then dug around in the material templates and the mkdocs Python code. I created an issue on mkdocs (it’s fixed already!) and on material (the maintainer said thanks!) for the bug, and then worked around it by adding an exception to our HtmlProofer invocation, after looking at its docs to find out how to do it. Which required breaking our HtmlProofer call into its own script because we call it in three places.
After that, there was still a broken link. I diagnosed this one also as a bug that could be fixed either in mkdocs or the template, but didn’t have the heart to make another issue report. I worked around it instead, by overriding that page in the template. (Just now, having noticed how nice the maintainers were, I made the effort to create another issue. I even tried to make a PR but the build steps didn’t work on my computer. This is not a surprise.)
A brief interval of work that I wanted to do
Now the tests and build work, and the upgrade is done. After this, I had an adventure getting the edit link working without having a GitHub repo link in the upper-right corner (it was useless and animated, yuck). To have the “Edit on GitHub” I need
repo_url defined in the config, but that always results in a repo link as well, so I had to override the entire
header.html to remove that link. When we update the theme, that override will be out of date. I considered various ways to use Atomist to make sure I remember to do that, then settled for a detailed commit message.
By the end of the day, I had a PR in to our docs repository with the upgrade. Over the weekend, I got that reviewed, modified and merged into master.
In which I break the entire site
The master branch of this repository gets published to GitHub pages for a final review. It looked fine there, so I deployed it to s3. A few hours later, I stopped by docs.atomist.com, and oh no!
That is not what our site should look like! None of the styles are loading!
Is it me? is it everyone? is it my browser? I tried clearing some caches, I tried a few browsers, then I rolled the site back. Then forward, and looked at it some more, then back again. There were some 404s on CSS, but later there weren’t, and then everything was loading OK but still it looked like garbage. Computed styles showed nearly-empty in Firefox but had more stuff in Chrome (later, someone pointed out that Chrome has a bunch of default styles).
This was beyond my paltry web diagnostic skills. The next day I asked for help from the team, and Danny volunteered to be a second pair of eyes. We modified the build process to push to a subdirectory so that we could leave the working site up while inspecting the newer, nonworking version. Danny spotted that the CSS files were loading, but the content type in the headers was wrong. It should be
text/css but is
text/plain . So the browser is loading the file and then ignoring it with no error. 😠
Useful! I had already noticed that the CSS files had changed name formats after the upgrade. Instead of
application-palette-f1354325.css (or something similar) we have
application-palette.f231453.css. Ah-ha, what if the dot is causing something to think it is not a
.css file but a
.f231453.cssfile? I went looking for that. I searched source for
application-palette and found it in the material theme.
application-palette.css in the
src directory and
application-palette-f1354325.css in the
material directory, which (according to mkdocs.yml) is where the templates for the theme live. So something is adding that extra number, in some sort of build process. OK, how does it build? There’s a
package.json so I check it for scripts. Sure enough,
scripts.build contains a call to …
make. OK, look for a
Makefile. Yup, and that contains a call to … webpack. Gah! I’ve been avoiding webpack because I know it is deep. Where does it get its definition? probably that
webpack.config.js file. Look in there, and it lists plugin after plugin, all of which are unfamiliar. Noooooo. But then I spot it! I found that stupid dot in the webpack.config.js … but changing that would mean rebuilding the theme, so I look for another way.
Which is good, because that wasn’t even the problem. Later I noticed that all the CSS files had the wrong content type, not just the ones with the dots. But I learned something, right? Right.
Next I searched for “s3 content type” since whatever makes a website available from s3 is sending these. That proved fruitful. The content type comes from metadata on s3, associated with each uploaded file. I opened the AWS console and looked in S3, found this bucket, found the CSS, looked at its metadata. Sure enough, it has a content-type element set to text/plain. So how does that get there?
Not pictured: at least half an hour of learning enough about the s3 command-line interfaces to be able to list metadata. (For the record, it’s
aws s3api head-object --bucket my.bucket --key path/to/file ) This included some frustration of “why is it saying Forbidden when I clearly have read permissions” which resolved to “oh right, because I’m authenticated in this one terminal but not this other one.” The API is pretty complicated; there are two. The friendly one does not list metadata. But it did let me manually
aws s3 sync some files up, which let me test more things. That program understands that
.css files are
While I’m working on this, I post updates and notes to myself in Slack, in my jessitron-stream channel. David notices and contributes some history and some research. Our files get to s3 using s3cmd. That should be setting the content-type metadata. David remarked “There used to be complaints in the build logs that python-magic was not installed so s3cmd was going to guess the content type,” so he installed
python-magic like 3 weeks ago to end that warning.
He linked to this issue: https://github.com/s3tools/s3cmd/issues/198 (and maybe there was another one?) and suggested adding
--no-mime-type --no-guess-content-type to the s3 arguments. He also removed
python-magic from the build. I tried those arguments. It complained about the first one being invalid (maybe because
python-magic was gone?) so I removed it. The upload happened, but when I visited that version of the site, it asked me if I wanted to download this DMG! (I’m on a Mac. That would be a .exe on Windows.) The content-type of
index.html was set to
octet-stream. Um, no, that’s worse. Deeper in that issue thread I found a suggestion to use
--guess-content-typeand tried that. My commit message (on a branch) was “Wave the wand this direction” because this is spellcasting, not understanding.
Lo, it worked! Everything worked! I rebased those changes to get rid of all the intermediate things we tried, merged them to master and tagged the new version to trigger deploy. Hooray, we are able to update docs.atomist.com again!
A red herring
In Slack, we got a ping from our designer, who was having trouble building the site now. David and I were like, oh no, the upgrade broke something. Setting up the development environment for these docs, with Python and Ruby, is a pain. Ben was getting an error that resolved on Google to “wrong version of Python,” something about an exception in a loop which was a change between Python 3.6 and 3.7. Ben uninstalled and installed Python eight different ways. Both David and I joined a screenshare to help. Now nothing (including
pip, Python’s package manager) can find the Python library
zlib, which is a wrapper of a native
libz library for compresssion (which is installed; he has xcode tools on his mac). This means
pip can’t install packages, because it can’t unzip anything, including
virtualenv, which we use to control the version of Python and of libraries. His machine is a giant circle of middle fingers.
I am not even gonna try to list the things we tried here. It was a mess. Homebrew was involved, and
sudo rm. The worst part is, you know what the problem was? He hadn’t updated mkdocs. He hadn’t pulled the master branch. What he had done was upgrade Python, which didn’t work with the old version of mkdocs but did work with the new! This was a few hours of all of our lives we would like to have back.
Not so fast
But this story is not over! Oh, no. The next day, some people complained on our Slack that the docs site was not loading. They were seeing the unstyled garbage. Clear the browser caches, same problem. David went to our CDN, CloudFront, and told it to not cache these things, and to manually refresh the caches. But NO. Somewhere in the bowels of the internet, bad content-types are cached for these CSS files. The files haven’t changed, so the caches decline to refresh. They do not notice that the content-type has changed. Having been bad for an hour or two, those files are now bad for some unknown amount of time, only to some people. The URLs are cursed.
The only thing to do is to rename them. I write a script that renames all the
.css files and changes all references to them. I kluge that into our build process, after we build the site and before we copy it up to s3. This works.
OK. Incident over, as far as we can tell.
There is no such thing as “root cause” in systems this complex. There are “conditions that allowed this to be a problem.” And crucially, there are many conditions and actions that kept it from being worse. Eliminating the former, trying to “make sure this never happens again” is Safety-I. Amplifying the latter, sharpening our vision into potential problems strengthening our ability to solve them, is Safety-II. In this analysis, I’ll remark briefly on the Safety-I sources of problems and more extensively on Safety-II sources of resilience.
How did this happen?
It seems likely that adding the
python-magic library contributed. That changed the behavior of
s3cmd, except that it didn’t show until new files were created. So
s3cmd’s behavior of not updating the metadata on files that already exist made this problem into a sneaky lurking one, dark debt.
Ironically, that library was added to reduce debt, as a response to this line in the build:
WARNING: Module python-magic is not available. Guessing MIME types based on file extensions.
It turns out guessing MIME types based on file extensions is a great way to do it. It’s great because it’s predictable by humans, a key property of collaborative automation. This beats clever-but-unpredictable magic.
The upgrade of mkdocs and material did trigger the problem, because it met the necessary condition of adding new files. It’s tempting to avoid upgrades, because upgrades often trigger latent problems, just like this. But upgrades also remove problems, like Ben’s upgrade to Python 3.7.
How did it get so bad?
We didn’t know before deploying the live site that this wasn’t going to work. Later, we had to develop a way of deploying to s3 without overwriting the live site in order to diagnose it. I also didn’t notice immediately that it was broken, so the site was garbage for a few hours … long enough for some CDN nodes to grab and keep the evil content-types.
Gosh, there was so much.
- I checked the site at all. That was deliberate.
- Communication and people: I couldn’t figure this out without Danny and David. Our daily standup, Zoom, and Slack were essential collaboration tools. Our notifications from Atomist in the #docs channel of Atomist community Slack helped us see what the others were doing.
This is not a checklist. This is a set of learnings that affect our priorities for future work.
- We want to see the rendering of a version of the site on s3 before we release it. This is something to build in the Atomist SDM I’m making for this site, which will replace the inflexible Travis build.
- I want to check the site after each release. I can make my SDM send me a direct message whenever one is done.
- We could switch from
aws s3 syncfor more predictable behavior, testable on more systems.
- Setting up the right versions of Python and Ruby on a local computer is bad. I want a development process that uses Docker for isolation. Ideally, an SDM that runs locally in Docker. (That’s already been on my list, and now it seems more important.)
- Don’t trust libraries with “magic” in the name.
If this type of analysis is useful to you, follow John Allspaw on Twitter, and me on The Composition.