Clearly, we can get better at managing the crunch time around deployment.
The last time we deployed, there were a few tense moments, but our rigorous test-everything-from-a-production-install process helped us do it smoothly. This time, not so much. Here are a few reasons why, and here's what I can do to make things better.
- I had set $access_check to FALSE because I wasn't sure if we could get in to update the system. The IT architect logged in as a super administrator and ran upgrade.php. However, since $access_check was FALSE, it apparently didn't check at all if the user was logged in as a super administrator, and so we ran into bugs that assumed account 1 was running the update (related to node saving). Symptom: The updates ran, but some updates didn't get fully applied. We only detected this the day after (the perils of doing an evening deployment when you're tired). I thought that just reloading the database backup and reapplying the changes (properly, this time!) would've been a cleaner way to do it, but my other team members voted for manually fixing things. So that was stressful.
The problem occurred a couple of times during QA testing, which is how I realized that update.php was misbehaving. I wrote about it, but I didn't review the other developers' code for potential issues, and I didn't emphasize the potential pitfalls during our meeting.
To do this better next time, we can come up with a more formal and regular code review process, and I can communicate more explicitly. We could try to always run update.php with $access_check = TRUE, but it may need to be false in some case in the future, and it's better to be aware of potential problems.
- After we deployed, we found out that a subdomain we were using hadn't been set up in DNS. We were no longer in control of the domain record because we had turned that over to the nonprofit partner who was supposed to be managing the site.
To do this better next time, we should make sure our QA and production setups are as close as possible (we had been using wildcards for QA), and we should test new domains.
- I had been in crunch mode for 10 days (since the weekend before). It's difficult to maintain sprint-like energy and focus for that long, and I was feeling physically fatigued after I stayed up relatively late to finish the deployment.
To do this better next time, I need to insist on taking breaks, even if it doesn't seem to be being much like a team player. Also, I need to reset my sleep cycle as quickly as possible.
- Give people feedback and send them patches instead of just fixing the code for them. I don't get fazed when code changes underneath me. I've worked with too much open source, I guess.
I just try to figure out what changed, why, and how to work with the new structure. Other people can feel alienated from their code, though, and they lose that feeling of ownership. Better to hand things over to other people, perhaps with a few tips, even if it means it won't be finished as quickly.
- Communicate changes more often and more explicitly. I liked having a Sametime group chat running. I don't like sending lots of e-mail, and having the chat made it easier for me to keep others in the loop.
- Make sure tests are up to date, and run them regularly. There were a few bugs I missed because I hadn't run the test suite, and I hadn't run it because it takes a lot of time on my system. I should make the time to do that (using it as break time if necessary), and I can also set up a testing environment so that other people can run the tests easily. Speaking of that - I spent nearly a day tracking down failures due to other people's changes because they didn't verify their work against our test suite. I need to figure out how to build more common ownership of our test suite, and how to get them to run the tests themselves. The SimpleTest web interface is okay, but it's still not as convenient as Drush. Maybe a line item in our administration interface... Hmm... Next time, I could also set up regular tests that e-mail us the results.
- Build little tools to help. Instead of analyzing the source code by hand in order to come up with the number of lines we changed (needed this for IBM Legal), I wrote a tool that analyzed our source code based on the Subversion history. It was pretty cool. It took me about 30 minutes to write, and we ended up running it twice. I expect it would've taken us three hours to do that all by hand. Yay! =)
- Make sure developers know about the gotchas we encountered.
- Set up an automated test environment and make sure other developers take ownership of the results.
- Keep a group chat running. I participate in that quite a lot. E-mail, not so much.
- Take more breaks.