This article is written for those who are interested in our path, and it will be a bit different because it deals with technology only superficially. We think the team aspects behind our transition to continuous delivery is much more interesting than the technologies we used to accomplish it. While technologies have a shelf life of three or four years, our team characteristics should be useful for much longer.
Please don’t consider this article as a prescription on how to adopt continuous delivery. All teams are different, so perhaps another path would be better, or maybe continuous delivery won’t be useful at all to your team. Our team agrees that continuous delivery was right for us, and I think this is where our path really began.
1. Building the right team culture
When we were asked to rewrite the previous version of Basefarm’s provisioning system we wanted to do this using an agile methodology.
For us, this meant building a team culture where we trust each other, with a minimum of office politics, where people can ask for and receive help without being thought less of, where we would rather produce working software instead of having meetings, and where we have all the knowledge needed to build the application inside the team.
As a corollary to this, only those on the team get a say in how the team solves problems. If you are not part of solving the issues that might arise, your opinion will be listened to, but you do not have a right to block the team from trying something new. We think this is an important point to make because humans are often skeptical to change, and granting non-team-members the power to stop experimentation would lead to no progress being made.
Not for everyone
Being an agile team is not for everyone. Over the years, seven people have left our team. Two of these left because they had fundamental issues with what we were doing. These are hardly numbers that are large enough to be statistically significant, but there are people out there who prefer to work in teams that are organized differently. We think it is safe to say that of the people currently on the team, many of them would probably leave a team that was not “agile”. However your team is organized, this is something that needs to be addressed. Our team has an explicit team contract which addresses some of these concerns, and we will hopefully be able to handle any future problems when they appear.
2. Stop wasting time in meetings
The first casualty of the team becoming agile was the “department meeting” because we needed the time taken up by meetings to make changes that could improve our efficiency. In our team, meetings have concrete agendas that are known beforehand, and it is not OK to call meetings with a generic name other than “Standup”, “Grooming/planning” and “Retrospect”, which are our three recurring meeting types. A few people on the team have worked in other teams where there were two-hour meetings every week for keeping the department manager updated. We wanted our team to avoid this style of communication because it steals days of working time every week.
3. Reduce the impact of failure
The next big change we endured was to take a good look at our own team. Developers are often in the habit of pointing to other IT functions and saying we could have done better. An example can be big IT processes where someone at some point make a mistake, and then a process, performed by humans, was put into place to ensure it didn’t happen again. We think this often is the cause for byzantine processes. Every step in the process makes sense individually, but taken as a whole the process becomes big and slow.
When we turned around and looked at ourselves, we found we were stuck in the same trap, but in a different context. The underlying fear was the same as our hosting colleagues had – we figured that production outages were a bad thing, so we had accrued a fair bit of magic over the years to placate the failure gods. Once we allowed ourselves to see this, we also accepted that we would have failures in production. It is inevitable.
By switching the team goal from “no failures” to “as few failures as possible at the greatest speed possible” we have hopefully avoided reintroducing big, complex processes, and we routinely ask new team members how we are doing.
4. Constrain the size of changes
The provisioning system we make is designed in such a way that every change made to the machine room is constrained in scope. This ensures that, if things fail, only the things inside the limited scope fail, not the whole machine room. Sequences of operations is ordered in such a way that steps are performed in the sequence with the least likelihood of failure. When we select systems for our machine rooms, these nonfunctional requirements are reflected in that the systems must provide for updates that are limited. On top of that, there is an extensive automated QA regime so we know when existing behavior changes, but more about that later.
This is a mindset change that sounds very fluffy, but was a very powerful one for us. It allowed us to get rid of a lot of very strict routines which had been put in place mostly to force people to think before making a change, and often doing so by forcing them into meetings. Instead, we could think more about mitigation and recovery, and about creating systems that could be automated to perform the tasks consistently every time.
Our experience is that the number of outages has gone down, even with a trivial process and without being able to point fingers at specific people who started installation of a new version. We mostly blame our automated tests for this success.
5. Use automated testing for safe deployment
We decided to invest in a regression test suite before writing a single line of code. One reason for this was the old system was in a language that doesn’t lend itself to testing as much as our current JDK/Scala combination does, and not having regression tests were proving to be costly. If our provisioning system handles our infrastructure the wrong way our customer solutions might stop working. Our provisioning system is business critical to us.
We also saw that manual testing had a quadratic cost over time, growing with the number of features and the complexity of the system. If down payments are not made to automate your manual tests, the cost is going to become prohibitive very fast.
This is the beauty of automated tests. They exchange the quadratic cost of manual testing with a once-off payment, and a small maintenance cost. The one-off payment is not much bigger that performing a manual test of the same feature once. Our QA regime is the first technical investment we made that enables continuous delivery. If you want hands-off installation of new versions, anyone would prefer to know that they work as well or better than the last one.
6. Keep coordination simple
The next big milestone for us was to move most of the dependencies for our application into the team. In the 90’s it was thought that operations would host an “application server” which would host many applications, which were then delivered by the developers according to a specification.
Our team adopted the big application servers in the start because we wanted to reduce the number of changes we made simultaneously. Working off the existing application server delivery model already in use by Basefarm was a sensible tradeoff for a time, but application servers have a cost associated with them, and the cost comes in terms of coordination with the external operations team. Nowadays, virtual machines cover much of the benefits of application servers, which negates any benefit they had in the past.
The application was rebuilt to embed a small Jetty server inside, causing it to become self-contained. We integrated with a startup script delivered by the operations people, and required an oracle JVM and a small property file to be present on the host running the application.
Operations manual for efficiency
We also delivered a tiny operations manual that told the 24/7 operations team how to restart the application, where logs were located and who to call if a restart didn’t work. From this point on, the development team took on the application server hosting in addition to operating the application itself. The overhead related to the nonfunctional aspects of the application have mostly disappeared as a result. This pattern coincides with what we do for many customers who have in-house developed solutions. At some point, a developer will have to wake up if the problem is big enough.
7. Work yourself away from coordination
Our next big overhead turned out to be patching the database. We delivered a set of patch files that were applied by the DBA. This required coordination between the DBA, the operations team, and the development team.
When this coordination went wrong we got outages. We found ourselves in situations where the application worked perfectly in the test installation, but not in production, and the cause was that the DBA had forgotten to create a table, for whatever reason. No application can be expected to work consistently under this kind of situation.
As a result, we taught our application how to automatically apply the database schema changes using a library called Flyway. This removed all the outages caused by people trying to coordinate a process divided amongst three different functions. We have been working off this model for half a decade, and we have had two outages, which is an order of magnitude better than what we experienced when database updates were manual.
8. Implement Database refactoring for even more efficient deployment
By this time, the application still required one to five minutes of downtime while upgrading it, and the application was seeing enough use that this downtime was becoming a problem. This was by far the biggest problem to overcome for the team. Basefarm is a managed service provider, and the immediate reaction to this downtime was that the upgrades of our application should be moved out of production hours. Our team chose to go another way.
The reason for the downtime was the database. When installing a new version of the application the database got patched, and our trivial implementation of database patching couldn’t guarantee the old code would be happy running off the new database schema. We sat down in the team and applied a set of patterns from a book called “database refactoring” (Ambler), and from that point on we were +1 compatibles on most of our releases, meaning that we had changed our development process so that we didn’t apply changes that the previous version would be unhappy with.
9. Increase communication with the users
Technically, this change was relatively easy. Talking to worried users has never been easy, and we cannot stress enough the importance of spending time with the users of the system so that their level of discomfort stays at a manageable level. We decided early on that we would try to make the development team available to the 550+ Basefarm employees to address issues when they arise because not feeling you can get help is a surefire way of escalating user discomfort.
Providing person-to-person support to 550+ people has worked out, and to this day, the development team is available in a chat channel where anyone can ask questions. We have had to add a handful of specialists to the channel to be able to handle questions that are focused around best practices for using our application, but overall this kind of communication has been instrumental in creating the kind of safety net that has allowed our users to feel comfortable that our regular changes will not interfere with their ability to do their work.
Chat channel for fast communication
The chat channel has also been a benefit to the team because the random interruptions to work we saw before where one of our users would stop by whoever was perceived to be the highest-ranking developer to get help. Interruptions like these break flow and destroy productivity. Having a chat channel lowers the expectations of an immediate response that people in a doorway have, which enables people to poll the channel when they have a bit of free time instead of having to break off what they were doing. In addition, the workload is spread away from senior people who are usually the ones to catch all the questions otherwise.
Don’t underestimate fear
At this point we had a pipeline running our tests which was not connected to the installation system. The final part of the puzzle was to use the existing pipeline for the installation task, too. This was the final step that would allow us a fully automated installation of new versions.
Even with our communication efforts there was a fair bit of fear at this point because our users felt uncertain. They wanted mails each time we upgraded the application in case it broke during the upgrade.
Sending mails is an automatable task, so we complied with this. We were just happy that we didn’t have to upgrade outside of office hours. It turns out that we were asked to remove the automated mails within three months because the upgrades turned out not to fail, which caused the mails to become an irritant. Sometimes, it takes time to build trust in new systems.
The automation of the installation process has not caused any outages to date. We are, again, seeing a much better performance from the automated process than from the manually performed one.
Our starting point
Our team was lucky in that we started out as an empowered, trusted team, mostly left to figure out how we wanted to work on our own. “Succeeding with agile”, Mike Cohn contains a good introduction to what a team needs to be able to succeed at being good at change.
This article started out by trying to focus away from “what” we did, and instead trying to describe “why” we did it. We believe this kind of mistake is very common, and our team has had to fight it constantly as our team evolved. It turns out lean adopters struggled with the same problem, and “Toyota Kata” by Mike Rother contains several war stories from the manufacturing world that have helped us in stopping when we were focusing too much on technology and rituals to the detriment of becoming better at change.
Today, change has become a constant force for our team, and continuous delivery is one aspect that has led us to be able to produce more value in a better way. Tomorrow, we will improve in some other way. Your own path towards becoming better at change will probably be very different, with different forces pulling your team in different directions than ours, but we hope this article can be of some help.
Would you like to know more: click here and leave your contact details!