CASE STUDY

Infrastructure
as code

August 01, 2023
Infrastructure <br /> as code

A startup grew faster than its IT infrastructure. We helped them bring their infrastructure up to speed and positioned them for greater success.

One of our clients is a successful SaaS startup in the logistics-technology space. Like many startups, they have a small engineering headcount. Unlike many startups, their modest user base of small-to-medium enterprise users grew suddenly and rapidly.

Also like many startups, limited resources and rapid growth forces specific choices. To keep their current users satisfied and attract prospects, they prioritized new features at the expense of their infrastructure. The state of their infrastructure was functional, but it was managed manually.

The challenges of manual infrastructure management

Our client’s processes were decently well-documented, had some automation scripts, but was mostly manual. Servers, databases, caches, DNS, and so on, were provisioned and managed with the AWS Management Console and remote shell access to servers.

When our client was small and agile, manual infrastructure management was the right decision and worked great. Then the disadvantages started piling up.

Unrepeatable

Our client deployed separate production and non-production environments. Each of them was stood up separately and manually. The local development environments used by their product engineers didn’t resemble their deployment environments. Subtle, difficult to find, differences lead to variance in product behaviour and performance.

Data loss and downtime

They couldn’t make any strong guarantees about disaster recovery. When a resource went down an engineer would attempt to remotely access the host to triage the issue. If the engineer couldn’t recover a resource or deleted it in error, they had to add it manually.

There was no redundancy because they hosted their SaaS in a single geographical region. This also extended the time to recover from data loss or regional network outages.

Slow scaling

Like any SaaS, a sudden traffic spike puts pressure on the database and application servers. Reacting to it was complicated. First, an engineer needed to notice it quickly enough. If that happened, the engineer responded by manually provisioning extra resources. The final step was to keep an eye on the spike until it ended then decommissioned the unneeded resources.

Opaque operation

Our client had no transparency on the state of the infrastructure, the interaction of its parts, and its management. After making a change, it was difficult, sometimes impossible, to know the impact. This increased risk when planning because they couldn’t confidently observe the consequences. And it prevented simple checks to minimize human error. How could they know if a database was provisioned correctly or an extra zero wasn’t fat-fingered somewhere to allocate 1000 GB instead of 100?

Reinvented wheels don’t roll smoothly

Our client created ad-hoc scripts with the only intention of reducing the pain of manual provisioning and configuration. Major parts, like databases, were self-hosted. These reinvented wheels were costly to maintain and prevented our client from adopting more powerful practices and tools. Examples:

Our client couldn’t take advantage of continuous integration and delivery (CI/CD).
Hiring is complicated because there is less interest in maintaining legacy code than leveraging industry-standard technologies to develop new features.
Training is longer and costly, forcing experienced engineers to support onboarding instead of driving feature development.
Low bus number: As the saying goes, it wouldn’t take many people to get hit by a bus to put the infrastructure at risk. In other words, only a few key people had crucial information about these reinvented wheels.

Our solution started with infrastructure as code

After concluding our discovery and analysis, Avante IO recommended some solutions. The first is the paradigm known as Infrastructure as Code (IAC).

Instead of constantly, manually filling various forms in a web browser and logging into remote servers, IAC lets a development team declare the desired IT infrastructure in files. We chose to use Terraform, a proven open source IAC tool.

Terraform by itself already brings advantages:

Terraform processes files to automate the provisioning and management of infrastructure.
Platform agnostic, supporting the most widely used cloud providers, and easy to extend.
Terraform files are human-readable, making them self-documenting.
These files integrate naturally into a development team’s existing workflows for reuse, peer review, version control, and so on.
Supported by a large community of users across private and public sectors.

From Terraform, we could then implement the rest of our solutions:

Containerization
An audit trail
Multi-regional deployments
Scaling automatically
Step 1

You can’t describe what you can’t see

With a plan in hand, we started with what we didn’t know. Avante IO consultants teamed up with our client’s infrastructure group to map out the topography of their existing infrastructure. This let us write the Terraform source code to precisely declare the infrastructure’s resources, their relationships, and the outcomes that we expect when the infrastructure is properly provisioned.

Step 2

Containers and orchestration

In parallel with the IAC effort, the team began containerizing the SaaS with Docker, another proven, well-supported, open source technology.

A container is a portable, independent environment that an application runs in. A single Docker container can be loaded and run on any platform that hosts Docker, and that includes pretty much all cloud providers.

We analyzed the various servers that made up the existing SaaS to identify the essential software that made their product run. We declared this software (and nothing more) in Dockerfiles.

To coordinate the containers, we used an orchestration tool, Kubernetes, once again, another proven, open source technology. Kubernetes automates the deployment, scaling, and management of containers.

For deeper transparency, we again turned to an industry-proven solution, AWS CloudTrail. With a few lines in Terraform, we were able to set up a log of activity among the servers in the infrastructure.

Step 3

Zero-downtime migration and reality

We took a zero-downtime approach to migrate infrastructure to IAC:

Migrate non-production environments first, from lowest-level to highest level environments.
For each migration, document the lessons learned during the migration. Apply these lessons for subsequent migrations.
Going live: We ran new infrastructure next to existing production infrastructure. When we were satisfied with the new infrastructure, we pointed the DNS to the new replicas. Undoing was easy too: just revert to old infrastructure.
Once new infrastructure was validated, we decommissioned the old production infrastructure.

One notable challenge in the process of containerizing the product as well as in migrating from manually managed infrastructure to IAC was that there were gaps in the understanding of how the existing production systems worked at an architectural level.

We discovered that some parts of our client’s SaaS didn’t lend themselves completely to IAC. There were several manual “hacks” that engineers routinely performed, like freeing disk caches, manually killing of zombie processes, and so on. This isn’t a fault of our client, it’s just a reality of real-world software engineering.

In some cases, this helped us, giving us valuable insight about undocumented parts of the system. That sometimes gave us the opportunity to switch from self-managed services to less costly alternatives. In other cases, the best decision was to accept that any investment to overhaul a service would outweigh the benefits of automation.

Outcomes

A fast return on investment

The project itself was planned and implemented in only four months. The results were impressive and immediate.

Deployment
Time

Before
After
180
20
Minutes

Production incidents from architectural interruption

Before
After
4.6
0.2
Incidents/month

Uptime

Before
After
40
5
Downtime/incident (minutes)
80
97
Overall uptime (%)

Not only that, but all-cause production incidents were reduced by 60%. Our client saw these quantitative improvements even though they doubled their release rate from once/month to bi-weekly.

The qualitative improvements are even more exciting.

Repeatable anywhere

Containers give our client a distributable, scalable, and repeatable copy of their product that can run on the cloud, on-premises, or even on the laptops of their software engineering and QA team members with no manual setup.

And we can add redundancy easily, with multi-regional deployments that let our client recover quickly from data loss and network outages.

Scaling is automatic

Containerization and orchestration solved the scaling problem. Scaling is now responsive, cost-effective, and reliable. When Kubernetes detects increasing load, it provisions more containers. When the load returns to normal, Kubernetes responds accordingly, saving cloud fees.

Faster updates, more reliability

Productivity has improved because engineers aren’t slowed down by their reinvented wheels. And they have more confidence in their infrastructure because it’s simpler to examine, reconfigure, and verify. Changing the infrastructure’s topology and configuration are subject to a code review, like any other code in the SaaS.

Our client can continue to focus on product development without worrying about infrastructure maintenance. Changing anything in the SaaS, from upgrading a Python package to adding an entirely new feature, is faster. With Docker, a feature engineering team develop the change on their laptops. Then the QA team builds and tests the same Docker container on a variety of platforms in an automated fashion through CI/CD. Both teams can develop, build, and test the change without waiting for architects and devops to provision new environments.

Transparency

There are two ways to get radical transparency into the system.

First, anyone can refer to the configuration files to know precisely how the infrastructure is organized, configured, and managed.

Second, with CloudTrail, we had an easily-configurable log of the infrastructure’s operation.
Leadership or an external auditor can view a log of anything that changes in the infrastructure.

Hiring and training costs less

Hiring and training engineers is simpler now that our client uses industry-recognized tools and practices. Candidates are more likely to already know a lot about their SaaS infrastructure before they’re even hired. For example, our client now has access to a larger talent pool of engineers with experience in Terraform and AWS certification.

Avante IO leaves the teams it works with better off than when they found them. As we worked with our client’s staff, we trained them to use these new technologies and practices. On-boarding new hires is no longer a challenge because infrastructure isn’t limited to the minds of a few engineers.

New opportunities

Migrating to IAC is an on-going success. Our client is enormously satisfied with the improvements to the infrastructure’s reliability, automation, and transparency, not to mention to the business’s ability to hire and train more effectively.

The future of their infrastructure is bright:

Compliance

Our client can accommodate their own clients who want a self-hosted version of the SaaS, thanks to the new flexibility for spinning up new environments.

Portability

Migrating to a cloud provider besides AWS is now possible.

Enhancement

Leverage Terraform further with Terragrunt.

Harden

Security auditing.

Frequency

Deeper, more insightful feature testing with A/B deployments.

Flexibility

More flexible, less risky deployment: our client can partially roll out riskier updates to a subset of containers. And they can roll back if something looks off.

Agility

Weekly and even nightly releases

Thank you for reading. Contact us below, if you have any questions.


Avante IO
© Avante IO Inc.

🇨🇦 Built in Toronto