Cloud — Moving from 1 to N
Our 3+ year Cloud journey: Why “just” moving to the Cloud is not enough.
Moving to Cloud is like taking your company from 0 to 1. While 0 to 1 is hard, taking it from 1 to 10 or 10 to 100 is different: and it’s more than a job, it’s a journey.
As for the Cloud journey, we’ve found four factors that are critical to advancing beyond level 1: availability, reliability, security and frugality. It is hard to skip any one of them. Not that these were/are not necessary for an on-premise or self-hosted environment, but given that the resources are not under your control, by proactively identifying and addressing all of these areas throughout your journey, you can help avoid bigger issues.
At Zenefits, we are born in the Cloud. We started our journey using AWS on day 1.
Zenefits Cloud Journey
Early 2018, I was picked to lead the Infrastructure portfolio at Zenefits. We inherited a well-built infrastructure setup. Much to my awe, the breadth covered amazed me. You name it, and it was built or was in the process of being built. A beautiful CI/CD setup, a very agile release process — twice a day, ten times a week and some great Dev Productivity tools (Spoofs). A marvel!
Of course, there were also gaps that required immediate attention. Velocity is sometimes a deterrent to Quality, but our charter was to balance both.
On my first day, we began re-imagining and announced the infrastructure team restructuring. By the next morning, I get a ping from both the CTO and our engineering SVP: “The site is down!” What a start.
Moving to the Cloud — from 0 to 1 — will be different for every company, some will be tougher, some more complex. But to progress beyond that level, there are a number of consistent challenges and opportunities for which all companies should be prepared.
Key Challenges
Some known and significant challenges we confronted included:
- Uptime and availability cannot be compromised.
- Continue our commitment to continuous feature innovation. So, infrastructure updates can not slow down product release — which is typically five days a week.
- Faster time to market means attention to developer productivity, and the right tools to support them.
- Proactively working to find the root cause for known and emerging product issues.
- Balancing resource utilization and optimizing the money drain of cloud expenses.
- The biggest of all: learn all the setup, hire great talent and ensure smooth knowledge transfer.
As a SaaS company supporting small businesses, our mission is to“Level the playing field for the other 99.7%” Therefore, it is crucial that we deliver a reliable, available, secure, scalable, robust and cost-effective infrastructure.
Moving Onward: The Journey
Team
Nothing great is ever accomplished on your own
The team is always a pivotal part of the story. Whether you are taking over a team or building it from scratch, in the early phases, the two most essential tenets are — build trust and lead by example from the front. Do note: trust is not only between manager and team, it is also critical among all members and across all stakeholders.
In our case, we had a unique challenge of rebuilding the team from the ground up. Given the velocity and hiring for talent, building trust helped us move from forming to performing team in quick succession. As a leader; be honest, open, reliable and show integrity.
Knowledge Transition
All knowledge is either tacit or rooted in tacit knowledge.
When you are balancing high velocity and quality, you tend to miss documentation. Documenting everything is a challenging problem, but most important is understanding the tacit knowledge.
The above was true for us. When we started rebuilding the team, we also had to plan a transition phase and knowledge transfer. We tackled this by getting more members from other groups to join in for shadow sessions, recording the sessions. Opt for driving towards use case based situations. Identify and record the actions taken. These recordings and sessions helped us ease managing the setups in their existing state and on-board the new team efficiently.
Codifying
Before software can be reusable, it first has to be usable.
Infrastructure as Code (IaC) goes hand-in-hand with Cloud development. When you are transforming things for the better, make sure your infrastructure is in code. An additional recommendation will be to be as cloud provider agnostic as possible.
Non-standard code and setup were the challenges we also had. With the breadth of things in Zenefits, it added more to our pain to have a few services setup via Cloudformation, some using Terraform, and secrets managed in a private repository. That is not bad, but with non-uniform coding, it becomes hard in SRE/DevOps. The bigger problem for us was that most of the production setup was hand-configured. Executing the existing code could cause production outages and servers out of sync with each other on configurations/setup. One of the very first things we moved quickly to use was Terraform + Ansible.
Tech Debt
No matter the stage in your journey, you will have some tech debt. The focus here should NOT be to make your code the most beautiful, copybook with all design patterns (just like any other programming style). Focus, instead, on making sure you keep up with the latest tech solutions to achieve the metrics you are targeting.
In the case of Zenefits, the earlier version of Cloud setup focused on using the VMs (EC2), with secrets stored in a private repository. Due to velocity during the early stage, we had multiple SaltStack master setups, and any standard change done had to be forced on to each. As noted, while we moved our tech stack to Terraform, we also containerized our production setup, additionally moved to use AWS ALBs from the classic ELB, and updated our cloud setup to eliminate a set of other tech debts.
Key Metrics
The items mentioned above are essential and matter a lot. They are all continuous improvement “table stakes.” You should also make sure you have target metrics that you can attain. Do timebox these and plan accordingly. Try to keep in sight the long tail and things that give better ROI for the work planned. Done well, these metrics will gradually move into the table stakes category at some point as well.
Availability
High Availability(HA) is one of the critical factors for successful Cloud adoption. There is no free lunch. You don’t get HA just by migrating/adopting Cloud or setting up your resources across multiple DataCenters (Availability Zones — AZ in AWS).
Identify your SPoF (Single Point of Failure) setups. Identify where you want an active-active and where you want an active-passive setup.
Zenefits, moving from 0 to 1, had the excellent foresight to ensure production setup across multiple AZs so that user traffic was load-balanced. Given that our SaaS product is an all-in-one HR product for small to mid-sized businesses, we wanted to have > 99.9% uptime. Three 9s means less than 10 minutes of downtime every week. As a team, we agreed to have a mark greater than three 9s to meet our targets. Note, these metrics require a core team which contributes heavily to customer satisfaction and retention.
We have an Availability of > 99.95% during the business hours and overall > 99.9%.
One of the significant works to attain this was making sure we have our databases in all 3 AZs (earlier, it was in 2 AZs). While moving to 3 AZs, we also eliminated a big SPoF on our setups by adopting ProxySQL and Orchestrator setup.
Scalability
Scalability is crucial for your Cloud journey. Generally, investing in scale before you need to is not a smart business proposition; however, some forethought into the design can save valuable time and resources in the future. Scaling your infrastructure is mostly associated with increasing capacity. Simultaneously, we should look at elasticity — right-sizing your resources for the load.
Zenefits, on AWS, had been using Auto Scaling Group (ASG) on the in-house built developer productivity applications. But on production, we were using EC2 instances for our application server. It is easy to paint the picture that we moved to use ASGs there. The hard problem was identifying how many servers are required and running the right amount of workers. We used to run the same server capacity irrespective of the time.
While I don’t have to expand on the need and importance of scalability, here are two examples to make the point.
- At the start of 2018, I asked one of our Site Reliability Engineers how long it takes to add a new node to the fleet. The answer was three days. Seeing the disbelief in my eyes, the engineer tried to course-correct and said it could be one day if pushed. During July 2018, while working on a site incident, we tried our best to add a node to our fleet. We spent 6 hours, but we were not successful.
- During Q1 2019, Zenefits had a potential situation where the load increased drastically. At this point, after a quick war room deliberation, we agreed to scale our fleet. And we went from 6 to 75 nodes in under 5 minutes.
We went from taking hours and days to add a node to scaling to 75 nodes in under 5 minutes.
Underneath a successful scalability and elastic model, you need to have an adequate monitoring and alerting system. We had to shore up our monitoring setup to get more details on our fleet’s state of affairs. Earlier, we only were able to see the CPU metrics. With automation and changes, we could view the details and the right size of our fleet better.
Reliability
Reliability is often confused with availability. Note: reliability will not increase just by making your setup highly available. My recommendation: read and leverage the AWS Well Architected Framework.
At Zenefits, we implemented many of the defined needs, which advanced our reliability.
- High Availability: Necessary to do this, but this is not sufficient alone.
- Recovery: Automatic recovery is necessary. We already had a load-balanced setup, and we moved from using classic ELB to ALB to better things. While moving our database set to 3 AZs, we eliminated a big SPoF on our setups by adopting ProxySQL and Orchestrator setup. Setting up automatic recovery is not sufficient. It would be best if you periodically do recovery testing. We do annual disaster recovery testing of infrastructure at Zenefits for multiple failure levels and improve based on the findings.
- Rigorous Monitoring: Monitoring and alerting are essential and should be done both for capacity and system metrics. We approached this in multiple ways. Alerts based on a threshold on data from our logging setup, extending in-depth monitoring of infrastructure metrics. Guessing capacity and growth is one of the significant deterrents that we have observed. We started tracking our capacity plans for databases more aggressively (monthly) and overall capacity once every six months.
- Keep up with Technology: It is essential to keep updating infrastructure to avoid falling behind. We did a big bang change to our infrastructure by upgrading our OS version (from ubuntu-14.04 to ubuntu-18.04), moving to Amazon Linux 2. We also implemented the patch management process to keep in check any new vulnerabilities reported. Recently we moved to use MySQL 8.0 as well.
- Communicate Changes: This is a tiny thing but gets the least attention. Keeping track of what is released is no less important than anything else in this article. We simplified this by automating our process to send out the Github PRs released on public channels — both for application and infrastructure.
Reduced our recovery time on multiple AZ failures — from 45 minutes to < 5 minutes.
Developer Productivity
Developer productivity has a lot of sub-components. From Zenefits’ perspective, developer productivity tools deserve a set of blogs on its own. We consider this set itself as a small startup under our product fleet.
Dev productivity is a culture, a set of principles. The more you immerse in it, the more agile you are. Zenefits heavily invested in building in-house tools for continuous integration and continuous testing. Over the period, I have interacted with numerous CI/CD tool providers, and the product we have is, if not better, very competitive on the features.
At a higher level, all the investments in developer productivity push towards improving your MTTI and MTTR.
- Spoofs: We moved our development and testing environment to the Cloud. We call it a Spoof. Spoof is a dynamically generated cloud setup for the engineers to test their code or triage an issue. The old version of Spoof was a setup which at best was having data 12 hours old. After a set of improvements both from the Cloud infrastructure setup and the database setup, we got this time down to < 2 hours. If needed, we can refresh data in < 5 minutes, but we choose to have it for 2 hours.
- Speed to Market: With the growing need for a better time to market (aka Speed to Market), one significant uplift we did over this period was enhancing our CI/CD and CT tools. We track the time taken for test runs, build and release times. We reduced our median testing time for all our pull requests from 65 minutes to < 20 minutes and production deployment build times from 1:30 hour to < 5 minutes for 95 percentile. The table and charts below show our targets, and we continuously monitor and alert on threshold breaches here.
Frugality
I think frugality drives innovation, just like other constraints do. One of the only ways to get out of a tight box is to invent your way out. — Jeff Bezos
The quote above summarizes our approach. You never know when the expenditure on Cloud could spiral upwards.
When we took over, our Cloud expenditure was high and growing. The bigger problem was to identify where the cost was precisely getting spent. In Cloud, one of the powerful things available is tagging. Decide on the primary tags that you want to have and filter resources based on tags. Set up a cost management tool — we use CloudHealth from VMWare. Identify projects which overlap with cost as a theme. Just reducing cost can be a big win, but you always need to balance that with increasing reliability, availability and scalability because they come with a price.
Our target initially was to analyze, and then we decided to bring our costs down. As I write this blog, we have got our expense down by nearly 35%.
Savings of > $1 million on the Cloud cost in 3 years.
Learnings
Along the journey comes a lot of learning. While you grasp and work through, do retros such as:
- Accountability: As a leader, build accountability within your team. And this starts with you, build trust and trust your team. Take a front seat during failures and go back during success.
- Decision making: Being decisive is critical. It can make or break things. Emotional Intelligence is important and closely related to decision making. EI is contagious, so use it and respect it. Trust (keep listening to) your instincts, evaluate them closely and then stand by your decisions.
- Radical candor: Calling a spade a spade takes effort. Don’t shy away from being direct with the team or other stakeholders. At times, being open and honest creates challenges: be prepared to re-assess and step back. In this state, leave your emotions and be confident to wear your failures as badges, don’t shy away.
- Data-Oriented: Data is essential. Set your targets based on metrics and timebox. Timeboxing is vital to make sure you are not running behind the long tail.
Summary
I am sure many of you out there are treading or have to walk the same path. Some are further along the journey, some catching up, and some planning to start. I hope some aspects here resonate with you and are helpful.
Using the Cloud and improving it is a journey just like any product development. While adopting the Cloud is essential, it is equally important to adapt better to the Cloud setup. Don’t hang your boots after adopting. Keep innovating and timebox your targets. Cloud continues to evolve, and it is crucial to keep abreast with things coming up. It is not the end of the world if you are not using the latest and greatest, but if you are not willing to adapt, evolution stops :).
As they say, it takes a village to raise a child. Thanks to all my extended team, peers and stakeholders who have been part and parcel of this journey. A great, respectful teamwork is critical to success in your Cloud journey. At the start our CTO said, “Remember, this is a marathon and not a sprint” and it continues to be a memorable marathon.
Please share your opinions and thoughts in the comments and let us know your experience.
Let us learn more together, #InItTogether.