Your organization has made a decision to move to a cloud infrastructure from a self hosted or colocated data center. How do you make the move with the future in mind? At Credit Karma, we run our microservices and environments in data centers and multiple cloud environments. Following a cloud agnostic approach gives us flexibility to run in multiple clouds and evaluate different options. Here, we’ll cover an approach for migrating to Google Cloud using cloud agnostic tools
Finding the right fit
Cloud adoption is a common theme across enterprises and startups today. Organizations face different challenges and adopt different paths to migrate to cloud. Your organization may already have a substantial infrastructure in company-owned or shared data centers. There may be regulatory and compliance reasons for maintaining a secured data center presence that are difficult to implement in a cloud environment. So how do you find the right fit?
One option is touse cloud provider specific technologies like Google App Engine and deployment tooling like Google Deployment Manager. This is 100% cloud adoption and is a kind of vendor lock in. This approach is best suited for new startups who want to scale quickly. Tools like Spinnaker can be used if you are starting with a clean slate.
In cases where you already have an established deployment toolchain like Salt or Ansible, you can leverage the same tooling to deploy applications in cloud environments and be cloud agnostic. This allows for a planned cloud adoption with minimum changes to existing deployment infrastructure.
If you’ve already invested in Docker based deployments then this is a much easier exercise. You can use Kubernetes or another container ecosystem for moving to the cloud.
Tools of the trade
The cloud ecosystem is a buffet of tools, each with their own strengths. Here are some common tools and insights about each that we’ve learned along the way.
In the case of a self hosted Data Center, the IT/Operations team is providing hardware and storage to development teams. We use Terraform to abstract out the hardware provisioning process. It has provisioners for Google and AWS Cloud.
Application deployment and configuration management tools of your choice can work in the cloud. You can run a master or masterless infrastructure as required by the tool. Cloud environments will require more efforts to maintain a framework with a centralized master, so just know this can become a bottleneck.
Cloud environments work with machine images. This gives a nice hook into the deployment infrastructure to bake basic functionalities in a base organization image. The image can have all of the common and security tooling for the organization. The role of Packer templates should be limited to image creation.
Baking images
Application Image creation
In order to provide a self service spark infrastructure, we decided to use Jenkins with Terraform and Salt. We bake images using the salt scripts to speed up the boot time for end users. A Jenkins instance/cluster provisions a machine in Google Cloud using Packer and provisions the base image using Salt. Provisioning processes can include security hardening, common package installation and network related configurations.
The provisioning process “highstates” a machine with a basic organization role, which is great because it guarantees compliance and existence of base packages.
Google Cloud supports the concept of Image families, which point to a latest image and therefore can be used by all downstream images. This ensures that the latest software and configurations are used downstream. Images are created in a single project under infrastructure control and can be shared across projects.
Advanced image baking processes can be used to minimise the image creation time.
Application or infrastructure deployment process
Infrastructure software like Hadoop, Apache Spark, or service applications can use the same framework for installation. The following example illustrates how we automate the making of transient Spark clusters at Credit Karma.
Jenkins and some orchestration machines use Terraform to create required machines using a base image. They assign roles to the required machines (as “grains” in Salt). All machines are “highstated” to the required role. You can pull the latest provisioning code into the machine image or bake it in the base department image.
Application Provisioning
This process can work for any application deployment using dynamic information rendered in Salt grains and pillars during the machine provisioning process by Terraform.
The Terraform state is version controlled for coordination between different Jenkins jobs.
Challenges
Cloud applications have their own set of challenges and nuances. A machine in cloud can go away anytime or be taken out for maintenance. Application developers need to design applications to be fault tolerant and highly available. Cloud environments provide shutdown hooks to tap into the shutdown signals, which can enable applications to initiate migration.
Google Cloud encrypts the storage volumes to keep your data safe. One thing to note is that applications will need some sort of shared credentials to access data in other services or databases. These credentials can be provisioned on systems using a secret management system like Vault, or provisioned using HSMs or access restricted files on individual servers dynamically during provisioning. We suggest you take care to exclude these files from the image creation process.
A lot of Google Cloud Services are still in early, beta stages. The APIs keep changing to allow for better enterprise adoption and ease of use. For example, Netflix had to create a lot of tooling to use the AWS Cloud. Google can benefit from those experience to provide similar services in the cloud by default.
Choosing what works for you
There are many ways to migrate your applications cloud. You can adopt cloud native technologies or solutions, or use Docker or a hybrid approach to use cloud as a VM provider. How you adopt the cloud is ultimately determined by the stage of your company and the risks involved. Our best recommendation is to move the dev and QA environments first, then utilize the experience in production setup. Having a consistent deployment strategy allows you to maintain a hybrid environment. At Credit Karma, we started with running non-critical services and adopted some core cloud products like BigQuery, which helped us reduce our operational overhead. It’s been great to see adoption on Google Cloud Storage, Compute Engine and BigQuery at Credit Karma, and our Data Platform team is excited to keep learning as we migrate more services and analytics to the cloud. If you want to help out, check out roles in our Data Platform team at creditkarma.com/careers.
Thank you to Anurag Phadke, Zack Loebel-Begelman, and Eva Vander Giessen.