Setting up a startup's staging environment
February 15, 2021To illustrate this article, you can find the github repository Dot-H/simple-staging-env-ex
At eKee, when the team and the number of users started growing, we had an issue. We wanted to test things out and propose different versions of the product to the users. Issue being we were a startup and duplicating the environment was way to expensive.
In this article, I'll shortly explain our backend architecture, the solution I went to and the tools used to develop it.
The backend environment
At eKee we had an hybrid architecture. Without the fancy words, it just means that our stateful services (databases, knowledge base software, gitlab...) were running on on-premise servers while the stateless services (engine, api, websites...) were running in a cloud infrastructure using kubernetes.
Everything was deployed using a gitops approach, taging a new version of a service would trigger a new deployment. In order to do so, we were using ansible for on-premise services and helm coupled with argocd for the in-cloud services.
The specification
Based on this environment, here is what I wanted to develop:
- Expose a
staging.[service]
subdomain for each service which would resolve to a specific version of this service; - The version of a service could be any of its branch on git. If I want to test feature-A on the ekee-api service I would deploy the latest commit on branch feature-A;
- Each service in the staging environment can use a different branch. They can be deployed independantly;
- The staging environment relies as much as it can on the prod one, no extra deployment needed;
- Deployments must be simple. If the developers needs to do more than two operations, they won't use it (we all know we are lazy kind of people).
Following this specification, here is the developer experience I wanted:
- Developer A develops a new interface to share documents. The branch doc-sharing-v2 is created in the repository ekee-dashboard,
- After commiting some beautiful engineering, Developer A opens a pull request and labels it staging,
- Once the CI returns no error, Developer A clicks deploy in argocd,
- When the PR is closed and/or the label staging is removed, the resources used by in the staging environment are released.
The implementation
Adding the workflow around the staging label in github
To trigger special workflows upon specific events, github exposes a wonderful set of tools through what they call github actions.
Here is the action we need to tag the latest PR's commit when the staging label is present and untag it when the PR is closed or unlabeled:
name: Release staging version
on:
pull_request:
types: [opened, synchronize, reopened, labeled]
jobs:
tag:
if: ${{ !contains(github.event.pull_request.labels.*.name, 'no-ci') }}
name: Release staging
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
with:
fetch-depth: 1
# Retrieve the branch name when there is a `staging` label
- name: Extract branch name
shell: bash
run: echo "branch=${GITHUB_HEAD_REF}" >> $GITHUB_OUTPUT
id: extract_branch
# Tag the current commit with `staging-BRANCH_NAME` when there is a `staging` label
- name: "Tag staging"
uses: eKee-io/git-tag-action@fix-not-a-git-dir
if:
${{ !contains(fromJson('["unlabeled", "closed"]'), github.event.action)
&& contains(github.event.pull_request.labels.*.name, 'staging') }}
env:
TAG: staging-${{ steps.extract_branch.outputs.branch }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
BRANCH: ${{ steps.extract_branch.outputs.branch }}
Building the staging image to deploy
Nothing fancy here! What we deploy is a docker image so the strategy is pretty simple:
- Put a Dockerfile in the root directory of the services,
- Add an action triggered on tagging events building the docker image,
- Tag the docker image with the same tag as git,
- Push this image to your registry in order for it to be accessible by your cloud resources.
To illustrate this, I used the github container registry in Dot-H/simple-staging-env-ex.
With the github workflows I had an issue I didn't have with gitlab: a workflow inside a branch does not seem to be able to trigger an other one without the explicit workflow_run action..
So no fancy linking through the tag event possible. We need to use this action in the build_image.yaml:
name: Build docker image
on:
workflow_run:
workflows: ["Release staging version"]
types:
- completed
env:
REGISTRY: ghcr.io
IMAGE_NAME: ${{ github.repository }}
jobs:
deploy:
name: Build the docker image
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
steps:
- name: Checkout repository
uses: actions/checkout@v3
# The workflow runs on the main branch, we need to checkout to the
# commit which triggered the `workflow_run` action
with:
ref: ${{ github.event.workflow_run.head_sha }}
- name: Log in to the Container registry
uses: docker/login-action@v2
with:
registry: ${{ env.REGISTRY }} # Login to ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
# Nicely format the tags & labels. This maybe useful if, as for me,
# your actor name contains uppercase letters
- name: Extract metadata (tags, labels) for Docker
id: meta
uses: docker/metadata-action@v4
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
tags: |
type=raw,value=staging-${{ github.event.workflow_run.head_branch }}
- name: Build and push Docker image
uses: docker/build-push-action@v3
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
Deploying the resources in k8s
Once the staging image is built and present on your container registry, it should be available to your infrastructure.
In order for the infrastructure to handle the deployment of the staging version, what we need is add a staging version for the deployment , the service and potentially exposing through an ingress.
To do so without having to rewrite everything, I use helm and its templates. I won't explain the helm chart I developed for the example in details here but rather the small helm tricks I used. Take a glance here to illustrate what I'm saying.
Staging specific values
The helm staging values are defined in the values-staging.yaml
and by default extend values.yaml
.
The staging value file is composed as follow:
nameOverride: "NAME"
image:
pullPolicy: Always
staging:
tag: staging-BRANCH_NAME
enabled:
echo-server: true
The staging.enabled
controls which services compose the staging deployment
by using the staging.tag
. When used, the tag is expected to be available
on the cloud repository.
The nameOverride
permits to give a name to the created resources in the
staging environment. This avoids them to conflict with the production ones.
You may wonder why we don't rely on namespaces here. The main reason is that
it makes sharing resources between the staging environment and the production
one harder and the configuration less clear. With this approach, having staging
versions using only a subset of the real environment is way simpler.
The image.pullPolicy: Always
causes the staging pods to always pull their
images. This is important because our building process does not change the tag
when a new version is deployed, it just publishes a new image. Therefore we
want the pods to always pull to make sure they always have the latest version.
Make staging elements use production resources
To permit one template file to be used by both the production & staging
deployments, their name are computed using the staging-env.name
template.
Nevertheless, some resources like secrets aren't meant for staging. For that
purpose, the staging-env.strictName
template permits to keep a fixed name which can then be used in staging
deployments:
# deployment.yaml
#....
- name: ECHO_SERVER_USERNAME
valueFrom:
secretKeyRef:
name: {{ include "staging-env.strictName" . }}-secret
key: username
Ignored resources
To avoid resources to be present in the staging deployments, they are wrapped in a conditional clause at the beginning of the file to ignore:
# secret.yaml
{{- if not .Values.staging.tag }} # Not in staging mode
apiVersion: v1
kind: Secret
metadata:
name: {{ include "staging-env.strictName" . }}-secret
type: Opaque
data:
username: YWRtaW4=
password: MWYyZDFlMmU2N2Rm
{{- end }} # Not in staging mode
Deploy a staging environment
Finally, thanks to all those tricks, deploying a staging environment is as simple as deploying a production one. All we need to do is:
- Give it a name;
- Specify the tag to use for the docker images;
- Inform which services are enabled.
helm install staging-echo-server . \
--values values-staging.yaml \
--set nameOverride=staging-echo-server \
--set staging.tag=staging-update-echo \
--set staging.enabled.echoServer=true
And that's it! The host echo.staging-echo-server.alx-b.com
will be available
and route to the staging-update-echo
version of the echo server.
To uninstall the staging env and free all the resources allocated we just need to run:
helm delete staging-echo-server
Conclusion
In this article we've seen how combining the github workflow, docker and helm permitted us to setup a whole staging environment just from adding the label staging to a pull request.
With this base, you can build more complex and complete environment. At eKee we used this in an architecture with 8 services connected with each others and despite multiple staging environments running at once, the extra cost stood low. Nevertheless, I recommand you to have a mirrored database for your staging environments. A mistake happens quickly when you deploy the latest commits of unmerged PRs 👀
I did not take the time to show you how you could connect the deployment to github. Personnaly I like to use argocd to manage deployments. It handles everything from the deployments to the rollbacks and has a nice interface for my engineers to manage their staging environment without having to type commands. Personnaly I felt like this section was making the article way too big but if you feel like it's needed, don't hesitate to send me a message on Linkedin.