Setting up a startup's staging environment

To illustrate this article, you can find the github repository Dot-H/simple-staging-env-ex

At eKee, when the team and the number of users started growing, we had an issue. We wanted to test things out and propose different versions of the product to the users. Issue being we were a startup and duplicating the environment was way to expensive.

In this article, I'll shortly explain our backend architecture, the solution I went to and the tools used to develop it.

The backend environment

At eKee we had an hybrid architecture. Without the fancy words, it just means that our stateful services (databases, knowledge base software, gitlab...) were running on on-premise servers while the stateless services (engine, api, websites...) were running in a cloud infrastructure using kubernetes.

Everything was deployed using a gitops approach, taging a new version of a service would trigger a new deployment. In order to do so, we were using ansible for on-premise services and helm coupled with argocd for the in-cloud services.

The specification

Based on this environment, here is what I wanted to develop:

Expose a staging.[service] subdomain for each service which would resolve to a specific version of this service;
The version of a service could be any of its branch on git. If I want to test feature-A on the ekee-api service I would deploy the latest commit on branch feature-A;
Each service in the staging environment can use a different branch. They can be deployed independantly;
The staging environment relies as much as it can on the prod one, no extra deployment needed;
Deployments must be simple. If the developers needs to do more than two operations, they won't use it (we all know we are lazy kind of people).

Following this specification, here is the developer experience I wanted:

Developer A develops a new interface to share documents. The branch doc-sharing-v2 is created in the repository ekee-dashboard,
After commiting some beautiful engineering, Developer A opens a pull request and labels it staging,
Once the CI returns no error, Developer A clicks deploy in argocd,
When the PR is closed and/or the label staging is removed, the resources used by in the staging environment are released.

The implementation

Adding the workflow around the staging label in github

To trigger special workflows upon specific events, github exposes a wonderful set of tools through what they call github actions.

Here is the action we need to tag the latest PR's commit when the staging label is present and untag it when the PR is closed or unlabeled:

name: Release staging version
on:
  pull_request:
    types: [opened, synchronize, reopened, labeled]
jobs:
  tag:
    if: ${{ !contains(github.event.pull_request.labels.*.name, 'no-ci') }}
    name: Release staging
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
        with:
          fetch-depth: 1
      # Retrieve the branch name when there is a `staging` label
      - name: Extract branch name
        shell: bash
        run: echo "branch=${GITHUB_HEAD_REF}" >> $GITHUB_OUTPUT
        id: extract_branch
      # Tag the current commit with `staging-BRANCH_NAME` when there is a `staging` label
      - name: "Tag staging"
        uses: eKee-io/git-tag-action@fix-not-a-git-dir
        if:
          ${{ !contains(fromJson('["unlabeled", "closed"]'), github.event.action)
          && contains(github.event.pull_request.labels.*.name, 'staging') }}
        env:
          TAG: staging-${{ steps.extract_branch.outputs.branch }}
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
          BRANCH: ${{ steps.extract_branch.outputs.branch }}

Building the staging image to deploy

Nothing fancy here! What we deploy is a docker image so the strategy is pretty simple:

Put a Dockerfile in the root directory of the services,
Add an action triggered on tagging events building the docker image,
Tag the docker image with the same tag as git,
Push this image to your registry in order for it to be accessible by your cloud resources.

To illustrate this, I used the github container registry in Dot-H/simple-staging-env-ex.

With the github workflows I had an issue I didn't have with gitlab: a workflow inside a branch does not seem to be able to trigger an other one without the explicit workflow_run action..

So no fancy linking through the tag event possible. We need to use this action in the build_image.yaml:

name: Build docker image
on:
  workflow_run:
    workflows: ["Release staging version"]
    types:
      - completed
env:
  REGISTRY: ghcr.io
  IMAGE_NAME: ${{ github.repository }}
jobs:
  deploy:
    name: Build the docker image
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
    steps:
      - name: Checkout repository
        uses: actions/checkout@v3
        # The workflow runs on the main branch, we need to checkout to the
        # commit which triggered the `workflow_run` action
        with:
          ref: ${{ github.event.workflow_run.head_sha }}
      - name: Log in to the Container registry
        uses: docker/login-action@v2
        with:
          registry: ${{ env.REGISTRY }} # Login to ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      # Nicely format the tags & labels. This maybe useful if, as for me,
      # your actor name contains uppercase letters
      - name: Extract metadata (tags, labels) for Docker
        id: meta
        uses: docker/metadata-action@v4
        with:
          images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
          tags: |
            type=raw,value=staging-${{ github.event.workflow_run.head_branch }}
      - name: Build and push Docker image
        uses: docker/build-push-action@v3
        with:
          context: .
          push: true
          tags: ${{ steps.meta.outputs.tags }}
          labels: ${{ steps.meta.outputs.labels }}

Deploying the resources in k8s

Once the staging image is built and present on your container registry, it should be available to your infrastructure.

In order for the infrastructure to handle the deployment of the staging version, what we need is add a staging version for the deployment , the service and potentially exposing through an ingress.

To do so without having to rewrite everything, I use helm and its templates. I won't explain the helm chart I developed for the example in details here but rather the small helm tricks I used. Take a glance here to illustrate what I'm saying.

Staging specific values

The helm staging values are defined in the values-staging.yaml and by default extend values.yaml.

The staging value file is composed as follow:

nameOverride: "NAME"
image:
  pullPolicy: Always
staging:
  tag: staging-BRANCH_NAME
  enabled:
    echo-server: true

The staging.enabled controls which services compose the staging deployment by using the staging.tag. When used, the tag is expected to be available on the cloud repository.

The nameOverride permits to give a name to the created resources in the staging environment. This avoids them to conflict with the production ones. You may wonder why we don't rely on namespaces here. The main reason is that it makes sharing resources between the staging environment and the production one harder and the configuration less clear. With this approach, having staging versions using only a subset of the real environment is way simpler.

The image.pullPolicy: Always causes the staging pods to always pull their images. This is important because our building process does not change the tag when a new version is deployed, it just publishes a new image. Therefore we want the pods to always pull to make sure they always have the latest version.

Make staging elements use production resources

To permit one template file to be used by both the production & staging deployments, their name are computed using the staging-env.name template. Nevertheless, some resources like secrets aren't meant for staging. For that purpose, the staging-env.strictName template permits to keep a fixed name which can then be used in staging deployments:

# deployment.yaml
#....
- name: ECHO_SERVER_USERNAME
valueFrom:
	secretKeyRef:
	name: {{ include "staging-env.strictName" . }}-secret
	key: username

Ignored resources

To avoid resources to be present in the staging deployments, they are wrapped in a conditional clause at the beginning of the file to ignore:

# secret.yaml
{{- if not .Values.staging.tag }} # Not in staging mode
apiVersion: v1
kind: Secret
metadata:
  name: {{ include "staging-env.strictName" . }}-secret
type: Opaque
data:
  username: YWRtaW4=
  password: MWYyZDFlMmU2N2Rm
{{- end }} # Not in staging mode

Deploy a staging environment

Finally, thanks to all those tricks, deploying a staging environment is as simple as deploying a production one. All we need to do is:

Give it a name;
Specify the tag to use for the docker images;
Inform which services are enabled.

helm install staging-echo-server . \
	--values values-staging.yaml \
	--set nameOverride=staging-echo-server \
	--set staging.tag=staging-update-echo \
	--set staging.enabled.echoServer=true

And that's it! The host echo.staging-echo-server.alx-b.com will be available and route to the staging-update-echo version of the echo server.

To uninstall the staging env and free all the resources allocated we just need to run:

helm delete staging-echo-server

Conclusion

In this article we've seen how combining the github workflow, docker and helm permitted us to setup a whole staging environment just from adding the label staging to a pull request.

With this base, you can build more complex and complete environment. At eKee we used this in an architecture with 8 services connected with each others and despite multiple staging environments running at once, the extra cost stood low. Nevertheless, I recommand you to have a mirrored database for your staging environments. A mistake happens quickly when you deploy the latest commits of unmerged PRs 👀

I did not take the time to show you how you could connect the deployment to github. Personnaly I like to use argocd to manage deployments. It handles everything from the deployments to the rollbacks and has a nice interface for my engineers to manage their staging environment without having to type commands. Personnaly I felt like this section was making the article way too big but if you feel like it's needed, don't hesitate to send me a message on Linkedin.