Sharing data across hybrid cloud and local CI/CD environments
Senior Technical Content Marketing Manager
This short tutorial demonstrates how you can work on data stored on your own infrastructure or in hybrid cloud CI/CD environments using CircleCI’s shared workspaces functionality — without having to configure VPNs, SSH tunnels, or other additional infrastructure.
If you frequently work with specialized execution environments or handle sensitive data in your CI/CD workloads, having the option to selectively share data between local and cloud environments can give you additional flexibility to customize your resources on a per-job basis.
CircleCI execution environments: managed cloud compute vs self-hosted runners
CircleCI is a flexible CI/CD platform that lets you run jobs and workflows in a specified execution environment, with the results being reported back to the CircleCI web console. Execution environments can be provided by either our managed cloud compute or your own machines using local self-hosted runners.
With our hosted cloud offering, you don’t have to manage any infrastructure and only pay for what you use (after using up all your build minutes on our generous free tier). Cloud execution environments include Docker, Linux virtual machines, macOS, Windows, Arm, and GPU and can be spun up on-demand for every job in your workflow. With Docker, you can bring your own image or run on one of our pre-built convenience images optimized for speed and efficiency.
Self-hosted runners allow you to execute CircleCI jobs on your own cloud instances hosted on AWS, Azure, Google Cloud, or your own physical machines. Using local runners is popular with organizations that want to keep the data their CI/CD workflows process on their local network. This approach is also favored by those that need to tailor their execution environments down to the hardware level for specialized tasks, such as intensive GPU processing for machine learning or automatic scaling of CI/CD infrastructure.
There and back again: working on your local data in the cloud
You may encounter situations in which you want to use CircleCI’s cloud compute resources to process data stored on your local infrastructure or hybrid cloud. In many use cases, you would expose this data to your external CI/CD platform in one of two ways:
- Storing it in a shared location that is externally accessible
- Setting up a VPN so that your CI/CD environment can connect to it and access internal resources
These options are fine for many scenarios but impractical for others.
For example, you may have an automated machine learning workflow that processes data using a self-hosted runner that has direct access to your private data store. The amount of data is growing quickly, however, and your self-hosted runner is starting to struggle under the load. In this case, CircleCI’s managed cloud compute will enable you to process the data faster without having to invest in and maintain additional infrastructure.
Rather than having to set up a potentially complex VPN to connect your automations to your data, it would be much more convenient to just load the required data on your self-hosted runner and then work on it in the cloud.
Persisting data between cloud and local execution environments with workspaces
The solution described above is what CircleCI’s workspaces feature lets you do: Data can be created or loaded in a job running in one execution environment and then persisted to another. When a job has been completed, the data is uploaded to the next job, even if it is running in a different location — no VPN or workarounds required.
In the example below, data is created on a self-hosted runner, persisted to a workspace, loaded in a cloud execution environment provided by CircleCI, and then checked to confirm that the transfer was successful.
How to run this example
Everything in this example is performed within the CircleCI configuration file. The data used to demonstrate persisting and loading data between jobs and environments is all generated using Bash commands defined in this configuration file.
To run this example, you’ll need to fork the example repository and create a CircleCI project from it.
Then, you will need to update all occurrences of RUNNER_NAMESPACE/RUNNER_RESOURCE_CLASS
in the included .circleci/config.yml
to match the details provided after setting up your own self-hosted runner. This file contains the complete job and workflow definitions, including the steps shown below.
Notes on where to store persisted data
Because CircleCI isn’t guaranteed control or access to directories outside of the configured working directory, especially when using self-hosted runners, we recommend that you mount and create workspaces within the working directory of the job.
Workspaces are not intended for long-term storage — only for persisting data between jobs — so make sure you store any data you want to be kept outside of your CI/CD infrastructure before your workflow ends.
When persisting data, you should also be aware of the behavior of files in persisted workspaces. Keep in mind potential usage limits if you are moving large amounts of data, and optimize the data you are persisting.
Persisting data to a workspace
To persist data, add the persist_to_workspace
key as a step to a job. Make sure this appears after the data has been created or loaded:
jobs:
create-test-data:
steps:
- persist_to_workspace:
# Workspaces let you persist data between jobs - https://circleci.com/docs/workspaces/
# Must be an absolute path or relative path from working_directory. This is a directory on the container that is taken to be the root directory of the workspace.
# In this example, the workspace root is the working directory (.)
root: .
paths:
- test_data.txt
In this case, the file at the path test_data.txt
is persisted to a workspace in the create-test-data
job. The workspace root is set to the working directory root (.
).
Loading data from the persisted workspace
Data is not persisted between jobs by default, so when the data is required in another job, it should be loaded by adding the attach_workspace key as a step, defining the at path that matches the path that was persisted to:
jobs:
check-test-data:
steps:
- attach_workspace:
# Must be absolute path or relative path from working_directory
# In this example, the workspace root is the working directory (.)
at: .
Above, within the check-test-data
job, the workspace is attached at the root of the working directory. This means that the test_data.txt
file persisted to it in the preceding job can be read at the same path in the new job.
Once the above steps have been added to a job, the jobs can be added to a workflow:
workflows:
test-workspace:
jobs:
- create-test-data
- check-test-data:
requires:
- create-test-data
Jobs in workflows will run concurrently unless the requires
condition is added. Above, the check-test-data
job is configured to run only after the create-test-data
job has been completed.
Testing for success
The example configuration generates a file containing a text string on a self-hosted runner and persists the workspace containing this file.
The next job, which is run on CircleCI’s managed cloud environment, verifies the file’s presence and contents once it has been loaded via the attached workspace.
If the test file was successfully created and persisted on the self-hosted runner and then loaded and read in the cloud execution environment, the workflow will successfully execute in the CircleCI web console.
If a problem occurs during the execution of a workflow, you will be notified so that you can take action.
Run CI/CD workflows anywhere with CircleCI
CircleCI provides the tools you need to automate your CI/CD workflows and other scriptable tasks in across multiple environments, including local infrastructure and private/hybrid clouds. While we provide images and cloud resources to support most common use cases, you can use your own images and machines for full control over your CI/CD environments — and mix them with our managed cloud environments for extra on-demand processing power.
If you’re not yet using CircleCI, you can get started with up to 6,000 free build minutes per month.