This post is a follow-on of our overview of persisting data in Workflows. To learn all about how to best use workspaces, caching and artifacts, read our introductory post here.
Workspaces are a feature of Workflows and are used to move data from a job in a Workflow to subsequent jobs.
The name “workspace” may conjure up an image of a single storage location which jobs can add to and remove from at will. However, Workflows can introduce a lot of concurrency into a build process, and there is no defined order that concurrent jobs run in. In these circumstances a single mutable storage location leads to jobs flip-flopping from pass to fail in different workflow runs since they never get a consistent view of the workspace contents. This significantly impacts your ability to get a repeatable build process if your workflow has concurrent jobs.
So the workspace needs to be built from immutable layers (we use gzipped tarballs stored in a blob-store). Each job can add a new immutable layer into the workspace using the
But that’s not enough, for repeatability we need to be able to work out what workspace contents a job should be able to see when it uses the
E.g. the following workflow:
workflows: version: 2 jobs: - compile-assets - compile-code - test: requires: - compile-assets - compile-code - code-coverage - deploy: requires: - test - code-coverage
Produces this graph (where I’ve labelled jobs with a letter in addition to the job-name):
+---------------------+ | code-coverage (D) +-------------------------+ +---------------------+ | | +--------------+ +----> deploy (E) | | +--------------+ +----------------------+ | | compile-assets (A) +---+ | +----------------------+ | | | +------------+ | +---> test (C) +---+ | +------------+ +--------------------+ | | compile-code (B) +-----+ +--------------------+
Job C should definitely see the workspace additions made by jobs A and B, but what about job D? Sometimes job D may run before job C, sometimes it may run after, there aren’t any guarantees.
Fortunately we can turn to the workflow graph to help, the rule is that a job can only see data added to the workspace by jobs that are upstream of it in the workflow graph. So job C can never see data from job D.
Another consideration is that a job can “overwrite” data that a previous job stored in the workspace, if job B and job C both store a file called
code.jar into the workspace with different content, which one should job E see?
Again we turn to the workflow graph, the workspace layers produced by the upstream jobs are applied in a topological order, i.e. the layers produced by a job’s ancestors are always applied before the layers produced by the job. One valid order for applying layers when attaching the workspace to job E is A, B, C, D, but another entirely valid order is D, B, A, C.
In all cases the layers from job C are applied after the layers from job A so we can know that job E will see
code.jar from job C.
The final consideration is what happens when two concurrent jobs (e.g. jobs A and B, or jobs C and D) add the same file into the workspace? If jobs C and D both store
code.jar which one should job E see? The workflow graph doesn’t help here, the jobs are concurrent, but we want to avoid ambiguity and definitely avoid flip-flopping in different workflow runs. Since the goal is to have repeatable workflows we flag the situation and fail the
attach_workspace command, with an error highlighting the ambiguous file.
Re-running a workflow with a workspace
The concerns and implementation highlighted above might sound very theoretical, without much practical benefit for simple workflows that don’t have any concurrency, but all this immutability and careful ordering of how layers are applied means we can share portions of a workspace between workflow runs!
This is how re-running just the failed jobs in a workflow works, we can determine all the workspace layers for the successful jobs from the previous run and use those in the re-run. This saves you time and resources since we don’t need to run the successful jobs again.
- Workspaces are intended to move data that have been produced during a job to later jobs in a workflow
- We can achieve repeatability by using immutability and the workflow graph to work out which data each job in a workflow can see.
- Workspaces aren’t magic; ultimately it’s creating tarballs and storing them in a blob store. Attaching a workspace needs to download and unpack tarballs for every upstream job that persisted data to the workspace.
That sounds awesome, but there’s still all this choice–which feature should I use for what?
The first thing to bear in mind is that everything you store carries some penalty to archive, upload, and download, regardless of which feature is used. Bandwidth is not infinite, latency is not zero. And don’t forget filesystem operations, packing and unpacking tens- to hundreds-of-thousands of tiny files into an archive takes time.
[footnote: on my multicore Macbook Pro dev machine with SSD and no
virtualisation overhead it takes 23 seconds just to create the archive of the
node_modules directory for a typical customer’s set of Node packages.]
Workspaces offer a tempting convenience, putting your whole checkout and dependencies into the workspace, but there are downsides to this convenience:
- causes an archive and upload of all your dependencies
- causes an archive and upload of your entire git repo (including history)
- causes re-downloads of these even if you have already restored a cache
For speedy workflows we want to minimise the archive, upload, download and unpack operations.
What should you do?
- Use the workspace for data produced in a job that needs to be available to subsequent jobs. This is workspaces’ bread and butter and works well with re-runs.
- Keep dependencies in the CircleCI cache and use
restore_cacheto copy them into your containers. If you key on the checksum of your dependencies manifest (packages.json, Gemfile.lock, project.clj, etc) then you can avoid archiving and uploading the dependency data in every workflow. The CircleCI caching mechanism will only save new caches if the content of your dependencies manifest changes. Since dependencies can be large and/or have a large number of files this can save significant time.
- Be specific about what you persist into your workspace to avoid storing data your subsequent jobs don’t need. E.g. If you don’t need to use
gitin downstream jobs, avoid adding
.gitto your workspace, it can contain large amounts of data.
- Don’t store the same data in the cache and workspace, you’ll pay the cost of downloading twice.
- If you do need your full git repo available consider using
checkoutto checkout the repo from your VCS provider. This is less clear-cut than the other advice because general network “weather” has a larger impact on VCS provider to CircleCI transfers.
- Use artifacts for any data produced by a workflow that you want to be able to access outside the CircleCI system after the workflow has finished.