Build an automated ETL pipeline for cryptocurrency data with CircleCI

To stay ahead in the crypto world, you need latest information about cryptocurrencies. With so many coins out there and prices changing all the time, knowing which ones are doing the best gives you a quick snapshot of what’s hot right now. Whether you’re investing, just curious, or trying to understand the market better, this information makes it easier to spot trends and make smarter decisions. Plus, it’s a great way to keep things simple and not get overwhelmed by all the noise in the crypto space.

In this tutorial, you will build a CI/CD-driven ETL (Extract, Transform, Load) pipeline that processes cryptocurrency data using the free, publicly available CoinGecko API. The result is a polished Markdown file with a top-10 market cap snapshot, automatically uploaded to an AWS S3 bucket. You will automate the entire process using CircleCI.

Prerequisites

Get a GitHub account
Get a free CircleCI account
Get a free AWS account
Install Python 3.10+

Project structure

Before diving into the code, take a look at the overall file layout. This will help ensure you undersand where each part of the ETL pipeline belongs. Create this folder structure on your local machine. You will add code to each file as you follow along with the article.

etl-pipelines/
├── .circleci/
│   └── config.yml
├── etl.py
└── requirements.txt

Setting up your dependencies

Open the requirements.txt file and add these lines.

annotated-types==0.7.0
certifi==2025.4.26
charset-normalizer==3.4.2
idna==3.10
mypy_extensions==1.1.0
numpy==2.2.5
packaging==25.0
pandas==2.2.3
pandera==0.23.1
pydantic==2.11.4
pydantic_core==2.33.2
python-dateutil==2.9.0.post0
pytz==2025.2
requests==2.32.3
six==1.17.0
typeguard==4.4.2
typing-inspect==0.9.0
typing-inspection==0.4.0
typing_extensions==4.13.2
tzdata==2025.2
urllib3==2.4.0

These libraries form the foundation of a lightweight yet efficient ETL pipeline. You won’t need to install or run them manually—CircleCI will manage everything automatically during the pipeline execution. Pandas is used for manipulating and analyzing tabular data, Requests fetches real-time cryptocurrency data from external APIs, and Pandera ensures data quality by validating the structure of Pandas DataFrames. The remaining packages are either direct dependencies or required by these core libraries to function properly.

Creating the ETL code

The etl.py script is the brain of the operation. It contains each step necessary for the ETL process. Open the etl.py file and add the required imports:

import sys
import requests
import pandas as pd
import pandera as pa
import os
import json
from datetime import datetime

Extract

Every ETL pipeline starts with data available in an expected format. But how do we ensure our source is reliable and structured enough to kick off our automation? For this pipeline, you will use the CoinGecko API to retrieve real-time market data for the top cryptocurrencies.

def extract_data():
    print("Fetching cryptocurrency data from CoinGecko API")
    url = "https://api.coingecko.com/api/v3/coins/markets"
    params = {
        "vs_currency": "usd",
        "order": "market_cap_desc",
        "per_page": 10,
        "page": 1,
        "sparkline": False
    }
    response = requests.get(url, params=params)
    response.raise_for_status()
    data = response.json()

    os.makedirs("data", exist_ok=True)
    with open("data/raw_data.json", "w") as f:
        json.dump(data, f)
    print("Raw data saved to data/raw_data.json")

    return data

Extract from API

The extract_data function sends a GET request to the CoinGecko markets API to retrieve real-time data on the top 10 cryptocurrencies sorted by market capitalization. It specifies query parameters such as currency (USD), sorting order, and pagination to fine-tune the results. After confirming a successful response with raise_for_status(), the JSON payload is parsed and saved to a local data/raw_data.json file. This ensures the data is reproducible and accessible for downstream processes without depending on live API calls every time the pipeline runs. Here’s what the function does step by step:

Transform

Now that we have the raw JSON data, what’s next? Since raw data can be hard to interpret or visualize directly, this step involves cleaning and formatting the data to make it easier to work with and present.

def transform_data(data=None):
    if data is None:
        print("Loading raw data from file")
        with open("data/raw_data.json", "r") as f:
            data = json.load(f)

    print("Transforming data")
    df = pd.json_normalize(data)
    df = df[["id", "symbol", "name", "current_price", "price_change_24h", "price_change_percentage_24h", "market_cap"]]
    df.columns = [col.replace("_", " ").title() for col in df.columns]

    # Combine "Price Change 24H" and "Price Change Percentage 24H"
    df['Price Change 24H'] = df.apply(
        lambda row: f"{row['Price Change 24H']:.2f} ({row['Price Change Percentage 24H']:.2f}%)",
        axis=1
    )
    df = df.drop(columns=['Price Change Percentage 24H'])

    df.to_csv("data/transformed_data.csv", index=False)
    print("Transformed data saved to data/transformed_data.csv")

    return df

The transform_data function begins by loading the raw JSON data from a file. It then uses pandas to normalize the nested JSON fields into a flat DataFrame. After selecting the relevant columns, the function renames them to make them more readable. Next, it combines the “Price Change 24H” and “Price Change Percentage 24H” into a single column that provides a clearer view of the price change in both absolute and percentage terms. The resulting DataFrame is saved to a CSV file (data/transformed_data.csv) for further processing.

Data transformation

Validate

As APIs evolve, their data structures may change. How do we future-proof our pipeline against breaking changes? Enter data validation:

def validate_data(df=None):
    if df is None:
        print("Loading transformed data from file")
        df = pd.read_csv("data/transformed_data.csv")

    print("Validating data schema")
    schema = pa.DataFrameSchema({
        "Id": pa.Column(str),
        "Symbol": pa.Column(str),
        "Name": pa.Column(str),
        "Current Price": pa.Column(float),
        "Price Change 24H": pa.Column(object),
        "Market Cap": pa.Column(int),
    })
    validated_df = schema.validate(df)

    validated_df.to_csv("data/validated_data.csv", index=False)
    print("Validated data saved to data/validated_data.csv")

    return validated_df

Data validation

With Pandera, you define a contract to which your data must adhere.

The validate_data function begins by loading the transformed data from a CSV file. It then defines a schema using Pandera to validate the data structure, ensuring each column conforms to the expected data type (e.g., strings for IDs and names, floats for prices, and integers for market caps).

The validate_data function proceeds if the data matches the schema. If the data structure changes, the schema.validate(df) call in the Python code will throw an exception, making the CircleCI pipeline fail. This gives software engineers an opportunity to intervene and fix the code and re-run the pipeline.

Once validated, the function saves the data as data/validated_data.csv, providing a reliable dataset for subsequent stages in the pipeline.

Load

Finally, it’s time to present the cleaned and validated data. But how do we make it both shareable and visually intuitive? Markdown tables strike a balance between readability and compatibility.

def load_data(df=None):
    if df is None:
        print("Loading validated data from file")
        df = pd.read_csv("data/validated_data.csv")

    print("Saving cryptocurrency data to Markdown file")
    md_path = "data/crypto.md"  # Define the Markdown file path
    md_dir = os.path.dirname(md_path) # Get the directory
    if md_dir: # Check if the directory exists
        os.makedirs(md_dir, exist_ok=True)  # Create the directory if it doesn't exist
        print(f"Markdown directory created: {md_dir}") # Log

    title = "# Top 10 Cryptocurrencies by Market Cap"
    description = "Data obtained from the [CoinGecko API](https://api.coingecko.com/api/v3/coins/markets)."
    markdown_table = create_markdown_table(df.copy()) # Use a copy
    timestamp = datetime.now().strftime("%Y-%m-%d %H:%M:%S %Z%z")
    footer = f"*Last updated: {timestamp.strip()}*"

    with open(md_path, "w") as f:
        f.write(title + "\n\n")
        f.write(description + "\n\n")
        f.write(markdown_table + "\n\n")
        f.write(footer + "\n")

    print(f"Cryptocurrency data successfully saved to: {md_path}")

The load_data() function builds a Markdown file with a title, a data table (created via create_markdown_table()), and a timestamped footer, writing everything to data/crypto.md. This final file is meant to be publicly shared via S3 to display the latest market data in a human-readable format.

Then, add the functions that generate the markdown table, format currency, and set the coin images:

def create_markdown_table(df):
    """Creates a formatted Markdown table with right-aligned numerical columns."""
    df['Icon'] = df['Symbol'].apply(get_crypto_icon_url)
    df['Current Price'] = df['Current Price'].apply(format_currency)
    df['Market Cap'] = df['Market Cap'].apply(format_market_cap)
    df = df[['Icon', 'Name', 'Current Price', 'Price Change 24H', 'Market Cap']]

    # Create the Markdown string with right alignment for the last 3 columns
    markdown_lines = ["| " + " | ".join(df.columns) + " |"]
    markdown_lines.append("| ---| ---| ---:| ---:| ---:|")
    for index, row in df.iterrows():
        row_values = [str(val) for val in row.tolist()]
        # Right-align the last three columns
        row_values[-3:] = [f" {val}" for val in row_values[-3:]]
        markdown_lines.append("| " + " | ".join(row_values) + " |")

    return "\n".join(markdown_lines)

The create_markdown_table() function builds a Markdown table from a cryptocurrency DataFrame. First, it improves the information by adding icon URLs for each coin and formatting prices and market caps for better readability. It then selects key columns—like name, price, and 24-hour change—and generates a Markdown string with proper formatting.

def get_crypto_icon_url(symbol):

    base_url = "https://raw.githubusercontent.com/cjdowner/cryptocurrency-icons/master/32/color/"
    filename = f"{symbol.lower()}.png"
    icon_url = f"{base_url}{filename}"

    try:
        response = requests.head(icon_url, timeout=3)
        if response.status_code == 200:
            return f'<img src="{icon_url}" width="16" height="16" align="absmiddle"> '
    except requests.RequestException:
        pass

    return "¤ "  # Fallback emoji if icon not found

The get_crypto_icon_url() function returns a small icon based on the symbol parameter. It looks up the symbol in a predefined GitHub repository containing cryptocurrency icons and generates an HTML <img> tag pointing to the corresponding icon when found. Otherwise, it displays a default currency symbol (¤).

def format_currency(value):
    """Formats a numeric value as currency with commas for thousands."""
    return f"${value:,.2f}"

def format_market_cap(value):
    """Formats a large integer as market capitalization with appropriate units."""
    if value >= 1_000_000_000_000:
        return f"${value / 1_000_000_000_000:.2f}T"
    elif value >= 1_000_000_000:
        return f"${value / 1_000_000_000:.2f}B"
    elif value >= 1_000_000:
        return f"${value / 1_000_000:.2f}M"
    else:
        return f"${value:,.2f}"

The format_currency() and format_market_cap() functions format numbers as money (e.g., $1,234.56), and convert large numbers into compact, human-friendly units like millions (M), billions (B), or trillions (T).

This Markdown export step finalizes our pipeline output and prepares the data for upload. You could use this file in newsletters, dashboards, or static websites.

Load data

Finally, add the block that executes specific ETL commands according to the parameter passed to the Python script:

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("No command provided. Available commands: extract_data, transform_data, validate_data, load_data, etl_pipeline")
        sys.exit(1)

    command = sys.argv[1]

    if command == "extract_data":
        extract_data()
    elif command == "transform_data":
        transform_data()
    elif command == "validate_data":
        validate_data()
    elif command == "load_data":
        load_data()
    else:
        print(f"Unknown command: {command}")
        sys.exit(1)

CircleCI workflow configuration

This project is orchestrated through CircleCI, a cloud-based continuous integration and continuous delivery (CI/CD) platform.

The configuration defines three jobs: project_checkout, etl, and upload. Each job has a clearly defined responsibility, forming a streamlined pipeline that automates the entire data lifecycle—from acquisition to publication.

Workflow overview

Here is the diagram showing the three workflow jobs and their steps that you need to implement:

ETL diagram

The pipeline flows logically:

Checkout the codebase
Run the ETL pipeline
Upload the processed data to AWS S3

CircleCI orbs

Open the .circleci/config.yml and add the following lines:

version: 2.1

orbs:
  aws-cli: circleci/aws-cli@4.0.0
  aws-s3: circleci/aws-s3@3.0.0

The orbs are CircleCI’s reusable configuration packages, and significantly reduce boilerplate code.

Job: project_checkout

This job leverages a full virtual machine (ubuntu-2204) with Docker layer caching enabled for faster builds. It checks out the code and persists it into a shared workspace, allowing subsequent jobs to access the exact project state.

jobs:
  project_checkout:
    machine:
      image: ubuntu-2204:edge
      docker_layer_caching: true
    steps:
      - checkout
      - persist_to_workspace:
          root: .
          paths:
            - .

Job: etl

The next CircleCI job will provision a Docker container and execute the crucial steps necessary for the ETL workflow:

  etl:
    docker:
      - image: cimg/python:3.10
    steps:
      - attach_workspace:
          at: ./
      - run:
          name: Install dependencies
          command: |
            python -m pip install --upgrade pip
            pip install -r requirements.txt

      - run:
          name: Extract Data
          command: python etl.py extract_data

      - run:
          name: Transform Data
          command: python etl.py transform_data

      - run:
          name: Validate Data
          command: python etl.py validate_data

      - run:
          name: Load Data
          command: python etl.py load_data

      - persist_to_workspace:
          root: .
          paths:
            - .

      - run:
          name: Verify file exists
          command: |
            ls -l data
            cat data/crypto.md

      - store_artifacts:
          path: data/crypto.md
          destination: output-data

Running in an isolated Docker container in the above job ensures consistent environments. Dependencies are installed fresh, preventing contamination from local environments or previous runs.

Each ETL phase is executed individually. This granularity enables identifying failures and improves observability. Also, each ETL function reads from and writes to a different file, allowing the next step to easily access the data it needs.

Once the ETL is finished, the data output is verified and exposed to the next job. And in the end, storing the Markdown output as a CircleCI artifact gives you a snapshot of each successful pipeline run.

Job: upload

The last job, upload, is part of a CircleCI pipeline that prepares and uploads the final ETL file to your Amazon S3 bucket using AWS credentials provided through OpenID Connect (OIDC).

  upload:
    executor: aws-cli/default
    steps:
      - attach_workspace:
          at: .

      - aws-cli/setup:
          profile_name: OIDC-User
          role_arn: arn:aws:iam::${AWS_ACCOUNT_ID}:role/circleci-role
          region: ${AWS_REGION}

      - run:
          name: Prepare S3
          command: |
            # Create the S3 bucket
            echo "Creating S3 bucket: ${S3_BUCKET}..."

            aws s3api create-bucket \
              --bucket "${S3_BUCKET}" \
              --region "${AWS_REGION}" \
              --no-cli-pager

            echo "Bucket created: ${S3_BUCKET}"

            # Configure public access block
            aws s3api put-public-access-block \
              --bucket "${S3_BUCKET}" \
              --public-access-block-configuration BlockPublicAcls=false,IgnorePublicAcls=false,BlockPublicPolicy=false,RestrictPublicBuckets=false \
              --no-cli-pager

            # Set ownership controls
            aws s3api put-bucket-ownership-controls \
              --bucket "${S3_BUCKET}" \
              --ownership-controls 'Rules=[{ObjectOwnership=BucketOwnerPreferred}]' \
              --no-cli-pager

            echo "Bucket public access and ownership controls configured."

      - aws-s3/copy:
          from: data/crypto.md
          to: s3://${S3_BUCKET}/crypto.md
          arguments: |
            --acl public-read

This job uses the aws-cli and aws-s3 orbs within the CI/CD workflow to securely authenticate, create the S3 bucket if it doesn’t already exist, configure its access policies, and upload the local crypto.md file to the bucket, making it publicly accessible. The commands rely on dynamic environment variables, which you’ll configure later in the setup.

CircleCI workflows

The last yaml section simply declares the jobs within the workflow, and defines that they should run sequentially:

workflows:
  ci-cd:
    jobs:
      - project_checkout
      - etl:
          requires:
            - project_checkout
      - upload:
          requires:
            - etl

Push code to GitHub

Commit the local changes done to your ETL Pipelines project and push them to your GitHub repository. Then refer to the official guide to create a project on CircleCI and associate it with your ETL Pipelines GitHub repository.

Setting up AWS identity provider and role

In this section, you will create an IAM identity provider and an IAM role in AWS. This configuration is necessary to establish a trust relationship between your AWS account and CircleCI’s OpenID Connect tokens.

In your CircleCI dashboard, select Organization Settings from the sidebar:

CircleCI dashboard

Take note of your organization on the “Organization Settings” page:

CircleCI organization settings

AWS Console

From the sidebar, select Identity providers under the “Access management” section. Click the Add provider button on the Identity providers page.

AWS identity providers

On the “Add Identity Provider” page, select OpenID Connect as the provider type. Then, set the provider URL to https://oidc.circleci.com/org/<organization-id>, replacing <organization-id> with your actual CircleCI organization ID. Use the same organization ID as the audience. Click Add provider to save the provider.

Add identity provider

On the sidebar, select Roles under the “Access management” section. On the Roles page select Create role.

![AWS IAM roles menu]2025-05-19-aws-roles-page.png

On the Select trusted entity page, set the trusted entity type to Web Identity. Under the “Web identity” section, choose the identity provider and audience you configured in the previous steps. Click Next to continue.

![Select trusted entity]2025-05-19-aws-iam-set-trusted-identity.png

On the Add permissions page, select only the AmazonS3FullAccess permission, which is needed for uploading our markdown file to the S3 bucket. Click Next.

![Add permissions]2025-05-19-aws-select-policies.png

On the Name, review, and create page, enter the name “circleci-role” for the role. Click the Create role button.

![Create role]2025-05-19-aws-iam-name-role.png

Note: For more information on OpenID Connect protocol integration, please refer to Using OpenID Connect identity tokens to authenticate jobs with cloud providers.

Configuring CircleCI environment variables

Now you can set up the environment variables used in your CircleCI workflow. Follow this guide to add environment variables to your project. Create the following:

Environment Variable Key	Value
`AWS_ACCOUNT_ID`	Your AWS account ID
`AWS_REGION`	Your AWS region, e.g. ‘us-east-1’
`S3_BUCKET`	A unique name for your target S3 bucket

Note: The S3 bucket name must be globally unique across all AWS users. If it’s not, the bucket creation step in the pipeline will fail.

Verifying the CircleCI pipeline

With everything now set, trigger the pipeline manually. It should execute successfully without any errors.

![Successful workflow execution]2025-05-19-successful-workflow-execution.png

Accessing the markdown file from the AWS S3 bucket

To view the generated Markdown file, go to the AWS console, open the S3 service, and navigate to the bucket you created. You should see the crypto.md file.

![Created crypto.md file]2025-05-19-newly-created-file.png

Because the bucket was configured to be public, anyone can access the file using its object URL.

![AWS Console, S3 File]2025-05-19-raw-markdown.png

Note: The content will be properly formatted when viewed in a Markdown viewer.

You now have a live cryptocurrency data report fully managed by CI/CD. From now on, your pipeline can be triggered either manually on the CircleCI dashboard or automatically via GitHub commits. The result is a publicly accessible, auto-updating Markdown page that tracks cryptocurrency prices. It is simple, elegant, and maintained entirely by code and cloud infrastructure.

Conclusion

Using AWS resources, a few Python functions, and simple CircleCI configuration, you’ve built a full-stack ETL solution that validates, transforms, and publishes a live report.

But this is just the beginning. Now it’s your turn. Starting with this code, try customizing it by:

Adding email alerts
Building complete HTML dashboards
Adding historical trend charts using Matplotlib or Plotly
Adding testing and linting steps to your pipeline
Integrating a scheduled trigger to run the pipeline hourly

You can check out the complete source code used in this tutorial on GitHub. Feel free to use the repository as a starting point for your own ETL processes and deployments.

Site

Blog