CI/CD パイプラインでの Android アプリケーションの\nベンチマーキング

Users expect apps that are highly performant, load fast, and interact smoothly. These qualities are not just a nice-to-have anymore. Adding performance validation as part of your release flow addresses one of these challenges: ensuring smooth interactions.

Google recently released updates to their Jetpack benchmarking libraries. One notable addition is the Macrobenchmark library. This lets you test your app’s performance in areas like startup and scroll jank. In this tutorial you will get started with the Jetpack benchmarking libraries, and learn how to implement it as part of a CI/CD flow.

Prerequisites

You will need a few things to get the most from this tutorial:

Experience with Android development and testing, including instrumentation testing
Experience with Gradle
A free CircleCI account
A Firebase and Google Cloud platform account
Android Studio Narwhal Feature Drop or higher

About the project

The project is based on an earlier testing sample I created for a blog post on testing Android apps in a CI/CD pipeline.

I have expanded the sample project to include a benchmark job as part of a CI/CD build. This new job runs app startup benchmarks using the Jetpack Macrobenchmark library.

Jetpack benchmarking

Android Jetpack offers two types of benchmarking: Microbenchmark and Macrobenchmark. Microbenchmark, which has been around since 2019, allows for making performance measurement of application code (think caching or similar processes that might take a while to process).

Macrobenchmark is the new addition to Jetpack. It allows you to measure your app’s performance as a whole on easy-to-notice areas like app startup and scrolling. The sample application uses these macrobenchmark tests for measuring app startup.

Both benchmarking libraries and approaches work with the familiar Android Instrumentation framework that runs on connected devices and emulators.

Set up the library

The library setup is documented well on the official Jetpack site - Macrobenchmark Setup.

Note: This tutorial will not cover all steps in fine detail because they might change in future preview releases.

Here is an overview of the procedure

Create the Macrobenchmark module
Configure the target app
Set up signing

1. Create the Macrobenchmark module:

Create a new test module named macrobenchmark. In Android Studio create an Android library module, and change the build.gradle.kts to instead use libs.plugins.test as the plugin. The new module needs minimum SDK level to API 29: Android 10.
Make several modifications to the new module’s build.gradle.kts file. Change test implementations to implementation, point to the app module you want to test, and make the release build type debuggable.

1. Configure the target app:

Add the <profileable> tag to your app’s AndroidManifest.xml to enable detailed trace information capture.
Create a benchmark build type that mimics your release configuration but uses debug signing for local testing.
Ensure your app includes ProfilerInstaller 1.3 or higher for profile capture functionality.

1. Set up signing:

Specify the local release signing config. You can use the existing debug config for the benchmark build type.

Please refer to the guide in the official Macrobenchmark library documentation for complete step-by-step instructions.

Writing and executing Macrobenchmark tests

Macrobenchmark library introduces a few new JUnit rules and metrics.

Here’s a simple startup benchmark test taken from the performance-samples project on GitHub:

package com.example.macrobenchmark.benchmark.startup

import androidx.benchmark.macro.StartupTimingMetric
import androidx.benchmark.macro.junit4.MacrobenchmarkRule
import androidx.test.ext.junit.runners.AndroidJUnit4
import androidx.test.filters.LargeTest
import com.example.macrobenchmark.benchmark.util.DEFAULT_ITERATIONS
import com.example.macrobenchmark.benchmark.util.TARGET_PACKAGE
import org.junit.Rule
import org.junit.Test
import org.junit.runner.RunWith

@LargeTest
@RunWith(AndroidJUnit4::class)
class SampleStartupBenchmark {
    @get:Rule
    val benchmarkRule = MacrobenchmarkRule()

    @Test
    fun startup() = benchmarkRule.measureRepeated(
        packageName = TARGET_PACKAGE,
        metrics = listOf(StartupTimingMetric()),
        iterations = DEFAULT_ITERATIONS,
        setupBlock = {
            // Press home button before each run to ensure the starting activity isn't visible.
            pressHome()
        }
    ) {
        // starts default launch activity
        startActivityAndWait()
    }
}

This startup benchmark measures how long it takes for your app to launch and become interactive using the StartupTimingMetric(). The test runs multiple iterations (defined by DEFAULT_ITERATIONS) to get consistent measurements, and provides concrete performance data you can use to track regressions over time.

For generating Baseline Profiles alongside your benchmarks, you can also include a profile generator:

package com.example.macrobenchmark.baselineprofile

import androidx.benchmark.macro.junit4.BaselineProfileRule
import androidx.test.internal.runner.junit4.AndroidJUnit4ClassRunner
import org.junit.Rule
import org.junit.Test
import org.junit.runner.RunWith

@RunWith(AndroidJUnit4ClassRunner::class)
class StartupProfileGenerator {
    @get:Rule
    val rule = BaselineProfileRule()

    @Test
    fun profileGenerator() {
        rule.collect(
            packageName = TARGET_PACKAGE,
            maxIterations = 15,
            stableIterations = 3,
            includeInStartupProfile = true
        ) {
            startActivityAndWait()
        }
    }
}

This baseline profile generator creates Baseline Profiles - a powerful Android optimization technique that improves app performance by about 30% from the first launch. It guides Android Runtime (ART) to perform Ahead-of-Time (AOT) compilation on critical code paths, and the includeInStartupProfile = true parameter also generates Startup Profiles for DEX file layout optimization.

When you ship an app with Baseline Profiles, Google Play processes them and delivers optimized code to users immediately after installation.

Evaluating benchmark results

Once your benchmarks complete, they export results in a JSON file containing detailed timing metrics for each test. However, the benchmark command won’t indicate how well they performed - that analysis is up to you.

I wrote a Node.js script that evaluates benchmark results and fails the build if any measurements exceed expected thresholds:

const benchmarkData = require('/home/circleci/benchmarks/com.example.macrobenchmark-benchmarkData.json')

const STARTUP_MEDIAN_THRESHOLD_MILIS = 400

// Handle different possible metric names
function getStartupMetrics(benchmark) {
    // Check for different possible startup metric names
    if (benchmark.metrics.startupMs) {
        return benchmark.metrics.startupMs;
    } else if (benchmark.metrics.timeToInitialDisplayMs) {
        return benchmark.metrics.timeToInitialDisplayMs;
    } else if (benchmark.metrics.timeToFullDisplayMs) {
        return benchmark.metrics.timeToFullDisplayMs;
    }
    return null;
}

let err = 0;

// Process all startup-related benchmarks
benchmarkData.benchmarks.forEach((benchmark, index) => {
    const metrics = getStartupMetrics(benchmark);

    if (!metrics) {
        console.log(`⚠️  Benchmark ${index + 1} (${benchmark.name || 'unnamed'}) - No startup metrics found, skipping`);
        return;
    }

    const benchmarkName = benchmark.name || `benchmark_${index + 1}`;
    const className = benchmark.className || 'unknown';
    const medianTime = metrics.median;

    let msg = `Startup benchmark (${benchmarkName}) - ${medianTime.toFixed(2)}ms`;

    if (medianTime > STARTUP_MEDIAN_THRESHOLD_MILIS) {
        err = 1;
        console.error(`${msg} ❌ - OVER THRESHOLD ${STARTUP_MEDIAN_THRESHOLD_MILIS}ms`);
        console.error(`  Class: ${className}`);
        console.error(`  Runs: [${metrics.runs.map(r => r.toFixed(2)).join(', ')}]`);
        console.error(`  Min: ${metrics.minimum.toFixed(2)}ms, Max: ${metrics.maximum.toFixed(2)}ms`);
    } else {
        console.log(`${msg} ✅`);
        console.log(`  Class: ${className}`);
        console.log(`  Runs: [${metrics.runs.map(r => r.toFixed(2)).join(', ')}]`);
    }
});

// If no benchmarks were processed, that's an error
if (benchmarkData.benchmarks.length === 0) {
    console.error('❌ No benchmarks found in the data file');
    err = 1;
}

process.exit(err)

This script handles different metric types that benchmarks may export, provides detailed output for debugging, and uses proper exit codes to integrate with CI systems. It will fail the build if any startup time exceeds the 400ms threshold.

To establish your threshold, run benchmarks on a known-good version of your app and use those results as your baseline.

Running benchmarks in a CI/CD pipeline

You can run these samples in Android Studio, which gives you a nice printout of your app’s performance metrics, but won’t do anything to ensure your app is always performant. To do that you need to integrate benchmarks in your CI/CD process.

The steps required are:

Build the release variants of app and Macrobenchmark modules
Run the tests on Firebase Test Lab (FTL) or similar tool
Download the benchmark results
Store benchmarks as artifacts
Process the benchmark results to get timings data
Pass or fail the build based on the results

I created a new benchmarks-ftl job to run my tests:

version: 2.1

orbs:
  android: circleci/android@3.1.0
  gcp-cli: circleci/gcp-cli@3.3.2

jobs:
  benchmarks-ftl:
    executor:
      name: android/android_docker
      tag: 2025.03.1-node
    steps:
      - checkout
      - android/restore_gradle_cache
      - run:
          name: Build app and test app
          command: ./gradlew app:assembleBenchmark macrobenchmark:assembleBenchmark
      - gcp-cli/setup:
          version: 404.0.0
      - run:
          name: run on FTL
          command: |
            gcloud firebase test android run \
              --type instrumentation \
              --app app/build/outputs/apk/benchmarkRelease/app-benchmarkRelease.apk \
              --test macrobenchmark/build/outputs/apk/benchmarkRelease/macrobenchmark-benchmarkRelease.apk \
              --device model=tokay,version=34,locale=en,orientation=portrait \
              --timeout 45m \
              --directories-to-pull /sdcard/Download \
              --results-bucket gs://android-sample-benchmarks \
              --results-dir macrobenchmark \
              --environment-variables clearPackageData=true,additionalTestOutputDir=/sdcard/Download,no-isolated-storage=true
      - run:
          name: Download benchmark data
          command: |
            mkdir ~/benchmarks
            gsutil cp -r 'gs://android-sample-benchmarks/macrobenchmark/**/artifacts/sdcard/Download/*'  ~/benchmarks
            gsutil rm -r gs://android-sample-benchmarks/macrobenchmark
      - store_artifacts:
          path: ~/benchmarks
      - run:
          name: Evaluate benchmark results
          command: node scripts/eval_startup_benchmark_output.js

workflows:
  test-and-build:
    jobs:
      - benchmarks-ftl:
          filters:
            branches:
              only: main # Run benchmarks only on main branch

This job runs the macrobenchmark tests on a real device in Firebase Test Lab. We use the Android Docker executor with Node.js for our evaluation script.

Key aspects of this configuration:

Benchmark build type: It will build the benchmark variant for proper profiling capabilities
Real device testing: Uses Firebase Test Lab with a Pixel 8a (Android 14) for reliable performance measurements. Running benchmarks on emulators is often flaky due to virtualization overhead and inconsistent performance, so real devices are strongly recommended for accurate results.
Device selection: You can explore the list of available Firebase devices if you want to test against a different device.
Extended timeout: Includes a 45-minute timeout to accommodate longer benchmark runs. Note that 45 minutes is the maximum timeout for physical devices on Firebase Test Lab.
Branch filtering: Runs benchmarks only on the main branch to control resource usage

After building the benchmark APKs, we initialize the Google Cloud CLI and run tests in Firebase Test Lab. Once tests complete, we download the benchmark data from Cloud Storage, store it as build artifacts, and run our evaluation script to pass or fail the build based on performance thresholds.

CircleCI workflow for building, testing, macrobenchmarking and release

We will create a workflow that combines all these testing strategies where unit tests run on every commit to catch basic logic errors quickly, while UI tests run on multiple Android versions but only on feature branches to save resources. Instrumentation tests run on the main branch across multiple API levels for thorough compatibility testing, and benchmarks run only on main branch commits after unit tests pass to ensure performance validation. Finally, release builds only happen after all other tests pass, guaranteeing quality releases. The workflow uses job dependencies and branch filtering to create an efficient pipeline that balances thorough testing with resource management.

Here’s how the complete workflow looks with unit tests, instrumentation tests, benchmarks, and release builds all working together:

version: 2.1

orbs:
  android: circleci/android@3.1.0
  gcp-cli: circleci/gcp-cli@3.3.2

jobs:
  unit-test:
    executor:
      name: android/android_docker
      tag: 2025.03.1
    steps:
      - checkout
      - android/restore_gradle_cache
      - android/run_tests:
          test_command: ./gradlew testDebug
      - android/save_gradle_cache
      - run:
          name: Save test results
          command: |
            mkdir -p ~/test-results/junit/
            find . -type f -regex ".*/build/test-results/.*xml" -exec cp {} ~/test-results/junit/ \;
          when: always
      - store_test_results:
          path: ~/test-results
      - store_artifacts:
          path: ~/test-results/junit

  android-test:
    parameters:
      system-image:
        type: string
        default: system-images;android-35;google_apis;x86_64
    executor:
      name: android/android_machine
      resource_class: large
      tag: 2024.11.1
    steps:
      - checkout
      - android/start_emulator_and_run_tests:
          test_command: ./gradlew :app:connectedDebugAndroidTest
          system_image: << parameters.system-image >>
      - run:
          name: Save test results
          command: |
            mkdir -p ~/test-results/junit/
            find . -type f -regex ".*/build/outputs/androidTest-results/.*xml" -exec cp {} ~/test-results/junit/ \;
          when: always
      - store_test_results:
          path: ~/test-results
      - store_artifacts:
          path: ~/test-results/junit

  benchmarks-ftl:
    # ... (same as shown above)

  release-build:
    executor:
      name: android/android_docker
      tag: 2025.03.1
    steps:
      - checkout
      - android/restore_gradle_cache
      - run:
          name: Assemble release build
          command: |
            ./gradlew assembleRelease
      - store_artifacts:
          path: app/build/outputs/apk/release/app-release-unsigned.apk

workflows:
  test-and-build:
    jobs:
      - unit-test
      - android/run_ui_tests:
          executor:
            name: android/android_machine
            resource_class: large
            tag: 2024.11.1
          filters:
            branches:
              ignore: main # regular commits
      - android-test:
          matrix:
            alias: android-test-all
            parameters:
              system-image:
                - system-images;android-35;google_apis;x86_64
                - system-images;android-33;google_apis;x86_64
                - system-images;android-32;google_apis;x86_64
                - system-images;android-30;google_apis;x86
                - system-images;android-29;google_apis;x86
          name: android-test-<<matrix.system-image>>
          filters:
            branches:
              only: main # Commits to main branch
      - benchmarks-ftl:
          requires:
            - unit-test
          filters:
            branches:
              only: main # Run benchmarks only on main branch
      - release-build:
          requires:
            - unit-test
            - android-test-all
            - benchmarks-ftl
          filters:
            branches:
              only: main # Commits to main branch

Running the workflow on CircleCI

In this section, you will learn how to automate the workflow using CircleCI.

Setting up the project on CircleCI

On the CircleCI dashboard, go to the Projects tab, search for the GitHub repo name and click Set Up Project for your project.

Setup project on CircleCI

You will be prompted to add a new configuration file manually or use an existing one. Because you have already pushed the required configuration file to the codebase, select the Fastest option. Enter the name of the branch hosting your configuration file and click Set Up Project to continue.

Configure project on CircleCI

Completing the setup will trigger the pipeline and the first build might fail as the environment variables are not set.

Create a Google Cloud service account

Before configuring environment variables, you’ll need to create a Google Cloud service account with the appropriate permissions for Firebase Test Lab. The service account needs the following IAM roles:

Firebase Test Lab Admin - To run tests on Firebase Test Lab
Storage Object Admin - To store and retrieve test results from Cloud Storage
Cloud Storage Admin - To manage Cloud Storage buckets for benchmark data

You can create a service account and download its JSON key file by following the official Google Cloud documentation. Make sure to assign the required permissions during the service account creation process.

Once you have the JSON key file downloaded, you’ll need to base64 encode it before adding it to CircleCI as an environment variable.

Set environment variables

Our benchmarking workflow requires Google Cloud Platform access for Firebase Test Lab. On the project page, click Project settings and head over to the Environment variables tab. On the screen that appears, click on Add environment variable button and add the following environment variables:

GCLOUD_SERVICE_KEY: Your Google Cloud service account key in JSON format (base64 encoded).
GOOGLE_PROJECT_ID: Your Google Cloud project ID where Firebase Test Lab is enabled.

Note: To get your service account key, create a service account in the Google Cloud Console with Firebase Test Lab Admin permissions, download the JSON key file, and base64 encode it before adding to CircleCI.

Once you add the environment variables, they should appear on the dashboard with their keys visible but values hidden for security.

Set environment variables on CircleCI

Now that the environment variables are configured, trigger the pipeline again. This time the build should succeed and your benchmarks will run on Firebase Test Lab.

Successful CircleCI build

You can also go to the Firebase console to view more details about the test execution.

Analyze test details on Firebase

Conclusion

In this article, I covered how to include Android app performance benchmarking in a CI/CD pipeline alongside your other tests. This prevents performance regressions from reaching your users as you add new features and other improvements.

We used the new Android Jetpack macrobenchmarking library and showed ways to integrate it with Firebase Test Lab to run benchmarks on real devices. We showed how to analyze the results, passing the build if the application startup time exceeds our allowed threshold.