Getting started with Runpod Flash

Runpod Flash (we’ll just call it “flash” from here) lets you write your code locally while it provisions GPU / CPU resources for you on Runpod on the fly.

This makes it possible to iterate very fast with any idea you might have, for example serving a model as an API, without worrying about Docker and setting up endpoints for testing.

All of that is taken care of by flash with a couple of simple commands.

Install flash CLI

I’ll be using uv for this tutorial. If you don’t have it installed yet, please install it first (it’s fast!).

In order to use flash everywhere, I recommend installing it globally:

uv tool install tetra_rp

The package is still named tetra_rp (our internal name during development), but will be renamed soon. Sorry for any confusion :D

You can check if the installation went fine by running this:

flash --version

This should output something like this: Runpod Flash CLI v0.18.0

Create a project

The next thing we have to do is to create a new project:

flash init hello-world
cd hello-world

This creates the following structure:

hello-world/
├── main.py                # FastAPI entry point
├── workers/
│   ├── gpu/
│   │   ├── __init__.py    # FastAPI router
│   │   └── endpoint.py    # @remote decorated function
│   └── cpu/               # Optional CPU worker (not used in this tutorial)
│       ├── __init__.py
│       └── endpoint.py
├── .env.example
├── requirements.txt
└── README.md

For this tutorial we’ll only work with workers/gpu/endpoint.py and ignore the CPU worker.

Setup

Create a virtual environment and install dependencies:

uv venv
source .venv/bin/activate
uv pip install -r requirements.txt

Copy the .env.example file:

cp .env.example .env

Edit .env and set your Runpod API key

RUNPOD_API_KEY=your_key_here

Great, now the project is setup and we can take a look on how it works.

The @remote decorator

The @remote decorator tells flash that this function should run on Runpod instead of in your local process when it’s called.

Now we’ll replace workers/gpu/endpoint.py with a version that’s almost the same as the original, but simplified and heavily commented so it’s clearer what happens and how to use flash:

from tetra_rp import remote, LiveServerless, GpuGroup

# Configuration for your serverless endpoint on Runpod (GPU, scaling, timeouts)
# Multiple @remote functions can share this config:
# each call is a separate job on the same endpoint
gpu_config = LiveServerless(
    name="hello-world-gpu",  # Unique name for this endpoint
    gpus=[GpuGroup.ANY],     # GPU choice; GpuGroup.ANY lets Runpod pick any available GPU
    workersMin=0,            # Scale to zero when idle
    workersMax=3,            # Max concurrent workers
    idleTimeout=5,           # Seconds before worker shuts down when no new request comes in
)

# Dependencies to install on the remote worker (not in your local virtualenv)
dependencies = ["torch"]

@remote(
    resource_config=gpu_config,
    dependencies=dependencies
)
async def gpu_hello(input_data: dict) -> dict:
    # Import dependencies that you want to use, which will only exist on the remote worker
    import torch

    # Get the message from the input
    name = input_data.get("message", "World")

    # Your code runs on an actual GPU
    greeting = f"Hello {name}, from a GPU!"
    gpu_name = torch.cuda.get_device_name(0)
    cuda_available = torch.cuda.is_available()

    # You decide what to return
    return {
        "greeting": greeting,
        "gpu": gpu_name,
        "cuda_available": cuda_available,
    }

We use GpuGroup.ANY here so Runpod can choose any available GPU. If you want to target specific GPU families instead, you can pass groups like GpuGroup.ADA_24 (RTX 4090) or GpuGroup.AMPERE_80 (A100 80GB). The tetra_rp README keeps an up-to-date list under “Available GPU types”.

Flash also sets up a FastAPI endpoint for you in workers/gpu/__init__.py. That file wires gpu_hello into FastAPI as a POST /gpu/hello route, so that’s the path we’ll call later over HTTP.

Running flash

Now that our project is set up, we can start the local development server that connects your code to the remote workers:

flash run

This command starts the local HTTP server and you should see output similar to this:

Starting Flash Server
Entry point: main:app
Server: http://localhost:8888
Auto-reload: enabled

Press CTRL+C to stop

INFO:     Will watch for changes in these directories: ['/path/to/hello-world']
INFO:     Uvicorn running on http://localhost:8888 (Press CTRL+C to quit)
INFO:     Started server process [86759]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

Open http://localhost:8888/docs in your browser to access the Swagger UI.

Test the endpoint

In Swagger UI, expand /gpu/hello, click “Try it out”, enter a name, and hit “Execute”.

Or use curl:

curl -X POST http://localhost:8888/gpu/hello \
    -H "Content-Type: application/json" \
    -d '{"message": "NERDDISCO"}'

Your terminal will show logs similar to this:

2025-12-01 17:31:43,276 | INFO  | Created endpoint: s8w0gpafvl6707 - hello-world-gpu-fb
2025-12-01 17:31:44,040 | INFO  | LiveServerless:s8w0gpafvl6707 | Started Job:8406eeeb-d174-4c36-867a-0e2a53e2084b-e2
2025-12-01 17:31:44,090 | INFO  | Job:8406eeeb-d174-4c36-867a-0e2a53e2084b-e2 | Status: IN_QUEUE
2025-12-01 17:31:44,479 | INFO  | Job:8406eeeb-d174-4c36-867a-0e2a53e2084b-e2 | .
2025-12-01 17:31:45,757 | INFO  | Job:8406eeeb-d174-4c36-867a-0e2a53e2084b-e2 | ..
...
2025-12-01 17:33:38,591 | INFO  | Job:8406eeeb-d174-4c36-867a-0e2a53e2084b-e2 | Status: COMPLETED
2025-12-01 17:33:38,618 | INFO  | Worker:na8dbgu8htou5e | Delay Time: 109571 ms
2025-12-01 17:33:38,619 | INFO  | Worker:na8dbgu8htou5e | Execution Time: 910 ms
dependency_installer.py:35 Installing Python dependencies: ['torch']
INFO:     127.0.0.1:52355 - "POST /gpu/hello HTTP/1.1" 200 OK

The first request takes ~60-90 seconds because flash has to provision a GPU worker and install dependencies like torch. Subsequent requests are much faster (~1-2s) as long as the worker stays warm. The idleTimeout setting controls how long workers stay alive after completing work.

Expected output

{
  "greeting": "Hello NERDDISCO, from a GPU!",
  "gpu": "NVIDIA GeForce RTX 4090",
  "cuda_available": true
}

What happens under the hood

When you send a request to /gpu/hello, your local flash dev server calls the gpu_hello() function in endpoint.py.

Because gpu_hello is decorated with @remote, flash does a couple of steps on Runpod instead of running them on your machine:

On the first request, it creates a serverless endpoint on Runpod for hello-world-gpu (later requests reuse the same endpoint). Flash uses the name you set in LiveServerless to identify this endpoint, so as long as you keep that name the same you’re talking to that one endpoint
It sends your request data (input_data) as a job to that endpoint
Runpod provisions a GPU worker if none is available yet (this is the cold start you see as dots in the logs)
The dependencies you listed in dependencies=["torch"] are installed on the worker the first time it starts (or when you change them); once they’re there, later requests will reuse them
The worker executes gpu_hello on the GPU and captures the return value
The result is sent back to your local flash dev server, which returns it as the HTTP response

dev locally with flash

Once this is running, you can keep editing the same workers/gpu/endpoint.py file and every time you save, the changes are transferred to the worker.

You can then trigger it via Swagger or api again.

As long as the worker is still warm (within your idleTimeout window), flash will reuse the same GPU worker and you’ll see your updated result come back almost immediately instead of waiting for another cold start.

You can now literally run your code on a B200.

Stopping & Cleanup

To stop the local server, just press Ctrl+C.

The remote worker will automatically shut down after 5 seconds of inactivity (thanks to idleTimeout=5 in our config), so you won’t keep paying for the GPU.

To fully remove the endpoint definition, go to your Runpod Console > Serverless.

Alpha

Flash is in still in alpha. Things may change or break. I’ll do my best to update this article when that happens.

Currently these are the things you should know:

flash run runs the HTTP server on your machine, which acts as a bridge to the remote workers
Endpoints persist in your Runpod account until manually deleted
Local .env vars don’t auto-forward to workers (use env={} in config)
We are working on flash deploy to convert this dev endpoint into a production-grade setup

Please report any issues on GitHub or reach out to us on our Discord.