Runpod Flash (we’ll just call it “flash” from here) lets you write your code locally while it provisions GPU / CPU resources for you on Runpod on the fly.
This makes it possible to iterate very fast with any idea you might have, for example serving a model as an API, without worrying about Docker and setting up endpoints for testing.
All of that is taken care of by flash with a couple of simple commands.
I’ll be using uv for this tutorial. If you don’t have it installed yet, please install it first (it’s fast!).
In order to use flash everywhere, I recommend installing it globally:
uv tool install tetra_rp
tetra_rp (our internal name during development), but will be renamed soon. Sorry for any confusion :DYou can check if the installation went fine by running this:
flash --version
This should output something like this: Runpod Flash CLI v0.18.0
The next thing we have to do is to create a new project:
flash init hello-world
cd hello-world
This creates the following structure:
hello-world/
├── main.py # FastAPI entry point
├── workers/
│ ├── gpu/
│ │ ├── __init__.py # FastAPI router
│ │ └── endpoint.py # @remote decorated function
│ └── cpu/ # Optional CPU worker (not used in this tutorial)
│ ├── __init__.py
│ └── endpoint.py
├── .env.example
├── requirements.txt
└── README.md
For this tutorial we’ll only work with workers/gpu/endpoint.py and ignore the CPU worker.
Create a virtual environment and install dependencies:
uv venv
source .venv/bin/activate
uv pip install -r requirements.txt
Copy the .env.example file:
cp .env.example .env
Edit .env and set your Runpod API key
RUNPOD_API_KEY=your_key_here
Great, now the project is setup and we can take a look on how it works.
The @remote decorator tells flash that this function should run on Runpod instead of in your local process when it’s called.
Now we’ll replace workers/gpu/endpoint.py with a version that’s almost the same as the original, but simplified and heavily commented so it’s clearer what happens and how to use flash:
from tetra_rp import remote, LiveServerless, GpuGroup
# Configuration for your serverless endpoint on Runpod (GPU, scaling, timeouts)
# Multiple @remote functions can share this config:
# each call is a separate job on the same endpoint
gpu_config = LiveServerless(
name="hello-world-gpu", # Unique name for this endpoint
gpus=[GpuGroup.ANY], # GPU choice; GpuGroup.ANY lets Runpod pick any available GPU
workersMin=0, # Scale to zero when idle
workersMax=3, # Max concurrent workers
idleTimeout=5, # Seconds before worker shuts down when no new request comes in
)
# Dependencies to install on the remote worker (not in your local virtualenv)
dependencies = ["torch"]
@remote(
resource_config=gpu_config,
dependencies=dependencies
)
async def gpu_hello(input_data: dict) -> dict:
# Import dependencies that you want to use, which will only exist on the remote worker
import torch
# Get the message from the input
name = input_data.get("message", "World")
# Your code runs on an actual GPU
greeting = f"Hello {name}, from a GPU!"
gpu_name = torch.cuda.get_device_name(0)
cuda_available = torch.cuda.is_available()
# You decide what to return
return {
"greeting": greeting,
"gpu": gpu_name,
"cuda_available": cuda_available,
}
GpuGroup.ANY here so Runpod can choose any available GPU. If you want to target specific GPU families instead, you can pass groups like GpuGroup.ADA_24 (RTX 4090) or GpuGroup.AMPERE_80 (A100 80GB). The tetra_rp README keeps an up-to-date list under “Available GPU types”.Flash also sets up a FastAPI endpoint for you in workers/gpu/__init__.py. That file wires gpu_hello into FastAPI as a POST /gpu/hello route, so that’s the path we’ll call later over HTTP.
Now that our project is set up, we can start the local development server that connects your code to the remote workers:
flash run
This command starts the local HTTP server and you should see output similar to this:
Starting Flash Server
Entry point: main:app
Server: http://localhost:8888
Auto-reload: enabled
Press CTRL+C to stop
INFO: Will watch for changes in these directories: ['/path/to/hello-world']
INFO: Uvicorn running on http://localhost:8888 (Press CTRL+C to quit)
INFO: Started server process [86759]
INFO: Waiting for application startup.
INFO: Application startup complete.
Open http://localhost:8888/docs in your browser to access the Swagger UI.
In Swagger UI, expand /gpu/hello, click “Try it out”, enter a name, and hit “Execute”.
Or use curl:
curl -X POST http://localhost:8888/gpu/hello \
-H "Content-Type: application/json" \
-d '{"message": "NERDDISCO"}'
Your terminal will show logs similar to this:
2025-12-01 17:31:43,276 | INFO | Created endpoint: s8w0gpafvl6707 - hello-world-gpu-fb
2025-12-01 17:31:44,040 | INFO | LiveServerless:s8w0gpafvl6707 | Started Job:8406eeeb-d174-4c36-867a-0e2a53e2084b-e2
2025-12-01 17:31:44,090 | INFO | Job:8406eeeb-d174-4c36-867a-0e2a53e2084b-e2 | Status: IN_QUEUE
2025-12-01 17:31:44,479 | INFO | Job:8406eeeb-d174-4c36-867a-0e2a53e2084b-e2 | .
2025-12-01 17:31:45,757 | INFO | Job:8406eeeb-d174-4c36-867a-0e2a53e2084b-e2 | ..
...
2025-12-01 17:33:38,591 | INFO | Job:8406eeeb-d174-4c36-867a-0e2a53e2084b-e2 | Status: COMPLETED
2025-12-01 17:33:38,618 | INFO | Worker:na8dbgu8htou5e | Delay Time: 109571 ms
2025-12-01 17:33:38,619 | INFO | Worker:na8dbgu8htou5e | Execution Time: 910 ms
dependency_installer.py:35 Installing Python dependencies: ['torch']
INFO: 127.0.0.1:52355 - "POST /gpu/hello HTTP/1.1" 200 OK
torch. Subsequent requests are much faster (~1-2s) as long as the worker stays warm. The idleTimeout setting controls how long workers stay alive after completing work.{
"greeting": "Hello NERDDISCO, from a GPU!",
"gpu": "NVIDIA GeForce RTX 4090",
"cuda_available": true
}
When you send a request to /gpu/hello, your local flash dev server calls the gpu_hello() function in endpoint.py.
Because gpu_hello is decorated with @remote, flash does a couple of steps on Runpod instead of running them on your machine:
hello-world-gpu (later requests reuse the same endpoint). Flash uses the name you set in LiveServerless to identify this endpoint, so as long as you keep that name the same you’re talking to that one endpointinput_data) as a job to that endpointdependencies=["torch"] are installed on the worker the first time it starts (or when you change them); once they’re there, later requests will reuse themgpu_hello on the GPU and captures the return valueOnce this is running, you can keep editing the same workers/gpu/endpoint.py file and every time you save, the changes are transferred to the worker.
You can then trigger it via Swagger or api again.
As long as the worker is still warm (within your idleTimeout window), flash will reuse the same GPU worker and you’ll see your updated result come back almost immediately instead of waiting for another cold start.
You can now literally run your code on a B200.
To stop the local server, just press Ctrl+C.
The remote worker will automatically shut down after 5 seconds of inactivity (thanks to idleTimeout=5 in our config), so you won’t keep paying for the GPU.
To fully remove the endpoint definition, go to your Runpod Console > Serverless.
Flash is in still in alpha. Things may change or break. I’ll do my best to update this article when that happens.
Currently these are the things you should know:
flash run runs the HTTP server on your machine, which acts as a bridge to the remote workers.env vars don’t auto-forward to workers (use env={} in config)flash deploy to convert this dev endpoint into a production-grade setupPlease report any issues on GitHub or reach out to us on our Discord.