serverless - Runpod Documentation

Manage Serverless endpoints, including creating, listing, updating, and deleting endpoints.

runpodctl serverless <subcommand> [flags]

Alias

You can use sls as a shorthand for serverless:

runpodctl sls list

Subcommands

List endpoints

List all your Serverless endpoints:

runpodctl serverless list

List flags

--include-template

bool

Include template information in the output.

--include-workers

bool

Include workers information in the output.

Get endpoint details

Get detailed information about a specific endpoint:

runpodctl serverless get <endpoint-id>

Get flags

--include-template

bool

Include template information in the output.

--include-workers

bool

Include workers information in the output.

Create an endpoint

Create a new Serverless endpoint from a template or from a Hub repo:

# Create from a template
runpodctl serverless create --template-id "tpl_abc123" --gpu-id "NVIDIA GeForce RTX 4090"

# Create from a template with a model reference
runpodctl serverless create --template-id "tpl_abc123" --gpu-id "NVIDIA GeForce RTX 4090" \
  --model-reference https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct:main

# Create a CPU endpoint
runpodctl serverless create --template-id "tpl_abc123" --compute-type CPU

# Create from a Hub repo
runpodctl hub search vllm                                         # Find the hub ID
runpodctl serverless create --hub-id cm8h09d9n000008jvh2rqdsmb --name "my-vllm"

# Create from a Hub repo and attach a model reference
runpodctl serverless create --hub-id cm8h09d9n000008jvh2rqdsmb --gpu-id "NVIDIA GeForce RTX 4090" \
  --model-reference https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct:main

# Create from a Hub repo with custom environment variables
runpodctl serverless create --hub-id cm8h09d9n000008jvh2rqdsmb --name "my-vllm" \
  --env MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \
  --env MAX_TOKENS=4096

When using --hub-id, GPU IDs and container disk size are automatically pulled from the Hub release config. You can override the GPU type with --gpu-id. Environment variables from the Hub release are included automatically, and you can override or add to them with --env.

Serverless templates vs Pod templates: Serverless endpoints require a Serverless-specific template. Pod templates (like runpod-torch-v21) cannot be used because they include configuration, which Serverless does not support. When creating a template with runpodctl template create, use the --serverless flag to create a Serverless template.Each Serverless template can only be bound to one endpoint at a time. To create multiple endpoints with the same configuration, create separate templates for each.

Create flags

--name

string

Name for the endpoint. Must be at least 3 characters. If omitted, a name is auto-generated in the format endpoint-XXXXXXXX.

--template-id

string

Template ID to use (required if --hub-id is not specified). Use runpodctl template search to find templates.

--hub-id

string

Hub listing ID to deploy from (alternative to --template-id). Use runpodctl hub search to find repos.

--gpu-id

string

GPU type for workers. Accepts either a GPU type ID (e.g., NVIDIA A40, NVIDIA GeForce RTX 4090) or a GPU pool ID (e.g., ADA_24, AMPERE_48). Use runpodctl gpu list to see available GPUs.

--gpu-count

int

default:"1"

Number of GPUs per worker.

--compute-type

string

default:"GPU"

Compute type (GPU or CPU). For CPU endpoints, use --instance-id to specify the CPU instance type.

--instance-id

string

default:"cpu3g-4-16"

CPU instance ID when using --compute-type CPU. If omitted, defaults to cpu3g-4-16. Only valid with --compute-type CPU.

--workers-min

int

default:"0"

Minimum number of workers.

--workers-max

int

default:"3"

Maximum number of workers.

--data-center-ids

string

Comma-separated list of preferred datacenter IDs. Use runpodctl datacenter list to see available datacenters.

--network-volume-id

string

Network volume ID to attach for single-region deployments. Use runpodctl network-volume list to see available network volumes. Mutually exclusive with --network-volume-ids.

--network-volume-ids

string

Comma-separated list of network volume IDs for multi-region deployments. Mutually exclusive with --network-volume-id.

--min-cuda-version

string

Minimum CUDA version required for workers (e.g., 12.4). Workers will only be scheduled on machines that meet this CUDA version requirement.

--scale-by

string

Autoscaling strategy: delay (scales based on queue wait time in seconds) or requests (scales based on pending request count).

--scale-threshold

int

Trigger point for the autoscaler. For delay, this is the target queue wait time in seconds. For requests, this is the pending request count that triggers scaling.

--idle-timeout

int

Idle timeout in seconds. Workers shut down after being idle for this duration. Valid range: 1-3600 seconds.

--flash-boot

bool

Enable or disable flash boot for faster worker startup. When enabled, workers start from cached container images.

--execution-timeout

int

Execution timeout in seconds. Jobs that exceed this duration are terminated. The CLI accepts seconds but converts to milliseconds internally.

--env

string

Environment variable in KEY=VALUE format. Use multiple --env flags to set multiple variables. These values only apply when deploying from --hub-id, where they override the Hub release defaults. With --template-id, environment variables come from the template, so --env is ignored and the CLI prints a note to that effect.

--model-reference

string

Model reference URL to attach to the endpoint. Use multiple --model-reference flags to attach multiple models. Works with both --template-id and --hub-id, and requires GPU compute type.

Update an endpoint

Update endpoint configuration:

runpodctl serverless update <endpoint-id> --workers-max 5

Update flags

--name

string

New name for the endpoint.

--workers-min

int

New minimum number of workers.

--workers-max

int

New maximum number of workers.

--idle-timeout

int

New idle timeout in seconds.

--scaler-type

string

Scaler type (QUEUE_DELAY or REQUEST_COUNT).

--scaler-value

int

Scaler value.

--flash-boot

bool

Enable or disable flash boot for faster worker startup.

--execution-timeout

int

Execution timeout in seconds. Jobs that exceed this duration are terminated.

Delete an endpoint

Delete an endpoint:

runpodctl serverless delete <endpoint-id>

Serverless URLs

Access your Serverless endpoint using these URL patterns:

Operation	URL
Async request	`https://api.runpod.ai/v2/<endpoint-id>/run`
Sync request	`https://api.runpod.ai/v2/<endpoint-id>/runsync`
Health check	`https://api.runpod.ai/v2/<endpoint-id>/health`
Job status	`https://api.runpod.ai/v2/<endpoint-id>/status/<job-id>`

​Alias

​Subcommands

​List endpoints

​List flags

​Get endpoint details

​Get flags

​Create an endpoint

​Create flags

​Update an endpoint

​Update flags

​Delete an endpoint

​Serverless URLs

​Related commands

Alias

Subcommands

List endpoints

List flags

Get endpoint details

Get flags

Create an endpoint

Create flags

Update an endpoint

Update flags

Delete an endpoint

Serverless URLs

Related commands