Skip to main content
Manage Serverless endpoints, including creating, listing, updating, and deleting endpoints.
runpodctl serverless <subcommand> [flags]

Alias

You can use sls as a shorthand for serverless:
runpodctl sls list

Subcommands

List endpoints

List all your Serverless endpoints:
runpodctl serverless list

List flags

--include-template
bool
Include template information in the output.
--include-workers
bool
Include workers information in the output.

Get endpoint details

Get detailed information about a specific endpoint:
runpodctl serverless get <endpoint-id>

Get flags

--include-template
bool
Include template information in the output.
--include-workers
bool
Include workers information in the output.

Create an endpoint

Create a new Serverless endpoint from a template or from a Hub repo:
# Create from a template
runpodctl serverless create --template-id "tpl_abc123" --gpu-id "NVIDIA GeForce RTX 4090"

# Create from a template with a model reference
runpodctl serverless create --template-id "tpl_abc123" --gpu-id "NVIDIA GeForce RTX 4090" \
  --model-reference https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct:main

# Create a CPU endpoint
runpodctl serverless create --template-id "tpl_abc123" --compute-type CPU

# Create from a Hub repo
runpodctl hub search vllm                                         # Find the hub ID
runpodctl serverless create --hub-id cm8h09d9n000008jvh2rqdsmb --name "my-vllm"

# Create from a Hub repo and attach a model reference
runpodctl serverless create --hub-id cm8h09d9n000008jvh2rqdsmb --gpu-id "NVIDIA GeForce RTX 4090" \
  --model-reference https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct:main

# Create from a Hub repo with custom environment variables
runpodctl serverless create --hub-id cm8h09d9n000008jvh2rqdsmb --name "my-vllm" \
  --env MODEL_NAME=meta-llama/Llama-3.1-8B-Instruct \
  --env MAX_TOKENS=4096
When using --hub-id, GPU IDs and container disk size are automatically pulled from the Hub release config. You can override the GPU type with --gpu-id. Environment variables from the Hub release are included automatically, and you can override or add to them with --env.
Serverless templates vs Pod templates: Serverless endpoints require a Serverless-specific template. Pod templates (like runpod-torch-v21) cannot be used because they include configuration, which Serverless does not support. When creating a template with runpodctl template create, use the --serverless flag to create a Serverless template.Each Serverless template can only be bound to one endpoint at a time. To create multiple endpoints with the same configuration, create separate templates for each.

Create flags

--name
string
Name for the endpoint. Must be at least 3 characters. If omitted, a name is auto-generated in the format endpoint-XXXXXXXX.
--template-id
string
Template ID to use (required if --hub-id is not specified). Use runpodctl template search to find templates.
--hub-id
string
Hub listing ID to deploy from (alternative to --template-id). Use runpodctl hub search to find repos.
--gpu-id
string
GPU type for workers. Accepts either a GPU type ID (e.g., NVIDIA A40, NVIDIA GeForce RTX 4090) or a GPU pool ID (e.g., ADA_24, AMPERE_48). Use runpodctl gpu list to see available GPUs.
--gpu-count
int
default:"1"
Number of GPUs per worker.
--compute-type
string
default:"GPU"
Compute type (GPU or CPU). For CPU endpoints, use --instance-id to specify the CPU instance type.
--instance-id
string
default:"cpu3g-4-16"
CPU instance ID when using --compute-type CPU. If omitted, defaults to cpu3g-4-16. Only valid with --compute-type CPU.
--workers-min
int
default:"0"
Minimum number of workers.
--workers-max
int
default:"3"
Maximum number of workers.
--data-center-ids
string
Comma-separated list of preferred datacenter IDs. Use runpodctl datacenter list to see available datacenters.
--network-volume-id
string
Network volume ID to attach for single-region deployments. Use runpodctl network-volume list to see available network volumes. Mutually exclusive with --network-volume-ids.
--network-volume-ids
string
Comma-separated list of network volume IDs for multi-region deployments. Mutually exclusive with --network-volume-id.
--min-cuda-version
string
Minimum CUDA version required for workers (e.g., 12.4). Workers will only be scheduled on machines that meet this CUDA version requirement.
--scale-by
string
Autoscaling strategy: delay (scales based on queue wait time in seconds) or requests (scales based on pending request count).
--scale-threshold
int
Trigger point for the autoscaler. For delay, this is the target queue wait time in seconds. For requests, this is the pending request count that triggers scaling.
--idle-timeout
int
Idle timeout in seconds. Workers shut down after being idle for this duration. Valid range: 1-3600 seconds.
--flash-boot
bool
Enable or disable flash boot for faster worker startup. When enabled, workers start from cached container images.
--execution-timeout
int
Execution timeout in seconds. Jobs that exceed this duration are terminated. The CLI accepts seconds but converts to milliseconds internally.
--env
string
Environment variable in KEY=VALUE format. Use multiple --env flags to set multiple variables. These values only apply when deploying from --hub-id, where they override the Hub release defaults. With --template-id, environment variables come from the template, so --env is ignored and the CLI prints a note to that effect.
--model-reference
string
Model reference URL to attach to the endpoint. Use multiple --model-reference flags to attach multiple models. Works with both --template-id and --hub-id, and requires GPU compute type.

Update an endpoint

Update endpoint configuration:
runpodctl serverless update <endpoint-id> --workers-max 5

Update flags

--name
string
New name for the endpoint.
--workers-min
int
New minimum number of workers.
--workers-max
int
New maximum number of workers.
--idle-timeout
int
New idle timeout in seconds.
--scaler-type
string
Scaler type (QUEUE_DELAY or REQUEST_COUNT).
--scaler-value
int
Scaler value.
--flash-boot
bool
Enable or disable flash boot for faster worker startup.
--execution-timeout
int
Execution timeout in seconds. Jobs that exceed this duration are terminated.

Delete an endpoint

Delete an endpoint:
runpodctl serverless delete <endpoint-id>

Serverless URLs

Access your Serverless endpoint using these URL patterns:
OperationURL
Async requesthttps://api.runpod.ai/v2/<endpoint-id>/run
Sync requesthttps://api.runpod.ai/v2/<endpoint-id>/runsync
Health checkhttps://api.runpod.ai/v2/<endpoint-id>/health
Job statushttps://api.runpod.ai/v2/<endpoint-id>/status/<job-id>