Cloud Run Cheat Sheet

What is Cloud Run, what is a container, an instance, example Python deployment.

1. What is Cloud Run?

What is the difference in between a Cloud Run Service, a Cloud Run Job and a Cloud Run Worker Pool? In which scenario should I use what?

▼

1. Cloud Run Service (The Web Server)

What it is: A managed resource that listens for HTTP/gRPC requests or events via a stable URL.
Key Behavior:
- Scale-to-Zero: It shuts down completely when no traffic exists and spins up instantly (Cold Start) when a request arrives.
- Autoscaling: Scales based on request concurrency or CPU utilization.
When to use: Web APIs, websites, webhooks, or any application that needs to respond to a user or another service via HTTP.

2. Cloud Run Job (The Script)

What it is: A resource designed to execute a specific task that runs to completion and then stops.
Key Behavior:
- No Endpoint: It does not have a URL; you trigger it manually, via a schedule (Cron), or via the CLI/API.
- Array Jobs: It can run multiple independent "tasks" in parallel (e.g., 50 instances of the same container, each processing a different piece of data).
When to use: Database migrations, nightly data processing, batch image resizing, or AI model training.

3. Cloud Run Worker Pool (The Always-On Processor)

What it is: A relatively new resource type designed for pull-based or background workloads that don't have an HTTP endpoint.
Key Behavior:
- No Inbound HTTP: Unlike a Service, it has no URL. It "pulls" work from an external source.
- Always-On Logic: It typically maintains a set number of instances to continuously process a stream of data.
- Autoscaling: Unlike Services (which autoscale "out of the box"), Worker Pools do not typically autoscale automatically by default. Most documentation specifies that Worker Pools require manual scaling (you set the instance count) or a custom autoscaler (you must set up a separate process or script to monitor your queue depth and tell the Worker Pool API to scale up/down).
When to use: Kafka consumers, Pub/Sub pull subscribers, RabbitMQ workers, or background "daemons" that need to run continuously without a web interface.

What is Cloud Run's pricing model?

▼

Cloud Run uses a fully managed, pay-as-you-go pricing model. You are charged only for the resources your workload actually consumes, billed per 100ms.

What You Pay For

CPU (vCPU-seconds) - Time your container is executing
Memory (GiB-seconds) - Memory allocated while running
Requests - For Cloud Run Services (per HTTP request)
Network egress - Outbound traffic based on Google Cloud networking rates

Billing Modes

1. Request-based billing (default for Services)

- CPU & memory billed only while handling requests
- Supports scale-to-zero
- Ideal for APIs and web services

2. Instance-based billing

- Billed for the entire lifetime of the instance (even when idle)
- Used by Jobs and Worker Pools
- Good for background or long-running workloads

There is also a monthly Free Tier

What can be deployed in Cloud Run?

▼

On Google Cloud Run, you can deploy any stateless containerized application that runs in a Linux-based Docker container (x86_64 or ARM64).

For a Cloud Run Service, the container must listen on a network port (typically PORT, default 8080) for HTTP/gRPC traffic.
For Jobs and Worker Pools, the container does not need to expose an HTTP port — it just needs to run and perform its task.

In short: If it runs in a Linux container, Cloud Run can run it — whether it’s a FastAPI app, a Node service, a Go binary, a batch script, or a background worker.

What is a cloud run container? When I push the docker image and i use cloud run to run that image what is it really?

▼

When you write a Dockerfile, build it, and push it to Google Artifact Registry, you are just uploading a tarball (a zipped folder of your Linux environment and Python code). When Cloud Run "runs" it, it is NOT running on a dedicated Virtual Machine just for you.

It is running on Google's massive, planetary-scale internal server fleet (called Borg). Because Google is running your code on the same physical hardware as other companies' code, they cannot trust standard Docker (which shares the host OS kernel).

What it really is:

Depending on your configuration, your container is wrapped in a highly secure micro-VM sandbox.

First-generation execution environment: Google uses a technology called gVisor. It intercepts every single system call your Python code makes to the Linux kernel and emulates it safely. It's a "fake" kernel.
Second-generation execution environment: Google spins up a lightning-fast, highly stripped-down microVM just for your container. It behaves exactly like a full Linux machine, meaning it has perfect compatibility for heavy Python C-extensions (like NumPy, PyTorch, or pandas).

You are renting an isolated, heavily sandboxed Linux process on a Google supercomputer.

What is a cloud run instance?

▼

An instance is one physical, running copy of your container. Think of your Docker image as a blueprint. The instance is the actual house built from the blueprint.

If you set Cloud Run limits to 4 vCPUs and 16GB RAM, one instance has exactly that much power.
As mentioned in the previous answer, if you set concurrency to 80, this one instance will accept up to 80 simultaneous requests.
If request #81 comes in, the Cloud Run Load Balancer says, "This instance is full!" and instantly builds a second instance (another 4 vCPU / 16GB RAM clone) to handle request #81.

Now you have two instances. You are paying for two instances.

How to explain the concurrency + instances configuration in cloud run?

▼

Concurrency isn't just a number; it's a cost and performance lever.

Default: 80
Max: 1000

The Best Combination:

I/O Bound (FastAPI/Async): High concurrency (80-250). Since Python spends most of its time waiting for DBs or APIs, one instance can juggle many requests.

CPU Bound (AI/Data Science): Low concurrency (1-10). If one request takes 100% of a vCPU, sending a second request to that same instance will just make both requests 2x slower.

Formula for Python: Set your THREADS or WORKERS equal to your concurrency if you are using a synchronous server (like Gunicorn/Flask) to ensure the instance can actually handle the simultaneous load.

2. Python Deployment Examples

How to setup the cloud run together with the Dockerfile?

▼

In Cloud Run, there is one non-negotiable rule: Your server must listen on 0.0.0.0 (all network interfaces), and it must bind to the port Cloud Run expects (which defaults to 8080). Ideally this port is mentionned as an environment variable in the Dockerfile.

Then need to build the image to artifact registry, and then deploy to cloud run using that image.

So I have a web server in cloud run. Let's say I have a single cloud run instance, can I send concurrent requests to the same web server? With Flask? FastAPI?

▼

Yes, but how they are handled depends entirely on your framework and server configuration. Cloud Run will gladly send 80 requests at the exact same millisecond to your single container instance.

With Flask (WSGI / Synchronous) — By default, standard Python is synchronous. If you don't configure your WSGI server (like Gunicorn) with threads or multiple workers, Flask processes one request at a time. The other 79 requests will sit in a queue waiting for the first one to finish.

With FastAPI (ASGI / Asynchronous) — FastAPI runs on an event loop. If your code uses async def and is doing I/O-bound tasks (like waiting for OpenAI's API or a database), a single worker can pause the waiting request and start processing the other 79 concurrently.

If I have one Gunicorn server using Flask and pushing this to Cloud Run, will one instance be able to handle concurrent requests?

▼

Only if you configure Gunicorn correctly.

If your Dockerfile just says:
CMD ["gunicorn", "-b", "0.0.0.0:8080", "app:app"]

You have 1 synchronous worker. It will handle exactly 1 request at a time. If a request takes 5 seconds to run your AI inference, the next request waits 5 seconds. Concurrency is effectively 0.

To fix this for Flask:
You must add threads or workers to Gunicorn.

CMD ["gunicorn", "--workers", "1", "--threads", "8", "-b", "0.0.0.0:8080", "app:app"]

Now, Gunicorn uses a thread pool. When Request A hits an I/O pause, the Python GIL (Global Interpreter Lock) switches to Request B. You can now handle 8 concurrent requests. No need to put any 'async' in front of the code.

If I have one Gunicorn server, with 1 Uvicorn worker, using FastAPI and pushing this to Cloud Run, will one instance be able to handle concurrent requests?

▼

Yes, highly concurrently—but ONLY for I/O-bound tasks.

Because Uvicorn uses an asynchronous event loop (asyncio), 1 single Uvicorn worker can handle hundreds of concurrent requests. If your endpoints are async def and you do a network call (e.g., await client.chat.completions.create(...)), concurrency works perfectly.

However, if you put CPU-bound work (e.g., matrix multiplications, pandas data processing, or standard synchronous PyTorch model inference) directly inside an async def function, you will block the event loop.

If the event loop is blocked, your concurrency immediately drops back to 1. No other requests can be processed until the CPU finishes that math. One way to not block the event loop is to run the CPU-bound code in a threadpool (run_in_threadpool function in FastAPI).

Another way is to define the route as def and have synchronous code inside. This makes FastAPI handle the concurrency itself.

If my dockerfile has one gunicorn server with 4 uvicorn workers, does that mean that I have 4 independant servers? Will they share the cloud run RAM? The CPU?

▼

Are they 4 independent servers?

No, they are 1 Master Process (Gunicorn) that has spawned 4 Child OS Processes (Uvicorn workers). Gunicorn acts as a reverse-proxy/manager. It listens on port 8080 and distributes incoming requests across the 4 child workers. Because they are separate OS processes, they bypass the Python GIL, allowing true parallel CPU execution.

Will they share the CPU?

Yes. They share the vCPUs allocated to that specific Cloud Run instance. If you allocated 2 vCPUs to your Cloud Run instance, the Linux OS scheduler will multiplex your 4 Uvicorn workers across those 2 physical cores. If all 4 workers try to do heavy CPU inference at the exact same time, they will fight for CPU time, and your latency will spike.

Will they share the RAM?

Yes and No. This is critical for ML models. They share the total pool of RAM allocated to the Cloud Run instance (e.g., if you allocated 8GB, all 4 workers must fit inside that 8GB). However, they do not share memory state with each other. They are independent processes.

The OOM (Out of Memory) Danger: If your FastAPI app loads a 2GB transformer model into memory globally on startup, and you have 4 workers, each worker loads its own copy of the model. 4 workers × 2GB = 8GB. You will likely trigger a Cloud Run OOM kill because you've exhausted the instance memory.
Copy-on-Write (CoW) trick: You can use Gunicorn's --preload flag. This loads the app (and your ML model) into the Master process before it forks the 4 workers. The OS shares the RAM across the 4 workers using Copy-on-Write. However, due to Python's reference counting, CoW is easily broken in Python, and memory will often duplicate anyway once requests start hitting the server.

3. Cloud Run Specifics

What is CPU Throttling?

▼

Throttling (Default): You pay per request. CPU is throttled to nearly zero the moment the response is sent. Result: Background threads freeze.
Instance-based billing or Always-Allocated: You pay for the entire life of the instance. CPU is always available.

The New "Startup CPU Boost": There is now a middle ground. You can enable Startup CPU Boost, which gives your container extra CPU power only during startup and for 10 seconds after. This is the "Goldilocks" setting for Python apps with heavy imports (like pandas or tensorflow)—it speeds up cold starts without the high cost of Always-On CPU.

How does container image size and the ENTRYPOINT initialization affect "Cold Start" latency? How to optimize it using Uvicorn or Gunicorn?

▼

Cold start is the latency between the first request arriving and the container being ready to serve it.

Contrary to popular belief, image size matters less than you think. Google uses a "streaming" file system for images. It pulls the chunks it needs to start the app first. However, a 2GB image will still be slower than a 200MB image because of metadata overhead. Use Slim or Distroless images.

Optimize:

Uvicorn/Gunicorn: If you use Gunicorn with multiple workers, each worker has to import your entire Python app (and all its libraries like torch or pandas). This multiplies your startup time.
Optimization: In a serverless environment, one worker is often better. Let Cloud Run handle the scaling by spinning up more instances rather than trying to manage a complex worker pool inside one container.

What happens to a request when Cloud Run decides to scale down or "terminate" an instance? SIGTERM signal

▼

When Cloud Run decides to scale down (because traffic dropped) or a deployment happens, it doesn't just pull the plug.

The Signal: It sends a SIGTERM to your container (Process ID 1).
The Grace Period: You have 10 seconds (default) to wrap up.
The Kill: After the grace period, it sends SIGKILL.

Why this matters: If you are halfway through writing a file to Cloud Storage or updating a DB, you need to catch that SIGTERM in Python using the signal library to shut down gracefully.

The Persistence Myth: You decide to save a temporary file to /tmp/data.csv during a request. A second request comes in 5 minutes later. Is that file guaranteed to be there? Why or why not?

▼

No. There are two reasons why the file is not guaranteed to be there:

Horizontal Scaling: The second request might be routed to a different instance (a completely fresh container) that doesn't share the /tmp directory of the first one.
Scale-to-Zero: If no other traffic occurred during those 5 minutes, Cloud Run likely terminated the original instance to save costs, wiping its in-memory file system (RAM) entirely.

The Rule: Cloud Run is stateless. Treat /tmp as volatile storage that exists only for the duration of a single request. For persistence, use Cloud Storage or a Cloud SQL database.