Show HN:使用 Cloud Run 建置多區域 Vertex AI 推論路由器
本文介紹了一種使用 Cloud Run 作為「智慧路由器」來實現 Vertex AI 多區域推論的解決方案,以克服標準全域負載平衡器在動態重寫 Vertex AI 端點路徑方面的限制。
Sign up
Sign in
Sign up
Sign in

Multi-Regional Inference With Vertex AI
--
Listen
Share
When running mission-critical inference workloads on Vertex AI, relying on a single region is a risk. Whether due to capacity constraints or unexpected regional outages, you need a strategy to fail over seamlessly.
The ideal architecture uses a Global Load Balancer to route traffic to the nearest healthy region. However, wiring the load balancer directly to Vertex AI endpoints presents a unique challenge that standard load balancing features cannot solve.
The Problem: The “Double Rewrite” Dilemma
Vertex AI Endpoints are strictly regional, and their resource IDs are globally unique.
A standard Global Load Balancer can route traffic based on health, but it cannot dynamically rewrite the URL path to swap ID 12345 for 67890 based on which backend region it selects. While it can modify Host headers, it lacks the logic to handle dynamic path restructuring.
The Solution: The “Smart Router” Pattern
To achieve absolute failover, you can introduce an intelligent middle layer. First, deploy a lightweight Cloud Run service in each region to act as a “Smart Router.”
The Traffic Flow:
Failover: If the local Vertex endpoint fails (e.g., HTTP 503), the Cloud Run instance catches the error and immediately retries against the remote region’s PSC endpoint.
Implementation Guide
Example with dummy model at https://github.com/bernieongewe/vertex-ai-multi-regional-inference
Configure Private Service Connect (PSC) with Global Access
Create a private connection from your VPC to Vertex AI. You must enable Global Access so the Cloud Run router in us-central1 can reach the endpoint in us-east4 during a failover event.
Deploy the “Smart Router” (Cloud Run)
This Python code handles the logic the Load Balancer cannot. It fixes the Protocol, Auth, and Pathing.
main.py
Network Configuration for Cloud Run
Cloud Run instances are isolated by default. To reach your internal PSC IPs, you must attach them to your VPC using Direct VPC Egress.
The Load Balancer Setup
Use Serverless Network Endpoint Groups (NEGs) to route traffic to Cloud Run.
Critical Caveat: When creating the Backend Service, do not specify --protocol or --port-name. Serverless NEGs are incompatible with named ports and will throw an error if defined.
Inference Via The Load Balancer
Once your architecture is deployed, sending prediction requests requires a slight shift from standard Vertex AI workflows. Because the Global Load Balancer (GLB) serves as a generic entry point (e.g., https://predict.example.com/predict) and relies on the "Smart Router" to inject the specific regional Endpoint IDs, you cannot use the standard Vertex AI SDK endpoint.predict() method. The SDK is designed to automatically construct the full, specific path to a resource (e.g., .../locations/us-central1/endpoints/12345...), which bypasses the generic routing logic we've built. Instead, you must send requests using a standard HTTP client like curl or Python’s requests library, targeting the GLB's IP or domain directly. This allows the load balancer to receive the generic request and hand it off to the Cloud Run proxy for dynamic path rewriting.
Other Caveats
If you are building this, watch out for these specific traps:
--
--
Written by Bernie Ongewe
Passionate technologist helping organizations integrate production workloads with in Cloud and on premises. Personal views, not my employer's
No responses yet
Help
Status
About
Careers
Press
Blog
Privacy
Rules
Terms
Text to speech
相關文章