Limits
Authoritative source: gpu_service/core/limits.py (input bounds) and
gpu_service/core/job_store.py (TTL). Numbers here mirror those files;
the constants win on drift.
Input length per task
Oversized or undersized inputs are rejected with 422 validation_failed
before any GPU work. Sequences must contain only A/C/G/T/N
(case-insensitive).
| Task | Endpoint | Min bp | Max bp |
|---|---|---|---|
| Promoter | POST /v1/tasks/promoter/predict | 1 | 500,000 |
| Splice | POST /v1/tasks/splice/predict | 1 | 500,000 |
| Enhancer | POST /v1/tasks/enhancer/predict | 1 | 500,000 |
| Chromatin | POST /v1/tasks/chromatin/predict | 1 | 500,000 |
| Annotation | POST /v1/tasks/annotation/predict | 1 | 500,000 |
| Expression (TSS-centered) | POST /v1/tasks/expression/predict | 1 | 500,000 |
Body cap: 16 MiB.
The expression model uses a fixed 9,198 bp TSS-centered window (±4,599 bp). Off-length sequences are accepted (the tokenizer truncates or pads to this window), but predictions on non-TSS-centered or off-length input are not guaranteed to be biologically meaningful. If you have raw genomic input where the TSS isn't pre-known, contact us; server-side annotation to expression chaining can be enabled per-tenant.
Latency
Rough sync latency at the recommended input size, on a warm model.
| Task | Sync latency at recommended size | When to go async |
|---|---|---|
| Promoter / splice / enhancer / chromatin | 0.3–10 s | inputs > ~100 kbp |
| Annotation | 1–60 s | inputs > ~30 kbp |
| Expression (fixed 9,198 bp window) | 0.5–3 s | n/a — sync is always safe |
Cold start
If a task's model isn't already loaded into GPU memory, the first request
pays a model-load cost. The response carries meta.cold_start: true and
meta.model_load_time_ms. Cold start adds 5–15 s for the smaller models
and 30–90 s for expression and annotation. Subsequent calls are warm.
Sync delivery: timeout and guidance
Sync delivery (the default, no Prefer header) is best-effort within the
upstream HTTP read timeout of 300 seconds. A request that takes longer than
that is terminated by the edge proxy and surfaces to your client as a
connection reset or 504 gateway_timeout. The body in this case is the
proxy's, not the unified {error: {...}} envelope; this is the only place
where that happens. Pick async whenever you expect a request to push past
~60 s.
Hard sync cap
No partner-visible endpoint enforces a hard sync cap today; every task
accepts sync up to its per-task max (see "Input length per task" above).
The 413 sync_too_large error class is reserved in the schema
(SyncTooLargeDetails) for future use; you do not need to handle it on the
current contract beyond switching on error.code.
Recommended async opt-in (client-side guidance, not enforced)
Use Prefer: respond-async when your input exceeds the threshold below.
These are calibrated against typical inference times on the production GPU.
Sync still works under them, but bursty traffic plus GPU contention can
push individual requests past the 300 s proxy window without warning.
| Task | Recommended async above |
|---|---|
| Promoter | 100,000 bp |
| Splice | 250,000 bp |
| Enhancer | 100,000 bp |
| Chromatin | 100,000 bp |
| Annotation | 30,000 bp |
| Expression (TSS-centered) | n/a — input is the fixed 9,198 bp window; sync is always safe. |
If sync is critical for your workload and these guidelines force more async than you can stomach, contact us; we can profile your distribution and tune the proxy timeout for your tenant.
Per-key quotas
Three limiters run side by side. Each is configured per partner; your account owner tells you the values issued for your key. Defaults for the partner tier:
| Setting | Default for partner tier | Enforced? |
|---|---|---|
| Concurrent in-flight requests | 2 | ✅ — exceeds cap → 429 too_many_requests, Retry-After: 1. |
| Per-minute request rate | 60 | ✅ — token bucket; capacity = rate ÷ 6 (10-second burst, default 10). Empty bucket → 429 too_many_requests, Retry-After ≈ seconds-to-next-token. |
| Edge per-IP cap | 10 r/s burst 20 on api.* | ✅ — at nginx, returns 429 too_many_requests with the unified {error: {...}} envelope (see errors). |
RateLimit-* headers (every authenticated response)
The application emits the IETF
httpapi-ratelimit-headers
draft set on every authenticated 2xx and on every 429 from this
service, so you can pace from header state without inferring rate from
429s:
| Header | Meaning |
|---|---|
RateLimit-Limit | Token-bucket capacity (= rate_per_minute ÷ 6). |
RateLimit-Remaining | Tokens left after this request, integer. |
RateLimit-Reset | Seconds until the bucket refills to full. |
RateLimit-Policy | <capacity>;w=60 — capacity per 60-second window. |
Retry-After | On 429 only, seconds until at least one token is available. |
Pacing guidance
- Pace at ~80% of your issued
RateLimit-Limitto leave headroom for bursts; serialize a small worker pool against your concurrency cap rather than firing N parallel requests at it. - The token bucket and concurrency semaphore are independent: a 429 can
come from either. Both carry the same
RateLimit-*spine, but only the rate-bucket 429 has aRetry-Aftercalibrated against refill time; the concurrency 429 usesRetry-After: 1.
Async result store
| Property | Value |
|---|---|
| TTL | 24 h from last activity |
| Storage | In-process |
| Persistence | None — resets on restart |
Fetch after TTL or after a restart returns 410 job_expired.