Limits

Authoritative source: gpu_service/core/limits.py (input bounds) and gpu_service/core/job_store.py (TTL). Numbers here mirror those files; the constants win on drift.

Input length per task

Oversized or undersized inputs are rejected with 422 validation_failed before any GPU work. Sequences must contain only A/C/G/T/N (case-insensitive).

Task	Endpoint	Min bp	Max bp
Promoter	`POST /v1/tasks/promoter/predict`	1	500,000
Splice	`POST /v1/tasks/splice/predict`	1	500,000
Enhancer	`POST /v1/tasks/enhancer/predict`	1	500,000
Chromatin	`POST /v1/tasks/chromatin/predict`	1	500,000
Annotation	`POST /v1/tasks/annotation/predict`	1	500,000
Expression (TSS-centered)	`POST /v1/tasks/expression/predict`	1	500,000

Body cap: 16 MiB.

The expression model uses a fixed 9,198 bp TSS-centered window (±4,599 bp). Off-length sequences are accepted (the tokenizer truncates or pads to this window), but predictions on non-TSS-centered or off-length input are not guaranteed to be biologically meaningful. If you have raw genomic input where the TSS isn't pre-known, contact us; server-side annotation to expression chaining can be enabled per-tenant.

Latency

Rough sync latency at the recommended input size, on a warm model.

Task	Sync latency at recommended size	When to go async
Promoter / splice / enhancer / chromatin	0.3–10 s	inputs > ~100 kbp
Annotation	1–60 s	inputs > ~30 kbp
Expression (fixed 9,198 bp window)	0.5–3 s	n/a — sync is always safe

Cold start

If a task's model isn't already loaded into GPU memory, the first request pays a model-load cost. The response carries meta.cold_start: true and meta.model_load_time_ms. Cold start adds 5–15 s for the smaller models and 30–90 s for expression and annotation. Subsequent calls are warm.

Sync delivery: timeout and guidance

Sync delivery (the default, no Prefer header) is best-effort within the upstream HTTP read timeout of 300 seconds. A request that takes longer than that is terminated by the edge proxy and surfaces to your client as a connection reset or 504 gateway_timeout. The body in this case is the proxy's, not the unified {error: {...}} envelope; this is the only place where that happens. Pick async whenever you expect a request to push past ~60 s.

Hard sync cap

No partner-visible endpoint enforces a hard sync cap today; every task accepts sync up to its per-task max (see "Input length per task" above). The 413 sync_too_large error class is reserved in the schema (SyncTooLargeDetails) for future use; you do not need to handle it on the current contract beyond switching on error.code.

Recommended async opt-in (client-side guidance, not enforced)

Use Prefer: respond-async when your input exceeds the threshold below. These are calibrated against typical inference times on the production GPU. Sync still works under them, but bursty traffic plus GPU contention can push individual requests past the 300 s proxy window without warning.

Task	Recommended async above
Promoter	100,000 bp
Splice	250,000 bp
Enhancer	100,000 bp
Chromatin	100,000 bp
Annotation	30,000 bp
Expression (TSS-centered)	n/a — input is the fixed 9,198 bp window; sync is always safe.

If sync is critical for your workload and these guidelines force more async than you can stomach, contact us; we can profile your distribution and tune the proxy timeout for your tenant.

Per-key quotas

Three limiters run side by side. Each is configured per partner; your account owner tells you the values issued for your key. Defaults for the partner tier:

Setting	Default for partner tier	Enforced?
Concurrent in-flight requests	2	✅ — exceeds cap → `429 too_many_requests`, `Retry-After: 1`.
Per-minute request rate	60	✅ — token bucket; capacity = `rate ÷ 6` (10-second burst, default 10). Empty bucket → `429 too_many_requests`, `Retry-After` ≈ seconds-to-next-token.
Edge per-IP cap	10 r/s burst 20 on `api.*`	✅ — at nginx, returns `429 too_many_requests` with the unified `{error: {...}}` envelope (see errors).

`RateLimit-*` headers (every authenticated response)

The application emits the IETF httpapi-ratelimit-headers draft set on every authenticated 2xx and on every 429 from this service, so you can pace from header state without inferring rate from 429s:

Header	Meaning
`RateLimit-Limit`	Token-bucket capacity (= `rate_per_minute ÷ 6`).
`RateLimit-Remaining`	Tokens left after this request, integer.
`RateLimit-Reset`	Seconds until the bucket refills to full.
`RateLimit-Policy`	`<capacity>;w=60` — capacity per 60-second window.
`Retry-After`	On `429` only, seconds until at least one token is available.

Pacing guidance

Pace at ~80% of your issued RateLimit-Limit to leave headroom for bursts; serialize a small worker pool against your concurrency cap rather than firing N parallel requests at it.
The token bucket and concurrency semaphore are independent: a 429 can come from either. Both carry the same RateLimit-* spine, but only the rate-bucket 429 has a Retry-After calibrated against refill time; the concurrency 429 uses Retry-After: 1.

Async result store

Property	Value
TTL	24 h from last activity
Storage	In-process
Persistence	None — resets on restart

Fetch after TTL or after a restart returns 410 job_expired.

Input length per task​

Latency​

Cold start​

Sync delivery: timeout and guidance​

Hard sync cap​

Recommended async opt-in (client-side guidance, not enforced)​

Per-key quotas​

RateLimit-* headers (every authenticated response)​

Pacing guidance​

Async result store​