Skip to main content

Limits

Authoritative source: gpu_service/core/limits.py (input bounds) and gpu_service/core/job_store.py (TTL). Numbers here mirror those files; the constants win on drift.

Input length per task

Oversized or undersized inputs are rejected with 422 validation_failed before any GPU work. Sequences must contain only A/C/G/T/N (case-insensitive).

TaskEndpointMin bpMax bp
PromoterPOST /v1/tasks/promoter/predict1500,000
SplicePOST /v1/tasks/splice/predict1500,000
EnhancerPOST /v1/tasks/enhancer/predict1500,000
ChromatinPOST /v1/tasks/chromatin/predict1500,000
AnnotationPOST /v1/tasks/annotation/predict1500,000
Expression (TSS-centered)POST /v1/tasks/expression/predict1500,000

Body cap: 16 MiB.

The expression model uses a fixed 9,198 bp TSS-centered window (±4,599 bp). Off-length sequences are accepted (the tokenizer truncates or pads to this window), but predictions on non-TSS-centered or off-length input are not guaranteed to be biologically meaningful. If you have raw genomic input where the TSS isn't pre-known, contact us; server-side annotation to expression chaining can be enabled per-tenant.

Latency

Rough sync latency at the recommended input size, on a warm model.

TaskSync latency at recommended sizeWhen to go async
Promoter / splice / enhancer / chromatin0.3–10 sinputs > ~100 kbp
Annotation1–60 sinputs > ~30 kbp
Expression (fixed 9,198 bp window)0.5–3 sn/a — sync is always safe

Cold start

If a task's model isn't already loaded into GPU memory, the first request pays a model-load cost. The response carries meta.cold_start: true and meta.model_load_time_ms. Cold start adds 5–15 s for the smaller models and 30–90 s for expression and annotation. Subsequent calls are warm.

Sync delivery: timeout and guidance

Sync delivery (the default, no Prefer header) is best-effort within the upstream HTTP read timeout of 300 seconds. A request that takes longer than that is terminated by the edge proxy and surfaces to your client as a connection reset or 504 gateway_timeout. The body in this case is the proxy's, not the unified {error: {...}} envelope; this is the only place where that happens. Pick async whenever you expect a request to push past ~60 s.

Hard sync cap

No partner-visible endpoint enforces a hard sync cap today; every task accepts sync up to its per-task max (see "Input length per task" above). The 413 sync_too_large error class is reserved in the schema (SyncTooLargeDetails) for future use; you do not need to handle it on the current contract beyond switching on error.code.

Use Prefer: respond-async when your input exceeds the threshold below. These are calibrated against typical inference times on the production GPU. Sync still works under them, but bursty traffic plus GPU contention can push individual requests past the 300 s proxy window without warning.

TaskRecommended async above
Promoter100,000 bp
Splice250,000 bp
Enhancer100,000 bp
Chromatin100,000 bp
Annotation30,000 bp
Expression (TSS-centered)n/a — input is the fixed 9,198 bp window; sync is always safe.

If sync is critical for your workload and these guidelines force more async than you can stomach, contact us; we can profile your distribution and tune the proxy timeout for your tenant.

Per-key quotas

Three limiters run side by side. Each is configured per partner; your account owner tells you the values issued for your key. Defaults for the partner tier:

SettingDefault for partner tierEnforced?
Concurrent in-flight requests2✅ — exceeds cap → 429 too_many_requests, Retry-After: 1.
Per-minute request rate60✅ — token bucket; capacity = rate ÷ 6 (10-second burst, default 10). Empty bucket → 429 too_many_requests, Retry-After ≈ seconds-to-next-token.
Edge per-IP cap10 r/s burst 20 on api.*✅ — at nginx, returns 429 too_many_requests with the unified {error: {...}} envelope (see errors).

RateLimit-* headers (every authenticated response)

The application emits the IETF httpapi-ratelimit-headers draft set on every authenticated 2xx and on every 429 from this service, so you can pace from header state without inferring rate from 429s:

HeaderMeaning
RateLimit-LimitToken-bucket capacity (= rate_per_minute ÷ 6).
RateLimit-RemainingTokens left after this request, integer.
RateLimit-ResetSeconds until the bucket refills to full.
RateLimit-Policy<capacity>;w=60 — capacity per 60-second window.
Retry-AfterOn 429 only, seconds until at least one token is available.

Pacing guidance

  • Pace at ~80% of your issued RateLimit-Limit to leave headroom for bursts; serialize a small worker pool against your concurrency cap rather than firing N parallel requests at it.
  • The token bucket and concurrency semaphore are independent: a 429 can come from either. Both carry the same RateLimit-* spine, but only the rate-bucket 429 has a Retry-After calibrated against refill time; the concurrency 429 uses Retry-After: 1.

Async result store

PropertyValue
TTL24 h from last activity
StorageIn-process
PersistenceNone — resets on restart

Fetch after TTL or after a restart returns 410 job_expired.