Skip to content

Epic 38 — Performance, Security & Accessibility Hardening

Decision (user, 2026-06-12): Run three platform-wide review tracks before beta — Performance (fast requests = better UX), Security (backend + data + user-facing safety), and Accessibility (full WCAG 2.1 AA, including visual contrast). Each track is audit-first: an audit ticket produces a written findings list with severities, then remediation tickets fix the findings. Audits cover both repos (castyou-backend, castyou-frontend: apps/app, apps/landing, packages/design-system).

Why: The platform has grown across 30+ epics with feature-level testing but no cross-cutting pass. Known smells already on record: design-system Input labels are not programmatically associated with their inputs ([[E2E Test Gotchas]] works around it by selecting on placeholder — an accessibility bug, not a test quirk); GraphQL resolvers were added incrementally by many epics so authorization and N+1 patterns are inconsistent; the app ships media-heavy pages (portfolio, feed, discover) where payload size and query waterfalls directly hurt UX. A talent marketplace also holds sensitive PII (minors may appear in casting contexts, pet owners, physical characteristics, location, salary expectations) — data exposure is a user-safety issue, not just a technical one.

Tracks & severity model: Findings are logged in docs/audits/ (one markdown file per audit) with severity P0 (exploit / data leak / blocker — fix immediately), P1 (real user impact — fix in this epic), P2 (improvement — backlog). Remediation tickets in this epic cover P0/P1; P2s become follow-up tickets only if cheap.

Scope notes: Epic 37 already covers content moderation (user safety from content); this epic covers safety from data exposure and account compromise. Landing visual polish stays gated behind Epic 15 ([[Landing Polish Gate]]) — but accessibility and security fixes to the landing are in scope here (a11y/security are not "polish"). All UI fixes land in @castyou/design-system first ([[Design System Rule]]); any list touched gets pagination verified ([[Pagination Rule]]).


PERF-AUDIT-001 — Backend & frontend performance audit

  • [x] Done (2026-06-15 — docs/audits/2026-06-performance.md, 17 findings: 2 P0/11 P1/4 P2; src/plugins/queryTiming.ts behind PERF_TRACE)

Files:

  • Create: docs/audits/2026-06-performance.md — findings with severity + measurement
  • Create: castyou-backend/src/plugins/queryTiming.ts — temporary GraphQL plugin logging per-resolver duration + Prisma/Mongo query counts per operation (kept behind PERF_TRACE=true, useful permanently)

Audit checklist:

  • N+1 queries: instrument the top operations (feed, jobs list, discover, search, portfolio, applications dashboard, admin tables) and log query-count-per-request; anything that scales with result count is a finding. Prisma relation access inside list field resolvers is the prime suspect.
  • Missing indexes: EXPLAIN ANALYZE the hot Prisma queries (feed ordering, follow lookups, search ILIKE paths, moderation/report queues, notification unread counts); cross-check schema.prisma @@index coverage against actual where/orderBy clauses. Same for Mongo (messages by conversation, notifications by user+read).
  • Over-fetching: GraphQL fields resolved eagerly but rarely requested (e.g. heavy JSON columns, counts); responses returning full entities where the UI needs a projection.
  • Payload & media: portfolio/feed image sizes served from R2 (are thumbnails used everywhere a thumbnail fits?), video poster usage, missing CDN cache headers.
  • Frontend: Lighthouse runs (mobile profile) on landing home, app feed, jobs, portfolio, profile; bundle analysis (vite build --mode analyze / rollup-plugin-visualizer) for oversized chunks (Konva, dnd-kit, ffmpeg-adjacent libs leaking into the main bundle); React Query waterfalls (sequential dependent queries that could parallelize); re-render hotspots on the feed.
  • Redis utilization: hot reads that recompute per request but could cache (suggestion engine, feature flags already cached — verify TTLs, search facets).

Acceptance criteria:

  • Findings doc exists with P0/P1/P2 severities, each with a measurement (query count, ms, KB) — not vibes
  • Query-timing plugin merged behind PERF_TRACE env flag
  • P0/P1 findings mapped to BE-PERF-001 / FE-PERF-001 scopes

BE-PERF-001 — Backend remediation: N+1s, indexes, caching

  • [x] Done (2026-06-15 — all P0/P1 closed: feed + job + petJob N+1 via per-request DataLoaders, 8 Postgres indexes + 2 migrations, Notification Mongo compound indexes, followSuggestions Redis cache. Deferred to follow-up (audit doc): pg_trgm GIN search index, scorer select-projection)

Files:

  • Edit: resolvers flagged by PERF-AUDIT-001 (batch with DataLoader or Prisma include/in queries)
  • Edit: castyou-backend/prisma/schema.prisma — missing @@index entries
  • Run: pnpm db:migrate (Postgres schema change — prisma generate alone is not enough)
  • Edit: Mongoose schemas — missing compound indexes
  • Create: castyou-backend/src/lib/dataloaders.ts — per-request loader registry on GraphQL context (users by id, profiles by userId, follow-state by pair, media counts)

Notes:

  • DataLoader instances must be per-request (constructed in context factory), never module-level — cross-user cache bleed is a security bug, not just staleness.
  • Redis caching only for reads that tolerate staleness (suggestions, public profile aggregates); explicit invalidation or short TTL ≤60s; never cache viewer-specific authorization results.
  • Target from audit baseline: hot list operations ≤ N+constant queries; p95 of instrumented operations improved and recorded in the findings doc as before/after.

Acceptance criteria:

  • Every P0/P1 backend finding closed with before/after numbers appended to the audit doc
  • No resolver issues queries proportional to list size on the instrumented operations
  • Migration applies cleanly; EXPLAIN confirms new indexes are used

FE-PERF-001 — Frontend remediation: code-splitting, media, query tuning

  • [x] Done (2026-06-15 — ReactQueryDevtools dev-only/lazy out of prod bundle, Vite manualChunks vendor split with Konva/dnd-kit kept lazy. Deferred: hover-prefetch, responsive R2 thumbnails)

Files:

  • Edit: castyou-frontend/apps/app/src/App.tsx (or router entry) — React.lazy route-level splitting for heavy/rare routes (reel editor, flier editor/Konva, admin pages)
  • Edit: image-rendering components in @castyou/design-systemloading="lazy", explicit width/height (CLS), thumbnail-first sources
  • Edit: React Query hooks flagged by the audit — parallelize independent queries, tune staleTime on stable data (profiles, flags), prefetch on hover for job/profile cards
  • Edit: castyou-frontend/apps/landing — next/image usage check, font loading (display: swap)

Notes:

  • Konva (flier editor) and @dnd-kit (reel editor) must not be in the entry chunk — verify with the bundle analyzer before/after.
  • Skeletons from the design system for all above-the-fold loading states ([[Design System Rule]]) — perceived performance counts.

Acceptance criteria:

  • Entry chunk shrinks measurably (record before/after KB in the audit doc); editor libs only load on their routes
  • Lighthouse performance score on feed/jobs/landing improves vs. audit baseline (record scores)
  • No layout shift from images on feed/portfolio (CLS ≈ 0 on those pages)

SEC-AUDIT-001 — Security audit: authorization, data exposure, abuse resistance

  • [x] Done (2026-06-15 — docs/audits/2026-06-security.md, 11 findings: 0 P0/4 P1/7 P2; authzMatrix.test.ts 357 pass + 4 todo, completeness guard verified)

Files:

  • Create: docs/audits/2026-06-security.md — findings with severity + reproduction steps
  • Create: castyou-backend/src/__tests__/security/authzMatrix.test.ts — table-driven test: every Query/Mutation in the schema × every role (anon, TALENT, PRODUCER, PET_OWNER, AGENCY, ADMIN) asserting the expected allow/deny; schema-introspection-driven so a new resolver without a matrix entry fails the test

Audit checklist:

  • Authorization (IDOR): every mutation taking an id — can user A act on user B's job/application/media/draft/message/folder? Field-level: do private fields (email, salary expectations, private media, moderation fields, strikes) leak through any public type path (PublicUser, search results, feed payloads, admin-only fields on shared types)?
  • Auth lifecycle: JWT expiry/rotation, suspension/ban invalidation actually blocks every transport (HTTP + WebSocket subscriptions if any), password reset token single-use + expiry, email-change confirmation, impersonation (Epic 24) cannot escalate or persist.
  • Input handling: GraphQL depth/complexity limits (a nested follow-graph query must not be a DoS vector), upload validation (presigned R2 URLs scoped by content-type/size/key prefix; can a user overwrite another user's object key?), socialProfileUrl() open-redirect guard still holds for any new link surface, raw string interpolation anywhere near Prisma $queryRaw/Mongo queries.
  • Abuse resistance: rate limiting on login (brute force), register/waitlist (spam), messaging + follow + report (harassment/flood), expensive endpoints (search, AI calls — CasTars charging is not a rate limit); account-lockout/back-off behavior.
  • Transport & headers: CORS allowlist (not * with credentials), cookie flags if any, security headers on app + landing nginx/Next configs (CSP at least report-only, X-Content-Type-Options, frame-ancestors), GraphQL introspection + error verbosity in production.
  • Secrets & dependencies: no secrets in repo/client bundles (R2/OpenAI keys server-side only — grep the built app bundle), pnpm audit / npm audit on both repos, Docker images run as non-root.
  • User-facing safety: session visibility ("log out other sessions"), password strength policy, no user enumeration via login/reset/register error differences, external links from user content open with rel="noopener noreferrer" and are plain https only.

Acceptance criteria:

  • Findings doc with severity + repro steps; P0s flagged to the user immediately, not just filed
  • Authz matrix test merged and green (every existing resolver covered; unlisted resolvers fail CI)
  • P0/P1 findings mapped to BE-SEC-001 / FE-SEC-001 scopes

BE-SEC-001 — Backend remediation: authz fixes, rate limiting, hardening

  • [x] Done (2026-06-15 — all P1 + backend P2 closed: messaging PII redaction, R2 key scoping, GraphQL depth(12)/cost(5000) limits, subscription ban gate, uniform auth errors, password policy, agency-invite consent, pet IDOR guard, logoutEverywhere; authzMatrix todos now real. Deferred (audit doc): SEC-F06 token storage [arch decision], SEC-F08 CSP/headers + SEC-F09 container/CI [-> TEST-PSA-001])

Files:

  • Edit: resolvers flagged by SEC-AUDIT-001 — ownership/role guards, field-level redaction
  • Create: castyou-backend/src/middleware/rateLimit.ts — Redis sliding-window limiter (per-user + per-IP), applied to login, register, password reset, sendMessage, follow, reportContent, search, AI endpoints
  • Edit: GraphQL server config — depth + complexity limits, production error masking, introspection off in prod (admin tooling exception if needed)
  • Edit: upload presign service — content-type allowlist, max size, user-scoped key prefixes
  • Edit: nginx/Next configs — security headers (CSP report-only first, X-Content-Type-Options, frame-ancestors, referrer-policy)

Notes:

  • Rate limiter is fail-open on Redis outage (same stance as notifications/moderation enqueue — availability over strictness) but logs loudly.
  • Uniform auth errors: login/reset/register return identical messages + timing for unknown-user vs wrong-password (no enumeration).
  • Every fix gets a regression entry in the authz matrix test or a dedicated security test — a security fix without a test will regress.

Acceptance criteria:

  • All P0/P1 backend findings closed, each with a test
  • Rate limits verified (burst test) and fail-open verified
  • Suspension/ban kills live access on all transports

FE-SEC-001 — Frontend remediation: client-side hardening + account safety UI

  • [x] Done (2026-06-15 — shared DS ExternalLink rolled out across app+landing, queryClient.clear() on logout, PasswordStrengthMeter on register + change-password, backend error-string sanitization, log-out-everywhere button wired to logoutEverywhere mutation [confirm modal + clearAuth/redirect, i18n en/pt/es])

Files:

  • Edit: user-content link rendering (posts, profiles, portfolio links, messages) — enforce https?: scheme allowlist + rel="noopener noreferrer" target="_blank" in one shared design-system ExternalLink component
  • Edit: token storage/refresh handling per audit findings; logout clears all cached query data (queryClient.clear())
  • Edit: apps/app Profile → Settings — "Active sessions / log out everywhere" + password-strength meter on change/register (DS components)
  • Edit: error boundaries / toast paths — never render raw backend error strings to users

Notes:

  • One ExternalLink component in the design system replaces ad-hoc anchors everywhere ([[Design System Rule]]) — the audit grep should find zero raw target="_blank" anchors in apps afterward.

Acceptance criteria:

  • All P0/P1 frontend findings closed with tests
  • Built app bundle contains no secrets (grep verified in CI or build script)
  • Logout-everywhere works (other session's next request is rejected)

A11Y-AUDIT-001 — WCAG 2.1 AA accessibility audit (app + landing + design system)

  • [x] Done (2026-06-15 — docs/audits/2026-06-accessibility.md, 20 findings: 6 P0/9 P1/5 P2, 17 failing contrast pairs; design-system/src/__tests__/a11y.test.tsx vitest+jest-axe smoke, green via it.todo)

Files:

  • Create: docs/audits/2026-06-accessibility.md — findings per screen × criterion, with severity
  • Create: castyou-frontend/packages/design-system/src/__tests__/a11y.test.tsx — vitest-axe smoke over every exported DS component in default + error + disabled states

Audit checklist:

  • Contrast (explicit user requirement): programmatic sweep of all design-system color tokens — text/background pairs ≥ 4.5:1 (normal) / 3:1 (large text + UI components/focus indicators), in both light and dark themes; check disabled-but-readable text, placeholder text, badge/status colors (notification badge, moderation states, application status chips), charts in admin.
  • Forms: every DS Input/Select/Textarea label programmatically associated (htmlFor/id or wrapping) — the known gap from [[E2E Test Gotchas]]; error messages linked via aria-describedby; required state announced.
  • Keyboard: full traversal of critical flows (register → profile edit → browse jobs → apply; producer post job; messaging) with keyboard only; focus trap + restore in every Modal/Drawer/dropdown (NotificationBell, NavDrawer, report modal); visible focus indicator on all interactive elements; no positive tabindex.
  • Screen reader: landmarks (nav/main/header), page titles per route, heading hierarchy, alt text on user media (portfolio items have titles — use them), icon-only buttons have accessible names, live regions for async results (toasts, "application submitted", unread counts), DataTable semantics + pagination controls announced.
  • Structure & motion: touch targets ≥ 44px on the mobile bottom bar/drawer, prefers-reduced-motion respected by DS transitions, zoom to 200% without loss, no information conveyed by color alone (status chips get icons/text).
  • Tooling: axe automated pass on every route of app + landing (Playwright + @axe-core/playwright in castyou-automated-tests), manual VoiceOver pass on the critical flows.

Acceptance criteria:

  • Findings doc per screen with WCAG criterion references and severities
  • DS axe smoke test merged (failing cases documented as findings, then fixed in DS-A11Y-001)
  • Token contrast sweep results recorded (every failing pair listed with measured ratio)

DS-A11Y-001 — Design system remediation: labels, contrast tokens, focus, ARIA

  • [x] Done (2026-06-15 — all P0/P1 closed: Input/Textarea/Select/MultiSelect label assoc + error ARIA, Modal/NotificationBell/NavDrawer focus traps, contrast tokens fixed both themes (placeholder/Chip/Badge/focus-ring/Button variants), status not-color-only, prefers-reduced-motion, content-tertiary token. DS axe smoke suite green. [[E2E gotchas]] memory updated)

Files:

  • Edit: castyou-frontend/packages/design-system/src/components/Input (+ Select, Textarea, Checkbox, Radio, Switch) — generate id (useId), associate label via htmlFor, wire aria-describedby for error/help text, aria-invalid on error
  • Edit: design-system color tokens — adjust failing pairs from the audit sweep in both themes (keep brand hues, shift lightness)
  • Edit: Modal, Drawer/NavDrawer, NotificationBell dropdown, menus — focus trap, Esc close, focus restore to trigger, aria-modal/role semantics
  • Edit: Button (icon-only variant requires aria-label — make it a TS-required prop), Badge/status chips (icon or text alongside color), DataTable (caption/scope, aria-sort), Pagination (current-page announcement), Skeleton (aria-busy on containers)
  • Edit: global focus-visible ring token meeting 3:1 against all surfaces; transitions honor prefers-reduced-motion

Notes:

  • The label association fix changes how tests select inputs — update affected specs to getByLabelText and update the [[E2E Test Gotchas]] memory/Playwright helpers (the placeholder workaround becomes obsolete; that's the point).
  • Token changes ripple everywhere — visual-check the admin area and both themes; this is a contrast fix, not a redesign.

Acceptance criteria:

  • DS axe smoke suite green across all components/states
  • All token pairs from the audit pass AA in both themes
  • Keyboard-only operation of every interactive DS component verified

FE-A11Y-001 — App & landing remediation: structure, flows, media

  • [x] Done (2026-06-15 — landmarks, skip-to-content link, 44px tab targets + aria-current, live regions for unread/async (i18n en/pt/es), alt from media titles, landing semantic main. Deferred: axe Playwright suite [-> TEST-PSA-001], 200% zoom manual pass)

Files:

  • Edit: apps/app shell + routes — landmarks, per-route document.title, heading hierarchy, skip-to-content link
  • Edit: feed/portfolio/jobs/discover screens — alt text from media titles, accessible names on icon actions (follow, report, like, share), live-region announcements for async mutations
  • Edit: apps/landing — same structural pass (semantic sections, image alts, form labels on waitlist/contact)
  • Edit: castyou-automated-tests — add the axe Playwright pass from the audit as a permanent suite

Notes:

  • Landing a11y/security fixes are in scope despite the Epic 15 gate ([[Landing Polish Gate]]) — this is compliance, not polish. No visual redesign of landing here.

Acceptance criteria:

  • Axe Playwright suite green on every app + landing route
  • Critical flows (register → apply; post job; message) completable with keyboard only and coherent under VoiceOver
  • All P0/P1 a11y findings closed

TEST-PSA-001 — CI guards so the wins stick

  • [x] Done — RESHAPED 2026-06-28. The original PR-gated, cross-repo design didn't fit reality: we push directly to dev (CI only fired on PRs/push:main, so it had not run since 2026-05-26), the repos are private on the free plan (branch protection / required checks return 403 Upgrade to GitHub Pro, so "required check" is impossible and moot with no PRs), and the two highest-cost gates needed a cross-repo PAT that was never set. Decision (user, 2026-06-28): lean CI = fast single-repo gates only; move real enforcement to a local pre-push hook.
    • CI kept (both repos, .github/workflows/ci.yml): typecheck, test (BE unit incl. authz-matrix; FE app + DS incl. axe smoke), audit (pnpm audit --audit-level high, blocking). Trigger now push: [main, dev] + pull_request: [main, dev] so direct pushes to dev actually run.
    • Removed from CI (heavy / service-dependent / cross-repo): BE migration-drift (needs a Postgres service), FE bundle-size budget (needs a full prod build), FE graphql-contract (cross-repo schema fetch + PAT), and the entire full-stack axe E2E workflow in castyou-automated-tests (Postgres/Mongo/Redis + dual-repo bring-up + seed; deleted .github/workflows/ci.yml). The artifacts they ran still exist and are runnable locally on demand: pnpm --filter @castyou/app size:check, pnpm --filter @castyou/app gql:check (set BACKEND_SCHEMA_PATH), and pnpm test:axe (axe spec retained at tests/axe-accessibility.spec.ts).
    • Enforcement = local pre-push hook (.githooks/pre-push in BE + FE, wired via core.hooksPath, auto-set on install by a prepare script). Runs typecheck && test && audit with CI=true (so vitest runs once); blocks the push on failure; bypass with git push --no-verify. This is the only thing that can actually block on the free plan.
    • CI_REPO_TOKEN is no longer needed (the only two jobs that used it were removed).

Notes:

  • Tradeoff accepted: migration-drift, bundle-size, gql-contract and full-stack a11y are no longer auto-enforced on push. They remain one local command each; re-promote any of them (e.g. to a pre-beta workflow_dispatch or a beefier plan's required checks) if/when the repo plan or workflow changes.
  • If the team later adopts a PR-into-dev flow or upgrades to GitHub Pro/Team, required checks become possible — revisit then.

Acceptance criteria (revised):

  • Every push to dev runs typecheck + test + audit in CI (reporting), and the pre-push hook blocks a failing push locally.
  • A new resolver without an authz-matrix entry, or a high+ dependency advisory, fails both the hook and CI.