9000x-onboarding · 05 — Healthcheck SQL fix (vector_boundary false-positive)
9000x — Healthcheck vector_boundary SQL fix
Defect surfaced by live onboarding
Pre-9000x, iu_vector_sync_point held 60 unique IUs across 61 indexed
points — just one multi-chunk IU (d3ad5874-… with 2 chunks). The
healthcheck SQL was written to tolerate exactly that single multi-chunk:
-- before (cutter_agent/iu_core/healthcheck.py @ vector_boundary surface):
SELECT count(*) AS sync_point_rows,
count(DISTINCT unit_id) AS unique_units,
(count(DISTINCT unit_id) = count(*)
OR count(*) - count(DISTINCT unit_id) <= 1) AS per_iu_boundary_ok
FROM public.iu_vector_sync_point WHERE sync_status='indexed';
After 9000x onboarding the table has 141 unique IUs across 149 indexed
points — 8 multi-chunk IUs (1 KT-B + 7 from DIEU-35). The old check
returned false because 149 - 141 = 8 > 1, and the Mac cron at
11:10:00Z flipped vector_boundary.ok=false with reason
per-IU boundary breach — a false positive.
Root semantic
The DB-level CHECK iu_vector_sync_point_boundary_chk forbids duplicate
(unit_id, chunk_index) AND requires consistent chunk_count across the
rows of one IU. Multi-chunk IUs are FINE as long as their rows have
distinct chunk_index. The pre-9000x SQL was a stand-in for that
invariant; it was correct only while the corpus had at most one
multi-chunk IU.
Fix applied
-- after:
SELECT count(*) AS sync_point_rows,
count(DISTINCT unit_id) AS unique_units,
NOT EXISTS (
SELECT 1 FROM public.iu_vector_sync_point
WHERE sync_status='indexed' AND unit_id IS NOT NULL
GROUP BY unit_id, chunk_index
HAVING count(*) > 1
) AS per_iu_boundary_ok
FROM public.iu_vector_sync_point WHERE sync_status='indexed';
This surfaces the EXACT same invariant as the DB CHECK with a single
GROUP BY scan. Live run returns 149|141|t — GREEN.
Verification
- All 7 healthcheck surfaces GREEN after the fix (
overall_ok=True):
three_axis_cache: in_sync (table=163, view=163)
directus_collection: 163 rows / 1 read-permission
qdrant_collection: iu_core_iu_chunks (149 indexed)
auto_refresh_trigger: gate=false fires_24h=1
vector_boundary: 149 pts / 141 unique (per_iu_boundary_ok=true)
write_gates: all 6 inert
operator_runtime: open_runs=0 failed_24h=0 active_leases=0
- All 10 tests in
tests/test_iu_core_5000x_healthcheck.pystill pass (they use a fake fixtureper_iu_boundary_ok=True/False, not the SQL itself). - Mac cron will pick up the fix on the next
*/10fire (the wrapper invokescutter_agent/iu_core/healthcheck.pydirectly from the repo).
Lesson
Healthcheck SQL that approximates an invariant by row arithmetic is brittle. When the DB layer already enforces the invariant via a CHECK, the healthcheck should surface the SAME predicate (GROUP BY + HAVING) so the operator sees what the DB sees — not a heuristic that holds only for a transient corpus shape.