Link Detection

KTX infers foreign key relationships between tables even when the database declares no primary keys or foreign key constraints. This is critical for analytics warehouses, where constraints are rarely enforced. This page documents the methodology, scoring pipeline, and a reproducible benchmark you can run yourself.

What this measures

Most analytics warehouses — Snowflake, BigQuery, Redshift — don't enforce referential integrity constraints. Tables like fct_product_events reference dim_accounts by convention (account_id → id), but nothing in the schema says so.

KTX's relationship detection discovers these links automatically. The benchmark measures how accurately it recovers known foreign key relationships from a schema with all declared constraints removed — the hardest operating mode.

Metrics tracked:

Accepted — relationships scored above the accept threshold (default 0.85) and written to the project manifest
Review — relationships scored between the review threshold (0.55) and accept threshold, flagged for human review
Rejected — relationships scored below the review threshold
Skipped — relationships not evaluated (e.g., filtered by candidate limits)

Methodology

Detection pipeline

Relationship detection runs as a multi-stage pipeline during ktx dev scan:

Candidate generation — scans the schema for potential FK relationships using multiple heuristics: exact column name matches, normalized table name matching, name inflection (singular/plural), column suffix patterns (_id, _key, _code, _uuid), self-references (parent_id, manager_id), and optionally embedding similarity and LLM proposals.
Column profiling — samples up to 10,000 rows per column (configurable via profile_sample_rows) to collect statistics: row counts, null rates, distinct value counts, uniqueness ratios, sample values, and text length ranges.
Validation — tests each candidate relationship against actual data by measuring target uniqueness, source coverage, violation ratio, and value overlap between child and parent columns.
Scoring — combines 7 weighted signals into a confidence score:

Signal	Weight	What it captures
Name similarity	0.24	How closely column/table names match FK conventions
Value overlap	0.22	What percentage of FK values exist in the PK column
Profile uniqueness	0.22	How unique the target column values are
Type compatibility	0.10	Whether data types are compatible (hard gate — score is 0 if incompatible)
Embedding similarity	0.10	Semantic similarity between column names
Profile null rate	0.08	Presence of non-null values
Structural prior	0.04	Baseline structural hints from schema conventions

Each signal is normalized to [0, 1], multiplied by its weight, and summed. The final confidence is 0.56 + (weighted_sum × 0.65), clamped to [0, 1].

Graph resolution — resolves conflicts when multiple candidates target the same column, detects primary keys (by name pattern and validation), and classifies each relationship into accepted, review, or rejected based on thresholds.

Threshold configuration

scan:
  relationships:
    accept_threshold: 0.85
    review_threshold: 0.55

Relationships scoring above accept_threshold are automatically accepted into the project manifest. Those between review_threshold and accept_threshold are flagged for analyst review. Below review_threshold, they're rejected.

Test fixture

The benchmark uses the Orbit-style product warehouse — a synthetic schema modeled after a real SaaS analytics warehouse with all declared constraints removed. The fixture is a SQLite database with 6 tables:

Table	Role	Estimated rows
`dim_accounts`	Dimension	3
`dim_users`	Dimension	4
`dim_workspaces`	Dimension	4
`fct_product_events`	Fact	5
`fct_invoices`	Fact	3
`support_tickets`	Fact	4

Ground truth: 6 primary keys (one id column per table) and 9 foreign key relationships, all many_to_one:

Source column	Target
`dim_users.account_id`	`dim_accounts.id`
`dim_workspaces.account_id`	`dim_accounts.id`
`dim_workspaces.user_id`	`dim_users.id`
`fct_product_events.account_id`	`dim_accounts.id`
`fct_product_events.user_id`	`dim_users.id`
`fct_product_events.workspace_id`	`dim_workspaces.id`
`fct_invoices.account_id`	`dim_accounts.id`
`support_tickets.account_id`	`dim_accounts.id`
`support_tickets.user_id`	`dim_users.id`

The fixture runs in multiple modes to isolate the contribution of each pipeline stage: with LLM disabled, profiling disabled, validation disabled, and embeddings disabled.

Results

Results for the default configuration will be added after the benchmark run is finalized.

Reproducing the benchmark

Prerequisites

Node.js 22+
pnpm
The KTX repository cloned and dependencies installed (pnpm install)

Running

From the repository root:

pnpm run relationships:verify-orbit

This runs ktx dev scan against the bundled SQLite fixture with enrichment disabled, then generates a verification report at:

examples/orbit-relationship-verification/reports/orbit-verification.md

The report includes the full relationship summary, enrichment details, artifact paths, and any warnings.

Custom project

To run verification against your own database (e.g., a local Orbit project):

KTX_ORBIT_PROJECT_DIR=/path/to/your-project pnpm run relationships:verify-orbit

Configuration

The benchmark project configuration lives at examples/orbit-relationship-verification/ktx.yaml:

scan:
  enrichment:
    backend: none
  relationships:
    enabled: true
    llm_proposals: false
    accept_threshold: 0.85
    review_threshold: 0.55
    profile_sample_rows: 10000
    validation_concurrency: 4

Adjust accept_threshold and review_threshold to see how threshold changes affect the accepted/review/rejected distribution. Lower thresholds accept more relationships (higher recall, lower precision); higher thresholds are more conservative.

Broader benchmark suite

Beyond the Orbit fixture, KTX includes a full benchmark corpus at packages/context/test/fixtures/relationship-benchmarks/ with fixtures across multiple tiers:

Unit — minimal schemas testing individual heuristics
Row-bearing — small schemas with data for validation testing
Product — full warehouse schemas like the Orbit fixture

Fixtures from public datasets (Chinook, Sakila, AdventureWorks, Northwind) supplement the synthetic fixtures. The benchmark runner measures precision, recall, and F1 for both primary key and foreign key detection across all fixtures and modes.

On this page