Opening — Why this matters now
Functional dependency (FD) discovery has quietly become a victim of its own success. Modern algorithms can enumerate everything—and that is precisely the problem. On realistic schemas, exhaustive FD discovery produces hundreds of thousands of valid dependencies, most of which are technically correct and practically useless. Computationally expensive. Cognitively overwhelming. Operationally irrelevant.
This paper makes a blunt claim: information overload in FD discovery is self‑inflicted. If you care about normalization, storage efficiency, or update anomalies, you do not need all FDs. You need the most redundant ones.
Background — Context and prior art
Classical FD discovery algorithms (TANE, HyFD, FDHITS, and friends) share a common philosophy: completeness above all. They traverse an exponential attribute lattice, validate candidates using tuple partitions or difference sets, and return the full minimal FD set.
Two structural problems follow:
- Computational blow‑up — exponential in attributes, quadratic in tuples.
- Result explosion — valid but near‑key FDs dominate output with zero practical payoff.
Previous work introduced redundancy count as a relevance metric: how much duplicated information an FD explains. But until now, redundancy was only used after discovery, as a ranking filter. The core search process remained blind to relevance.
Analysis — What the paper actually does
The paper reframes FD discovery as a top‑k retrieval problem under redundancy count. Instead of discovering everything and sorting later, it integrates redundancy into the search itself.
Redundancy as a pruning signal
For a valid FD $X \to A$, redundancy is:
$$ \text{red}(X \to A) = |r| - |\pi_X| $$
where $|\pi_X|$ is the number of equivalence classes induced by $X$.
Key observation: adding attributes to the LHS can only refine partitions. Finer partitions mean less redundancy. Therefore, redundancy has a monotone upper bound over the attribute lattice.
This monotonicity enables aggressive pruning: if the maximum possible redundancy of a candidate falls below the current top‑k threshold, the entire subtree can be skipped.
SDP: Selective‑Discovery‑and‑Prune
The proposed algorithm (SDP) builds on FDHITS’ hypergraph‑based enumeration but adds three critical mechanisms:
| Component | Purpose | Effect |
|---|---|---|
| Redundancy upper bound | Safe pruning | Cuts low‑value branches early |
| Global top‑k threshold $\tau$ | Dynamic cutoff | Tightens pruning over time |
| Shared heap across RHS | Cross‑attribute learning | Early wins prune later searches |
Validation—still expensive—is now reserved for candidates that can actually enter the top‑k set.
Findings — Results with visualization logic
Across 40+ real datasets, SDP consistently outperforms exhaustive discovery.
Performance summary
| Dimension | Exhaustive (FDR) | SDP |
|---|---|---|
| Runtime | Explodes on wide schemas | 10–1000× faster |
| Memory | Frequent OOM failures | Sub‑GB on wide data |
| Candidates validated | Hundreds of thousands | Tens to hundreds |
| Output quality | Unfiltered | High‑redundancy only |
On the Flight dataset (109 attributes), exhaustive discovery generates nearly one million FDs. SDP visits less than 0.01% of the search space—and still returns the exact top‑k.
Why this works
Three optimizations matter:
- Attribute ordering — search low‑cardinality attributes first to raise $\tau$ quickly.
- Pairwise bounds (PCM) — detect “hidden keys” early using attribute pairs.
- Global best‑first scheduling — always explore the most promising branch across all RHS attributes.
The combined effect is not incremental. It is multiplicative.
Implications — What this means in practice
This paper quietly redefines what “FD discovery” should mean in operational systems.
- For database normalization, SDP surfaces exactly the dependencies that cause real storage waste.
- For data quality, high‑redundancy FDs highlight synonym columns, derived fields, and logging invariants.
- For large schemas, exhaustive discovery is no longer defensible as a default.
More broadly, the work demonstrates a pattern that extends beyond databases: ranking‑aware search beats exhaustive correctness when relevance is well‑defined and monotone.
Conclusion — Less is more, provably
SDP does not approximate. It does not sample away correctness. It simply refuses to waste time on dependencies that do not matter.
By turning redundancy from a passive metric into an active pruning signal, the paper shows that exact top‑k discovery can scale where exhaustive methods collapse. The message is refreshingly unsentimental: if an FD explains no redundancy, it explains nothing worth storing.
Cognaptus: Automate the Present, Incubate the Future.