Opening — Why this matters now

Functional dependency (FD) discovery has quietly become a victim of its own success. Modern algorithms can enumerate everything—and that is precisely the problem. On realistic schemas, exhaustive FD discovery produces hundreds of thousands of valid dependencies, most of which are technically correct and practically useless. Computationally expensive. Cognitively overwhelming. Operationally irrelevant.

This paper makes a blunt claim: information overload in FD discovery is self‑inflicted. If you care about normalization, storage efficiency, or update anomalies, you do not need all FDs. You need the most redundant ones.

Background — Context and prior art

Classical FD discovery algorithms (TANE, HyFD, FDHITS, and friends) share a common philosophy: completeness above all. They traverse an exponential attribute lattice, validate candidates using tuple partitions or difference sets, and return the full minimal FD set.

Two structural problems follow:

  1. Computational blow‑up — exponential in attributes, quadratic in tuples.
  2. Result explosion — valid but near‑key FDs dominate output with zero practical payoff.

Previous work introduced redundancy count as a relevance metric: how much duplicated information an FD explains. But until now, redundancy was only used after discovery, as a ranking filter. The core search process remained blind to relevance.

Analysis — What the paper actually does

The paper reframes FD discovery as a top‑k retrieval problem under redundancy count. Instead of discovering everything and sorting later, it integrates redundancy into the search itself.

Redundancy as a pruning signal

For a valid FD $X \to A$, redundancy is:

$$ \text{red}(X \to A) = |r| - |\pi_X| $$

where $|\pi_X|$ is the number of equivalence classes induced by $X$.

Key observation: adding attributes to the LHS can only refine partitions. Finer partitions mean less redundancy. Therefore, redundancy has a monotone upper bound over the attribute lattice.

This monotonicity enables aggressive pruning: if the maximum possible redundancy of a candidate falls below the current top‑k threshold, the entire subtree can be skipped.

SDP: Selective‑Discovery‑and‑Prune

The proposed algorithm (SDP) builds on FDHITS’ hypergraph‑based enumeration but adds three critical mechanisms:

Component Purpose Effect
Redundancy upper bound Safe pruning Cuts low‑value branches early
Global top‑k threshold $\tau$ Dynamic cutoff Tightens pruning over time
Shared heap across RHS Cross‑attribute learning Early wins prune later searches

Validation—still expensive—is now reserved for candidates that can actually enter the top‑k set.

Findings — Results with visualization logic

Across 40+ real datasets, SDP consistently outperforms exhaustive discovery.

Performance summary

Dimension Exhaustive (FDR) SDP
Runtime Explodes on wide schemas 10–1000× faster
Memory Frequent OOM failures Sub‑GB on wide data
Candidates validated Hundreds of thousands Tens to hundreds
Output quality Unfiltered High‑redundancy only

On the Flight dataset (109 attributes), exhaustive discovery generates nearly one million FDs. SDP visits less than 0.01% of the search space—and still returns the exact top‑k.

Why this works

Three optimizations matter:

  1. Attribute ordering — search low‑cardinality attributes first to raise $\tau$ quickly.
  2. Pairwise bounds (PCM) — detect “hidden keys” early using attribute pairs.
  3. Global best‑first scheduling — always explore the most promising branch across all RHS attributes.

The combined effect is not incremental. It is multiplicative.

Implications — What this means in practice

This paper quietly redefines what “FD discovery” should mean in operational systems.

  • For database normalization, SDP surfaces exactly the dependencies that cause real storage waste.
  • For data quality, high‑redundancy FDs highlight synonym columns, derived fields, and logging invariants.
  • For large schemas, exhaustive discovery is no longer defensible as a default.

More broadly, the work demonstrates a pattern that extends beyond databases: ranking‑aware search beats exhaustive correctness when relevance is well‑defined and monotone.

Conclusion — Less is more, provably

SDP does not approximate. It does not sample away correctness. It simply refuses to waste time on dependencies that do not matter.

By turning redundancy from a passive metric into an active pruning signal, the paper shows that exact top‑k discovery can scale where exhaustive methods collapse. The message is refreshingly unsentimental: if an FD explains no redundancy, it explains nothing worth storing.

Cognaptus: Automate the Present, Incubate the Future.