HAROOD: When Benchmarks Grow Up and Models Stop Cheating

Opening — Why this matters now

Human Activity Recognition (HAR) has quietly become one of those applied ML fields where headline accuracy keeps improving, while real-world reliability stubbornly refuses to follow. Models trained on pristine datasets collapse the moment the sensor moves two centimeters, the user changes, or time simply passes. The industry response has been predictable: larger models, heavier architectures, and now—inevitably—LLMs. The paper behind HAROOD argues that this reflex is misplaced. The real problem is not model capacity. It is evaluation discipline.

Background — From neat datasets to messy reality

Classic HAR benchmarks reward models that interpolate well within a narrow data regime. But deployment settings introduce domain shifts that benchmarks rarely isolate: cross-person differences, sensor placement changes, device heterogeneity, and temporal drift. Prior work on domain generalization (DG) and out-of-distribution (OOD) learning offers a zoo of techniques—Mixup, adversarial alignment, invariant risk minimization, meta-learning—but their effectiveness in HAR has been scattered and hard to compare.

HAROOD enters as a corrective. Rather than proposing yet another algorithm, it proposes something more uncomfortable: a standardized way to discover that many existing methods do not generalize nearly as well as their papers imply.

Analysis — What HAROOD actually does

HAROOD is a PyTorch-based benchmark designed explicitly for OOD generalization in sensor-based HAR. It unifies:

Six public sensor datasets (e.g., DSADS, UCI-HAR, PAMAP2, WESAD)
Four domain-shift scenarios: cross-person, cross-position, cross-dataset, and cross-time
Sixteen OOD/DG algorithms, spanning data manipulation, representation learning, and strategy-based methods
Two backbone families: CNNs and Transformers
Two model-selection protocols, including oracle selection for diagnostic clarity

Crucially, HAROOD treats evaluation itself as a first-class research object. Distribution distances between domains are measured explicitly, revealing that performance degradation is not just task-dependent but shaped by dimensionality and class structure. In other words: not all domain shifts are created equal, and pretending otherwise has distorted past conclusions.

Findings — What breaks, what survives

Several results are quietly damning:

Observation	Implication
Large models underperform small ones in OOD settings	Capacity amplifies spurious correlations
Mixup-style augmentation helps large models more than small ones	Data manipulation scales better than alignment
Some meta-learning methods collapse to single-class predictions	Optimization stability matters more than theory
Classical alignment methods remain competitive	Boring baselines are still dangerous

The authors also test LLM-style approaches (e.g., HARGPT on GPT-OSS-120B). The verdict is blunt: performance is significantly worse than lightweight CNN/Transformer models, with far higher cost. Memorization, not abstraction, appears to dominate.

Implications — For practitioners and researchers

HAROOD sends three messages that extend beyond HAR:

Benchmark design is policy. What you measure decides what survives publication.
OOD robustness is structural, not architectural. Bigger models do not fix weak objectives.
LLMs are not free generalization engines. In sensor domains, they often memorize instead of adapt.

For applied teams, the takeaway is pragmatic: invest in evaluation realism before investing in model scale. For researchers, HAROOD is an invitation to stop hiding failure modes behind averaged accuracy.

Conclusion

HAROOD is not flashy. It does not promise state-of-the-art numbers. Instead, it does something rarer: it makes it harder to lie to ourselves about generalization. In 2025, that may be the most valuable contribution of all.

Cognaptus: Automate the Present, Incubate the Future.

Opening — Why this matters now#

Background — From neat datasets to messy reality#

Analysis — What HAROOD actually does#

Findings — What breaks, what survives#

Implications — For practitioners and researchers#

Conclusion#