HAROOD: When Benchmarks Grow Up and Models Stop Cheating

A wearable model can look brilliant in the lab and embarrass itself on Monday morning.

The user changes. The watch slides down the wrist. A sensor is mounted on the chest instead of the pocket. The same person walks differently after fatigue, injury, aging, or simply because life has the terrible habit of not matching the training set. Human Activity Recognition, or HAR, has always lived with this problem. It turns sensor streams from accelerometers, gyroscopes, EMG, ECG, and other wearable or ambient devices into labels such as walking, running, sitting, cycling, or stress state. It is useful precisely because it moves into the real world. That is also where benchmark accuracy goes to die.

The paper behind HAROOD does not solve this by announcing another heroic architecture with a bigger acronym and a slightly more athletic GPU bill. Good. We have enough of those. Instead, Wang Lu, Yao Zhu, and Jindong Wang propose HAROOD: a unified benchmark for out-of-distribution generalization in sensor-based human activity recognition.¹ Its value is not merely that it evaluates 16 OOD algorithms on six public datasets. Its value is that it names the ways HAR systems fail after deployment.

That naming matters. A model that fails on a new user is not failing in the same way as a model that fails when the sensor position changes. A model that survives one dataset may still break across data sources. A model that works in January may drift by April. “OOD robustness” sounds like one technical virtue. HAROOD shows it is closer to a checklist of separate operational risks.

The uncomfortable lesson is simple: neither an OOD algorithm, nor a Transformer backbone, nor an LLM-style interface automatically buys generalization. The method that wins in one condition may stumble in another. A supposedly stronger backbone may help on one dataset and hurt on another. Oracle-style hyperparameter selection improves results, but only because it has information real teams do not get to use. And the early large-model baseline performs much worse than compact HAR models in the reported test. Apparently the sensor stream did not read the AI hype memo.

HAROOD turns deployment failure into four named tests

Most technical papers about generalization can be read as method-first: here is an algorithm, here is the benchmark, here is the number. HAROOD is better read category-first. The paper’s contribution is a benchmark, but the practical meaning of that benchmark comes from its four domain-shift scenarios:

HAROOD scenario	Real deployment question	Paper setup	Business meaning
Cross-person	Does the model work on people it has never seen?	Subjects are split into separate domains across six datasets.	A fitness, eldercare, workplace-safety, or medical-monitoring model cannot assume one body represents another.
Cross-position	Does the model work when the sensor is worn somewhere else?	DSADS sensor positions become separate domains.	A product spec that says “wearable” is not enough; placement becomes part of the model contract.
Cross-dataset	Does the model transfer across data sources?	DSADS, USC-HAD, PAMAP2, and UCI-HAR are aligned into common activity classes and treated as domains.	Data portability across devices, collection protocols, and datasets is not guaranteed by ordinary accuracy.
Cross-time	Does the model remain reliable as signals drift over time?	EMG, PAMAP2, and WESAD are split chronologically into domains.	Continuous monitoring systems need drift testing, not just a launch-day validation report.

This structure is more useful than a generic leaderboard because businesses do not deploy “generalization.” They deploy models into particular failure modes.

A fall-detection system in assisted living mostly worries about cross-person generalization: younger, healthier training subjects do not fully represent older adults. A sports-tracking product may worry about cross-position generalization: the phone moves from hand to pocket, the watch is worn tighter or looser, and the signal changes. A device vendor or insurance platform may worry about cross-dataset generalization: models trained under one data collection protocol are asked to survive another. A health-monitoring system worries about cross-time: sensor drift, behavior changes, physiological variation, and longitudinal non-stationarity.

HAROOD’s category design forces teams to ask the correct question before asking which model is “best.” That is a small act of civilization.

The benchmark is not another method; it is a referee

The paper uses six public sensor-based datasets: DSADS, USC-HAD, UCI-HAR, PAMAP2, EMG, and WESAD. These datasets vary in subject count, activity classes, sampling rates, sensors, and scale. For example, UCI-HAR contains 30 subjects and six activities, while DSADS contains eight subjects and 19 activities. WESAD is not a classic HAR dataset, but the authors include it because physiological and affective signals overlap with behavioral sensing. That inclusion broadens the benchmark, though it also means readers should not treat every dataset as the same kind of activity-recognition task.

The benchmark evaluates 16 methods: ERM, Mixup, DDLearn, DANN, CORAL, MMD, VREx, LAG, MLDG, RSC, GroupDRO, ANDMask, Fish, Fishr, URM, and ERM++. These include general domain-generalization methods and HAR-specific methods. Each is tested with CNN-based and Transformer-based backbones, under two model-selection protocols.

The first protocol is training-domain validation selection. This is the realistic one. The model is selected using validation data from the source domains, not from the unseen target domain. The second is oracle selection. This is diagnostic, not deployable: it assumes the best target-domain parameter choice can be identified. The authors use it to estimate the method’s potential within the searched parameter range.

That distinction is important enough to slow down for. Oracle selection is not a secret trick for practitioners. It is a mirror. If oracle selection beats ordinary validation, the gap tells us model selection itself is a bottleneck. In HAROOD, oracle selection consistently improves over training-domain validation by nearly two percentage points. The business interpretation is not “use oracle tuning.” The business interpretation is: your validation protocol may be choosing the wrong model for the world you are about to enter.

Cross-person: the model meets a body it has never seen

Cross-person generalization is the most intuitive failure mode. People move differently. Their gait, muscle patterns, posture, reaction times, and sensor contact differ. A model trained on one group of subjects may treat another group as a distributional ambush.

HAROOD tests cross-person generalization by dividing subjects into separate domains across multiple datasets. The setup is conceptually clean: train on some subject groups, hold out another group, and ask whether the model still recognizes activities. This is the kind of test that matters in healthcare monitoring and assisted living, where the most important users may be exactly the least represented in the training data.

The paper’s result is not “method X solves cross-person HAR.” It is messier and therefore more useful. CORAL performs best on the cross-person DSADS task in the authors’ reported findings, but other tasks show different winners. URM performs best on UCI-HAR. ERM performs best in cross-position tasks. Across scenarios, the paper highlights CORAL, Fish, and Fishr as relatively safe choices when the setting is unclear, but “safe” here means comparatively stable, not magically universal.

For product teams, this changes the procurement question. Instead of asking whether a vendor uses an OOD algorithm, ask whether the model has been tested on users excluded from the training distribution. Better: ask which user groups were held out, how subjects were split, and whether class-level errors were inspected. A model that averages well can still fail on activities that matter most for safety.

The paper’s confusion-matrix analysis makes this point concrete. ERM and LAG show different class strengths on the first EMG task: ERM performs well on some classes while LAG performs better on others, and their misclassification patterns differ. The operational lesson is not that one should blindly ensemble them. The lesson is that class-level behavior matters. For fall detection, rehabilitation, workplace safety, or eldercare, a harmless confusion and a dangerous confusion do not have the same cost.

Cross-position: the same activity becomes a different signal

Cross-position generalization sounds trivial until one remembers that sensors do not observe activities directly. They observe motion and physiological signals from a physical location. A wrist, chest, ankle, pocket, and upper arm do not produce interchangeable time series.

HAROOD uses DSADS to turn five body-worn sensor positions into domains. The same activity, collected from a different body position, becomes a domain shift. This is a very practical test. Consumer products often assume users will wear devices correctly. Users respond by being users.

The paper reports that ERM achieves the highest accuracy in cross-position tasks, while performance disparities across algorithms are substantial. This result should make practitioners pause. It means the simple baseline can be highly competitive under a specific shift, and more elaborate OOD machinery does not automatically help. That does not make ERM the new king. It makes the leaderboard conditional.

For business deployment, cross-position testing should influence both model design and user experience design. If a model only works under strict placement, the product should say so, enforce it, or detect when the placement is wrong. A smartwatch fitness feature, a workplace fatigue monitor, and a clinical wearable should not bury placement assumptions in engineering notes. They should treat them as part of reliability.

The practical design choice becomes:

Product decision	What HAROOD implies
Require fixed placement	Test narrowly, document clearly, and detect misuse.
Allow flexible placement	Train and evaluate across position domains; average accuracy is not enough.
Infer placement automatically	Treat placement recognition as a supporting task, not a decorative feature.
Ignore placement	Enjoy your beautiful benchmark number while the wristband migrates.

This is where HAROOD’s category-based framing beats a single accuracy table. Cross-position is not just a lower score. It is a product-specification problem.

Cross-dataset: portability is not a logo on a slide

Cross-dataset generalization is where many deployment dreams go for a small, quiet funeral. A model trained on one dataset may fail on another because collection hardware, sampling rate, preprocessing, subject population, activity definitions, and environmental context all change.

HAROOD constructs cross-dataset tasks using DSADS, USC-HAD, PAMAP2, and UCI-HAR as domains. To make comparison possible, the authors align modalities and restrict the task to six common activity classes. This is already a simplification. Real data portability is usually uglier. Still, the scenario is valuable because it asks a question businesses often skip: will a model trained under one data regime survive another?

For AI vendors, this is a serious issue. Many demos are trained and tested inside the same dataset family. That can prove the pipeline works, but it does not prove portability. A hospital network, fitness app, insurer, or smart-home provider may operate across devices, cohorts, countries, and data protocols. Cross-dataset evaluation is the closest benchmark proxy for that reality.

The paper does not claim that cross-dataset transfer is solved. Its broader finding is that method performance varies across scenarios and architectures. That variation is the key result. If an algorithm’s success depends on which dataset becomes the held-out domain, then the algorithm is not a product guarantee. It is an engineering candidate.

For business interpretation, cross-dataset results should be used as a vendor-diligence tool. Ask whether the model has been evaluated on external datasets, whether activity labels were harmonized, whether sensor channels were aligned, and whether preprocessing choices were kept consistent. If the answer is a confident paragraph without a held-out dataset, invoice the vendor for fiction writing.

Cross-time: drift is not a footnote

Cross-time generalization is the easiest shift to underestimate because nothing obvious needs to change. The user can be the same. The sensor can be the same. The task can be the same. Time alone is enough.

HAROOD simulates temporal distribution shift by splitting EMG, PAMAP2, and WESAD into chronological segments. This scenario is especially relevant for continuous monitoring. Physiological signals drift. Sensors age. User behavior changes. The model that was validated at deployment may gradually become less reliable.

The paper’s window-length experiment belongs here. It is best read as a robustness or sensitivity test, not as the main thesis. On the EMG cross-time setting, the authors rerun methods with different window lengths: 100, 200, and 500. Performance changes across methods. Fish and Fishr remain strong in several configurations, VREx changes materially with longer windows, and some methods remain weak. The point is not that 500 is the magic number. The point is that segmentation choices interact with algorithms.

This matters because time-series AI systems often hide preprocessing choices behind the model name. Window length, overlap, normalization, chronological splitting, and domain definition are not neutral plumbing. They shape the signal the model sees. In a business setting, this means model governance should include preprocessing governance. Otherwise, the team may think it is evaluating a model while actually evaluating a fragile data-window recipe.

Cross-time also connects to model monitoring. If HAROOD’s chronological split exposes performance differences, then deployed systems should not rely only on aggregate launch validation. They need drift checks, periodic re-evaluation, and possibly adaptive selection strategies that do not rely on target labels being conveniently available. The authors point to future work on online or meta-adaptive selection, but the current benchmark does not yet provide a deployable solution for that. It tells us where the bruise is.

The leaderboard has no monarch, which is the point

The paper’s most business-relevant finding is negative: no single method consistently dominates every setting.

That sentence sounds modest. It is not. It attacks a common procurement fantasy: choose the best algorithm, standardize it across products, and move on. HAROOD does not support that fantasy. The authors report that CORAL, URM, ERM, and ERM++ each look strong in different places. ERM++ dominates cross-time generalization with CNN in one setting but drops sharply when the backbone changes. ANDMask and LAG perform poorly under CNN rankings but rise into the top group under Transformer. DDLearn, despite being HAR-specific, does not show stable dominance in the unified benchmark.

This does not mean OOD methods are useless. It means the question “Which OOD method is best?” is too broad to deserve an answer.

A better question is:

Question	Why it matters
Which shift scenario are we targeting?	Cross-person, cross-position, cross-dataset, and cross-time stress different failure modes.
Which backbone is used?	CNN and Transformer backbones change method rankings materially.
Which model-selection protocol is realistic?	Oracle selection estimates potential but cannot be used as a real deployment protocol.
Which classes carry the highest business risk?	Average accuracy can hide dangerous class-specific failures.
Which preprocessing choices are fixed?	Window length and normalization can change observed performance.

The paper’s comparison with HAR-specific methods is particularly useful. Under a unified codebase, LAG and DDLearn do not reproduce the kind of broad dominance one might expect from their original framing. The authors fairly note that modifications for consistency may affect results and invite original authors to contribute implementations. That is the right posture. The benchmark should not become a guillotine for methods. It should become a referee that forces them onto the same field.

Architecture choice is not a decoration

HAROOD tests methods with CNN and Transformer backbones. The results make architecture choice look less like a technical afterthought and more like a first-order deployment variable.

Under a CNN backbone, ERM++ secures top performance in the paper’s overall ranking, while ANDMask and LAG rank much lower. Under a Transformer backbone, ERM++ drops to 14th, while ANDMask and LAG move into the top three. On DSADS, the best Transformer-based method exceeds the best CNN-based method by four points. On WESAD cross-time, the top CNN-based approach outperforms the best Transformer.

The honest interpretation is not “Transformers are better” or “CNNs are safer.” The honest interpretation is that model architecture and OOD method interact. A learning strategy that works with one representation pipeline may fail with another. A business team choosing the algorithm without testing the backbone is only doing half the evaluation.

The paper also includes targeted architecture experiments on EMG across small, mid, large, RNN, and LSTM configurations. Larger-capacity models appear to benefit more from aggressive data augmentation such as Mixup, while CORAL remains comparatively robust for smaller models and low-data regimes. RNN and LSTM results are weak in this setup, but that should not be inflated into a universal claim that recurrent models are obsolete. It is evidence within this benchmark configuration, not a funeral service.

Oracle selection is a diagnostic mirror, not a deployment trick

The gap between training-domain validation and oracle selection is one of the paper’s most important signals. Training-domain validation selects hyperparameters using source-domain validation data. Oracle selection assumes the best target-domain parameter choice is known. The authors emphasize that oracle selection is not valid as a real OOD benchmarking method because it leaks target-domain information. They use it to estimate the best achievable performance within the tested parameter range.

The reported gain from oracle selection is nearly two percentage points. In some fields, two points would be a rounding error. In HAR safety and monitoring contexts, two points may mean a meaningful number of missed or misclassified events. The exact business impact depends on the use case, class distribution, and failure cost, but the direction is clear: model selection is part of robustness.

This is where many teams underinvest. They tune on convenient validation splits, report aggregate performance, and assume the selected model is ready for deployment. HAROOD suggests a less comforting view: training-domain validation may select a model that is not the best for the unseen target domain. The business remedy is not to use oracle data. It is to design validation protocols that better approximate target uncertainty: leave-one-domain-out testing, stress testing by user group or device position, synthetic perturbation, and post-deployment monitoring.

The paper does not solve adaptive model selection. It exposes why we need it.

The large-model test is a warning label, not a final verdict

The existing short version of this article leaned too hard into “large models fail because they memorize.” That is punchy. It is also more than this paper proves.

What HAROOD directly shows is narrower and still valuable. The authors add a large-model testing interface and compare HARGPT on GPT-OSS:120B with ERM on USC-HAD for two targets. HARGPT scores 5.48 and 5.54, while ERM scores 67.08 and 64.79. That is not a small gap. It is a crater.

But the correct interpretation is not “LLMs cannot do sensor HAR.” It is: in this reported setup, an LLM-style baseline does not outperform compact HAR models, and the cost-performance tradeoff is poor. The paper itself frames compact CNN and Transformer models as practical for edge deployment and small-model encoders, while noting that the benchmark can support future larger models.

For business teams, this is a useful antidote to reflexive scale worship. Sensor data is not natural language. Prompting raw or transformed time-series data into a large model is not the same as learning robust sensor representations. If a large-model solution is proposed for HAR, it should clear the same scenario tests as compact models: cross-person, cross-position, cross-dataset, and cross-time. Otherwise, it is not a generalization engine. It is a very expensive way to be wrong.

How to read the paper’s evidence

Not every experiment in the paper plays the same role. Mixing them together produces bad interpretation. Here is the clean reading:

Evidence in the paper	Likely purpose	What it supports	What it does not prove
Main benchmark across 16 methods, six datasets, four shifts, two backbones	Main evidence	No single method dominates; method rankings depend on shift and architecture.	That any one method is universally best for all HAR products.
Training-domain validation vs oracle selection	Diagnostic comparison	Model selection is a bottleneck; oracle potential exceeds realistic validation.	That oracle selection can be used in deployment.
Confusion matrices on EMG	Error-pattern analysis	Methods differ by class-level strengths and misclassifications.	That a specific ensemble will always improve performance.
Window-length experiment on EMG cross-time	Robustness/sensitivity test	Preprocessing interacts with method performance.	That one window length is optimal across HAR.
HARGPT vs ERM on USC-HAD targets	Exploratory extension / comparison	LLM-style baseline is much weaker in this setup.	That all future sensor foundation models will fail.
Architecture capacity experiments on EMG	Ablation-style architecture analysis	Capacity, architecture, data augmentation, and alignment methods interact.	That RNNs/LSTMs are universally inferior in every sensor setting.
Inference-time and GPU-use statistics	Implementation and resource analysis	Many compact benchmark methods have comparable inference costs; ERM is efficient.	That total system cost is identical in full deployment.
F1-score comparison	Metric robustness check	Accuracy-based trends largely agree with F1 on reported data.	That accuracy alone is sufficient for safety-critical deployment.

This table is not decorative. It prevents a common reading error: taking a robustness test as a second thesis, or treating an exploratory large-model baseline as the final verdict on an entire model class. HAROOD is broad, but its evidence still has boundaries.

What businesses should do with HAROOD

The direct research contribution is a benchmark. The business contribution is a testing discipline.

For organizations building or buying sensor-based HAR systems, HAROOD suggests a practical sequence:

Define the deployment shift before selecting the algorithm. Is the main risk new users, sensor placement, new devices and data protocols, or time drift? “OOD robustness” is not specific enough.
Hold out domains that resemble the business risk. If the product targets new patients, hold out people. If it permits different wearing positions, hold out positions. If it must scale across devices, hold out data sources. If it runs continuously, test chronological drift.
Evaluate backbones and algorithms together. The paper shows that method rankings can change sharply when switching from CNN to Transformer. Selecting the algorithm alone is not enough.
Inspect class-level errors. In safety, healthcare, and compliance settings, class-specific mistakes matter more than average accuracy. A missed fall and a confused low-risk movement are not equivalent.
Treat preprocessing as part of the model. Window length, normalization, overlap, sensor-channel alignment, and chronological splitting all affect results. The model card should include them.
Use oracle-style results only as an upper-bound diagnostic. If oracle performance is much better than validation-selected performance, the team has a model-selection problem, not a production-ready miracle.
Keep compact models in the comparison set. The large-model baseline in HAROOD performs poorly against ERM in the reported USC-HAD test. Even when larger models improve, they still need to beat compact models on cost, latency, privacy, and edge deployment.

This is not glamorous advice. It is useful advice. That already puts it ahead of many AI strategy decks.

Boundaries: what HAROOD clarifies and what it cannot promise

HAROOD is a benchmark, not a deployment certificate.

First, it is built on public datasets. Public datasets are necessary for reproducibility, but they do not cover every operational environment. Aging-related motion changes, long-term sensor degradation, rare clinical events, household-specific context, cultural activity differences, and device manufacturing variation may not be fully represented.

Second, the domain shifts are constructed. Cross-person, cross-position, cross-dataset, and cross-time are realistic categories, but they are still benchmark abstractions. A real product may face all four at once. A wearable used by older adults in humid environments with inconsistent placement and months of sensor drift is not politely choosing one HAROOD scenario.

Third, the hyperparameter search is limited. The authors use 20 combinations because the benchmark is large. This supports comparability, but it may disadvantage algorithms that are sensitive to tuning. The paper is transparent about this limitation.

Fourth, oracle selection is not deployable. Its value is diagnostic. It shows potential performance under idealized selection, not a production procedure.

Fifth, the benchmark currently emphasizes accuracy and F1-style evaluation. The authors include class-level analysis, but business applications often need cost-sensitive metrics, calibration, uncertainty, false-negative constraints, and human-review thresholds. A fall-detection product, for example, should not optimize the same objective as a recreational fitness classifier.

These boundaries do not weaken HAROOD. They explain how to use it without turning it into another leaderboard idol.

The real contribution is harder self-deception

HAROOD’s main message is not that CORAL, Fish, Fishr, ERM, ERM++, CNNs, or Transformers are the answer. The answer changes with the shift.

The stronger contribution is that HAROOD makes the question harder to cheat. It says: do not show me one dataset. Do not show me one split. Do not show me one backbone. Do not show me an OOD algorithm tested only in the scenario where it behaves nicely. Show me cross-person, cross-position, cross-dataset, and cross-time. Show me validation selection, not just oracle potential. Show me class-level failure. Show me whether the result survives a different architecture and a different window length.

That is what mature benchmarks do. They reduce the space in which convenient stories can hide.

For businesses using sensor-based AI, the implication is direct. Before deploying a HAR model into healthcare monitoring, assisted living, workplace safety, smart homes, or fitness tracking, ask which HAROOD-style failure modes have been tested. If the answer is “we achieved high accuracy on a benchmark,” the correct response is: adorable, but which world did the benchmark simulate?

HAROOD does not make HAR deployment easy. It makes the difficulty visible. In applied AI, that is often where progress actually begins.

Cognaptus: Automate the Present, Incubate the Future.

Wang Lu, Yao Zhu, and Jindong Wang, “HAROOD: A Benchmark for Out-of-distribution Generalization in Sensor-based Human Activity Recognition,” arXiv:2512.10807, 2025/2026. https://arxiv.org/abs/2512.10807 ↩︎

HAROOD turns deployment failure into four named tests#

The benchmark is not another method; it is a referee#

Cross-person: the model meets a body it has never seen#

Cross-position: the same activity becomes a different signal#

Cross-dataset: portability is not a logo on a slide#

Cross-time: drift is not a footnote#

The leaderboard has no monarch, which is the point#

Architecture choice is not a decoration#

Oracle selection is a diagnostic mirror, not a deployment trick#

The large-model test is a warning label, not a final verdict#

How to read the paper’s evidence#

What businesses should do with HAROOD#

Boundaries: what HAROOD clarifies and what it cannot promise#

The real contribution is harder self-deception#