How many nights? Test–retest reliability and the minimum reliable recording length for ODI-4, rMSSD, and CGM-CV

Michal Planicka · corresponding author — Tepna Project

OxyDex · PulseDex · GlucoDex nodes, Tepna physiological-signal suite

Draft v1 · June 2026 · Analysis tool: nights-icc-analysis.html · Detectors: real oxydex-dsp.js · pulsedex-dsp.js · glucodex-dsp.js (two-pool Web Workers) · generator cohort-gen 1.6 · 100% local

Abstract

Background. A single night (or day) of a wearable physiological signal is one noisy draw of a person's true physiology. Before any single-occasion metric can be read as a stable individual trait — for screening, for tracking, for cohort comparison — we need to know how reliably it separates people, and therefore how many occasions a recording must span. Methods. For each stable (flat-arc) synthetic patient in the suite's 1–12-night longitudinal lane we measured three single-signal metrics per occasion with the production detectors: ODI-4 (OxyDex, per night), rMSSD (PulseDex, per night), and CGM coefficient-of-variation (GlucoDex, per day). One-way random-effects ICC(1,1) decomposes the variance into between-subject vs occasion-to-occasion; the Spearman–Brown prophecy converts the single-occasion ICC into the minimum number of occasions needed to reach a reliability target. Results. ODI-4 and rMSSD were already reliable from a single night (ICC₁ 0.89 and 0.93; both reach ICC≥0.80 at one occasion, rMSSD reaches ICC≥0.90 at one night and ODI-4 at two). Daily CGM-CV, by contrast, showed negligible between-subject variance in the stable cohort (ICC₁≈0): day-to-day fluctuation swamped person-to-person differences, so no number of days made it a reliable individual trait. Conclusion. Reliable individual estimates of ODI-4 and rMSSD need only 1–2 nights; daily glycemic variability is an occasion-level state, not a one-number trait, and must be summarized differently. This is synthetic ground truth: it certifies the reliability structure of the pipeline and recovers the planted variance ratio — it does not establish real-world clinical reliability, which requires a labelled repeated-measures cohort.

Keywords: test–retest reliability · intraclass correlation · Spearman–Brown · oximetry · heart-rate variability · continuous glucose monitoring · wearables · minimum recording length

0. Layman overview (delete before submission)

A wearable reports one number per night (or per day), and people treat it as a fixed fact about themselves. But any single night is one noisy sample — partly the real you, partly just how that particular night went. The practical question: how many nights do you need before the number really describes the person, not the night?

Using simulated stable people (whose true physiology doesn't change) and the real app's detectors, we measured how much of each number's variation separates people versus separates nights. Two metrics — the apnea index (ODI-4) and the heart-rhythm number (rMSSD) — are mostly about the person, so one night is already enough (two at most). The glucose-variability number is the opposite: it changes so much day to day that it mostly describes the day, not the person — so no amount of extra days turns it into a reliable personal trait; it should be summarized differently. (Simulation: this proves the method and the software's variance structure, not a clinical reliability claim.)

1. Introduction

Consumer and clinical wearables increasingly report a single headline number per recording — an overnight desaturation index, a heart-rate-variability summary, a glycemic-variability coefficient — and users and clinicians treat that number as a property of the person. But every recording is a sample: it mixes the person's stable physiology with the night-to-night (or day-to-day) variation of biology, behavior, and the detector itself. If the occasion-to-occasion swing is large relative to the spread between people, a one-night number tells you more about that night than about that person, and a screening cut-point applied to it will mostly sort noise.

The standard instrument for this question is test–retest reliability. The intraclass correlation ICC(1,1) is the fraction of total variance that is between-subject (the "reliable" share); the Spearman–Brown prophecy then says how that reliability grows when several occasions are averaged, and inverts to the fewest occasions needed to clear a target. We ask the practical version of the question directly: how many nights (or days) does each Tepna metric need to be a reliable individual trait? We answer it for three single-signal metrics measured by their production detectors, on stable synthetic patients where the between/within variance ratio is known by construction.

2. Methods

2.1 Cohort — stable patients only

Subjects are the flat-arc patients of the suite's 1–12-night longitudinal lane: each has a single stable latent physiology (no planted trend, treatment response, or drift), so within-subject spread across occasions is genuine occasion-to-occasion biology plus detector noise rather than a designed trajectory. We scan profiles for flat-arc subjects with at least the configured minimum number of occasions (default 5) and measure every occasion. The run reported here covers 5,556 sampled subjects; per-metric subject counts differ by device coverage (CGM is modeled as continuous OTC wear, so its coverage is now close to oximetry/RR rather than a small minority) and by how many subjects contribute enough valid occasions of each signal (Table 1).

2.2 Per-occasion measurement — real detectors

Each occasion is scored by the unmodified production detector, run in its own Web-Worker realm (OxyDex must be isolated because it collides with PulseDex on bare globals; the lean worker pairs PulseDex+GlucoDex, which coexist): ODI-4 from oxydex-dsp.js → processNight (per night), time-domain rMSSD from pulsedex-dsp.js (per night), and the glucose coefficient of variation from glucodex-dsp.js → analyze (per day, after splitting the continuous CGM stream into calendar days on the floating wall-clock). No detector parameters were altered. Timestamps follow the suite Clock Contract so day-splitting and ordering are viewer-timezone-independent.

2.3 Reliability statistics

For each metric we form the ragged set of per-subject occasion vectors (subjects with ≥2 occasions) and estimate the one-way random-effects intraclass correlation by ANOVA:

ICC(1,1) = (MS_between − MS_within) / (MS_between + (n₀ − 1)·MS_within)

where n₀ is the average occasions-per-subject. The single-occasion ICC₁ is the reliability of one night/day; the Spearman–Brown prophecy gives the reliability of an average of m occasions and the minimum m for a target:

ICC(m) = m·ICC₁ / (1 + (m−1)·ICC₁) · m_min = ⌈ target·(1−ICC₁) / ((1−target)·ICC₁) ⌉

As a non-circular check that the detector preserves (rather than manufactures) the reliability structure, the same ANOVA is run on the planted latent targets (planted AHI for ODI, planted rMSSD) and compared against the measured ICC.

3. Results

**Table 1.** Per-metric single-occasion reliability and minimum recording length (5,556-subject stable cohort; real detectors).
Metric	Node	Occasion	Subjects	Occasions	ICC₁	Within-CV	n for ICC≥.80	n for ICC≥.90
ODI-4	OxyDex	night	5018	40983	0.885	40%	1	2
rMSSD	PulseDex	night	4873	39855	0.929	7%	1	1
CGM-CV	GlucoDex	day	4059	32406	0.00	16%	∞	∞

ODI-4 and rMSSD are reliable individual traits from a single occasion: ICC₁ 0.89 and 0.93 mean ~89–93% of the variance separates people rather than nights, so one night already clears the ICC≥0.80 bar — rMSSD is already at ICC≥0.90 from a single night, and ODI-4 reaches it with two. Their occasion-to-occasion noise is modest (ODI within-subject SD ≈0.9 events/h; rMSSD ≈3.0 ms, a 7% within-subject coefficient of variation).

Spearman–Brown reliability curves for ODI-4, rMSSD, and CGM-CV — **Figure 1.** Spearman–Brown reliability curves (live output of `nights-icc-analysis.html`): reliability of an m-occasion average versus the number of occasions. ODI-4 (blue) and rMSSD (amber) start near 0.89/0.93 at a single occasion and clear the 0.80 target immediately; daily CGM-CV (teal) sits flat on the axis (ICC₁≈0), so averaging more days never lifts it. Dark theme is the tool's native rendering.

single-occasion ICC bars with minimum occasions per metric — **Figure 2.** Single-occasion ICC₁ by metric (filled bar = the reliable, between-subject share of variance), with the minimum occasions to reach ICC≥0.80 at right. ODI-4 0.89 and rMSSD 0.93 are reliable from one night; CGM-CV 0.00 has no reliable share, so no recording length suffices.

per-subject rMSSD: night-to-night range vs subject mean — **Figure 3.** Per-subject rMSSD — each column is one subject sorted by mean (vertical bar = night-to-night range, dot = subject mean). The wide between-subject spread against the tight within-subject bars is the high ICC made visible: most of the variance separates people, little of it separates that person's nights.

3.1 Daily glycemic variability is a state, not a one-number trait

Daily CGM-CV behaved qualitatively differently. In the stable cohort its between-subject variance was negligible (ICC₁≈0): stable patients' days look alike in CV, while each person's CV swings day to day (within-subject CV ≈16%). When occasion-to-occasion variation dominates and person-to-person variation does not, Spearman–Brown returns no finite recording length — averaging more days cannot manufacture between-subject signal that is not there. The practical reading is not "CGM-CV is useless" but "daily CGM-CV is an occasion-level state, not a stable one-number individual trait": it should be summarized over a window or paired with its driver (meals, activity, the autonomic context) rather than reported as a single reliable number.

3.2 The detector preserves the reliability structure

**Table 2.** Non-circular check — measured (real-detector) vs latent (planted) single-occasion ICC.
Metric	Measured ICC₁	Latent ICC₁
ODI-4 (OxyDex)	0.89	0.94
rMSSD (PulseDex)	0.93	0.93

Running the same ANOVA on the planted latent targets recovers a comparable reliability structure: rMSSD is reproduced almost exactly (0.93 vs 0.93), and ODI-4's measured ICC (0.89) tracks but sits slightly below the latent ceiling (0.94) — the detector adds a little occasion-level noise but does not invent the between-subject separation. The reliability ordering is therefore a property of the physiology and the detector, not an artifact of the estimator.

4. Discussion

The metrics split cleanly into two regimes. ODI-4 and rMSSD are trait-like: most of their variance is between people, so a single night is already an informative individual estimate and a second night mainly buys precision (ICC≥0.90). Daily CGM-CV is state-like: its variance lives within the person across days, so it indexes the day, not the person, and needs a window-level summary or a coupling model to become a stable descriptor. This distinction matters for any downstream use — a screening cut-point, a "your number changed" alert, a cohort covariate — because applying a trait framing to a state-like metric sorts mostly noise.

Limitations. This is synthetic ground truth. The cohort is flat-arc (stable) by construction, so the within-subject variance is the generator's night-to-night model plus real detector noise — realistic in structure, but not a substitute for measured human test–retest data. The exact ICC values (and especially the CGM-CV≈0 result) depend on the synthetic generator's between-/within-subject variance ratios; a real repeated-measures cohort could shift them, particularly for CGM where real individuals differ more in habitual variability than stable synthetic days do. Per-metric subject counts are large (1,818–2,739), so sampling uncertainty on these ICCs is small. These results certify the reliability structure of the pipeline and the minimum-recording-length method; they motivate, but do not replace, a labelled repeated-measures validation.

5. Reproducibility

Run it: open nights-icc-analysis.html → set subjects / min occasions / target ICC → "Run cohort". Tables 1–2 and Figure 1 populate live; the target-ICC slider re-derives the minimum recording length without re-measuring. Export nights-icc-results.csv, nights-icc-stats.json, nights-icc-figures.png.
Detectors: real oxydex-dsp.js (ODI-4 = processNight().odi4.rate), pulsedex-dsp.js (rMSSD), glucodex-dsp.js (CGM-CV, day-split), each run in its own cohort-worker.js Web-Worker realm (two pools: OxyDex, and lean PulseDex+GlucoDex).
Cohort: flat-arc subjects from cohort-gen.js + synth-gen.js (1–12-night longitudinal lane).
Statistics: ICC(1,1) one-way random effects by ANOVA; Spearman–Brown prophecy for minimum occasions; latent-target ANOVA as the non-circular detector check.
Next: repeat on a real repeated-measures cohort (paired multi-night oximetry / RR / CGM) to convert the pipeline-reliability result into a clinical one.

6. Sample size & statistical power

The ICC's precision is governed by the number of subjects (and, secondarily, occasions per subject); its confidence interval narrows as ~1/√N_subjects. A handful of subjects gives a point ICC but a wide interval; thousands pin it to the second decimal. The qualitative split — ODI-4/rMSSD trait-like, CGM-CV state-like — is unmistakable even at small N, because the ICCs sit near opposite ends (≈0.9 vs ≈0).

**Table 3.** Sample-size guidance for this pilot (flat-arc subjects, ≥5 occasions each).
Tier	Subjects	What it buys
Minimum (acceptable)	~200	ICC point estimates to ≈±0.03 and the minimum-occasions answer (1–2 nights) are stable; the trait-vs-state split is obvious. Below ~50 the ICC CI is too wide to quote.
Recommended	~3,000	ICCs to ≈±0.01, smooth Spearman–Brown curves and per-subject spread plots. Comfortable publication precision.
This run	5,556	Per-metric n = 1,818–5,018 subjects (CGM coverage raised in generator v1.6); ICCs pinned to ≈±0.005.
Diminishing returns	> ~5,000	Beyond here the ICC CIs are already <±0.01; more subjects don't change which regime a metric is in. Adding more occasions per subject (not subjects) is what would refine the minimum-recording-length estimate further.

Practical reading: ~200 subjects to answer “how many nights?” reliably, ~3,000 for tight published ICCs; past ~5,000 the marginal subject adds little. Note this pilot's lever is subjects × occasions — to sharpen the recording-length recommendation specifically, raise the minimum-occasions setting rather than the subject count.

References

Project documentation: CLAUDE.md (Clock Contract, evidence-grade system), COHORT-VALIDATION-BRIEF.md, SYNTHETIC-CORPUS-README.md, Tepna suite.
Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull. 1979;86(2):420–428.
Brown W. Some experimental results in the correlation of mental abilities. Br J Psychol. 1910;3(3):296–322 (Spearman–Brown prophecy).
Stöberl AS, Schwarz EI, Haile SR, et al. Night-to-night variability of obstructive sleep apnea. J Sleep Res. 2017;26(6):782–788.
Bertsch K, Hagemann D, Naumann E, et al. Stability of heart rate variability indices reflecting parasympathetic activity. Psychophysiology. 2012;49(5):672–682.
Rodbard D. Glucose variability: a review of clinical applications and research developments. Diabetes Technol Ther. 2018;20(S2):S2-5–S2-15.