Measuring a device's σ without a canonical reference: heart-rate error across the O2Ring, Polar H10 and Verity Sense by repeatability, transfer-standard agreement, and the three-cornered hat

Michal Planicka · corresponding author — Tepna Project

OxyDex (oximetry) · PulseDex / ECGDex (cardiac) nodes, Tepna physiological-signal suite

Draft v1 · June 2026 · Analysis tool: sigma-no-reference-analysis.html · Inputs: real co-recorded device files · 100% local, reproducible

Abstract

Problem. How do you state a consumer sensor's measurement uncertainty (σ) when you own no laboratory-calibrated reference? Approach. A three-rung recipe that needs no certified instrument: (1) repeatability — random scatter from repeated reads, estimated with no reference; (2) transfer-standard agreement — promote the best device on hand (a chest-strap ECG) to a working reference and report Bland–Altman bias, 95% limits of agreement and the accuracy root-mean-square (Arms), recovering the test device's random σ by variance subtraction; (3) the three-cornered hat — with three devices measuring the same quantity at once, solve each one's individual variance with none assumed canonical. Data. The pulse-oximeter heart-rate channel (Wellue O2Ring) against a Polar H10 ECG strap, co-recorded over seven overnight sessions (128,444 paired 1-Hz seconds), plus a Polar Verity Sense armband as the third corner. Results. The O2Ring pulse carries a negligible mean bias (−0.33 bpm) but a second-by-second random σ of ≈3.7 bpm on six clean nights (one motion-corrupted night reached 9.3 bpm), Arms 3.8 bpm. Across six simultaneous three-device windows (63,231 s) — H10 HR re-derived from raw ECG (Pan-Tompkins QRS), Verity HR from raw PPG — the three-cornered hat returns an across-window median reference-free σ of 1.83 bpm (O2Ring), 2.04 bpm (H10), 3.50 bpm (Verity) [95% CI], none assumed canonical. The O2Ring and H10 are statistically indistinguishable (overlapping CIs); the apparent ranking reflects error structure, not fidelity — the O2Ring's firmware emits an internally-smoothed integer pulse (visible as diagonal banding in the Bland–Altman plot) that deflates its variance, while the H10 reports instantaneous per-beat HR. The ECG-derived H10 matches Polar's onboard RR to −0.04 bpm, confirming the reference leg. Channel limit. Only the HR channel is testable — the ECG and PPG references carry no SpO₂, so the O2Ring's oxygen-saturation trueness cannot be established here. Capture lesson. The Verity Sense's onboard HR/PPI streams were empty (all-zero / header-only), yet its HR was fully recoverable from the raw photodiode signal at SQI ≈ 1.0 — raw-signal logging, not the device's firmware estimate, is what made the third corner possible. This is a single-subject methods pilot — it demonstrates the apparatus, not a population accuracy claim.

Keywords: measurement uncertainty · metrology without a standard · three-cornered hat · Gray–Allan variance · Bland–Altman · pulse oximetry · heart rate · raw PPG / ECG · QRS detection · consumer wearables

0. Layman overview (delete before submission)

Every gadget that reports your heart rate is a little bit wrong. The usual way to measure how wrong is to compare it against a lab-grade “truth” instrument — but ordinary people don't own one. So how do you put a real error number on a fitness ring without a reference? This paper uses three tricks that need no certified equipment: (1) measure the same thing repeatedly and look at the scatter (that gives precision); (2) borrow the best device you do own — a chest-strap ECG — as a stand-in reference; and (3) the clever one: wear three devices at once and use a bit of algebra (borrowed from atomic-clock testing) to solve for each device's own error with none of them assumed perfect.

What we found, on one person across several nights: the O2Ring's average heart rate is essentially correct (off by a third of a beat), but any single second can be wrong by ~4 beats — fine for an overnight summary, not for instant readings. The most useful surprise was about capture, not accuracy: one armband's built-in heart-rate output was completely empty (its firmware never locked on), yet the raw light-sensor signal it recorded was perfect — we reconstructed a clean heart rate from it ourselves. Lesson: always log the raw waveform, not just the device's own number. This is a real-data methods demonstration on one subject — it shows the recipe works, not a population accuracy rating for these products.

1. Introduction

A number reported by a sensor is only as useful as the uncertainty attached to it. The textbook way to obtain that uncertainty is to compare against a reference traceable to a national standard — a calibrated instrument the consumer device manufacturer, but rarely the end user, can access. This note asks the practical question that arises when no such reference is on the bench: given two or three imperfect devices, what can be said rigorously about each one's σ? The answer separates cleanly into two quantities that are routinely conflated. Precision (repeatability) is random scatter and needs no reference at all — only a stable thing to measure repeatedly. Trueness (bias) is systematic offset and fundamentally requires a comparison. We treat them separately and add a third tool — the three-cornered hat from frequency metrology — that recovers each device's own variance when three measure the same quantity simultaneously, assuming none is perfect.

One mismatch to declare up front. The O2Ring reports two signals — SpO₂ and pulse rate — but the references here (an ECG strap and a PPG armband) measure only heart rate. Everything below concerns the HR channel. The O2Ring's SpO₂ trueness is not obtainable from this data: establishing it requires arterial blood-gas CO-oximetry, the actual gold standard. SpO₂ would admit only a repeatability figure and cross-oximeter agreement, never absolute accuracy, without that reference.

2. Methods

2.1 Devices and channels

Three devices were worn together: a Polar H10 chest strap (single-lead ECG, the clinical reference for inter-beat timing), a Wellue O2Ring ring oximeter (SpO₂ + photoplethysmographic pulse rate at 1 Hz), and a Polar Verity Sense arm-band (PPG heart rate). Each device exposes both a firmware-computed heart rate and, via Polar Sensor Logger, its raw waveform (H10 raw ECG at ~130 Hz; Verity raw PPG at ~176 Hz). Where the analysis benefits — and wherever a device's firmware HR failed — we work from the raw signal with the suite's own detectors (ECGDSP Pan-Tompkins QRS; PPGDSP optical beat detection) rather than the vendor's estimate. The H10 is the most precise heart-rate source (an electrical R-wave is sharper than any optical pulse), so it serves as the transfer standard for the two optical devices.

2.2 Alignment — the Clock Contract does the work

The devices share no clock and no timezone, but each stamps local civil time. Under the suite's time model every record is stored as UTC-normalized floating wall-clock milliseconds (tMs = Date.UTC(y,mo−1,d,h,mi,s)), so two devices that record the same wall-clock second produce the same tMs by construction — alignment is then exact-second intersection with no zone negotiation. H10 R-R intervals are converted to instantaneous heart rate (60000/RR) and averaged into 1-Hz buckets; the O2Ring pulse is natively 1 Hz. We keep only seconds present in both streams, discarding the O2Ring's -- contact-loss rows and any out-of-range (<30 or >220 bpm) sample.

2.3 The three rungs

(1) Repeatability σ. Short-term precision with no reference: the 1-Hz residual after a 7-point rolling median, summarized by a robust SD (1.4826·MAD). On the H10 this is a genuine precision figure; on the O2Ring it is degenerate because the device's reported pulse is internally smoothed (successive seconds are frequently identical), so the O2Ring's precision is read instead from its scatter against the H10.

(2) Transfer-standard agreement. With the H10 promoted to a working reference, the per-second difference d = O2Ring − H10 gives the mean bias d̄, the SD of differences s_d, the 95% limits of agreement d̄ ± 1.96·s_d, and the accuracy root-mean-square Arms = √(d̄² + s_d²) — the single number used to grade pulse devices. Because the reference is itself slightly noisy, the test device's random uncertainty is recovered by variance subtraction:

σ_device = √( s_d² − σ_ref² ) , σ_ref = H10 repeatability ≈ 0.7 bpm

(3) Three-cornered hat (Gray–Allan). With three devices A, B, C measuring the same true HR, and pairwise difference variances V_AB, V_AC, V_BC, each device's own variance follows with none assumed canonical:

σ²_A = ½(V_AB + V_AC − V_BC) , σ²_B = ½(V_AB + V_BC − V_AC) , σ²_C = ½(V_AC + V_BC − V_AB)

Shared physiological variation cancels in the differences, leaving device error; a negative output flags correlated errors that break the independence assumption. The estimator is implemented in the tool and runs on every window where all three usable streams coexist — here, the H10 ECG, the raw-recovered Verity PPG, and the O2Ring pulse.

3. Results

Three-device overlay — H10 ECG, O2Ring pulse and Verity raw-PPG HR, first ~600 aligned seconds of a clean window — **Figure 1a.** A clean three-device window's first ~600 aligned seconds — H10 ECG, O2Ring pulse and Verity raw-PPG HR all track the same physiology; the spread between the three is device error. Live output of `sigma-no-reference-analysis.html`.

Bland-Altman against H10: O2Ring-H10 and Verity-H10 clouds over the three-device overlap, with O2Ring diagonal banding — **Figure 1b.** Bland–Altman against the H10 reference over the three-device overlap: O2Ring−H10 (Arms 2.86 bpm) and Verity−H10 (Arms 4.59 bpm), drawn on the *same* simultaneous seconds so the two are directly comparable. Both bias lines sit near zero; the Verity cloud is visibly wider — that is its larger σ. The pronounced **diagonal banding** in the O2Ring cloud is the fingerprint of the device's **firmware smoothing**: its pulse is reported as held, internally-smoothed integers, so for each integer value the difference-vs-mean points fall on a line of slope −2 — one parallel band per beat-per-minute.

Per-night standard deviation of O2Ring minus H10, seven nights — **Figure 1c.** Per-night SD of (O2Ring − H10): six clean nights span 2.3–4.2 bpm; night 06-12 (red) is auto-flagged as motion-corrupted at 9.3 bpm (dashed line = the 1.8× median threshold).

3.1 The O2Ring pulse is unbiased but noisy against ECG

**Table 1.** O2Ring pulse vs Polar H10 ECG, per co-recorded night (1-Hz paired seconds). Reference-free σ = √(SD² − σ²_H10), σ_H10 = 0.74 bpm. † 06-18 is a short (~36 min) session whose H10 ECG was contact-flat, so its H10 leg is the strap's onboard R-R; it is the tightest night.
Night	Paired s	Bias	SD	95% LoA	Arms	Pearson r	σ ref-free
06-10	21,116	−0.10	3.54	±6.9	3.54	0.54	3.46
06-11	23,896	−0.05	3.62	±7.1	3.62	0.65	3.55
06-12 ⚑	21,099	−1.32	9.31	±18.3	9.41	0.31	9.29
06-14	22,283	−0.13	3.68	±7.2	3.68	0.38	3.60
06-15	20,406	−0.24	4.17	±8.2	4.18	0.55	4.10
06-17	17,477	−0.19	4.07	±8.0	4.08	0.31	4.01
06-18 †	2,167	+0.04	2.29	±4.5	2.30	0.76	2.17
Pooled (clean, 6 nt)	107,345	−0.14	3.78	±7.4	3.78	—	3.71
Pooled (all, 7 nt)	128,444	−0.33	5.14	±10.1	5.15	0.49	5.08

The O2Ring's pulse rate is essentially unbiased against ECG — the pooled clean offset is −0.14 bpm, well inside a single quantization step. Its random uncertainty, however, is substantial: a reference-free σ of ≈3.7 bpm on clean nights, i.e. a typical second's pulse can be wrong by several beats even when the average is right. The accuracy figure Arms ≈ 3.8 bpm sits at the boundary of the ±5 bpm / 10% tolerance commonly cited for wrist/ring HR. Including the one motion-corrupted night (06-12) inflates every pooled scatter statistic — σ rises to 5.1 bpm and the LoA widen to ±10 — which is exactly why a per-night quality flag, not a single grand pool, is the honest summary.

Why the correlation is only ≈0.5 despite a near-zero bias. These are resting nights: true HR barely moves, so its variance is small relative to the device's beat-to-beat noise (range compression). Pearson r divides signal variance by signal-plus-noise variance and therefore looks poor whenever the subject is stable — it is the wrong agreement metric here. Bias, SD/LoA and Arms describe the error directly and are unaffected by how much the underlying HR happened to vary.

The H10's own short-term repeatability is 0.74 bpm — small enough that subtracting it changes the recovered O2Ring σ by under 0.1 bpm (3.78 → 3.71), which both justifies treating the strap as a transfer standard and shows the result is insensitive to that assumption.

The O2Ring's firmware smoothing is visible in the data — and it matters for §3.2. Two signatures expose it. First, in the Bland–Altman plot (Fig 1b) the reported pulse is whole-integer and held across runs, so the difference-vs-mean cloud collapses onto a lattice of diagonal bands — one per integer bpm, each of slope −2 (eliminate the held value O from x = (O+H)/2, y = O−H and the locus is the line y = −2x + 2O) — rather than the formless scatter a continuously-valued device leaves. Second, successive seconds are frequently identical, so the device's own second-to-second variance is small because the firmware averages real beat-to-beat variation away, not because it tracks the instantaneous rate better. This is why a same-device self-residual is a degenerate precision proxy for the O2Ring (we read its precision from the scatter against the H10 instead) and — decisively — why the three-cornered hat in §3.2 places the O2Ring's σ below the 130 Hz ECG's: smoothness deflates variance. That ranking is error structure, not sensor fidelity.

3.2 The third corner, recovered from raw signal — a real three-cornered hat

The Verity Sense was meant to be the third corner, and its onboard outputs nearly defeated it: across every night the heart-rate stream is all-zero and the pulse-interval export is a header with no rows — the band's firmware never locked a pulse. But Polar Sensor Logger also recorded the raw photodiode signal, and the embedded algorithm's failure does not mean the signal is gone. Running the raw PPG through the suite's own optical pipeline (PPGDSP: 0.5–8 Hz band-pass, channel selection, systolic-foot detection, per-beat SQI) recovers a clean heart rate where the firmware reported nothing. We did the same for the reference leg: rather than trust Polar's onboard R-R, we re-derived the H10 heart rate from its raw ECG with the suite's Pan-Tompkins detector (ECGDSP).

**Table 2.** Onboard firmware output vs raw-signal recovery, Verity Sense. The device's own HR/PPI are empty; the raw PPG is excellent.
Source	Onboard usable	Raw-recovered (PPGDSP)	Mean SQI	Clean beats
Verity HR stream (all nights)	0 samples	—	—	—
Verity PPI stream (06-16)	0 beats (header only)	—	—	—
Verity raw PPG (06-16/17)	n/a	full HR series, ~176 Hz	0.95–1.00	96–100%

The first window came on the night of 06-16: the Verity raw PPG runs ~6 hours and an H10 ECG recording begins at 01:06, while the O2Ring records throughout — so for ~2 hours (01:06–03:04, 7,057 paired seconds) all three devices observe the same heart at once. With the three streams aligned on the shared floating tMs grid, the three-cornered hat recovers each device's own σ with none assumed canonical. The tool runs this as multi-window machinery (an array of windows, each solved by the same kernel; see §5), reporting a per-device σ distribution — median, a 95% CI, the window count and the total simultaneous seconds — rather than a bare point. Six nights now clear the ≥1,000-s, SQI-gated bar (the 06-10/11, 06-11/12, 06-14/15, 06-15/16 and 06-16/17 overnight windows, plus a short 06-18/19 window whose contact-flat ECG forced its H10 leg onto the onboard R-R; 63,231 simultaneous seconds in all), so the σ below is the across-window median with a 95% bootstrap CI over the six windows. One further candidate (06-12/13) was excluded for failing the pre-registered H10↔O2Ring control leg (SD 8.6 vs ~2.7, from a noisy ECG derivation that night). The H10↔O2Ring leg is carried as a built-in control (it stays bias≈0 / SD 2.2–3.1 in every retained window) and any negative TCH variance is surfaced, not hidden.

Three-cornered-hat per-device sigma, median across 6 windows: O2Ring 1.83, H10 2.04, Verity 3.50 bpm — **Figure 2.** Three-cornered-hat reference-free σ (live tool output) over the three-device overlap — six windows, 63,231 s. Bars are the aggregate (across-window median) per-device σ with **95% CI whiskers** (bootstrap over the six windows) and faint per-window dots. H10 HR from raw ECG (Pan-Tompkins), Verity HR from raw PPG (PPGDSP), O2Ring native pulse. Each bar is that device's individual variance solved from the three pairwise difference variances — no device assumed perfect.

**Table 3.** Three-cornered-hat σ (none canonical) — aggregate median across **N=6** three-device windows (63,231 s) with across-window 95% bootstrap CI. Pairwise rows are pooled across all six windows. H10 = ECG-derived.
Device / pair	σ (TCH) [95% CI]	bias	SD	95% LoA	Arms	r
O2Ring (pulse)	1.83 [1.40–2.12]	— device σ, reference-free —
H10 (ECG-derived)	2.04 [1.77–2.23]	— device σ, reference-free —
Verity Sense (PPG)	3.50 [2.27–5.24]	— device σ, reference-free —
H10 − O2Ring (control)	—	−0.01	2.86	±5.6	2.86	—
H10 − Verity	—	+0.64	4.55	±8.9	4.59	—
Verity − O2Ring	—	−0.65	4.50	±8.8	4.55	—

The H10↔O2Ring leg is the same tight agreement seen across the seven nights — here it doubles as the hat's built-in control leg: it stays bias≈0 / SD 2.2–3.1 in all six retained windows, so a window where it drifts is flagged as mis-aligned rather than trusted — which is exactly what removed 06-12/13. The aggregate reference-free σ ranks O2Ring (1.83) ≈ H10 (2.04) < Verity (3.50) bpm, but the O2Ring and H10 CIs overlap — so those two are statistically indistinguishable, and only Verity separates. The most important change from adding windows is Verity: its single-window σ of 6.2 bpm (06-16/17) proved to be the worst window, not the typical one — across six windows the median falls to 3.50 bpm [2.27–5.24], the other windows clustered near 1.6–4.3. This is exactly why a single window cannot be trusted for a device-σ claim, and why the brief made multi-window the priority. The H10's ≈2.0 bpm here is not in tension with the ≈0.7 bpm repeatability quoted in §2.3/§3.1: the two measure different quantities — §2.3's 0.7 bpm is the H10's short-term precision (the 1-Hz residual left after a rolling median, i.e. tracking jitter once slow trend is removed), whereas the three-cornered-hat 2.04 bpm is the H10's total reference-free variance over the overlap, which additionally absorbs 1-Hz bucketing and the beat-to-beat granularity of an instantaneous ECG HR against the smoothed peers. A sanity check anchors the reference: the ECG-derived H10 HR matches Polar's onboard R-R to a −0.04 bpm bias (SD 2.27, r 0.85, n 7,224) — two independent QRS algorithms on the same heart agree, so the H10 leg is sound whichever way it is computed.

Read the ranking as error structure, not a plain accuracy order. The three-cornered hat assumes the three devices' errors are uncorrelated. Two physical facts bend that here: the O2Ring's reported pulse is internally smoothed (successive seconds rarely change), which suppresses its apparent σ below its true instantaneous error; and the raw-recovered Verity HR is instantaneous per beat, so it carries genuine beat-to-beat variability (true HRV) that the smoothed devices average away, inflating its apparent σ. The hat still does what no pairwise method can — separate three devices with none assumed canonical — but the honest reading is that the O2Ring is the smoothest and the raw Verity the least filtered, not simply that one is "more accurate" than another. The decisive, assumption-light lesson is the capture one: log the raw waveform, not the firmware's HR — it turned a dead third corner into a clean one at SQI ≈ 1.0. Concretely: the 130 Hz ECG (H10) carries a higher reference-free σ than the 1 Hz smoothed O2Ring despite being the truer source, and their CIs overlap — so the only statistically robust separation is Verity, at roughly twice the other two.

4. Discussion

The recipe answers the opening question concretely. Without any certified instrument we can state that the O2Ring's pulse rate is unbiased to within a beat and carries a random uncertainty near 3.7 bpm at rest — a number good enough to trust an overnight average while distrusting any single second, and good enough to flag a night where motion has destroyed the signal. The separation of precision from trueness is what makes this defensible: repeatability needed no reference; bias needed one, and a chest-strap ECG — itself uncertified but an order of magnitude more precise — is a sound stand-in whose residual noise we measured and subtracted. The three-cornered hat then removed even that mild assumption across the six windows where all three devices co-recorded, putting a number on each device with none held canonical. The sharpest practical lesson was unexpected: the third device's firmware produced nothing, yet its raw signal was pristine — so the right thing to log is the waveform, and the right thing to run on it is a detector you control.

Limitations. (i) Single subject, seven nights plus six three-device windows — a methods demonstration, not a population accuracy statement; the σ is this O2Ring on this finger, at rest. (ii) HR channel only; the O2Ring's SpO₂ trueness is untouched and would need arterial CO-oximetry. (iii) The reference is a consumer ECG strap, not a clinical monitor — fit for a transfer standard, not a regulatory claim. (iv) Resting data compresses the dynamic range, so these figures bound rest-HR error and may understate error during rapid HR change; an exercise protocol would probe the high-slew regime. (v) The three-cornered hat assumes uncorrelated errors; the O2Ring's internal smoothing and the Verity's instantaneous derivation bend that assumption (Table 3 callout), so its σ ranking describes error structure, not a clean accuracy order. (vi) Verity HR is recoverable only where raw PPG was logged; six co-recorded windows now clear the ≥1,000-s SQI-gated bar. One (06-18/19) had a contact-flat H10 ECG, so its reference leg falls back to the strap's onboard R-R rather than ECG-derived QRS — a footnoted substitution, defensible because the two H10 derivations agree to a few hundredths of a percent (PulseDex's RR↔ECG comparator). One further window (06-12/13) was excluded for failing the H10↔O2Ring control leg, and an earlier 06-13/14 night was auto-skipped for insufficient three-way overlap. With N=6 the reported CI is an across-window bootstrap and the uncorrelated-error assumption is testable; reaching the recommended 5–10 windows (§6) and spanning non-resting states remains a capture task. (vii) The hat cannot be extended to a fourth corner using the H10's onboard R-R: it is the same electrodes and signal as the ECG-derived leg (the two agree to ≈ 0.04% via PulseDex's comparator, error correlation ≈ 1), so it is not an independent measurement. Fed as a separate corner it violates the uncorrelated-error assumption and degenerates the solve (a near-zero mutual variance, typically a negative output); a genuine fourth corner needs a physically separate sensor, which is why the onboard R-R serves only as a same-device cross-check and the 06-18/19 fallback leg.

5. Reproducibility

Run it: open sigma-no-reference-analysis.html → “Run corpus”. Figures 1–2 and Tables 1–3 populate live from the committed device files and raw-derived series; export sigma-no-reference-results.csv, -stats.json, -figures.png.
Inputs (real): Polar H10 *_RR.txt and Wellue O2Ring *.csv for nights 06-10/11/12/14/15/17/18; for the three-device hat, each window's H10 *_ECG_part*.txt (raw ECG) and Verity Sense *_PPG_part*.txt (raw PPG) — six overnight windows spanning 06-10/11 to 06-18/19. All under uploads/, captured with Polar Sensor Logger (Polar devices) and the O2Ring exporter.
Raw-signal recovery (committed derived series): for each window, verity-ppg-derived-<date>-HR.txt = Verity HR from raw PPG via PPGDSP.analyze (SQI-gated) and h10-ecg-derived-<date>-HR.txt = H10 HR from raw ECG via ECGDSP Pan-Tompkins; the 06-18/19 window, whose ECG was contact-flat, instead uses h10-rr-derived-2026-06-19-HR.txt (onboard R-R). All are compact 1-Hz tMs;hr files the live tool ingests.
Time model: the tool mirrors the suite parseTimestamp (Clock Contract) — ISO no-zone for Polar, HH:MM:SS DD/MM/YYYY (DMY) for O2Ring; all display via getUTC*, so results are viewer-timezone-independent.
Estimators: repeatability (rolling-median robust SD), Bland–Altman + Arms + variance subtraction, and threeCorneredHat(V_AB,V_AC,V_BC) applied per window. The hat is driven by a TRIOS[] array of three-device windows: a builder intersects each window's three raw-derived streams on the floating tMs grid (keeping spans ≥1,000 s), the per-window kernel solves σ for each device, and an aggregator reports the median σ with a CI (across-window bootstrap at N≥3; within-window block bootstrap below that), N_windows, total simultaneous seconds, plus negative-variance and H10↔O2Ring control-leg checks. The full per-window distribution is written into sigma-no-reference-stats.json.
Add a window: derive the night's Verity HR from raw PPG (PPGDSP) and H10 HR from raw ECG (ECGDSP), commit them as uploads/*-derived-YYYY-MM-DD-HR.txt, and append a TRIOS entry — step-by-step in SIGMA-WINDOW-DERIVATION.md, which also carries the three-device capture protocol.
Next: reach 5–10 three-device windows (capture Verity raw PPG alongside the H10 every session) to turn the within-window CI into an across-window distribution and test the uncorrelated-error assumption; an exercise session to bound high-slew error; multi-subject capture before any accuracy claim.

6. Sample size & statistical power

Unlike the simulation pilots, this paper runs on real captured data, so “power” means having enough co-recorded time — and enough simultaneous three-device time — rather than synthetic patients. Two different sample sizes matter: paired seconds set the precision of the bias/SD/Arms figures (their SE falls as ~1/√n_seconds, and one night already supplies >20,000), while co-recorded nights/sessions set how well we separate a stable device-σ from night-specific artifacts, and three-device overlap windows are the binding constraint on the three-cornered hat.

**Table 4.** Data-sufficiency guidance for reference-free σ (real co-recorded captures).
Quantity	Minimum (acceptable)	Recommended	Diminishing returns
Bias / SD / Arms vs transfer standard	1 clean night (~20k paired s) → SD to ≈±0.05 bpm	5–7 clean nights — separates a stable σ from night artifacts; lets you flag (not average over) a bad night	> ~15 nights: per-night σ already stable; extra nights mainly characterize night-to-night spread, not the central σ
Three-cornered-hat per-device σ	achieved: 6 windows, 63,231 simultaneous s (each ≥1,000 s, SQI-gated) — tool reports an across-window 95% bootstrap CI	5–10 overlap windows on different nights — tightens the across-window CI and probes the uncorrelated-error assumption further	once windows span varied HR/motion states; more identical resting windows add little
HR dynamic range probed	resting only (this run)	+1 exercise/recovery session — bounds high-slew error the rest data can't see	—
Subjects	1 (methods demonstration)	≥10–20 for any population accuracy statement	set by the claim, not by σ precision

Practical reading for this dataset: the per-night HR-error σ is already well-determined (seven nights, 128k paired seconds), so additional captures buy the most where we are currently thinnest — more simultaneous three-device windows (every session that logs Verity raw PPG alongside the H10 adds a corner), and at least one non-resting session to bound error during rapid heart-rate change. More resting single-device nights add the least. This directly shapes what to prioritize as further data is added: capture all three devices together, keep the raw waveforms, and include some movement.

References

D. W. Allan, “Statistics of atomic frequency standards,” Proc. IEEE 54(2):221–230, 1966 — the variance framework underlying frequency-stability estimation.
J. E. Gray & D. W. Allan, “A method for estimating the frequency stability of an individual oscillator,” Proc. 28th Ann. Symp. Frequency Control, 1974 — the three-cornered-hat variance decomposition.
A. Premoli & P. Tavella, “A revisited three-cornered hat method for estimating frequency standard instability,” IEEE Trans. Instrum. Meas. 42(1):7–13, 1993 — handling negative and correlated variance estimates.
F. Torcaso, C. R. Ekstrom, E. A. Burt & D. N. Matsakis, “Estimating frequency stability and cross-correlations,” Proc. 30th PTTI Meeting, 1998 — the N-cornered hat with correlated sources (why a non-independent corner degenerates the solve).
W. J. Riley, “Handbook of Frequency Stability Analysis,” NIST Special Publication 1065, 2008 — practical multi-source variance estimation.
J. M. Bland & D. G. Altman, “Statistical methods for assessing agreement between two methods of clinical measurement,” Lancet 1(8476):307–310, 1986 — bias and 95% limits of agreement.
J. M. Bland & D. G. Altman, “Measuring agreement in method comparison studies,” Stat. Methods Med. Res. 8(2):135–160, 1999 — repeated-measures limits of agreement.
D. Giavarina, “Understanding Bland Altman analysis,” Biochem. Med. 25(2):141–151, 2015 — interpretation of difference-vs-mean plots, including discretization/banding artifacts.
J. Pan & W. J. Tompkins, “A real-time QRS detection algorithm,” IEEE Trans. Biomed. Eng. 32(3):230–236, 1985 — the R-peak detector applied to the raw H10 ECG.
J. Allen, “Photoplethysmography and its application in clinical physiological measurement,” Physiol. Meas. 28(3):R1–R39, 2007 — PPG signal fundamentals.
A. Schäfer & J. Vagedes, “How accurate is pulse rate variability as an estimate of heart rate variability?” Int. J. Cardiol. 166(1):15–29, 2013 — PPG- vs ECG-derived heart rate.
M. Gilgen-Ammann, T. Schweizer & T. Wyss, “RR interval signal quality of a heart rate monitor and an ECG Holter at rest and during exercise,” Eur. J. Appl. Physiol. 119:1525–1532, 2019 — Polar H10 validation against ECG.
B. Bent, B. A. Goldstein, W. A. Kibbe & J. P. Dunn, “Investigating sources of inaccuracy in wearable optical heart rate sensors,” npj Digit. Med. 3:18, 2020 — error sources in PPG heart rate (motion, perfusion).
A. Jubran, “Pulse oximetry,” Crit. Care 19:272, 2015 — oximetry principles and accuracy limits (why SpO₂ trueness is not testable here).
ISO 80601-2-61, “Medical electrical equipment — particular requirements for basic safety and essential performance of pulse oximeter equipment” — the accuracy root-mean-square (Arms) convention; to be cited precisely at submission.
Project documentation: CLAUDE.md (Clock Contract, capture provenance), Tepna suite; analysis apparatus sigma-no-reference-analysis.html; detectors ecgdex-dsp.js (ECGDSP) and ppgdex-dsp.js (PPGDSP); same-device RR↔ECG cross-derivation comparator in pulsedex-app.js.