Methodology
How we measure detection accuracy, what those numbers mean, and what we publish honestly.
What We Measure
AwareFlow uses one binary classifier per habit. Right now there are two: a sniff classifier and a throat clearing classifier. Each one is trained on labeled audio examples to recognize a single specific sound.
Detection runs entirely on your device using the Apple Neural Engine. Audio is analyzed in real time and immediately discarded. Nothing is recorded or sent anywhere. The numbers below are about how reliably the models recognize their target sound, not about anything stored from you.
What We Currently Report
The figures below come from CreateML training, which evaluates each model against held-out portions of the same labeled dataset used to train it.
- Sniff classifier. 90% training accuracy, 90% validation accuracy, 92% test accuracy.
- Throat clearing classifier. 88% validation accuracy.
These numbers reflect how the model performs against held-out splits of its own training data. That is internally meaningful but not the same thing as field accuracy. Field accuracy depends on your environment, your microphone, the distance to your phone, and your individual sound profile.
What These Numbers Don't Mean
A training metric is not a clinical claim. AwareFlow is not a medical device and does not measure or treat any condition.
Real-world accuracy varies. Quiet rooms behave differently from cafes. AirPods behave differently from a phone microphone on a desk. A throat clear from someone with a cold has different acoustics from a clear throat clear during a meeting. The training metrics above don't capture any of that.
The Calibration Lab personalizes the threshold per user, not the model itself. So your experience improves as the system learns what is yours, rather than because the underlying classifier changed.
How AwareFlow Communicates Trust
AwareFlow does not show users a static accuracy percentage in the app. No single number describes your personal experience, so showing one would be misleading.
Instead, AwareFlow uses a Detection Maturity system: every habit moves through three states as the app gathers your feedback.
- Learning. Early sessions. AwareFlow is building a baseline for your environment and your individual sound profile. Expect some over-detection while it tunes.
- Adapting. Pattern stabilizing. Your confirmations and rejections are shaping what the system notices.
- Personalized. Your specific threshold has stabilized. AwareFlow has heard enough of your habits and your environment to respond reliably.
Trust is earned through use, not claimed up front. The state label is visible in the app so you always know what stage you're in.
What We're Working On
Two longer-term measurement efforts are planned. They live here so the methodology stays honest as the product grows.
- Tier 2: field telemetry, quarterly publication. AwareFlow already collects three pieces of feedback you give it during use: when you confirm a detection, when you reject one, and when you manually log a moment the app missed. From those signals, we can compute a precision proxy: out of the detections you confirmed or rejected, what fraction were correct? This number will be published quarterly here on this page. The aggregate is computed from de-identified summary statistics, not from individual user data, and no audio is ever involved. First publication is targeted for late June 2026, once enough of you have used the app long enough to make the number meaningful.
- Tier 3: bench-validation study. The longer-term goal is a published peer-reviewed bench validation of each classifier. The target venue is JMIR Formative Research, modeled on Bodymatter's 2025 SleepWatch protocol. Methodology: standardized audio files (target sounds plus standardized confounders such as coughs, talking, and background noise), independent raters annotating ground truth blinded to AwareFlow's outputs, and reported sensitivity, specificity, accuracy, and inter-rater reliability. The result is a citable paper, not a marketing number. Estimated timeline is 2 to 3 months of focused effort once the product has stabilized post-launch.
Both efforts are about giving you, and the broader research community, a defensible way to evaluate how well AwareFlow does what it claims to do.