On-Device AI Insights — Step FWD Devlog

We wanted Step FWD to do more than count steps. We wanted it to notice patterns — the kind of patterns you might miss when you’re just glancing at a number each day. Are your Mondays consistently stronger than your Fridays? Are you trending up over the past month? How quickly do you bounce back after a missed goal day?

The catch: we’d committed to keeping everything on-device. No cloud, no accounts, no server to send your data to. That rules out the usual approach of shipping walking data to a backend and running models there. Everything had to run locally, on the phone, with the data that was already there.

Seven engines, one pipeline

The insight system is built around seven statistical engines, each designed to detect a specific pattern in your walking history. They run sequentially on every view refresh, and the top 3 most relevant insights get surfaced to you.

1. Rhythm Analysis

Requires 4 weeks of data. Groups your step counts by day of week and calculates the average for each. Then it identifies your strongest and weakest days and measures the spread between them. If your Mondays average 18,245 steps and your Fridays average 12,761 — that’s a 43% spread, and it’s worth mentioning.

2. Trend Detection

Uses linear regression over your last 3+ weeks. We calculate the slope of your daily step counts, convert it to a weekly change rate, and measure the fit with R². A trend only surfaces if the weekly change exceeds 200 steps (to filter noise) and the R² is above 0.1 (meaning at least 10% of the variance is explained by the trend).

3. Consistency Score

Measures how stable your step counts are day to day using the coefficient of variation (standard deviation divided by mean). The result maps to a 0–100 score: above 80 is “Very Consistent,” below 20 is “Very Irregular.” This insight gets more relevant the further your score is from the middle — being very consistent or very irregular is more interesting than being moderately variable.

4. Goal Pacing

Only runs for the current period (not historical). Calculates how many days you’ve met your goal so far, projects that rate across the full period, and tells you whether you’re on track. This one peaks in relevance around the midpoint of your period — early on there’s not enough data, and near the end the outcome is already clear.

5. Personal Best

Checks if your best day in the current period is an all-time record or lands in your top 10th percentile historically. We include explicit date annotations — the system always tells you which day and how far back the comparison goes, so the insight has real context instead of vague superlatives.

6. Bounce-Back Resilience

Counts how often you recover after a missed goal day. If you miss your goal on Tuesday and hit it on Wednesday, that’s a recovery. The bounce-back rate is recoveries divided by missed days. A high rate suggests resilience; a low rate gets a constructive suggestion rather than judgment.

7. Walking Pace Trend

Requires at least 6 completed walking sessions with GPS data. Splits your sessions chronologically into earlier and recent halves, calculates average pace (minutes per kilometer) for each, and reports the percentage change. Only surfaces if the change exceeds 3% — anything less is noise.

Relevance scoring

Each engine assigns a relevance score between 0.0 and 1.0 to its output. Insights below 0.3 are filtered as noise. The remaining insights are sorted by relevance, and only the top 3 are shown.

This prevents information overload. On any given day, you might have data that triggers five or six engines. But the system picks the three most meaningful observations and presents those.

From statistics to language

Raw statistics aren’t useful to most people. “CV = 0.23” means nothing. “R² = 0.47” means nothing. The numbers need to become sentences.

This is where Apple’s Foundation Models come in. Each engine produces a promptContext — a structured summary of what it found, written for an LLM to interpret. The on-device model receives this context along with system instructions that define tone, length, and formatting rules.

The model generates a short title (4 words max, no emoji) and a 2–3 sentence body. The rules are specific: use exact numbers from the data, always specify what you’re comparing against, never echo technical statistics, write for a normal person, and be honest about negatives.

The generation happens entirely on-device using LanguageModelSession from Apple’s FoundationModels framework. No API call, no network request. Your walking patterns are analyzed where they were collected — on your iPhone.

Fallback path

On-device LLMs aren’t perfect. They can hallucinate dates, misinterpret trends, or produce awkward phrasing. Every insight has a pre-written template title and fallback sentence that can replace the LLM output if generation fails or produces something unusable.

This dual-path approach means insights always work — even on devices without Foundation Models support. The templates are less personalized but still useful: “Steps Trending Up” with a clear factual sentence beats a blank card.

What we learned about on-device LLMs

Small models struggle with temporal reasoning. A date like “January 15” gets summarized as “today” unless you explicitly annotate it with “(not today)” in the prompt. Technical statistics leak into the output unless you strip them from the prompt context entirely.

We iterated through several prompt strategies:

Adding explicit parentheticals after every historical date
Removing all statistical measures (CV, R², stddev) from the prompt
Including “NEVER say ‘today’” directly in the system instructions
Requiring explicit comparison baselines (“compared to your last 6 months” instead of “top 1%”)

These aren’t elegant solutions — they’re battle-tested patches for real failure modes we encountered during testing.

Caching

Insights are cached with a 4-hour TTL. The cache invalidates on four triggers: the TTL expiring, today’s steps crossing the daily goal, a calendar day change, or a walking session being completed. This balances freshness with computation cost.

Why not the cloud

It would have been easier. Ship the data to a server, run a proper model, return polished results. But “easier” isn’t the right metric when you’re building a privacy-first walking app.

Your walking data is health data. Your patterns reveal when you’re active, when you’re sedentary, where you go, how long you’re outside. That data deserves to stay on your device. The statistical engines are fast enough to run locally. The on-device LLM, with its quirks and workarounds, produces results that are good enough to be genuinely useful.

The trade-off is real — cloud models would produce better prose. But the privacy guarantee is worth more than perfect sentences.