Verification study

Built to flag early. Verified in the open.

This is internal consistency verification, not clinical validation. No real children were involved. We ran 108 synthetic developmental scenarios, including borderline cases, through the public engine and compared the output against hand-labeled expectations across 294 domain-level observations. Result: 98.6% agreement, with 4 threshold calibration disagreements where the engine flagged concern more aggressively than the hand labels.

Methodology

Start with what the engine does. It's built on CDC 2022 milestone guidance, covers 8 developmental domains from 0 to 36 months, includes regression detection, and is fully open source so every rule can be inspected.

The verification set was designed to stress the decision boundaries, not just easy cases: typical development, known delay patterns, sparse-answer cases, regression scenarios, and 4 borderline threshold questions that now need clinical review.

Coverage

8 domains, ages 0 to 36 months, CDC 2022 aligned milestone logic.

Verification set

108 synthetic scenarios, including borderline cases where a clinician might say "monitor and re-check."

Decision rule

High-concern and moderate-concern outputs count as flagged. Normal, watch, low-concern, and insufficient-evidence outputs count as not flagged.

Transparency

The disagreements are published instead of hidden. They show where the engine is intentionally more cautious than the hand-labeled expectation.

Results

98.6%

Agreement

Across hand-labeled profiles

0.971

Cohen's Kappa

Agreement beyond chance

294

Labeled observations

Domain-profile pairs

108

Synthetic scenarios

Including borderline cases

4

Threshold cases

Identified for clinical review

45

Abstentions

15.3% of observations

Per-domain agreement

Four domains had a single borderline over-flag. The other four matched the hand labels exactly. That is the tradeoff we want from a screening aid: catch known delays, then calibrate edge cases with clinicians.

Domain N Agreement Notes
Social-Emotional 45 97.8% 1 borderline over-flag on sparse data
Expressive Language 67 98.5% 1 borderline over-flag near threshold
Gross Motor 72 100% No disagreements
Vision / Hearing 23 100% No disagreements
Cognitive / Play 22 100% No disagreements
Receptive Language 34 97.1% 1 borderline over-flag with too little evidence
Fine Motor 17 94.1% 1 borderline over-flag near threshold
Self-Help / Adaptive 14 100% No disagreements

The engine knows when to say "I don't know"

In 45 of 294 observations (15.3%), the engine returned insufficient_evidence. Fewer than 2 answered questions in a domain? It declines to classify instead of pretending to know.

All 45 abstentions were correctly negative in the hand-labeled set. This is the evidence sufficiency gate doing its job. A screening aid that guesses when it lacks data is worse than one that says "come back with more signal."

The honest version

What this proves

  • The engine is built on CDC 2022 milestone guidance across 8 domains from 0 to 36 months
  • Every hand-labeled delay case in this verification set was flagged
  • The only disagreements were borderline cases where the engine erred toward concern
  • Regression detection and sparse-data abstention behave the way we intended
  • Every rule is open source and the full report is reproducible

What this does NOT prove

  • Clinical validation with real patient data
  • Sensitivity or specificity against a clinical gold standard
  • How it compares with ASQ-3, M-CHAT-R/F, Denver, or RBSK outcomes
  • How the thresholds should be finalized for clinical use
  • Whether performance holds across languages, cultures, and care settings

All thresholds are hypothesis-level, aligned to CDC 2022 guidance. Clinical validation with real patient data is the next milestone, and we're actively seeking clinical partners to do that work.

Don't trust us. Run it yourself.

Every number on this page is reproducible. Install the engine, run the validation, check the output. If something doesn't match, open an issue.

# Install

npm install mychild-engine


# Run the full validation suite

npx mychild-engine validate --profiles data/synthetic-profiles.json --format markdown

Where the questions come from

Every question in the engine traces back to peer-reviewed developmental science. The synthetic validation methodology is grounded in published work on using synthetic data in clinical contexts.

Primary source: CDC 2022 Revised Developmental Milestones

Zubler JM, Wiggins LD, Macias MM, et al. "Evidence-Informed Milestones for Developmental Surveillance Tools." Pediatrics. 2022;149(3):e2021052138.

The 2022 revision shifted milestones from the 50th percentile to the 75th percentile. Each milestone in the engine represents a skill that 75% of children would be expected to demonstrate by the listed age.

doi:10.1542/peds.2021-052138

AAP Developmental Surveillance and Screening Policy

Lipkin PH, Macias MM; Council on Children with Disabilities. "Promoting Optimal Development: Identifying Infants and Young Children With Developmental Disorders Through Developmental Surveillance and Screening." Pediatrics. 2020;145(1):e20193449.

AAP recommends developmental surveillance at every well-child visit and standardized screening at 9, 18, and 30 months.

doi:10.1542/peds.2019-3449

CDC "Learn the Signs. Act Early." Program

Public domain milestone checklists that formed the basis for the question bank's caregiver-facing wording. These are monitoring tools, not screening instruments.

cdc.gov/act-early/milestones

RBSK — India's National Child Health Screening Program

Rashtriya Bal Swasthya Karyakram. Independently validated at 97% sensitivity, 96.4% specificity against ASQ-3. The engine's question bank has been mapped domain-by-domain against RBSK screening items.

Our questions map to RBSK items, but that doesn't make us RBSK-validated. The scoring logic hasn't been tested with Indian populations yet.

rbsk.mohfw.gov.in

Synthetic Data Validation Methodology

"The validity of synthetic clinical data: a validation study of a leading synthetic data generator (Synthea) using clinical quality measures." BMC Medical Informatics and Decision Making. 2019.

doi:10.1186/s12911-019-0793-0

Synthetic Data in Clinical Outcomes

"Synthetic data can aid the analysis of clinical outcomes: How much can it be trusted?" PNAS. 2024.

doi:10.1073/pnas.2414310121

Synthetic Validation of Pediatric Trust Instruments

"Synthetic Validation of Pediatric Trust Instruments Using LLMs." MedRxiv preprint. 2025.

medrxiv.org/10.1101/2025.11.25.25340922v1