Built to flag early. Verified in the open.
This is internal consistency verification, not clinical validation. No real children were involved. We ran 108 synthetic developmental scenarios, including borderline cases, through the public engine and compared the output against hand-labeled expectations across 294 domain-level observations. Result: 98.6% agreement, with 4 threshold calibration disagreements where the engine flagged concern more aggressively than the hand labels.
Methodology
Start with what the engine does. It's built on CDC 2022 milestone guidance, covers 8 developmental domains from 0 to 36 months, includes regression detection, and is fully open source so every rule can be inspected.
The verification set was designed to stress the decision boundaries, not just easy cases: typical development, known delay patterns, sparse-answer cases, regression scenarios, and 4 borderline threshold questions that now need clinical review.
Coverage
8 domains, ages 0 to 36 months, CDC 2022 aligned milestone logic.
Verification set
108 synthetic scenarios, including borderline cases where a clinician might say "monitor and re-check."
Decision rule
High-concern and moderate-concern outputs count as flagged. Normal, watch, low-concern, and insufficient-evidence outputs count as not flagged.
Transparency
The disagreements are published instead of hidden. They show where the engine is intentionally more cautious than the hand-labeled expectation.
Results
98.6%
Agreement
Across hand-labeled profiles
0.971
Cohen's Kappa
Agreement beyond chance
294
Labeled observations
Domain-profile pairs
108
Synthetic scenarios
Including borderline cases
4
Threshold cases
Identified for clinical review
45
Abstentions
15.3% of observations
Per-domain agreement
Four domains had a single borderline over-flag. The other four matched the hand labels exactly. That is the tradeoff we want from a screening aid: catch known delays, then calibrate edge cases with clinicians.
| Domain | N | Agreement | Notes |
|---|---|---|---|
| Social-Emotional | 45 | 97.8% | 1 borderline over-flag on sparse data |
| Expressive Language | 67 | 98.5% | 1 borderline over-flag near threshold |
| Gross Motor | 72 | 100% | No disagreements |
| Vision / Hearing | 23 | 100% | No disagreements |
| Cognitive / Play | 22 | 100% | No disagreements |
| Receptive Language | 34 | 97.1% | 1 borderline over-flag with too little evidence |
| Fine Motor | 17 | 94.1% | 1 borderline over-flag near threshold |
| Self-Help / Adaptive | 14 | 100% | No disagreements |
The engine knows when to say "I don't know"
In 45 of 294 observations (15.3%), the engine returned insufficient_evidence. Fewer than 2 answered questions in a domain? It declines to classify instead of pretending to know.
All 45 abstentions were correctly negative in the hand-labeled set. This is the evidence sufficiency gate doing its job. A screening aid that guesses when it lacks data is worse than one that says "come back with more signal."
The honest version
What this proves
- The engine is built on CDC 2022 milestone guidance across 8 domains from 0 to 36 months
- Every hand-labeled delay case in this verification set was flagged
- The only disagreements were borderline cases where the engine erred toward concern
- Regression detection and sparse-data abstention behave the way we intended
- Every rule is open source and the full report is reproducible
What this does NOT prove
- Clinical validation with real patient data
- Sensitivity or specificity against a clinical gold standard
- How it compares with ASQ-3, M-CHAT-R/F, Denver, or RBSK outcomes
- How the thresholds should be finalized for clinical use
- Whether performance holds across languages, cultures, and care settings
All thresholds are hypothesis-level, aligned to CDC 2022 guidance. Clinical validation with real patient data is the next milestone, and we're actively seeking clinical partners to do that work.
Don't trust us. Run it yourself.
Every number on this page is reproducible. Install the engine, run the validation, check the output. If something doesn't match, open an issue.
# Install
npm install mychild-engine
# Run the full validation suite
npx mychild-engine validate --profiles data/synthetic-profiles.json --format markdown
Where the questions come from
Every question in the engine traces back to peer-reviewed developmental science. The synthetic validation methodology is grounded in published work on using synthetic data in clinical contexts.
Primary source: CDC 2022 Revised Developmental Milestones
Zubler JM, Wiggins LD, Macias MM, et al. "Evidence-Informed Milestones for Developmental Surveillance Tools." Pediatrics. 2022;149(3):e2021052138.
The 2022 revision shifted milestones from the 50th percentile to the 75th percentile. Each milestone in the engine represents a skill that 75% of children would be expected to demonstrate by the listed age.
doi:10.1542/peds.2021-052138AAP Developmental Surveillance and Screening Policy
Lipkin PH, Macias MM; Council on Children with Disabilities. "Promoting Optimal Development: Identifying Infants and Young Children With Developmental Disorders Through Developmental Surveillance and Screening." Pediatrics. 2020;145(1):e20193449.
AAP recommends developmental surveillance at every well-child visit and standardized screening at 9, 18, and 30 months.
doi:10.1542/peds.2019-3449CDC "Learn the Signs. Act Early." Program
Public domain milestone checklists that formed the basis for the question bank's caregiver-facing wording. These are monitoring tools, not screening instruments.
cdc.gov/act-early/milestonesRBSK — India's National Child Health Screening Program
Rashtriya Bal Swasthya Karyakram. Independently validated at 97% sensitivity, 96.4% specificity against ASQ-3. The engine's question bank has been mapped domain-by-domain against RBSK screening items.
Our questions map to RBSK items, but that doesn't make us RBSK-validated. The scoring logic hasn't been tested with Indian populations yet.
rbsk.mohfw.gov.inSynthetic Data Validation Methodology
"The validity of synthetic clinical data: a validation study of a leading synthetic data generator (Synthea) using clinical quality measures." BMC Medical Informatics and Decision Making. 2019.
doi:10.1186/s12911-019-0793-0Synthetic Data in Clinical Outcomes
"Synthetic data can aid the analysis of clinical outcomes: How much can it be trusted?" PNAS. 2024.
doi:10.1073/pnas.2414310121Synthetic Validation of Pediatric Trust Instruments
"Synthetic Validation of Pediatric Trust Instruments Using LLMs." MedRxiv preprint. 2025.
medrxiv.org/10.1101/2025.11.25.25340922v1