Nobody built this. So we did.
An open-source screening engine that tracks developmental milestones across 8 domains, from birth to 36 months. Every question is weighted by clinical evidence, every result is traceable to the exact rule that produced it, and nothing ever leaves your device.
npm install mychild-engine · Apache-2.0 (code) / CC BY-SA 4.0 (data)
The problem
The free checklists from the CDC don't score anything. The validated tools like ASQ-3 and M-CHAT-R/F do, but they're copyrighted and expensive. There's nothing in between.
MyChild Engine fills that gap — it tracks milestones over time, weights each observation by clinical evidence, and tells you exactly why it flagged something. Not a diagnosis. Just enough signal to know when to talk to your pediatrician.
Under the hood
Every question gets scored against the child's corrected age. Results aggregate across domains with evidence sufficiency gates, and every single decision is traceable. No black boxes.
Weighted questions
Not all observations are equal. Each question carries a weight: Low, Medium, High, or Red-Flag. Low-weight items need corroboration before they matter. Red-flags skip the line.
Independent domains
Gross Motor, Fine Motor, Receptive Language, Expressive Language, Social-Emotional, Cognitive, Self-Help, Vision/Hearing. Each one scored on its own. A delay in one doesn't contaminate the others.
Age-appropriate bands
Birth through 36 months, plus universal red flags that apply at any age. The engine only asks what's developmentally relevant right now. No premature questions, no wasted anxiety.
False alarm protection
The engine won't flag "high concern" from a single observation. It requires at least 2 independent data points before escalating. One bad day shouldn't send a parent into a spiral.
Preterm correction built in
If your child was born before 37 weeks, the engine automatically adjusts expectations until 24 months. Parents of preemies don't need to do mental math to figure out what's actually on track.
Show your work
Every result comes with a plain-English explanation of exactly what drove it. Parents deserve to know why. Clinicians need to verify the reasoning. Both get it.
Rule simulator
Run synthetic child timelines against different threshold configurations. Change a weight, adjust a grace period, see exactly which alerts shift and by how much. We built this because no screening tool, open or proprietary, lets you actually test the rules before deploying them.
The numbers so far
0.906
Cohen's Kappa
Measures how much the engine and the evaluator agree, beyond what you'd expect from random chance. 1.0 is perfect agreement. Anything above 0.8 is considered "almost perfect" in clinical research. Ours is 0.906 across the combined set.
91.4%
Sensitivity
When there's actually a concern, how often does the engine catch it? 91.4% of the time. The remaining 8.6% are misses where the engine said "looks fine" but the evaluator disagreed. For a screening tool, you want this number high.
98.8%
Specificity
When a child is developing normally, how often does the engine correctly say so? 98.8% of the time. Only 1.2% false alarms. This matters because false flags cause unnecessary worry for parents.
71,939
Profiles tested
Synthetic developmental profiles generated by the autoresearch system. Not real children. Each profile simulates a child's milestone responses across multiple domains and age points.
How we tested it
We borrowed the idea from Karpathy's autoresearch. Hand-labeled profiles first to check basic logic. Then adversarial cases to see what breaks.
Hand-verified set
1,747
Profiles
1.000
Kappa
Curated profiles where the expected output is unambiguous. Perfect agreement by construction. This validates the engine's logic on clear-cut cases, not edge-case performance.
Adversarial stress test
70,192
Profiles
0.852
Kappa
70,192 profiles designed to break the engine: regression patterns, sparse data, contradictory answers, preterm edge cases. Specificity held at 98.0%. Of 781 disagreements, most were under-flags. The engine is conservative. That feels like the right failure mode for a screening tool, but it's still a miss.
Ground-truth labels came from an AI clinical evaluator, not licensed clinicians. This is not clinical validation. We publish it because we'd rather show our work than wait until it's perfect.
How it compares
There are a handful of established screening tools that pediatricians and public health programs use around the world. We compared our numbers against theirs. On specificity, the engine is ahead. Sensitivity is in the same range. But our numbers come from synthetic data, so this comparison is directional at best.
Every tool below has been validated with real children in clinical settings. Ours hasn't. You can't put them on equal footing. We include this comparison so you can see where the engine sits relative to the field, not to claim equivalence.
Ages and Stages Questionnaire (ASQ-3)
The most widely used parent-completed screening tool in the world. Clinically validated with ~18,000 children.
82-97%
Sensitivity
83-93%
Specificity
Modified Checklist for Autism in Toddlers (M-CHAT-R/F)
Two-stage autism screening for toddlers 16-30 months. Recommended by the AAP for routine screening at 18 and 24 months.
85-95%
Sensitivity
93-99%
Specificity
Parents' Evaluation of Developmental Status (PEDS)
10-question tool for birth to 8 years. Uses parent concerns as the signal. Validated with ~1,500 children.
74-79%
Sensitivity
70-80%
Specificity
MyChild Engine (combined)
Our engine. 129 evidence-weighted questions across 8 developmental domains, for children 1-24 months. These numbers come from 71,939 synthetic profiles.
91.4%
Sensitivity
98.8%
Specificity
What this proves and what it doesn't
What it shows
- 70,192 adversarial profiles tried to break it. Kappa held at 0.852.
- 98.0% specificity on adversarial data. Low false alarm rate.
- The autoresearch system shipped 14 fixes and tried 325 threshold changes that all came back "keep current parameters."
- Questions based on CDC 2022 developmental milestones and mapped to India's RBSK screening program.
- Open source. Every rule inspectable, every result reproducible.
What it does not prove
- Clinical validation with real children. That needs ethics board oversight and clinical partners we don't have yet.
- Real-world sensitivity or specificity. Synthetic profiles are not real children.
- Whether it works across languages, cultures, or care settings.
- Whether the thresholds are right for clinical use. They're hypothesis-level, aligned to the 2022 developmental milestone guidelines.
A prospective clinical study is next. We're looking for clinical partners.
Reproduce it yourself
npm install mychild-engine
npx mychild-engine validate --profiles data/synthetic-profiles.json --format markdown
India
India has 26 million babies a year. The screening infrastructure doesn't scale.
RBSK (Rashtriya Bal Swasthya Karyakram) is India's national child health screening program. It's been independently validated at 97% sensitivity and 96.4% specificity against ASQ-3. The problem isn't the tool. It's that there aren't enough trained health workers to run it at the scale India needs.
We've mapped our question bank domain-by-domain against RBSK screening items. The questions align. Community health workers running RBSK screenings could use the same questions with the engine's scoring logic on top, on a phone, offline, without a clinician standing next to them.
To be clear: mapping our questions to RBSK doesn't make this RBSK-validated. The scoring logic hasn't been tested against RBSK benchmarks or with Indian populations. This is alignment, not endorsement. The validation work is still ahead of us.
What we're building next
The engine works. Now we're making it smarter.
Adaptive questioning
Right now the engine asks every question in a domain. It shouldn't have to. If the first three answers are all "yes, doing this easily," the remaining motor questions probably don't need asking. Fewer questions, less parent fatigue, same signal.
NLP probe analysis
Parents don't think in yes/no checkboxes. "He kind of does it but only when he's in a good mood" has real signal in it. We want the engine to understand free-text responses and extract what matters without losing the nuance.
Longitudinal pattern recognition
One screening is a snapshot. Three screenings over six months is a trajectory. We want the engine to distinguish a late bloomer from a persistent delay, and catch regression patterns before a parent has to notice them on their own.
Recognized by
Google made a documentary about this
Back when MyChild App was live, Google filmed the story of how a kid with dyspraxia built a screening tool used in 100+ countries. This is where the engine came from.
Five minutes to your first screening
# Install
npm install @mychild/engine
# Use
import { evaluate, computeChildAge, getDueQuestions } from '@mychild/engine';
// Get age-appropriate questions for a 7-month-old
const questions = getDueQuestions({ dob: new Date('2025-09-01') }, []);
// Returns 12 questions across Motor, Language, Cognitive, Social
Full API docs, architecture walkthrough, and integration guide at /docs. Package page on npm.
Where the questions come from
- 1.
Zubler JM, Wiggins LD, Macias MM, et al. "Evidence-Informed Milestones for Developmental Surveillance Tools." Pediatrics. 2022;149(3):e2021052138.
doi:10.1542/peds.2021-052138 - 2.
Lipkin PH, Macias MM; Council on Children with Disabilities. "Promoting Optimal Development: Identifying Infants and Young Children With Developmental Disorders Through Developmental Surveillance and Screening." Pediatrics. 2020;145(1):e20193449.
doi:10.1542/peds.2019-3449 - 3.
CDC "Learn the Signs. Act Early." Program — public domain milestone checklists for caregiver-facing developmental monitoring.
cdc.gov/act-early/milestones - 4.
RBSK (Rashtriya Bal Swasthya Karyakram) — India's National Child Health Screening Program. Validated at 97% sensitivity, 96.4% specificity against ASQ-3.
rbsk.mohfw.gov.in - 5.
"The validity of synthetic clinical data: a validation study of a leading synthetic data generator (Synthea) using clinical quality measures." BMC Medical Informatics and Decision Making. 2019.
doi:10.1186/s12911-019-0793-0 - 6.
"Synthetic data can aid the analysis of clinical outcomes: How much can it be trusted?" PNAS. 2024.
doi:10.1073/pnas.2414310121 - 7.
"Synthetic Validation of Pediatric Trust Instruments Using LLMs." MedRxiv preprint. 2025.
medrxiv.org/10.1101/2025.11.25.25340922v1
Please read this
This is not a diagnostic tool. It tracks developmental milestones and surfaces patterns. It cannot and does not diagnose any medical condition, developmental disorder, or disability. If something concerns you about your child's development, talk to your pediatrician. That conversation is the whole point.
We've verified the scoring logic against 108 synthetic scenarios, including borderline cases. Across 294 hand-labeled domain observations, the engine reached 98.6% agreement and flagged every known delay case in the set. That still does NOT make it clinically validated. Real-patient validation is the next milestone. The question bank draws from publicly available CDC milestone checklists, not copyrighted instruments like ASQ-3, M-CHAT-R/F, or Denver. Everything runs locally on your device. No child data leaves your machine.