Back after 10 years. The problem never went away.

Nobody built this. So we did.

An open-source screening engine that tracks developmental milestones across 8 domains, from birth to 36 months. Every question is weighted by clinical evidence, every result is traceable to the exact rule that produced it, and nothing ever leaves your device.

npm install mychild-engine · Apache-2.0 (code) / CC BY-SA 4.0 (data)

The problem

The free checklists from the CDC don't score anything. The validated tools like ASQ-3 and M-CHAT-R/F do, but they're copyrighted and expensive. There's nothing in between.

MyChild Engine fills that gap — it tracks milestones over time, weights each observation by clinical evidence, and tells you exactly why it flagged something. Not a diagnosis. Just enough signal to know when to talk to your pediatrician.

Under the hood

Every question gets scored against the child's corrected age. Results aggregate across domains with evidence sufficiency gates, and every single decision is traceable. No black boxes.

131

Weighted questions

Not all observations are equal. Each question carries a weight: Low, Medium, High, or Red-Flag. Low-weight items need corroboration before they matter. Red-flags skip the line.

8

Independent domains

Gross Motor, Fine Motor, Receptive Language, Expressive Language, Social-Emotional, Cognitive, Self-Help, Vision/Hearing. Each one scored on its own. A delay in one doesn't contaminate the others.

10

Age-appropriate bands

Birth through 36 months, plus universal red flags that apply at any age. The engine only asks what's developmentally relevant right now. No premature questions, no wasted anxiety.

False alarm protection

The engine won't flag "high concern" from a single observation. It requires at least 2 independent data points before escalating. One bad day shouldn't send a parent into a spiral.

Preterm correction built in

If your child was born before 37 weeks, the engine automatically adjusts expectations until 24 months. Parents of preemies don't need to do mental math to figure out what's actually on track.

Show your work

Every result comes with a plain-English explanation of exactly what drove it. Parents deserve to know why. Clinicians need to verify the reasoning. Both get it.

Rule simulator

Run synthetic child timelines against different threshold configurations. Change a weight, adjust a grace period, see exactly which alerts shift and by how much. We built this because no screening tool, open or proprietary, lets you actually test the rules before deploying them.

Validation study

The numbers so far

0.906

Cohen's Kappa

Measures how much the engine and the evaluator agree, beyond what you'd expect from random chance. 1.0 is perfect agreement. Anything above 0.8 is considered "almost perfect" in clinical research. Ours is 0.906 across the combined set.

91.4%

Sensitivity

When there's actually a concern, how often does the engine catch it? 91.4% of the time. The remaining 8.6% are misses where the engine said "looks fine" but the evaluator disagreed. For a screening tool, you want this number high.

98.8%

Specificity

When a child is developing normally, how often does the engine correctly say so? 98.8% of the time. Only 1.2% false alarms. This matters because false flags cause unnecessary worry for parents.

71,939

Profiles tested

Synthetic developmental profiles generated by the autoresearch system. Not real children. Each profile simulates a child's milestone responses across multiple domains and age points.

How we tested it

We borrowed the idea from Karpathy's autoresearch. Hand-labeled profiles first to check basic logic. Then adversarial cases to see what breaks.

Phase 1

Hand-verified set

1,747

Profiles

1.000

Kappa

Curated profiles where the expected output is unambiguous. Perfect agreement by construction. This validates the engine's logic on clear-cut cases, not edge-case performance.

Phase 2

Adversarial stress test

70,192

Profiles

0.852

Kappa

70,192 profiles designed to break the engine: regression patterns, sparse data, contradictory answers, preterm edge cases. Specificity held at 98.0%. Of 781 disagreements, most were under-flags. The engine is conservative. That feels like the right failure mode for a screening tool, but it's still a miss.

Ground-truth labels came from an AI clinical evaluator, not licensed clinicians. This is not clinical validation. We publish it because we'd rather show our work than wait until it's perfect.

How it compares

There are a handful of established screening tools that pediatricians and public health programs use around the world. We compared our numbers against theirs. On specificity, the engine is ahead. Sensitivity is in the same range. But our numbers come from synthetic data, so this comparison is directional at best.

Every tool below has been validated with real children in clinical settings. Ours hasn't. You can't put them on equal footing. We include this comparison so you can see where the engine sits relative to the field, not to claim equivalence.

Ages and Stages Questionnaire (ASQ-3)

The most widely used parent-completed screening tool in the world. Clinically validated with ~18,000 children.

82-97%

Sensitivity

83-93%

Specificity

Clinical

Modified Checklist for Autism in Toddlers (M-CHAT-R/F)

Two-stage autism screening for toddlers 16-30 months. Recommended by the AAP for routine screening at 18 and 24 months.

85-95%

Sensitivity

93-99%

Specificity

Clinical

Parents' Evaluation of Developmental Status (PEDS)

10-question tool for birth to 8 years. Uses parent concerns as the signal. Validated with ~1,500 children.

74-79%

Sensitivity

70-80%

Specificity

Clinical

MyChild Engine (combined)

Our engine. 129 evidence-weighted questions across 8 developmental domains, for children 1-24 months. These numbers come from 71,939 synthetic profiles.

91.4%

Sensitivity

98.8%

Specificity

Synthetic

What this proves and what it doesn't

What it shows

  • 70,192 adversarial profiles tried to break it. Kappa held at 0.852.
  • 98.0% specificity on adversarial data. Low false alarm rate.
  • The autoresearch system shipped 14 fixes and tried 325 threshold changes that all came back "keep current parameters."
  • Questions based on CDC 2022 developmental milestones and mapped to India's RBSK screening program.
  • Open source. Every rule inspectable, every result reproducible.

What it does not prove

  • Clinical validation with real children. That needs ethics board oversight and clinical partners we don't have yet.
  • Real-world sensitivity or specificity. Synthetic profiles are not real children.
  • Whether it works across languages, cultures, or care settings.
  • Whether the thresholds are right for clinical use. They're hypothesis-level, aligned to the 2022 developmental milestone guidelines.

A prospective clinical study is next. We're looking for clinical partners.

Reproduce it yourself

npm install mychild-engine

npx mychild-engine validate --profiles data/synthetic-profiles.json --format markdown

India

India has 26 million babies a year. The screening infrastructure doesn't scale.

RBSK (Rashtriya Bal Swasthya Karyakram) is India's national child health screening program. It's been independently validated at 97% sensitivity and 96.4% specificity against ASQ-3. The problem isn't the tool. It's that there aren't enough trained health workers to run it at the scale India needs.

We've mapped our question bank domain-by-domain against RBSK screening items. The questions align. Community health workers running RBSK screenings could use the same questions with the engine's scoring logic on top, on a phone, offline, without a clinician standing next to them.

To be clear: mapping our questions to RBSK doesn't make this RBSK-validated. The scoring logic hasn't been tested against RBSK benchmarks or with Indian populations. This is alignment, not endorsement. The validation work is still ahead of us.

See the full RBSK alignment

What we're building next

The engine works. Now we're making it smarter.

1

Adaptive questioning

Right now the engine asks every question in a domain. It shouldn't have to. If the first three answers are all "yes, doing this easily," the remaining motor questions probably don't need asking. Fewer questions, less parent fatigue, same signal.

2

NLP probe analysis

Parents don't think in yes/no checkboxes. "He kind of does it but only when he's in a good mood" has real signal in it. We want the engine to understand free-text responses and extract what matters without losing the nuance.

3

Longitudinal pattern recognition

One screening is a snapshot. Three screenings over six months is a trajectory. We want the engine to distinguish a late bloomer from a persistent delay, and catch regression patterns before a parent has to notice them on their own.

Google made a documentary about this

Back when MyChild App was live, Google filmed the story of how a kid with dyspraxia built a screening tool used in 100+ countries. This is where the engine came from.

Five minutes to your first screening

# Install

npm install @mychild/engine


# Use

import { evaluate, computeChildAge, getDueQuestions } from '@mychild/engine';


// Get age-appropriate questions for a 7-month-old

const questions = getDueQuestions({ dob: new Date('2025-09-01') }, []);

// Returns 12 questions across Motor, Language, Cognitive, Social

Full API docs, architecture walkthrough, and integration guide at /docs. Package page on npm.

Where the questions come from

  1. 1.

    Zubler JM, Wiggins LD, Macias MM, et al. "Evidence-Informed Milestones for Developmental Surveillance Tools." Pediatrics. 2022;149(3):e2021052138.

    doi:10.1542/peds.2021-052138
  2. 2.

    Lipkin PH, Macias MM; Council on Children with Disabilities. "Promoting Optimal Development: Identifying Infants and Young Children With Developmental Disorders Through Developmental Surveillance and Screening." Pediatrics. 2020;145(1):e20193449.

    doi:10.1542/peds.2019-3449
  3. 3.

    CDC "Learn the Signs. Act Early." Program — public domain milestone checklists for caregiver-facing developmental monitoring.

    cdc.gov/act-early/milestones
  4. 4.

    RBSK (Rashtriya Bal Swasthya Karyakram) — India's National Child Health Screening Program. Validated at 97% sensitivity, 96.4% specificity against ASQ-3.

    rbsk.mohfw.gov.in
  5. 5.

    "The validity of synthetic clinical data: a validation study of a leading synthetic data generator (Synthea) using clinical quality measures." BMC Medical Informatics and Decision Making. 2019.

    doi:10.1186/s12911-019-0793-0
  6. 6.

    "Synthetic data can aid the analysis of clinical outcomes: How much can it be trusted?" PNAS. 2024.

    doi:10.1073/pnas.2414310121
  7. 7.

    "Synthetic Validation of Pediatric Trust Instruments Using LLMs." MedRxiv preprint. 2025.

    medrxiv.org/10.1101/2025.11.25.25340922v1

Please read this

This is not a diagnostic tool. It tracks developmental milestones and surfaces patterns. It cannot and does not diagnose any medical condition, developmental disorder, or disability. If something concerns you about your child's development, talk to your pediatrician. That conversation is the whole point.

We've verified the scoring logic against 108 synthetic scenarios, including borderline cases. Across 294 hand-labeled domain observations, the engine reached 98.6% agreement and flagged every known delay case in the set. That still does NOT make it clinically validated. Real-patient validation is the next milestone. The question bank draws from publicly available CDC milestone checklists, not copyrighted instruments like ASQ-3, M-CHAT-R/F, or Denver. Everything runs locally on your device. No child data leaves your machine.