71,939 synthetic profiles. Every result published.
We built an autonomous research system, inspired by Karpathy's autoresearch, that generates adversarial developmental profiles, runs them through the engine, labels the expected output, and measures where the two disagree. Four runs. 14 engine fixes shipped. 325 threshold experiments that changed nothing. Here's what we found.
None of this is clinical validation. Ground-truth labels came from an AI clinical evaluator, not licensed clinicians reviewing real children. We publish it anyway because we'd rather show our work than wait until it's perfect.
0.906
Cohen's Kappa
Combined set
91.4%
Sensitivity
Flags real concerns
98.8%
Specificity
Doesn't cry wolf
34,587
Observations
Domain-profile pairs scored
Two-phase validation
First, a hand-labeled verification set to prove the engine's logic works. Then, an adversarial stress test to find where it breaks.
Curated set
1,747
Profiles
12,991
Observations
1.000
Kappa
100%
Sensitivity & specificity
These profiles were promoted from the adversarial set after manual review confirmed unambiguous agreement. Kappa = 1.000 is expected by construction. This validates the engine's logic on clear-cut cases, not its edge-case performance.
Adversarial set
70,192
Profiles
21,596
Observations
0.852
Kappa
86.5%
Sensitivity
70,192 profiles designed to break the engine: regression patterns, sparse data, contradictory answers, preterm edge cases, single-domain flags. Specificity held at 98.0%. Of the 781 disagreements, most were under-flags. The engine is cautious. It misses some things, but it doesn't make things up.
What "adversarial" actually means
The system generated 18 categories of edge-case profiles. Not random data. Each category targets a specific way the engine could fail.
10,578
Regression
9,072
Sparse data
8,786
Single domain
6,500
Mutations
5,616
Exhaustive age
5,208
Contradictory
5,073
Clear delay
4,964
Clear typical
4,752
Preterm
2,897
Borderline
2,544
Combinatorial
1,914
Real-world
315 critical findings need human review.
Adversarial profiles where the engine and the AI evaluator disagreed on clinically meaningful classifications. Logged and triaged. Not hidden.
How it compares
On specificity, the engine beats the published numbers for comparable tools. Sensitivity is in the same range. But our numbers come from synthetic data, so this comparison is directional at best.
ASQ-3, M-CHAT-R/F, and PEDS are clinically validated with real patient data. Ours are synthetic. You can't put them on equal footing.
| Tool | Sensitivity | Specificity | Validation | Sample |
|---|---|---|---|---|
| ASQ-3 | 82–97% | 83–93% | Clinical | ~18,000 |
| M-CHAT-R/F | 85–95% | 93–99% | Clinical | — |
| PEDS | 74–79% | 70–80% | Clinical | ~1,500 |
| MyChild (adversarial) | 86.5% | 98.0% | Synthetic | 21,596 obs |
| MyChild (combined) | 91.4% | 98.8% | Synthetic | 34,587 obs |
14 improvements across 4 runs
Each run found failure modes. Each run shipped fixes. By Run 4, the system couldn't find anything left to change. 325 threshold experiments, all came back "keep current parameters."
Run 1
Isolated single-flag downgrade
Cross-domain sparse correction
Regression bypasses evidence gate
Cross-domain regression protection
Run 2
Soft regression detection
Multi-regression escalation
Caregiver uncertainty detection
Developmental sequence anomaly
Asymmetric screening thresholds
+ 5 more fixes
RF scoring, label normalization, evidence gate tuning, low_concern handling, sparse precaution
Run 3
8 refinement commits
Polishing edge cases found in Run 2's improvements
Run 4
Zero changes
325 threshold experiments. All returned "keep." Current parameters are locally optimal.
We tried 325 ways to make it better. None worked.
The autoresearch system didn't just test the engine. It tried to improve it. It swept every tunable parameter: delta-not-yet interactions, delta-unsure values, infant and toddler grace periods, the yellow-orange-red thresholds. 325 experiments total.
Every single one came back "keep current parameters." The defaults we built from first principles against CDC 2022 guidance are locally optimal across the curated validation set. If we want the numbers to move from here, we need to change the algorithms, not the knobs.
Tyellow = 2
Torange = 4
Tred = 6
Grace: 3w / 12w
RBSK alignment
Mapped against India's national screening program
RBSK (Rashtriya Bal Swasthya Karyakram) screens 270 million children across India. Independently validated at 97% sensitivity and 96.4% specificity against ASQ-3. We mapped our question bank domain-by-domain against their screening items.
That doesn't make us RBSK-validated. The scoring logic hasn't been tested with Indian populations. But the developmental constructs we measure are the same ones a national public health program considers worth screening for.
What MyChild adds
- Evidence-weighted escalation
- Probe-based false positive reduction
- Automated regression detection
- Corrected age for prematurity
What RBSK covers that we don't
- Ages 3 to 6 years
- India health infrastructure integration
- ABDM / Poshan Tracker connectivity
The original verification
Before the autoresearch system existed, we ran 108 hand-labeled developmental scenarios through the engine. This is where it all started.
98.6%
Agreement
0.971
Cohen's Kappa
294
Labeled observations
4
Threshold cases
45
Abstentions
Per-domain agreement
Four domains had a single borderline over-flag. The other four matched exactly. When the engine gets it wrong, it errs toward concern. For a screening aid, that's the right kind of wrong.
| Domain | N | Agreement | Kappa | Notes |
|---|---|---|---|---|
| Gross Motor | 72 | 100% | 1.000 | No disagreements |
| Cognitive / Play | 22 | 100% | 1.000 | No disagreements |
| Self-Help / Adaptive | 14 | 100% | 1.000 | No disagreements |
| Vision / Hearing | 23 | 100% | 1.000 | No disagreements |
| Expressive Language | 67 | 98.5% | — | 1 borderline over-flag near threshold |
| Social-Emotional | 45 | 97.8% | — | 1 borderline over-flag on sparse data |
| Receptive Language | 34 | 97.1% | — | 1 borderline over-flag with too little evidence |
| Fine Motor | 17 | 94.1% | — | 1 borderline over-flag near threshold |
The honest version
What this shows
- 70,192 adversarial profiles tried to break it. Kappa held at 0.852.
- 98.0% specificity on adversarial data. Low false alarm rate.
- 14 algorithmic fixes shipped directly from autoresearch findings
- 325 threshold experiments all said "keep current parameters"
- Question bank maps to CDC 2022 milestones and RBSK screening items
- Open source. Every rule inspectable, every result reproducible.
What this does NOT prove
- Clinical validation with real patient data. That needs IRB oversight and clinical partners we don't have yet.
- Real-world sensitivity or specificity. Synthetic profiles are not real children.
- Whether it works across languages, cultures, or care settings
- Whether the thresholds are right for clinical use
- 315 critical adversarial findings still need a clinician to look at them
All thresholds are hypothesis-level, aligned to CDC 2022 guidance. A prospective clinical study is next. We're looking for clinical partners.
Don't trust us. Run it yourself.
Every number on this page is reproducible. Install the engine, run the validation, check the output against what we've published. If something doesn't match, open an issue.
# Install
npm install mychild-engine
# Run the full validation suite
npx mychild-engine validate --profiles data/synthetic-profiles.json --format markdown
# Run the autoresearch system
npx mychild-engine autoresearch --runs 4 --adversarial 70000
Where the questions come from
Every question in the engine traces back to peer-reviewed developmental science. The synthetic validation approach draws on published work about when and how synthetic data can stand in for clinical data.
Primary source: CDC 2022 Revised Developmental Milestones
Zubler JM, Wiggins LD, Macias MM, et al. "Evidence-Informed Milestones for Developmental Surveillance Tools." Pediatrics. 2022;149(3):e2021052138.
The 2022 revision moved milestones from the 50th to the 75th percentile. Each milestone is a skill 75% of children can demonstrate by the listed age.
doi:10.1542/peds.2021-052138AAP Developmental Surveillance and Screening Policy
Lipkin PH, Macias MM; Council on Children with Disabilities. "Promoting Optimal Development: Identifying Infants and Young Children With Developmental Disorders Through Developmental Surveillance and Screening." Pediatrics. 2020;145(1):e20193449.
doi:10.1542/peds.2019-3449CDC "Learn the Signs. Act Early." Program
Public domain milestone checklists that formed the basis for the question bank's caregiver-facing wording. Monitoring tools, not screening instruments.
cdc.gov/act-early/milestonesRBSK — India's National Child Health Screening Program
Rashtriya Bal Swasthya Karyakram. Independently validated at 97% sensitivity, 96.4% specificity against ASQ-3. Our questions are mapped domain-by-domain against RBSK screening items.
Our questions map to RBSK items, but the scoring logic hasn't been tested with Indian populations.
rbsk.mohfw.gov.inAutoresearch methodology
Adapted from Karpathy's autoresearch pattern. The system autonomously generates adversarial profiles, labels expected outputs after running them (run-then-label), and iterates on the engine when it finds failures.
Synthetic Data Validation Methodology
"The validity of synthetic clinical data: a validation study of a leading synthetic data generator (Synthea) using clinical quality measures." BMC Medical Informatics and Decision Making. 2019.
doi:10.1186/s12911-019-0793-0Synthetic Data in Clinical Outcomes
"Synthetic data can aid the analysis of clinical outcomes: How much can it be trusted?" PNAS. 2024.
doi:10.1073/pnas.2414310121Synthetic Validation of Pediatric Trust Instruments
"Synthetic Validation of Pediatric Trust Instruments Using LLMs." MedRxiv preprint. 2025.
medrxiv.org/10.1101/2025.11.25.25340922v1