What is the Collaborative Perinatal Project?

The Collaborative Perinatal Project (CPP), also known as the National Collaborative Perinatal Project (NCPP), was a landmark prospective birth cohort study conducted from 1959 to 1974 under the auspices of the National Institute of Neurological Diseases and Stroke (NINDS). Twelve university-affiliated medical centers across the United States enrolled approximately 60,000 pregnancies and followed children from the prenatal period through age 7–8. The study collected detailed information on maternal health, prenatal care, labor and delivery, neonatal outcomes, and child development, including Stanford-Binet IQ at age 4 and WISC IQ at age 7.

Which file should I start with?

For most analyses, start with cpp_clean_v1.csv (or .rds for R). This has 185 key variables already cleaned and labeled, including all WISC subtests and 7-year psychological battery variables. See the Getting Started page for code examples.

For everything in one file, use cpp_unified_wide.csv (64,834 × 4,862). For the complete CPPVAR extraction with all 1,236 columns, use cppvar_all_columns.csv.

How should I cite this data release?

Please cite the companion data paper:

Lasker, J. (2026). The Collaborative Perinatal Project: A Modern Data Release. OSF Preprints. https://osf.io/4tna9

Does “breastfeeding” mean long-term breastfeeding?

No. The variables bf_days and bf_ever capture only in-hospital nursery feeding during the first few days of life. In the 1960s U.S., nursery breastfeeding was more common among lower-SES women—the opposite of today’s pattern. Do not interpret these as measures of long-term breastfeeding duration.

What does sex = 3 mean?

The 804 records with sex=3 are early fetal losses (mean birth weight 1,891g, gestational age 16.3 weeks), not live births with ambiguous sex. Exclude these from live-birth analyses by filtering to sex %in% c(1, 2).

Why do I need to read case_id as a string?

case_id is a 9-digit identifier and mother_id is a 7-digit identifier. Some sites have leading zeros that will be lost if these are read as integers. Always read them as character/string types:

What are the sentinel/missing codes?

Following 1960s conventions, the CPP uses all-9 sentinel values (9, 99, 999, 9999) to denote “unknown” or “not applicable” responses. The CPP_Codebook.csv documents the specific missing codes for each variable in the missing_codes column. The analysis-ready files (cpp_clean_v1) have most sentinel values already recoded to NA.

How were twin zygosities determined?

Twin zygosity uses a five-tier evidence schema:

  1. MZ_definite (189 pairs): Myrianthopoulos' blood group determinations (9-system panel)
  2. MZ_very_probable (9 pairs): Monochorionic membrane from pathology records
  3. MZ_probable (10 pairs): Bayesian developmental trajectory model (P(MZ) > 0.80)
  4. DZ_definite (315 pairs): Sex-discordant or blood-type-discordant
  5. DZ_probable and unresolved: Remaining pairs classified by biological markers

79% of twin pairs are resolved using the gold-standard serological method.

What survey weights should I use?

The file cpp_weights.csv provides three types of weights:

Which codebook file should I use?

CPP_Codebook.csv (1,239 entries) is the curated, publication-quality codebook—use this one. cppvar_codebook.csv (1,140 entries) is the raw auto-parsed version, retained for reproducibility only. The Codebook Browser on this site searches the curated codebook.

Are there follow-up data beyond age 7?

The public-use CPP data files cover only the prenatal period through the age 7–8 assessment. However, several site-specific follow-up studies were conducted on CPP sub-cohorts:

These follow-up datasets are not included in this release because they are held separately by the individual institutions and were never incorporated into the central NICHD public-use files. The Pathways to Adulthood data is available through ICPSR; for other follow-ups, contact the individual sites directly.

What are the known limitations of the CPP data?

How can I report errors or contribute?

If you find errors in the data or codebook, or have suggestions for improvement, please open an issue on the GitHub repository.