FAQ — CPP Data Repository

What is the Collaborative Perinatal Project?

The Collaborative Perinatal Project (CPP), also known as the National Collaborative Perinatal Project (NCPP), was a landmark prospective birth cohort study conducted from 1959 to 1974 under the auspices of the National Institute of Neurological Diseases and Stroke (NINDS). Twelve university-affiliated medical centers across the United States enrolled approximately 60,000 pregnancies and followed children from the prenatal period through age 7–8. The study collected detailed information on maternal health, prenatal care, labor and delivery, neonatal outcomes, and child development, including Stanford-Binet IQ at age 4 and WISC IQ at age 7.

Which file should I start with?

For most analyses, start with cpp_clean_v1.csv (or .rds for R). This has 314 key variables already cleaned and labeled, including all WISC subtests and 7-year psychological battery variables. See the Getting Started page for code examples.

For everything in one file, use cpp_unified_wide.csv (64,834 × 4,862). For the complete CPPVAR extraction with all 1,236 columns, use cppvar_all_columns.csv.

How should I cite this data release?

Please cite the companion data paper:

Lasker, J. (2026). The Collaborative Perinatal Project: A Modern Data Release. OSF Preprints. https://osf.io/4tna9

Does “breastfeeding” mean long-term breastfeeding?

No. The variables bf_days and bf_ever capture only in-hospital nursery feeding during the first few days of life. In the 1960s U.S., nursery breastfeeding was more common among lower-SES women—the opposite of today’s pattern. Do not interpret these as measures of long-term breastfeeding duration.

What does sex = 3 mean?

The 804 records with sex=3 are early fetal losses (mean birth weight 1,891g, gestational age 16.3 weeks), not live births with ambiguous sex. Exclude these from live-birth analyses by filtering to sex %in% c(1, 2).

Why do I need to read case_id as a string?

case_id is a 9-digit identifier and mother_id is a 7-digit identifier. Some sites have leading zeros that will be lost if these are read as integers. Always read them as character/string types:

R: fread("cpp_clean_v1.csv", colClasses = c(case_id = "character"))
Python: pd.read_csv("cpp_clean_v1.csv", dtype={"case_id": str})
Stata: tostring case_id mother_id, replace

What are the sentinel/missing codes?

Following 1960s conventions, the CPP uses all-9 sentinel values (9, 99, 999, 9999) to denote “unknown” or “not applicable” responses. The CPP_Codebook.csv documents the specific missing codes for each variable in the missing_codes column. The analysis-ready files (cpp_clean_v1) have most sentinel values already recoded to NA.

How were twin zygosities determined?

Twin zygosity uses a five-tier evidence schema:

MZ_definite (189 pairs): Myrianthopoulos' blood group determinations (9-system panel)
MZ_very_probable (9 pairs): Monochorionic membrane from pathology records
MZ_probable (10 pairs): Bayesian developmental trajectory model (P(MZ) > 0.80)
DZ_definite (315 pairs): Sex-discordant or blood-type-discordant
DZ_probable and unresolved: Remaining pairs classified by biological markers

79% of twin pairs are resolved using the gold-standard serological method.

What survey weights should I use?

The file cpp_weights.csv provides three types of weights:

Attrition weights: Inverse-probability weights correcting for differential follow-up loss
Population weights: Raking weights calibrated to 1960–65 Vital Statistics marginals
Combined weights: The recommended column is weight_combined, which combines both adjustments for nationally representative estimation

Which codebook file should I use?

CPP_Codebook.csv (1,239 entries) is the curated, publication-quality codebook—use this one. cppvar_codebook.csv (1,140 entries) is the raw auto-parsed version, retained for reproducibility only. The Codebook Browser on this site searches the curated codebook.

Are there follow-up data beyond age 7?

The public-use CPP data files cover only the prenatal period through the age 7–8 assessment. However, several site-specific follow-up studies were conducted on CPP sub-cohorts:

Pathways to Adulthood (Johns Hopkins): Janet Hardy and Sam Shapiro followed 2,694 members of the Baltimore cohort through age 27–33 (1992–94), plus their mothers (G1) and children (G3), creating a three-generation study. 82% of participants were located; 65% interviewed. Archived at ICPSR Study #2420.
New England Family Study (Boston + Providence): Stephen Buka (Brown), Jill Goldstein, and Larry Seidman (Harvard) have followed 17,741 individuals born at the Boston and Providence sites from the 1980s through the present day, with over 4,000 still active. The study has produced landmark findings on schizophrenia risk, ADHD, and adult cardiovascular health.
Philadelphia-Providence Intergenerational Study: Klebanoff et al. (1998) re-contacted 1,782 female CPP offspring from Philadelphia and Providence in 1987–91 to study intergenerational transmission of preterm birth and low birth weight.
Minnesota CPP 2.0: Logan Spector and Julia Steinberger (University of Minnesota) are conducting an ongoing follow-up of the Minnesota site cohort (now in their late 50s–60s), linking childhood exposures to adult cancer, heart disease, and diabetes via medical records and cancer registries.
NICHD Mortality Linkage: Edwina Yeung (NICHD) linked 44,174 CPP mothers to the National Death Index through 2016, enabling research on associations between pregnancy complications and cause-specific mortality across the lifespan.

These follow-up datasets are not included in this release because they are held separately by the individual institutions and were never incorporated into the central NICHD public-use files. The Pathways to Adulthood data is available through ICPSR; for other follow-ups, contact the individual sites directly.

What are the known limitations of the CPP data?

No maternal IQ: The CPP did not test mothers’ cognitive ability, limiting genetic inference. Maternal education and SEI are the best available proxies.
Nursery breastfeeding only: Feeding data covers only the first days of life, not long-term breastfeeding.
No data beyond age 7–8: The public-use files end at the 7-year assessment. Site-specific follow-ups exist but are not part of this release (see above).
Attrition varies by site: Follow-up rates at age 7 range from ~28% to ~91% across sites. Survey weights partially address this.
Historical period: The data reflect 1959–1974 conditions. Medical practices, demographics, and social norms have changed substantially.
Convenience sample: The CPP enrolled women at 12 university-affiliated hospitals, not a probability sample of U.S. births. Survey weights help but cannot fully correct this.
No genetic data: No DNA, genotype, or molecular genetic data was collected. Behavior genetic analyses rely on kinship structure (twins and siblings) rather than measured genotypes.
Historical race categories: The race variable uses 1960s classifications (White, Black, Puerto Rican, Oriental, Other) that do not map cleanly to modern OMB categories.
Performance subtests have lower coverage: WISC Performance subtests (Picture Arrangement, Block Design, Coding) come from CPPMASTER card 31300 with N ≈ 31,400, about 22% lower than the ~40,000 coverage of Verbal subtests from CPPVAR.
Income is categorical: Family income is recorded in broad categories, not as a continuous dollar amount. The SEI (Socioeconomic Index) provides a more granular continuous SES measure.

How can I report errors or contribute?

If you find errors in the data or codebook, or have suggestions for improvement, please open an issue on the GitHub repository.

Frequently Asked Questions