How to load, explore, and analyze the CPP data.
For most analyses, download cpp_clean_v1.csv (29 MB, 59,391 rows × 185 columns). This is the recommended analysis-ready dataset with cleaned values, meaningful variable names, all WISC subtests, and variables from the 7-year psychological battery.
For sibling and twin analyses, also download cpp_kinship_links.csv (14,208 sibling/twin pairs with relatedness coefficients). For extended-family analyses, add cpp_extended_kinship_links.csv (13,513 cousin and other cross-family pairs).
Here is how to load the analysis-ready dataset in R, Python, and Stata. Note that case_id and mother_id must always be read as strings to preserve leading zeros.
library(data.table)
# Load analysis-ready dataset
d <- fread("cpp_clean_v1.csv") # 59,391 x 185
d[, case_id := as.character(case_id)]
# Load kinship links for sibling/twin analyses
links <- fread("cpp_kinship_links.csv",
colClasses = c(id_1="character", id_2="character"))
# Load precomputed g factor scores
g <- fread("cpp_g_factors.csv",
colClasses = c(case_id="character"))
d <- merge(d, g, by = "case_id", all.x = TRUE)
# Basic analysis: IQ by race
d[, .(mean_iq = mean(wisc_fsiq, na.rm=TRUE),
sd_iq = sd(wisc_fsiq, na.rm=TRUE),
n = sum(!is.na(wisc_fsiq))), by = race]
import pandas as pd
d = pd.read_csv("cpp_clean_v1.csv",
dtype={"case_id": str, "mother_id": str})
links = pd.read_csv("cpp_kinship_links.csv",
dtype={"id_1": str, "id_2": str})
import delimited "cpp_clean_v1.csv", clear
tostring case_id mother_id, replace
case_id uniquely identifies each pregnancy/child (a 7–8 digit number combining institution code, family number, and pregnancy sequence). mother_id identifies the biological mother—children sharing a mother_id are siblings. Both should always be read as strings, not integers, to preserve leading zeros.
The CPP was not a probability sample, so survey weights are provided to adjust for both attrition and population non-representativeness. Here is an example in R:
library(data.table)
# Load data and weights
d <- fread("cpp_clean_v1.csv", colClasses = c(case_id = "character"))
wt <- fread("cpp_weights.csv", colClasses = c(case_id = "character"))
d <- merge(d, wt, by = "case_id")
# Weighted mean IQ (corrects for attrition + population non-representativeness)
weighted.mean(d$wisc_fsiq, d$wt_recommended, na.rm = TRUE)
The CPP is well-suited for within-family designs because 8,772 mothers contributed two or more children. The recommended workflow:
cpp_clean_v1.csv and restrict to the outcome of interest (e.g., !is.na(wisc_fsiq)).mother_id.fixest::feols() in R or xtreg, fe in Stata with mother_id as the panel variable.cpp_kinship_links.csv to obtain relatedness-classified sibling pairs.Here is an example of a mother fixed-effects regression in R using the fixest package:
library(data.table)
library(fixest)
d <- fread("cpp_clean_v1.csv", colClasses = c(case_id = "character"))
# Mother fixed effects: breastfeeding and IQ
fe_model <- feols(wisc_fsiq ~ bf_ever + birth_wt_g + gest_age + sex | mother_id,
data = d)
summary(fe_model)
Every file in the release has a companion codebook:
CPP_Codebook.csv (1,239 entries) — primary codebook covering every CPPVAR column. Also browseable on the Codebook page.cpp_unified_manifest.csv (4,862 entries) — documentation for the unified wide file.master/parsed/card_XXXXX_fields.csv — per-card field documentation for CPPMASTER cards.cpp_item_scores_codebook.csv — documentation for all derived factor scores.See the FAQ for common pitfalls, known limitations, and answers to frequently asked questions about the CPP data.