Getting Started — CPP Data Repository

Quick Start

For most analyses, download cpp_clean_v1.csv (29 MB, 59,391 rows × 314 columns). This is the recommended analysis-ready dataset with cleaned values, meaningful variable names, all WISC subtests, and variables from the 7-year psychological battery.

For sibling and twin analyses, also download cpp_kinship_links.csv (14,208 sibling/twin pairs with relatedness coefficients). For extended-family analyses, add cpp_extended_kinship_links.csv (13,513 cousin and other cross-family pairs).

Loading the Data

Here is how to load the analysis-ready dataset in R, Python, and Stata. Note that case_id and mother_id must always be read as strings to preserve leading zeros.

R (data.table)

library(data.table)

# Load analysis-ready dataset
d <- fread("cpp_clean_v1.csv")       # 59,391 x 314
d[, case_id := as.character(case_id)]

# Load kinship links for sibling/twin analyses
links <- fread("cpp_kinship_links.csv",
               colClasses = c(id_1="character", id_2="character"))

# Load precomputed g factor scores
g <- fread("cpp_g_factors.csv",
           colClasses = c(case_id="character"))
d <- merge(d, g, by = "case_id", all.x = TRUE)

# Basic analysis: IQ by race
d[, .(mean_iq = mean(wisc_fsiq, na.rm=TRUE),
      sd_iq = sd(wisc_fsiq, na.rm=TRUE),
      n = sum(!is.na(wisc_fsiq))), by = race]

Python (pandas)

import pandas as pd

d = pd.read_csv("cpp_clean_v1.csv",
                 dtype={"case_id": str, "mother_id": str})
links = pd.read_csv("cpp_kinship_links.csv",
                     dtype={"id_1": str, "id_2": str})

Stata

import delimited "cpp_clean_v1.csv", clear
tostring case_id mother_id, replace

Key Identifiers

case_id uniquely identifies each pregnancy/child (a 7–8 digit number combining institution code, family number, and pregnancy sequence). mother_id identifies the biological mother—children sharing a mother_id are siblings. Both should always be read as strings, not integers, to preserve leading zeros.

Applying Survey Weights

The CPP was not a probability sample, so survey weights are provided to adjust for both attrition and population non-representativeness. Here is an example in R:

library(data.table)

# Load data and weights
d <- fread("cpp_clean_v1.csv", colClasses = c(case_id = "character"))
wt <- fread("cpp_weights.csv", colClasses = c(case_id = "character"))
d <- merge(d, wt, by = "case_id")

# Weighted mean IQ (corrects for attrition + population non-representativeness)
weighted.mean(d$wisc_fsiq, d$wt_recommended, na.rm = TRUE)

Within-Family Analyses

The CPP is well-suited for within-family designs because 8,772 mothers contributed two or more children. The recommended workflow:

Load cpp_clean_v1.csv and restrict to the outcome of interest (e.g., !is.na(wisc_fsiq)).
Identify families with 2+ children using mother_id.
Use fixest::feols() in R or xtreg, fe in Stata with mother_id as the panel variable.
For behavior genetic analyses, merge with cpp_kinship_links.csv to obtain relatedness-classified sibling pairs.

Here is an example of a mother fixed-effects regression in R using the fixest package:

library(data.table)
library(fixest)

d <- fread("cpp_clean_v1.csv", colClasses = c(case_id = "character"))

# Mother fixed effects: breastfeeding and IQ
fe_model <- feols(wisc_fsiq ~ bf_ever + birth_wt_g + gest_age + sex | mother_id,
                   data = d)
summary(fe_model)

Codebooks

Every file in the release has a companion codebook:

CPP_Codebook.csv (1,239 entries) — primary codebook covering every CPPVAR column. Also browseable on the Codebook page.
cpp_unified_manifest.csv (4,862 entries) — documentation for the unified wide file.
master/parsed/card_XXXXX_fields.csv — per-card field documentation for CPPMASTER cards.
cpp_item_scores_codebook.csv — documentation for all derived factor scores.

See the FAQ for common pitfalls, known limitations, and answers to frequently asked questions about the CPP data.