# Homework 3: Returns to Education

_Due date:_ 

_Part 1: November 3, 2025, 11:59 pm_

_Part 2: November 10, 2025, 11:59 pm_

_Please hand in your completed .ipynb file on Gradescope (you can work in the same notebook for part 2). Moreover, for part 2b, please hand in a separate pdf document. Make sure you name the file with your and your teammates' names, e.g. "Anna_Amar_Mary_Guido_Assignment_3.ipynb" and "Anna_Amar_Mary_Guido_Policy_Memo_3.pdf".  Only one of you will need to hand in the file, but make sure you list everyone's name in the next cell._

# <font color='red'>TO DO:</font>

<font color='red'>Double-click on this box and fill in your names</font>

**Names:**

_Name 1_, _Name 2_, _Name 3_

## Overview

In the first part of our third class module, we discussed Mincer's model for choosing years of education to maximize future earnings. A key drawback of this model is omitted variable bias, which can bias our OLS treatment effect estimates if an important confounder is excluded from the regression. A confounder must meet two criteria: it should correlate with both the treatment and the error term of the outcome variable. In terms of the graph we introduced in the first week, this means there must be edges from the confounder to both the treatment and the outcome variable. 

However, unlike last week where we simply adjusted for these confounders in our OLS regression, the challenge in this module is that we oftentimes do not observe the confounder. For instance, in the context of education, ability is often considered a confounder, but we lack a reliable measure to adjust for it directly.

This is where instrumental variables come into play. An instrumental variable must 1) be correlated with the explanatory variable of interest (relevance) and 2) affect the outcome of interest  _solely_ through its influence on the treatment (independence and exclusion). The primary goal of using instrumental variables is to obtain a causal treatment effect of an independent variable based only on variation introduced by the instrument.


**What are we doing in this homework?**

In this HW, we will use data on college proximity, years of education and earnings to understand the returns to education. Specifically, we will re-analyze David Card's study on using geographic variation in college proximity to estimate the returns to education in terms of wages. Geographic variation here means whether or not a person grew up near a 4-year college.

In part 1 of this homework, we will again begin with an exploratory data analysis, including calculating summary statistics and running simple regressions. In part 2 of this homework, we will make then use of our new tool we learned in class to causally identify parameters: _instrumental variables_. Lastly, for part 2b, in the policy memo section, we will suggest and explore policy interventions based on our findings.


**Why are we doing this homework?**

1. Understanding the returns to education is important not just for guiding individual career choices, but also for informing policy, e.g. incentivizing young adults to go to college by eliminating college tuition.

2. Instrumental variables regression is commenly used in applied economics research as standard linear regressions oftentimes fail to provide unbiased estimates due to unobserved confounders.

## About the data

The data set we are working with is drawn from the National Longitudinal Survey of Young Men (NLSYM). The first year of the survey was 1966 and there are data from follow-up surveys available until 1981. The sample contained 5525 men aged 14-24. The survey was quite comprehensive and asked a lot of questions about their personal characteristics (such as age, race, etc.), family background (including whether they lived near a 4-year college), questions from the "Knowledge of the World of Work" questionnaire, their educational attainment, the local labor market characteristics and their wages. We provide you with a subset of the data.

## References

You don't have to look at this paper, but if you're curious, here is a reference:

* Card, D. (1993). Using geographic variation in college proximity to estimate the return to schooling.

## Understand the overview

1. Write down the variables of interest in this setting: what is the outcome, what is the (potentially endogeneous) independent variable of interest, what is the instrument, etc. in this case?
2. Describe the two assumptions underlying valid instruments in the context of this returns to education example.
3. Do you find those assumptions credible? Why or why not? What are ways we could assess the credibility of the assumptions?

## <font color='red'>TO DO</font>

<font color='red'>Double-click on the box below and fill in your answers to 1-3 above.</font>

**ANSWER**:
1. TODO
2. TODO
3. TODO


## Part 1: Exploratory Analysis

**Due:** November 3, 2025, 11:59 pm

In this section, we are focusing on familiarizing ourselves with the data set and doing some exploratory analysis. We will guide you through this step-by-step. Hopefully by now you know the drill!

### Part 1a: Loading the Data

Before you run the cell, make sure you saved the data in the same dictionary as your Jupyter Notebook. 

Run the following cell (click Shift+Enter) to load the data. Recall from the Python tutorial that we are turning the data into a pandas dataframe to ease computations and visualizations later down the road. 

In [None]:
# Import all of our favorite packages...
import pandas as pd
import csv

# Load the csv file
datafile = open('school_data.csv', newline='', encoding = 'utf-8')
dataDict = csv.DictReader(datafile, delimiter=",")

# Turn this into a pandas dataframe, and cast the values as numeric
school_data = pd.DataFrame(dataDict)
for var in dataDict.fieldnames:
    school_data[var] = pd.to_numeric(school_data[var], errors='coerce')


print("Great! Data is loaded")

Let's see what the data set looks like.

In [None]:
school_data

At the bottom, you can see that our data set has 3010 rows (_read_ individuals) and 10 columns (_read_ characteristics) stored.

**Here's what each of these variables mean:**
* id: sequential identification number for an individual
* nearc4: individual grew up near 4-yr college
* educ: years of education
* black: individual is black (1) or not (0)
* smsa: individual lives in a metropolitan are (1) or not (0)
* south: individual lives in the south (1) or not (0)
* exper: individual's potential years of experience (age-education-7)
* lwage: hourly log wages in 1976
* expersq: squared experience
* expersq_100: squared experience divided by 100

### Part 1b: Summary Statistics

We are starting off by computing summary statistics of the variables in the data set again. Recall that this means calculating, for example, the means and standard deviations and plotting histograms of the covariates.

### <font color="red">Coding time!</font>

<font color="red"> Provide some summary statistics for the covariates you observe in the data set.</font>

In [None]:
#TODO: Your code here.

### <font color="red">Interpretation time!</font>

<font color="red"> Do you find anything surprising in the data from the summary statistics? If yes, what is it and why is it surprising? If no, why not?
    
<font color="red">**TODO:** Double-click the box below and enter your answer. </font>

**ANSWER:**

### Part 1c: Simple OLS

We start off by running a simple OLS regression for our observational sample to estimate the returns to education in terms of wages. We'll fit a simple linear model of the form: [lwage] = a + b_1 [educ].

### <font color="red">Coding time!</font>

<font color="red"> Write your code for the regression "lwage ~ educ" and print out the model summary.</font>

In [None]:
#TODO: Your code here.

### <font color="red">Interpretation time!</font>

<font color="red"> 1. How do you interpret the regression coefficient on education mechanically? (Your answer should be something like _"One more unit of X increases Y by b"_)

<font color="red"> 2. How does the regression coefficient on education compare to other returns to education estimates that we have seen in the lecture?
    
<font color="red"> 3. Should you interpret this regression coefficient causally? If yes, why? If not, why not? (Think about what would need to be true for you to interpret this causally - is this reasonable based on the data or your knowledge? Why or why not?)

**ANSWER:**

### Part 1d: OLS with Covariates

As we saw in the last homework, when we deal with observational data, we hardly ever find balanced groups and need to properly adjust for confounders in our regression. Let's do that for our education regression! We'll fit a simple multivariate linear model of the form: [lwage] = a + b_1 [educ] + b_2 [exper] + b_3 [expersq_100] + b_4 [black] + b_5 [south] + b_6 [smsa].

### <font color="red">Coding time!</font>

<font color="red"> Write your code for the regression "lwage ~ educ + exper + expersq_100 + black + south + smsa" and print out the model summary.</font>

In [None]:
#TODO: Your code here.

### <font color="red">Interpretation time!</font>

<font color="red"> 1. How does the regression coefficient on education compare to the one you estimated in part 2a and the other one estimates we have seen in class?
    
<font color="red"> 2. Should you interpret this regression coefficient causally? If yes, why? If not, why not? (Think about what would need to be true for you to interpret this causally - is this reasonable based on the data or your knowledge? Why or why not?)

**ANSWER:**

### <font color="blue">!! STOP !!</font>

<font color="blue"> You have finished part 1 of the homework. This is the only part of the homework that is due on November 3, 2025, 11:59 pm. Save your progress and please upload your Jupyter notebook on Gradescope.</font>

## Part 2: Analysis

**Due:** November 10, 2025, 11:59 pm

In the second part of the homework, we will continue with our causal analysis.

### Part 2a: Is Education Really Exogeneous?

Many people have argued over the last years that the level of education is **not** randomly assigned in the population and cannot be seen as an exogeneous variable. Is all hope of us getting an estimate of the returns to education lost? Fortunately, it is not. In our second and third lecture in the returns to education unit, we learned about _instrumental variables_ (IV). Instrumental variables are variables that provide exogeneous variation in education outcomes which we can use to obtain credible causal effects, even when our regressor is not exogeneous. Recall that a good instrument satisfies two criteria: 1) it is correlated with the explanatory variable of interest (relevance) and 2) it only affects the outcome of interest through its correlation with the explanatory variable (independence and exclusion). Let's dive deeper into both of these and our instrumental variables analysis to see how much our estimates change from the OLS estimates.

In this part of the HW, we want to obtain the IV estimate for education on earnings using college proximity. As we learned in the lecture, we actually have all the tools we need for this readily available from our last unit. Specifically, we will run two regressions: 1) **first stage:** we regress education on college proximity and 2) **reduced form:** we regress earnings on college proximity. Taking the ratio of both of these regression coefficients will give us our IV estimate. 

### <font color="red">Coding time!</font>

<font color="red"> Write your code for first stage regression "educ ~ nearc4 + exper + expersq_100 + black + south + smsa" and print out the model summary.</font>

In [None]:
#TODO: Your code here.

### <font color="red">Coding time!</font>

<font color="red"> Write your code for the reduced form regression "lwage ~ nearc4 + exper + expersq_100 + black + south + smsa" and print out the model summary.</font>

In [None]:
#TODO: Your code here.

### <font color="red">Coding time!</font>

<font color="red"> Write your code to obtain the IV estimate.</font>

<font color="red"> Your code should print out "The IV estimate is: ____."

<font color="red"> **Hint:** Remember you can access the model parameters using model.parameters() (if you named your regression outcome "model"). If you assign that object to a variable, you can use the square brackets to extract just one of the parameters.</font>

In [None]:
#TODO: Your code here.

### <font color="red">Interpretation time!</font>

<font color="red"> 1. What conclusions can you draw from your IV analysis for the returns on education in terms of wages?

<font color="red"> 2. How does your IV estimate compare to your OLS estimate? If you see a difference, what could be explaining the difference?

<font color="red"> 3. Based on the analysis you have conducted so far, do you think "college proximity" is a good instrument? Why or why not? Is your answer different to part 3. of "Understand the Overview"?

**ANSWER:**

### <font color="red">OPTIONAL: Coding time!</font>

<font color="red">Implement the two stage least squares estimator. How does it compare to the IV estimate you obtained in 2c? (Hint: You already ran the first stage regression - all that's left is the second stage!)</font>

In [None]:
# OPTIONAL TODO: Your code here.

### Part 2b: Policy Memo

Imagine you have been hired as a policy analyst for the California Department of Higher Education, which is debating whether to open a new California State University (CSU) campus in Far Northern California to improve college access for residents of the region. The Governor’s office wants an evidence-based recommendation on whether expanding local college opportunities is a good investment for the state.

Draft a policy memo that summarizes what the evidence shows and makes a clear, actionable recommendation for the agency’s leadership team. The Governor and senior staff do not have the same technical background as you, so your memo should explain the findings and their implications in non-technical language.

* **State your recommendation clearly.** Should the state open a new CSU campus? If yes, where and for whom should access be prioritized? If not, explain why and suggest alternative ways to expand educational opportunity.
*  **Summarize the key evidence.** Highlight the most policy-relevant numbers on the returns to education from your IV analysis and other studies. 
*  **Discuss policy design.** If recommending a new campus, specify details such as location, admission criteria, outreach strategies, financial aid, and expected enrollment. If recommending a different policy (e.g., tuition subsidies, community-college transfers), describe the design and rationale.
*  **Confront limitations.**  Discuss what your analysis cannot capture—e.g., potential differences in returns for today’s students, non-monetary benefits of college, general-equilibrium effects, or the possibility that returns vary across subgroups.
*  **Propose next steps.** If you had more time, money, or data, what further research or pilot testing would most help the agency make a confident decision? 

_Deliverables:_
* A policy memo (max. 1,000-word), written for the Governor and the California Higher Education Coordinating Board, with 1-2 small exhibits maximum (a table or figure).
* A technical appendix (1-2 page) with supporting tables and figures for readers who want more details.

_Suggested memo structure:_
1) Policy recommendation (2-3 sentences)
2) Evidence summary
3) Policy design/implementation plan
4) Cost-effectiveness (if applicable)
5) Limitation of the evidence
6) Next steps for research

### <font color="blue">!! DON'T FORGET !!</font>

<font color="blue"> Hand in your policy memo in a pdf format on Gradescope as well as your completed Jupyter notebook with your team's names on it, e.g. "Anna_Amar_Mary_Guido_Assignment_3.ipynb" and "Anna_Amar_Mary_Guido_Policy_Memo_3.pdf".</font>

In [None]:
##OPTIONAL TODO (policy memo): Your code here.

## Wrap-up and take-aways

In this homework, we learned about one way to estimate returns to college. In our analysis, we made use of a new tool we learned for estimating causal effects with observational data: instrumental variables! It turns out that IV is just combining regressions in a particular way! We also learned a way of using the data to (at least partly) assess whether we have a valid instrument.

After our returns to education unit, you should be able to answer the following questions:
* Why might we not be able to interpret our least squares estimator on our treatment variable causally, even if we adjusted for all observable covariate?
* What do we mean when we call a variable an instrument?
* How can we assess whether an instrument is good?
* How do we calculate the instrumental variables estimator?