# Homework 2: Universal Basic Income and Permanent Income Hypothesis

_Due date:_ 

_Part 1: October 20, 2025, 11:59 pm_

_Part 2: October 27, 2025, 11:59 pm_

_Please hand in your completed .ipynb file on Gradescope (you can work in the same notebook for part 2). Your file should have the output for each cell visible, so make sure you run the cell and then save it. Moreover, for part 2b, please hand in a separate pdf document. Make sure you name the file with your and your teammates' names, e.g. "Anna_Amar_Mary_Guido_Assignment_2.ipynb" and "Anna_Amar_Mary_Guido_Policy_Memo_2.pdf". Only one of you will need to hand in the file, but make sure you list everyone's name in the next cell._

# <font color='red'>TO DO:</font>

<font color='red'>Double-click on this box and fill in your names</font>

**Names:**

_Name 1_, _Name 2_, _Name 3_

## Overview

In the first part of our second class module, we discussed Friedman's permanent income hypothesis. Moreover, we explored some ideas of how we could empirically test this hypothesis and how we can obtain reliable causal estimates. One specific question we touched on was the impact of universal basic income on labor supply. Since running an experiment in this context is hard, we need alternative methods to establish causality. One alternative method is identifying natural experiments — policy changes or other events that we can argue are plausibly random. We refer to the variation this creates as _quasi-experimental variation_ because it is close to an actual experiment, but it was not randomized.

**What are we doing in this homework?**

In this HW, we will use lottery data to test Friedman's hypothesis in the context of universal basic income. Specifically, we will re-analyze the Imbens-Rubin-Sacerdote lottery study that estimates the effect of permanent income on labor supply decisions. The quasi-experimental variation is coming from the fact that the lottery is random.

In part 1 of this homework, we will again begin with an exploratory data analysis, including calculating summary statistics, balance tables and scatterplots. In part 2 of this homework, in addition to a policy memo, we will make use of the first tool we learned in class to causally identify parameters with observational data: _regressions_.


**Why are we doing this homework?**

1. Understanding citizens' reactions to extra income (either one-time payments or permanent payments) is crucial due to its significant policy implications, including its potential role in stimulus measures (think back to COVID policies) and as a permanent policy.

2. Linear regressions are one of the most widely used techniques for data analysis and decision-making not just by applied economists, but across various fields.

## About the data

As mentioned in the overview, we will give you data from the Imbens-Rubins-Sacerdote lottery study.

### "Winners" Sample
The winners sample consists of people who played the Megabucks lottery in Massachusetts during the 1984-1988 and won a major prize. These major prizes mimic a universal income as they are paid out in installments over the next 20 years. Their dollar amount ranged from \\$22,000 to \\$9,696,000 with the sample mean and median equal to \\$1,104,000 and \\$635,000. The data for these winners was observed through a survey sent out to each individual. There were three sets of questions: 1) outcomes at time of survey, 2) economic behavior and background characteristics at time of winning and 3) earnings. The data set's earnings records are social security earnings records for the six years preceding their lottery win and the six years following the lottery win.
 

## References

You don't have to look at these, but if you are curious, here are some references.  The PDFs are linked from the assignment.

* Imbens, G. W., Rubin, D. B., & Sacerdote, B. I. (2001). Estimating the effect of unearned income on labor earnings, savings, and consumption: Evidence from a survey of lottery players. American Economic Review, 91(4), 778-794.

## Understand the overview

1. Use the potential outcome notation to write down the variables of interest in this setting: what are the potential outcomes, what is the outcome, what is the (binary) treatment, etc. in this case?
2. What is the ideal RCT we would like to run in this case to estimate the causal effect of universal basic income on labor supply?
3. How do the data and design at hand get close to mimicing this ideal experiment? 

## <font color='red'>TO DO</font>

<font color='red'>Double-click on the box below and fill in your answers to 1-3 above.</font>

**ANSWER**:
1. TODO
2. TODO
3. TODO


## Part 1: Exploratory Analysis

**Due:** October 20, 2025, 11:59 pm

In this section, we are focusing on familiarizing ourselves with the data set and doing some exploratory analysis. We will guide you through this step-by-step.

### Part 1a: Loading the Data

Before you run the cell, make sure you saved the data in the same dictionary as your Jupyter Notebook. 

Run the following cell (click Shift+Enter) to load the data. Recall from the Python tutorial that we are turning the data into a pandas dataframe to ease computations and visualizations later down the road. 

In [None]:
# Import all of our favorite packages...
import pandas as pd
import csv

# Load the csv file
datafile = open('lottery.csv', newline='')
dataDict = csv.DictReader(datafile, delimiter=",")

# Turn this into a pandas dataframe, and cast the values as numeric
winners = pd.DataFrame(dataDict)
for var in dataDict.fieldnames:
    winners[var] = pd.to_numeric(winners[var], errors='coerce')


print("Great! Data is loaded")

Let's see what the data set looks like.

In [None]:
winners

At the bottom, you can see that our data set has 496 rows (_read_ individuals) and 35 columns (_read_ characteristics) stored.

**Here's what each of these variables mean:**
* age: age of the individual
* bigwinner: 1 if the subject won a big prize (> \\$100,000), 0 otherwise
* education: years of education
* male: 1 if the subject is male, 0 otherwise
* tixbot: number of tickets bought by the individual
* winner: 1 if the subject won a small prize (< \\$100,000), 0 otherwise
* workthen: 1 if the subject was working when he won the prize
* xearn1: earnings 6 years prior to winning (in Thousands of \\$\\$)
* xearn2,..., xearn6: earnings 5,..., 1 year prior to winning (in Thousands of \\$\\$)
* xearnp1: positive earnings 6 years prior to winning
* xearnp2,..., xearnp6: positive earnings 5,..., 1 year prior to winning
* yearlpr: yearly prize money that was paid out over 20 years
* yearn1: earnings 1 year after winning (in Thousands of \\$\\$)
* yearn2,..., yearn7: earnings 2,....,7 years after winning (in Thousands of \\$\\$)
* yearnp1: positive earnings 1 year after winning
* yearnp2,..., yearnp6: positive earnings 2,...,6 years after winning
* yearw: year that the prize was won

### Part 1b: Summary Statistics

We are starting off by computing summary statistics of the variables in the data set again. Recall that this means calculating, for example, the means and standard deviations and plotting histograms of the covariates.

### <font color="red">Coding time!</font>

<font color="red"> Provide some summary statistics for the covariates you observe in the data set.</font>

In [None]:
#TODO: Your code here.

### <font color="red">Interpretation time!</font>

<font color="red"> Do you find anything surprising in the data from the summary statistics? If yes, what is it and why is it surprising? If no, why not?
    
<font color="red">**TODO:** Double-click the box below and enter your answer. </font>

**ANSWER:**

### Part 1c: Balance table

Next, we calculate a balance table again to see how balanced the covariates are for the treatment group, i.e. the winners, and the control group, i.e. the 'losers'.

**For the purposes of this assignment, think of people with "winners == 1" as winners and "winners == 0" as 'losers'!**

### <font color="red">Interpretation time!</font>

<font color="red"> Before computing anything, what do you expect the balance of the the treatment and control group to look like for the different covariates?
    
<font color="red">**TODO:** Double-click the box below and enter your answer. </font>

**ANSWER:**

### <font color="red">Coding time!</font>

<font color="red">Write code below to investigate whether or not the treatment and control groups in the observational data look about the same. Pick a two to three covariates. Feel free to reuse any code from the last homework problem.

**Note:** Make sure the covariates you use to assess balance are pre-treatment covariates. If you use post-treatment covariates, you should make sure that they are not affected by the treatment.

In [None]:
#TODO: Your code here.

### <font color="red">Interpretation time!</font>

<font color="red"> After computing, what is your take-away?  Double-click on the box below and say whether you think that the observational data does a good job of getting similar-looking control and treatment groups. </font>

**ANSWER:**

### Part 1d: Graphical relationships

We will start off with investigating how earnings later in life (we will average all post-treatment earnings) are affected by whether or not an individual wins in the lottery. We will document whether the two variables of interest are correlated by creating a scatterplot and calculating the correlation coefficient.

First, we will create a variable for the average earnings post winning the prize.

Run the cell below to do that.

In [None]:
# Average all the post-treatment earnings that are stores in columns 20-26
winners["avg_earn"] = winners.iloc[:,20:27].mean(axis = 1) # axis = 1 is telling Python to compute the mean of each row

### <font color="red">Coding time!</font>

<font color="red"> Write code below to create a scatterplot of the treatment variable, _winner_, and the outcome variable, _avg_earn_.</font>

In [None]:
#TODO: Your code here.

### <font color="red">Coding time!</font>

<font color="red"> Write code below to calculate the correlation coefficient of the treatment variable, _winner_, and the outcome variable, _avg_earn_.</font>

In [None]:
#TODO: Your code here.

### <font color="red">Interpretation time!</font>

<font color="red"> How can we interpret the correlation coefficient we obtained between treatment and outcome?

**ANSWER:**

### <font color="blue">!! STOP !!</font>

<font color="blue"> You have finished part 1 of the homework. This is the only part of the homework that is due on October 20, 2025, 11:59 pm. Save your progress and please upload your Jupyter notebook on Gradescope.</font>

## Part 2: Analysis

**Due:** October 27, 2025, 11:59 pm

We will split our analysis into two parts: a) causal analysis (including treatment effect estimation and inference) and b) writing a policy memo.

### Part 2a: Causal Analysis

We will move to our causal analysis. We start off by running a regression for our observational sample.  We'll fit a simple linear model of the form: [avg_earn] = a + b[winner].

That is, we want to estimate the "best" values of a and b so that the above equation is as close to being satisfied as possible.

Run the following code to do this.  Here are what the arguments to the ols function mean:
* The first argument "avg_earn ~ winner" describes the model we are trying to fit.  It says that we want to explain the variable "avg_earn" in terms of the variable "winner".  We'll see later that if we want to explain some variable Y in terms of multiple other variables, we'd use a plus sign, like "avg_earn ~ winner + age".
* The second argument, winners, is our dataset
* The third argument, missing="drop", is optional, and tells Python what to do with data that have missing observations.  In this case, we're telling it to drop those data points.

**Note 1:** ols =  = *o*rdinary *l*east *s*quares, which is just a fancy word for regression!

**Note 2:** You might need to download the package "statsmodel" using pip install if it was not included when you downloaded Python. Check out the "WelcomeToPython" tutorial if you are not sure how to do that.  

In [None]:
# Load the necessary package
from statsmodels.formula.api import ols

model = ols("avg_earn ~ winner", winners, missing="drop").fit()
model.params

Above, you should have gotten a value for "Intercept" and value for "winner".  The "Intercept" is the best estimate for "a" above; it's the y-intercept of the line.  The "winner" is the best estimate for "b" above; it's the coefficient on the "treatment" variable, in our case the magnitude of the winning.

You can get more information about the regression by running the following command.  It tells you things like "R-squared", "standard errors", and "p-values", which quantify how good a fit this linear model is.

We're not going to go into the details of how these are computed and what they mean, but here are some rules of thumb:

### Rules of Thumb for Deciding if Regression Results are "Important."
* **Look at the "std err" column for each coefficient.**  If a coefficient's standard error is small relative to its value, then the estimate is precise. If the standard error is large relative to its value, then the estimate is not precise.  
* **Look at the "p-value" in the column labeled "P > |t|".**  This translates the standard error into a statement about how surprised you should be.  It tells you the probability of seeing a coefficient that is this large in magnitude by chance.  Loosely, if we were to generate random samples from the same population, then how often would we get a coefficient of this size just by chance.  If this is very small (a "standard" threshold is less than 0.05), then we think that the relationship captured by that coefficient is "statistically significant," aka, that it's a real effect, and not one that just showed up by random chance.  If this p-value is large, then we probably shouldn't read too much into the relationship captured by this coefficient. Note that if you look at 20 coefficients then even if there is nothing going on you should expect to be surprised by 1 of them!
* **Look at the "R-squared" that appears at the top.**  This number tells us the share of the variance that is "explained" by the regressors.  If it is very close to zero, this means that most of the variance is *not* explained by the regression model, so we don't think that these factors do a good job of predicting/explaining the outcome.  If it is very close to 1, this means that most of the variance *is* predicted/explained by the regression model, so we do think that these factors do a good job of predicting/explaining the outcome.

In [None]:
print(model.summary())

As we learned in class, a regression of this form is just finding a line of best fit.  Let's plot that and see what it looks like!

In [None]:
a = model.params["Intercept"]
b = model.params["winner"]
x = winners["winner"]
y = winners["avg_earn"]
linearX = [x/10 for x in range(0,11)]
linearY = [a + b*x for x in linearX]

In [None]:
# Import the matplot library
from matplotlib import pyplot as plt
import numpy as np

plt.scatter(np.array(x),np.array(y))
plt.xlabel("Winner")
plt.ylabel("Average earnings after winning")
plt.plot(linearX, linearY,color="red")
plt.show()

## <font color="red">Interpretation time!</font>

1. <font color="red">How can we interpret the regression coefficient we obtained with respect to Friedman's permanent income hypothesis? Do they align or do they differ? </font>
2. <font color="red">Given our findings from the balance table, do we trust our current analysis and estimates we got to represent causal effects? Why or why not? </font>

**ANSWER:**

## <font color="red">OPTIONAL: Coding time!</font>

1. <font color="red">Implement the least squares estimator from scratch using the formula we learned in class and confirm that it matches your estimate from using the built-in function!</font>
2. <font color="red">Calculate the difference-in-means estimator. How does it compare to the least squares estimator?</font>

In [None]:
##OPTIONAL TODO: Your code here.

### Addressing potential confounders

Now we can throw a few more variables into the model and see how our results change.  More precisely, let's fit a model of the form:

[avg_earn] = a + b[winner] + c[age] + d[education] + ... + z[xearnp6]

(that is, we'll throw in all of the pre-treatment variables!)

Run the following cell to do that.

In [None]:
model = ols("avg_earn ~ winner + age + education + male + tixbot + workthen + xearn1 + xearn2 + xearn3 + xearn4 + xearn5 + xearn6 + xearnp1 + xearnp2 + xearnp3 + xearnp4 + xearnp5 + xearnp6", winners, missing="drop").fit()
print(model.summary())

As before, you should get a number in the "coef" column for "Intercept" -- that's the "a" in our equation -- and similarly coefficients on each of the other variables in our linear model.  You'll also get a lot of other information.  It's okay to ignore this for now since we haven't talked about what it all means, but take a look at them in light of the "rules of thumb" mentioned above.

## <font color="red">Interpretation time!</font>

<font color="red"> How should we interpret these numbers?  In particular, answer the following questions. (A sentence or two each is fine).</font>

   1. <font color="red">How would you interpret the number on "winner"?  Would you interpret this the same way or differently as the number in our first regression?  Why or why not?  (In particular, is this number close to or far from the number that we got in the first regression?  Does that make sense? Do you draw the same conclusion about Friedman's permanent income hypothesis?)</font>
   2. <font color="red">How would you interpret the number on, say "education," or "male"?  Do you interpret those numbers the same way you interpret the number on "winner"?  Why or why not?</font>

**ANSWER:**

### Did we adjust for all the covariates?

When dealing with experimental data, randomization guarantees that we cut off links of any confounding variables between our treatment and outcome of interest. When dealing with observational data, we do not have those guarantees. Hence, we need to make some _assumptions_ that get us to "as-if" randomization and give us some guarantees. In our regression case, we _assume_ that we can properly adjust for all the confounding variables that might come up in our problem at hand. 

Of course, it is natural to wonder how _plausible_ this assumption is. How do we really know whether we have sufficiently adjusted for the right covariates? Because we never observe the true treatment effect because of the fundamental problem of causal inference, we have no way of validating that we have sufficiently adjusted for the right covariates. Sometimes, however, we can get _close_ to testing our assumption when we have access to so-called pseudo-outcomes, which could be seen as proxies for the actual observed outcome. One natural choice for a pseudo-outcome is the outcome variable in a prior year, if available. The general idea behind this is that the treatment variable should have a zero effect on this pseudo-outcome, after adjusting for the right covariates (if treatment is not random). Why? If you use the outcome variable in a prior year for your pseudo-outcome, that variable was obviously measured before the treatment happened, which means, by default, the treatment should not affect it.

In our data set, we do have access to multiple years of pre-treatment earnings. Thus, in this part, we will test our model. **We can pick one year of prior earnings (let's say earnings right before winning the lottery) as our pseudo outcome.** Let's run the same two regressions as we did with our actual observed outcome.

### <font color="red">Coding time!</font>

<font color="red"> Write your code for the regression "earnings prior to winning ~ winner" and print out the model summary.</font>

In [None]:
#TODO: Your code here.

### <font color="red">Interpretation time!</font>

<font color="red"> How does this regression output inform your decision on whether or not we can interpret the regression coefficient in our earlier regression causally or not? </font>

**ANSWER:**

### <font color="red">Coding time!</font>

<font color="red"> Redo the exercise, now conditioning on all the relevant covariates. </font>

In [None]:
#TODO: Your code here.

### <font color="red">Interpretation time!</font>

<font color="red"> How does this regression output inform your decision on whether or not we can interpret the regression coefficient in our earlier regression where we included all covariates causally or not? Is this different from your previous conclusion when we did not include the covariates? </font>

**ANSWER:**

### Part 2b: Policy Memo

Imagine you have been hired as a policy analyst for the State Department of Human Services, which is deciding whether to launch a Universal Basic Income (UBI) initiative.

Draft a policy memo that summarizes what the evidence shows and makes a clear, actionable recommendation for the agency’s leadership team. The commissioner and senior staff do not have the same technical background as you, so your memo should explain the findings and their implications in non-technical language.

Your memo should be rooted in your empirical findings and critical evaluation from earlier sections, but you are encouraged to conduct additional analyses as well or draw on other studies seen in class:

* **State your recommendation clearly.** Should the state adopt a UBI?*
*  **Summarize the key evidence.** Highlight the most policy-relevant results (e.g., estimated treatment effects, subgroup differences, cost per participant) in clear, non-technical terms and explain how the lottery design informs your conclusions. Here, you should include findings from earlier sections. You can also include results from any additional analyses you have conducted or discussion around findings from studies seen in class.
*  **Discuss policy design.** If recommending adoption, describe (some of) the program features you propose, e.g. amount, frequency of payments, eligibility rules, funding source, and administrative details. If discouraging adoption, suggest alternative policies the state might consider.
*  **Confront limitations.**  Use the framework from class to discuss what the evidence does and does not tell us (e.g., selection into the lottery, external validity, construct validity, spillover effects, ethical concerns).
*  **Propose next steps.** If you had more time, money, or data, what further research or pilot testing would most help the agency make a confident decision? 

_Deliverables:_
* A policy memo (max. 1,000-word), written for the Commissioner of Human Services and her senior staff, with 1-2 small exhibits maximum (a table or figure).
* A technical appendix (1-2 page) with supporting tables and figures for readers who want more details.

_Suggested memo structure:_
1) Policy recommendation (2-3 sentences)
2) Evidence summary
3) Policy design/implementation plan
4) Cost-effectiveness (if applicable)
5) Limitation of the evidence
6) Next steps for research


*It is perfectly acceptable to recommend not adopting UBI. If you take this position, walk the policymakers through your reasoning and propose a clear plan for gathering better evidence. You can still use the next points and the suggested memo structure as guidance for your policy memo.

### <font color="blue">!! DON'T FORGET !!</font>

<font color="blue"> Hand in your policy memo in a pdf format on Gradescope as well as your completed Jupyter notebook with your team's names on it, e.g. "Anna_Amar_Mary_Guido_Assignment_2.ipynb" and "Anna_Amar_Mary_Guido_Policy_Memo_2.pdf".</font>

In [None]:
##OPTIONAL TODO (policy memo): Your code here.

## Wrap-up and take-aways

In this homework, we learned about people's possible responses to a universal basic income in terms of their labor market decisions as one specific example of how to combine economic models and data. Policymakers do not want to go in blind when issuing a new policy and, typically, rely on a combination of economic theory and data to evaluate the benefits of the new policy. In our analysis, we made use of a new tool we learned for estimating causal effects with observational data: regressions! We also learned a way of assessing whether we sufficiently adjusted for the right covariates.

After our universal basic income unit, you should be able to answer the following questions:
* What is a linear regression?
* How do we calculate the least squares estimator?
* How can we test its significance?
* What assumption do we need to make about the confounding variables to interpret the least squares estimator on our treatment variable causally?
* How can we assess the plausibility of this assumption?