# Welcome to Python!

This is an accompanying script for the python tutorial in our review of Python in Week 1. If you have never used Python before, this script aims to serve as a guide to introduce you to the coding language and basic functions in Python.

## Part 00: Think like a Coder!

If you have never coded before, this course can come as quite a change since there is something fundamentally new that you will learn about how to structure your thinking when you learn how to code. While computers are very powerful and they can certainly do a lot of tasks much faster and more efficient than humans (think about doing some complex matrix algebra!), they cannot do that on their own (...yet). We use code to communicate to the computer _what_, in the most precise words possible, we want it to do for us. That is why it is called a _coding language_. When we code, it is very important that we tell the computer every little thing, so it is prepared for any state the world could be in. For example, think about people's attempts to automate driving. When your self-driving car enters an intersection, you need to tell it what to do as it cannot think for itself (see earlier point). However, there are a lot of different scenarios that can occur: 1) there is a traffic light and the traffic light is green, 2) there is a traffic light and it is yellow, 3) there is a traffic light and it is red, 3) there is a traffic light that is yellow and a car is following you very closely, 4) there is no traffic light, but a stop sign, etc etc. If we don't want the car to freak out at the intersection (and potentially just break and stop altogether), we need to tell it what to do in each of these scenarios. This means that you need to think through all the possible scenarios in your code before you run your code and make sure the computer knows what to do in every possible scenario if you want it to succeed. That is oftentimes easier said than done and you will likely encounter a lot of error messages when you start coding. This is nothing to be ashamed of! Even after years of coding, I still get error messages like "You did not define _x_." when I know all too well that I need to initialize every variable in my code. That being said, the two most valuable lessons I have learned about coding:

1. The computer is never wrong. If there is an error in your code, you messed up.
2. There is always a solution to your problem/ error message. With enough patience **and Googling/ChatGPT**, you will find the answer. 

## Part 0: Jupyter Notebooks

This tutorial and all of the homework assignments are written in so-called Jupyter notebooks. They provide a nice and structured way of interacting with your code while giving you the flexibility to add some narrative text (like this cell) to the code as well. Moreover, all the visualizations are in the same place too. In general, using Jupyter notebooks is a great place to start learning how to code. More advanced coders will oftentimes use so-called scripts, but maybe you'll learn (or have learned) about those in another class :) Before we dive into Python as a coding language, I want to touch upon a few basics you'll need for Jupyter notebooks.

### Part 0a: Cells

Jupyter Notebooks consists of different **cells**. You can add cells by pressing the plus button at the top of this notebook, but for the purposes of this course, we have created cells for you whenever you have to code/write. A double-click on those cells will give you access to start typing in it. There are three different kinds of cells in Jupyter notebooks:

1. Code
2. Markdown
3. Raw

The "Code" cell is pretty self-explanatory. You will choose this format when you are writing your actual code. If you want to comment in this cell format, use the # before you start typing. When you want to write accompanying text, the "Markdown" cell is really nice (this is what we are using right now). The "Raw" cell is used to convert text to different outputs (e.g. HTML or Latex). We will not use this type of cell in any of our homeworks.

### Part 0b: How to run cells

As mentioned before, you can access "Markdown" and "Raw" cells by double-clicking on them and "Code" cells by simply clicking on them. In order to run them (i.e. make Python execute the code you wrote or convert the text you wrote into a nice Markdown cell, you can click **Shift+Enter** or **Ctrl+Enter** (or a Mac equivalent) or press the "Play" button (triangle pointing to the right) up on top. Once you have run a coding cell, the bracket to the left of the cell will show a number. This can help you keep track of which cells you have already run and in which order. You can try it out below. 

In [1]:
print("Hello world!")

Hello world!


### Part 0c: Saving the Jupyter Notebook

To save your Jupyter notebook (e.g. after you have completed your homework), simply hit the "Save" button up top. It looks like a floppy disk. Alternatively, the usualy short cut, **Ctrl+S** (or Mac equivalent), should work as well. You can also navigate to "File" up to and find "Save Notebook".

## Part 1: The Basics

In this part, we will go over the basics of Python and introduce the basic structure behind the coding language. I guess you can think of this as learning the alphabet - we will give you the necessary tools here that you will need later on to perform operations (such as adding up two numbers for example). 

### Part 1a: Scalars and Numeric Lists

We will start off with the most basic object in python, which is a scalar number. The = operator will assign each object a value. (Note: it does *not* check whether the variable has the value mentioned after the equality sign. We typically use to equality signs for that, but more on that later.) 

Remember that you can run the cell by clicking Shift + Enter or Ctrl + Enter. 

In [2]:
x = 2

Python now knows that whenever you type "x" in your code, it will act as if "x" is 2. If you forget what value you assigned to the variable called "x", you can always ask Python to print out the value using the "print(_)" function as follows:

In [3]:
print(x)

2


Alternatively, we can also just type the variable alone into the cell and it will return the value as well:

In [4]:
x

2

Now you can use this scalar variable to do various tasks with it, e.g. figure out what x+3 is

In [5]:
x+3

5

_Note_: See how I didn't have to tell Python to print out the value this time around? Generally, if you perform any operations and do **not** assign the operation to a variable, Python will print out the results for you. If I assigned the same operations a name instead (see below), then it would have not printed out the results.

Moreover, now we can access the variable _y_ as well and we do not need to type in the same operation (_x+3_) again if we needed it later on in the tutorial.

In [6]:
y = x+5

You can also alter existing variables by assigning them a new value. Let's say we actually want "x" to be 1 number higher. We can do that as follows:

In [7]:
x = x+1

Now let's print out what "x" is:

In [8]:
print(x)

3


### <font color="red">Coding time!</font>

<font color="red"> I want to assign the variable _l_ the value of 129874. Now I am interested in what _l_ + 4256 is. Write code below that will tell you the answer.</font>

In [10]:
#TODO: Your code here.
l = 129874
w = l+4256
w

134130

We can combine multiple scalars into a numeric lists using the square brackets.

In [11]:
a = [1,5,16,35,x/(5-2)]

In [12]:
a

[1, 5, 16, 35, 1.0]

We can access each element of the list using the list's name, in this case "a", and the position number in square brackets. **Note:** Python starts counting the positions at position 0.

In [13]:
a[0]

1

In [14]:
a[3]

35

We can perform a lot of other operations on lists too. Here are some examples:

In [15]:
# Get the number of elements in a list: length
a_length = len(a) 
print(a_length)

5


In [16]:
# Get the sum of all the lists elements: sum
a_sum = sum(a)
print(a_sum)

58.0


In [17]:
# Get the maximum value in the list: max (similarly with min, just replace max by min)
a_max = max(a)
print(a_max)

35


In [18]:
# Get the mean value of the list: mean
a_mean = sum(a)/len(a)
print(a_mean)

11.6


_Aside:_ Sometimes, Python distinguishes between numerical variables of the type "integer" (meaning only displaying integer values) or "float" (meaning displaying the number with decimals). We can easily convert the types if needed by using the functions "int()" to convert a number to an integer number and "float" to convert it to a float.

### <font color="red">Coding time!</font>

<font color="red"> I randomly ask 5 of my female friends to tell me their shoe size. They report the following shoe size: 7, 6.5, 8, 9, 7. What is the average shoe size amongst my friends? What is the maximum? Create new variables for both of those metrics.</font>

In [19]:
#TODO: Your code here.
b = [7,6.5,8,9,7]
b_max = max(b)
print(b_max)

9


### Part 1b: Logical Lists

Sometimes all we care about knowing is whether a person has a certain characteristic or not, e.g is this person male or female? These objects are what we refer to as "logical" objects in coding. These objects are typically coded up in two ways: 1) as a binary variables (i.e. 0 or 1, where 1 means that the person has the characteristic) or 2) as a logical variable (i.e. False or True, where TRUE means that the person has the characteristic). Logical vectors are particularly important when we want to restrict our analyses to particular subsets of the data.

In our example below, we are defining a logical variable by checking whether our variable _x_ that we defined earlier is bigger than 5. Note that we use the following operators to compare (do not copy the colon):

- $\gt$: greater than
- $\gt =$: greater or equal to
- $\lt$: less than
- $\lt =$: less than or equal to
- $==$: equal (_Note:_ we **cannot** use = since that is the operator that we use to define variables!)
- $!=$: not equal

In [21]:
x_bigger_five = x>5

print(x_bigger_five)

False


These logical variables can be a list as well just like the numerical ones. Note that the logical variables will become important in Part 1d when we talk about if statements.

### <font color="red">Coding time!</font>

<font color="red"> This exercises builds on the previous one about shoe sizes. My own shoe size is a 9. I am wondering if my shoe size is above average. Write code that returns "True" or "False", where "True" means my shoe size is above average for my friend group.</font>

In [22]:
#TODO: Write code here.
shoe_above_average = 9>b_mean
print(shoe_above_average)

True


### Part 1c: Character/Strings lists

Sometimes our data may contain words rather than numbers. We will generally refer to that type of data as "characters" (if it is only a single letter) or "strings" (if it is words). This can for example happen if we code up a person's country of birth or we think about having data available for different countries and one column called "country". Generally, we generate string objects the same way we do with numerical objects. We use the = operator to assign values, except we now put the string in **quotation marks (either " or ')**. We can print out the objects like usual. Note that now the object that is printed out is in quotation marks as well. 

In [23]:
state_abbreviation = "CA"

In [24]:
state_abbreviation

'CA'

If you are ever unsure about the type of object you are dealing with (numeric, integer, string, etc.), you use the "type(_)" function to print out an objective's type as follows:

In [25]:
type(state_abbreviation)

str

Similarly as with the last set of objects introduced, we can create lists with this type of object as well. Again, we use the square brackets to indicate the start and end of the list>

In [26]:
state_names = ["AL","AK","AZ","AR","AS",state_abbreviation,"CO"]

In [27]:
state_names

['AL', 'AK', 'AZ', 'AR', 'AS', 'CA', 'CO']

### Part 1d: if Statements

The nice thing about a coding language is that we can ask it to do multiple comparison, calculations, etc. all in one go and in one platform. One basic more advanced operation you will frequently encounter in code are _if statements_. If statements are of the form: _if_ something is true, _then_ do this. "Do this" here can mean multiple things, oftentimes a manipulation to a variable. We are going to start with an easy statement first to give you an idea. Let's ask Python to let us know whether a variable is bigger than 10.

In [28]:
s = 5
if s > 10:
    print("The number is bigger than 10.")
else:
    print("The number is not bigger than 10.")

The number is not bigger than 10.


Now there are a few things going on here. First, we initialized a new variable "s" and assigned it a value. Then we asked Python to check if s is larger than 10 and print out an answer. Note the specific structure of how exactly we coded up the if statement: in general it will have the following structure:

if _condition_:<br>
&emsp;_action 1_<br>
else:<br>
&emsp;_action 2_<br>

This means, we start with a line with the if statement and the condition it needs to fulfill and end this line with a colon. In the line underneath it, we code the action. Note the indent? The indent is very important for Python to be able to execute the action. Then we ended our _if statement_ with an _else_ statement, where we put a colon behind the "else" and in a new line, still with an indent, printed out action 2. Remember how I was talking in the introduction about making sure you tell the computer what to do in every possible scenario? This is exactly an application of this! If we didn't have the else statement, the computer would just do nothing (which is of course okay if that's what you want, but now imagine your self-driving car at the interaction after having encountered a scenario where we didn't program up what it should do in the "else"-state of the world!).

Now your _if statements_ can be as easy or as complex as you want (/need) them to be. Here are a few things you can do for the condition:

- Check if a number is equal (==), greater (>), greater equal (>=), smaller (<) or smaller equal (<=) than another number or variables.
- Check if a list has a desired length (e.g. len(a) == 5).
- Check more than one condition at once: we can combine conditions through "AND" (&) or "OR" (|). For example, check whether a number is greater than 5 and below 10 (s >5 & s < 10). 

In [29]:
if s >  5 & s < 10:
    print("The number is bigger than 5 and less than 10.")
else:
    print("The number is 5 or less or 10 or higher.")

The number is 5 or less or 10 or higher.


### <font color="red">Coding time!</font>

<font color="red"> Recall the last exercise where I asked you to print out a binary variable about whether my shoe size is above average. Now I'd like you to redo this exercise using _if statements_. If the statement is true, the statement should say "Lea's shoe size is above average." If it is not true, it should say "Lea's shoe size is not above average.".</font>

In [None]:
#TODO: Your code here.

### Part 1e: Loading packages

Python is an _open source_ programming language which means that it is free to use and a lot of people have written their own functions (often bundled in what we call packages) that you can use for your analysis! **Note:** While it is nice to have so many packages at your disposal, I'd always be a bit weary of them because you don't entirely know what is going on under the hood. In general, if it's easy enough to code up, this gives you the highest guarantee of the function actually doing what you want it to do.

Sometimes, the packages are necessary for you to perform operations (like arithmetic operations) or create nice ways to store data, e.g. in data frames. In general, a lot of softwares have some of the packages built in and you will only need to load them. The next cell shows how to load a package called "numpy" which is what we will use to perform arithmetic operations on the list objects introduced earlier. It will convert the lists to arrays that will allow us to add, subtract, multiple, divide, etc..

If the following cell gives you an error because it cannot find a package called numpy, read on and you will find out how to install the package before you can import it.

In [30]:
import numpy as np

It is technically enough to simply write "import numpy". We add the "as np" part, so in the future, when we are referring to a function from the numpy package, we can write "np._function_" instead of having to write out numpy.

If the package you want is not already downloaded into the interface you are using, you can run the following cell to install it.

In [None]:
pip install numpy

Sometimes you may need to exit out and open up the notebook again to proceed to the previous step and import it. 

In general, everytime you open up your Jupyter notebook you will need to import the packages again and tell Python that you want to use them. Similarly, when you start a new Jupyter notebook, you will need to import all the packages you need to use in that session. Hence, my Jupyter notebooks **always** start off with a cell called "Loading packages".

### Part 1f: Arithmetic Operations

While there are some things we can do with the lists, when it comes to doing more sophisticated arithmetic operations, we typically work with what we call arrays. We need the numpy package to convert our lists into arrays. Once we did that, there are a range of thing again that we can do, a few of which are highlighted below:

In [31]:
import numpy as np # importing numpy - this line is redundant if you already ran the line earlier

x = 2 # this line is redundant if you already ran the line earlier
x = x+1 # this line is redundant if you already ran the line earlier
a = [1,5,16,35,x/(5-2)] # this line is redundant if you already ran the line earlier

a_as_array = np.array(a) # converting our list into an array

In [None]:
# Add a number to each element of the array

a_as_array+5

In [None]:
# Multiply each element in the array by a constant

a_as_array*3

In [32]:
# Take the mean of the array values
# Note that we did this before manually, but using the numpy package, we can simply use the np.mean(_) function

np.mean(a_as_array)

np.float64(11.6)

We can also add two arrays together or multiply their respective elements!

In [33]:
b_as_array = np.array([2,10,54,3,5])

In [34]:
# Add the arrays: make sure they have the same length!
a_as_array+b_as_array

array([ 3., 15., 70., 38.,  6.])

In [None]:
# Multiply the arrays: make sure they have the same length!
a_as_array*b_as_array

### <font color="red">Coding time!</font>

<font color="red"> Redo the average shoe size computation from earlier using numpy arrays and the built-in function.</font>

In [None]:
#TODO: Your code here.

## Part 2: The Basics for This Course

### Part 2a: Pandas Dataframes

We are going to work a lot with Pandas dataframes as a format for our data. In this subsection, you'll get an introduction to what Pandas dataframes are and what we can do with them. The homeworks themselves will provide some more details, but I wanted to give you an idea of what to expect. In general, data frames are like matrices where we think of a row representing an observation and a column representing a specific characteristic of this observation (e.g. a person's earnings). 

We will go through an example using dummy data. **Your data needs to be saved in the same folder as your Jupyter notebook!** We use the pandas package and the csv package to load the cvs file with the open(\_) function. Remember that we add the "as pd" after we imported the pandas package, so we can refer to it as "pd" later, e.g. when we turn the loaded data file into a pandas dataframe using pd.DataFrame(\_).  

**Note:** If you installed Python/Jupyter notebooks using Anaconda, you should already have the pandas package. If you installed using pip, you may or may not have it, but if you don't, recall from section 1e that you can get it with "pip install pandas" on the command line.

In [None]:
import pandas as pd # import the pandas package
import csv # import the csv package to load csv files

# Load the csv file
datafile = open('dummy_data.csv', 'r', encoding='utf-8-sig', newline='')
dataDict = csv.DictReader(datafile, delimiter=",")

# Turn this into a pandas dataframe
dataPd = pd.DataFrame(dataDict)

# Print out the pandas dataframe
dataPd

We have 7 rows in total and 4 variables, so our dataframe is 7x4. The leftmost (bold) column with no name indicates the row number. Note here again how Python starts counting at 0.

Now that we have loaded our data set, we often want to start applying arithmetic operations to them etc. Typically, we want to access specific **rows** or **columns** instead of the entire data matrix. Accessing those is fairly straightforward. Generally, there are two different ways of accessing columns: 1) numerically or 2) by variable name.

Let us start off with the rows. Here, we use the .iloc feature to access a row numerically. Since we did not name the rows (other than the counts), there is no way to access them by variable. We use a colon for the second dimension (the columns) to indicate that we want to get the value for **every** variable for a specific row.

In [None]:
dataPd.iloc[0,:]

**Note:** The .loc feature works as well to access a particular row.

In [None]:
dataPd.loc[0,:]

Next, we will focus on accessing whole columns. Again, we start with how we can access one numerically. We can use the .iloc feature of Pandas again, but now leaving the first entry (which refers to the rows) with a colon, indicating we want to grab the whole column, and using the second entry to indicate which column, in this case column 0, we want to access.

In [None]:
dataPd.iloc[:,0]

Similarly, we can access the same column by referring to its name, where the name is in quotation marks (either ' or ").

In [None]:
dataPd['Age']

Now, what if we want to access a particular row **and** variable? We can use the techniques described above and simply replace the colon by the specific row/column/variable name. The following three cells demonstrate a purely numerical approach and one with the variable name for the column.

In [None]:
dataPd.iloc[0,1]

In [None]:
dataPd.loc[0,'Age']

In [None]:
dataPd['Age'][0]

**Note:** If we want to access multiple rows or columns, we can indicate that by indicating the range with a colon. For example, let's say we want to access the first three rows of the data set. 

In [None]:
dataPd.iloc[0:3,:]

Next, we demonstrate how to grab the first three columns.

In [None]:
dataPd.iloc[:,0:3]

### Part 2b: Mean, Standard Deviation, Histograms

In this next part, I will cover the topics that you reviewed in the first week's statistics review lecture and show you how to code them up in Python for the pandas data frame.

I will focus on the _Age_ variable. Note that we need a numeric variable to perform arithmetic operations.  

In [None]:
# Transform the column into numeric if this hasn't been done right after loading the data file
dataPd['Age'] = pd.to_numeric(dataPd['Age'], errors='coerce')

# Calculate the mean using np.mean()
np.mean(dataPd['Age'])

Note that we can also decide to only use the data from a subgroup of people for the mean calculations.

**Note:** When you want to subset data based on an indicator in Python using parantheses, e.g. [0] to get the 0th entry or [0:3] to get the first three entries, you need to indicate one more element than you expect in the list when accessing multiple elements. For example, [start:(start+number of elements you want)]. 

In [None]:
# Mean for the first 3 people
np.mean(dataPd['Age'][0:3])

In [None]:
# Mean for all girls
ind_f = np.where(dataPd['Gender']=='F') # Create a new variable with all the row numbers of the girsl
np.mean(dataPd['Age'][ind_f[0]]) # Calculate their mean

Similarly, we can do the calculations for the standard deviation. We use the function "np.std(\_)" for that.

In [None]:
# Calculate the standard deviation using np.std()
np.std(dataPd['Age'])

In [None]:
# Std for the first 3 people
np.std(dataPd['Age'][0:3])

In [None]:
# Std for all girls
np.std(dataPd['Age'][ind_f[0]]) # Use the same indicator from before

Lastly, we are plotting some histograms. Again, we will do it for the full population and then the subpopulations.

In [None]:
# Import the matplot library
from matplotlib import pyplot as plt

# Plot the histogram
print("")
plt.hist([dataPd['Age']], label=["Age"])
plt.legend(loc='upper right')
plt.show()

In [None]:
# Plot the histogram split by gender
ind_m = np.where(dataPd['Gender']=='M') # Create a new variable with all the row numbers of the girsl


print("")
plt.hist([dataPd['Age'][ind_f[0]], dataPd['Age'][ind_m[0]]], label=["Age Girls", "Age Boys"])
plt.legend(loc='upper right')
plt.show()

## Part 2c: Loops and List Comprehensions

In this course, we will work quite a bit with specific subsets of the data, e.g. we want to compute the average age of all the girls in the data set separately. Generally speaking, there are two ways of doing so: 1) loops and 2) list comprehensions. 

The for-loop is pretty self-explanatory. We loop through every data row that is in our data frame and first check whether the person in this row is a boy or a girl. If the person is a boy, we don't do anything. If the person is a girl, however, we add the corresponding age to a final sum. We also have a count variable that adds 1 if the person is a girl. Lastly, to get the average, we divide our sum variable of ages by the number of girls in the data set. 

The list comprehension uses the same logic, it is just a slightly cleaner way fo writing it. You will notice a lot of the same elements as you can see in the more detailed explicit for-loop.

In [None]:
## Compute the average age among girls.

##########
#
# Here's the way to do it with basic Python and a for-loop:
#
##########

sum_age = 0 # create a variable to keep track of the sum of all the ages of the girls in the data set
num_girls = 0 # create a variable to keep track of the number of girls in the data set

num_rows = dataPd.shape[0] # obtain the number of rows -> if you change 0 to 1, you get the number of columns
for x in range(num_rows): # range(num_rows) indicates that we are looping through the values 0,...,num_rows-1 for x
    if dataPd.loc[x,'Gender'] == 'F': # if person is a girl, count her age towards the sum
        sum_age += dataPd.loc[x,'Age'] # the "+=" notation means that we add that amount
        num_girls += 1
print("Avg. Age among Girls:", sum_age/num_girls)

##########
#
#   Here's another way to do the same computation with a slightly fancier version 
#   of a list comprehension.  This version puts in an "if" 
#   statement at the end of the link to pick out the gender.
#   Slick!
#
######

print("AGAIN WITH A LIST COMPREHENSION")
age_girls = [dataPd.loc[x,'Age'] for x in range(num_rows) if dataPd.loc[x,'Gender'] == 'F']

print("Avg. Age among Girls:", np.mean(age_girls))

# References

_Brown, M_, R for Applied Economics: A Beginnerâ€™s Guide (2023)

_Dicken, B_, PyFlo - A Free, Interactive Guide to Python Programming (2023): https://pyflo.net/