Passion Driven Statistics: Downloading Data Files and Importing Data in Python
Last Friday, we started importing the Add Health data and selecting the two features of the world (variables) that students would study. There were some hurdles.
Many students have written their first Python program that imports a comma-separated value (CSV) Add Health file into a Python DataFrame and a subset of the Add Health data with their chosen two variables. We didn't make much progress in producing frequency tables. Here are some lessons:
Review, I repeat, review folder directories before you start any Python coding. One day, this problem will be solved. But we are in transition and not there yet. As my colleague puts it, spending 15 minutes doing this passes the benefit-cost test.
Also, ensure you understand how the student downloaded the Add Health file from your LMS. We use Moodle, and the download setting you select for a file can lead to significant challenges. You will want to use the force download setting in the Moodle file activity's appearance attribute.
Consider creating a Google sign-up sheet for one-on-one help during office hours and seeing a student smile when the code works—priceless.
Please keep it simple. You may be tempted to introduce some efficiency tricks. Resist and make it as simple as possible. PDS also recommends some data management tasks. Data management is essential, but given this week's experience, I fear it may lead us down a rabbit hole. We haven't yet made it to simple frequency tables, but that's ok.
#Import the Pandas library and give the library a shorter name to ease typing (pd)
import pandas as pd
#Load full data set into a Pandas DataFrame. I have a called the DataFrame "df" in this example. Choose what works best for you.
df = pd.read_csv('addhealth_pds.csv',low_memory=False)
#Create a subset of data in a second Pandas DataFrame that only contains the two variables. I called the new DataFrame "my_df"
my_df = df[['H1PF2','H1EE1']]
print("my project variables")
print(my_df.head())
Next week, we hope to move to frequencies for each variable. Then, we will explore ways to look at the two variables together. A snow event this past week has slowed us down.


