Data Management and Visualization: Data Management

Code: 

# Import data set libraries
import pandas
import numpy

# Import my data set
data = pandas.read_csv('addhealth_pds.csv', low_memory=False)

# bug fix for display formats to avoud run time errors
pandas.set_option('display.float_format', lambda x:'%f'%x)

data['H1SE2'] = data['H1SE2'].convert_objects(convert_numeric=True)


print("counts of plan ahead for birth control use H1SE2 1=very sure")
c5 = data['H1SE2'].value_counts(sort=False)
print (c5)

print("percentage plan ahead for birth control use H1SE2 1=very sure")
p5 = data['H1SE2'].value_counts(sort=False, normalize=True)
print (p5)

# recode missing values to python missing (NaN)
data['H1SE2']=data['H1SE2'].replace(96, numpy.nan)
data['H1SE2']=data['H1SE2'].replace(97, numpy.nan)
data['H1SE2']=data['H1SE2'].replace(99, numpy.nan)

print("percentage plan ahead for birth control use - missing nan")
p5 = data['H1SE2'].value_counts(sort=False, dropna=False)
print (p5)

 


print("counts of resist sex if partner doesnt want bc H1SE3 1=very sure")
c6 = data["H1SE3"].value_counts(sort=False)
print (c6)

print("percentage resist sex if partner doesnt want bc H1SE3 1=very sure")
p6 = data["H1SE3"].value_counts(sort=False, normalize=True)
print (p6)

# recode missing values to python missing (NaN)
data['H1SE4']=data['H1SE4'].replace(96, numpy.nan)
data['H1SE4']=data['H1SE4'].replace(98, numpy.nan)

print("counts of resist sex if partner doesnt want bc - missing nan")
p6 = data['H1SE3'].value_counts(sort=False, dropna=False)
print (p6)

This week I focused on coding out missing data.

I faced some roadblocks trying to do some of the more complicated data management code snippets, so I'd like to revisit this lesson and try again. 

The nan values of each variable: 

Name: H1SE2, percentage plan ahead for birth control use
nan          2060

Name: H1SE3, counts of resist sex if partner doesnt want birth control
nan          2055

Name: H1SE4, counts of self-percieved intelligence
nan             5

 

The first two variables deal with specific questions about their safe sex practices, or what they would think they could do. Both had about the same number respond: refuse (96), legitimate skip (97), or not applicable (99). 

What's interesting is the third variable asks the subject to rate their general intelligence compared to their peers, and only 30 people opted out of answering (refusal [96] or don't know [98]). 

Conclusion: Teens are much more willing to rate their overall intelligence, but when it comes to specific topical questions, they are less confident to answer. 

Note: the last variable did not include the Legitimate Skip answer response. I am not quite sure what qualifies as a legitimate skip for these answers (I assume it's under 15 yrs old). I might need to re-do this comparison, eliminating those legitimate skip responses from the first two variables.