Assessing SAT Scores for New York Schools

Exploratory data analysis of SAT scores across high schools

The SAT (Scholastic Aptitude Test) is a standardized test widely used for college admissions in the United States. The test is divided into 3 sections, each of which has a total score of 800. High schools are often ranked by their average SAT scores, and high SAT scores are considered a sign of how good a school district is.

This project utilizes different datasets taken from the NYC OpenData initiative. The main idea here was to combine data from multiple associated data sources and perform exploratory data analysis on it to understand:

  1. Are schools differentiated by SAT scores?
  2. The schools which are doing best and the ones which are doing worse, what are the key differentiating factors between these schools?
  3. Do demographics really play a role in SAT scores? If yes, then how and which demographic categories are doing better than others

Data pull and cleaning

For purpose of this analysis, we will be taking multiple open-source data sets available, perform some cleaning, and consolidate them into one dataset having one row per school, which would then allow us to do the rest of Exploratory Data Analysis pretty easily. As I would be using the NYC OpenData APIs to pull in the data, I have added the links to APIs for downloading the data directly into pandas dataframes.

Before we start with the datasets, I would like to take a moment and explain how the data hierarchy works in these datasets. I did some research and found out that New York is divided into districts (denoted by numbers like 01,02, etc.), which are then further divided into boroughs, which then have schools. Thus, each school is identified by its DBN (District Borough Number) which looks something like “01M015”.

The datasets that we would be using in this project are:

  • Sat Scores - SAT Scores by sections for each school. Download JSON here
  • Enrollment - Average attendance & Number of students enrolled at a district level. Download JSON here
  • High Schools - A plethora of details about every school such as location, address, website, etc. Download JSON here
  • Class Size - Details about school class sizes & education programs. Download JSON here
  • Advanced Placement - Summary of students taking advanced placement exams by school. Download JSON here
  • Graduation - Graduation related stats for each school by cohorts. Download JSON here
  • Demographics - Student demographics information for each school. Download JSON here
  • Districts - District coordinates and shape polygons. You will need to download this data in GeoJSON format in order to use it for plotting. Download them from here
  • Math test results - Math test results for every school. Download JSON here
  • Survey Results - Survey responses of students, teachers, and parents for each school providing overview of their perception of school quality. This is not available through APIs and has to be downloaded from here

Having gotten enough context, let’s get started with this project!!

First, we will import the required libraries using the below code:

#Importing all relevant libraries
import numpy as np
import pandas as pd
import os
import folium # For plotting chloropleth charts
from folium import plugins
import pygeoj # For reading in geoJSON District map file

Now, let’s read in all the different datasets that I described above:

#Using Department of Education APIs to read in all relevant data
sat_results = pd.read_json("https://data.cityofnewyork.us/resource/f9bf-2cp4.json")
enrollment = pd.read_json("https://data.cityofnewyork.us/resource/7z8d-msnt.json")
high_schools = pd.read_json("https://data.cityofnewyork.us/resource/n3p6-zve2.json")
class_size = pd.read_json("https://data.cityofnewyork.us/resource/urz7-pzb3.json")
ap2010 = pd.read_json("https://data.cityofnewyork.us/resource/itfs-ms3e.json")
graduation = pd.read_json("https://data.cityofnewyork.us/resource/vh2h-md7a.json")
demographics = pd.read_json("https://data.cityofnewyork.us/resource/ihfw-zy9j.json")
math_results = pd.read_json("https://data.cityofnewyork.us/resource/jufi-gzgp.json")
survey_1 = pd.read_csv("masterfile11_gened_final.txt",delimiter='\t',encoding='windows-1252')
survey_2 = pd.read_csv("masterfile11_d75_final.txt",delimiter='\t',encoding='windows-1252')

Having the data in place, its always good to check samples for each data set to understand it better

#Browsing the datasets one by one
sat_results.head()
dbn num_of_sat_test_takers sat_critical_reading_avg_score sat_math_avg_score sat_writing_avg_score school_name
0 01M292 29 355 404 363 HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES
1 01M448 91 383 423 366 UNIVERSITY NEIGHBORHOOD HIGH SCHOOL
2 01M450 70 377 402 370 EAST SIDE COMMUNITY SCHOOL
3 01M458 7 414 401 359 FORSYTH SATELLITE ACADEMY
4 01M509 44 390 433 384 MARTA VALLE HIGH SCHOOL
enrollment.head()
district ytd_attendance_avg_ ytd_enrollment_avg_
0 DISTRICT 01 91.18 12367
1 DISTRICT 02 89.01 60823
2 DISTRICT 03 89.28 21962
3 DISTRICT 04 91.13 14252
4 DISTRICT 05 89.08 13170
high_schools.head()
:@computed_region_92fq_4b7q :@computed_region_efsh_h5xi :@computed_region_f5dn_yrer :@computed_region_sbqj_enih :@computed_region_yeji_bk3q addtl_info1 addtl_info2 advancedplacement_courses bbl bin ... school_name school_sports school_type se_services start_time state_code subway total_students website zip
0 47 20529.0 51 59 3 Uniform Required: plain white collared shirt, ... Extended Day Program, Student Summer Orientati... Calculus AB, English Language and Composition,... 4.157360e+09 4300730.0 ... Frederick Douglass Academy VI High School Step Team, Modern Dance, Hip Hop Dance NaN This school will provide students with disabil... 2020-04-28 07:45:00 NY A to Beach 25th St-Wavecrest 412.0 http://schools.nyc.gov/schoolportals/27/Q260 11691
1 45 17616.0 21 35 2 Our school requires completion of a Common Cor... NaN NaN 3.068830e+09 3186454.0 ... Life Academy High School for Film and Music NaN NaN This school will provide students with disabil... 2020-04-28 08:15:00 NY D to 25th Ave ; N to Ave U ; N to Gravesend - ... 260.0 http://schools.nyc.gov/schoolportals/21/K559 11214
2 49 18181.0 69 52 2 Dress Code Required: solid white shirt/blouse,... Student Summer Orientation, Weekend Program of... English Language and Composition, United State... 3.016160e+09 3393805.0 ... Frederick Douglass Academy IV Secondary School Basketball Team NaN This school will provide students with disabil... 2020-04-28 08:00:00 NY J to Kosciusko St ; M, Z to Myrtle Ave 155.0 http://schools.nyc.gov/schoolportals/16/K393 11221
3 31 11611.0 58 26 5 All students are individually programmed (base... Extended Day Program Art History, English Language and Composition,... 2.036040e+09 2022205.0 ... Pablo Neruda Academy Baseball, Basketball, Flag Football, Soccer, S... NaN This school will provide students with disabil... 2020-04-28 08:00:00 NY N/A 335.0 www.pablonerudaacademy.org 10473
4 19 12420.0 20 12 4 Chancellor’s Arts Endorsed Diploma NaN Art History, Biology, Calculus AB, Calculus BC... 1.011560e+09 1030341.0 ... Fiorello H. LaGuardia High School of Music & A... NaN Specialized School This school will provide students with disabil... 2020-04-28 08:00:00 NY 1 to 66th St - Lincoln Center ; 2, 3 to 72nd S... 2730.0 www.laguardiahs.org 10023

5 rows × 69 columns

class_size.head()
average_class_size borough core_course_ms_core_and_9_12_only_ core_subject_ms_core_and_9_12_only_ csd data_source grade_ number_of_sections number_of_students_seats_filled program_type school_code school_name schoolwide_pupil_teacher_ratio service_category_k_9_only_ size_of_largest_class size_of_smallest_class
0 19.0 M - - 1 ATS 0K 1.0 19.0 GEN ED M015 P.S. 015 Roberto Clemente NaN - 19.0 19.0
1 21.0 M - - 1 ATS 0K 1.0 21.0 CTT M015 P.S. 015 Roberto Clemente NaN - 21.0 21.0
2 17.0 M - - 1 ATS 01 1.0 17.0 GEN ED M015 P.S. 015 Roberto Clemente NaN - 17.0 17.0
3 17.0 M - - 1 ATS 01 1.0 17.0 CTT M015 P.S. 015 Roberto Clemente NaN - 17.0 17.0
4 15.0 M - - 1 ATS 02 1.0 15.0 GEN ED M015 P.S. 015 Roberto Clemente NaN - 15.0 15.0
ap2010.head()
ap_test_takers_ dbn number_of_exams_with_scores_3_4_or_5 schoolname total_exams_taken
0 39.0 01M448 10.0 UNIVERSITY NEIGHBORHOOD H.S. 49.0
1 19.0 01M450 NaN EAST SIDE COMMUNITY HS 21.0
2 24.0 01M515 24.0 LOWER EASTSIDE PREP 26.0
3 255.0 01M539 191.0 NEW EXPLORATIONS SCI,TECH,MATH 377.0
4 NaN 02M296 NaN High School of Hospitality Management NaN
graduation.head()
advanced_regents_n advanced_regents_of_cohort advanced_regents_of_grads cohort dbn demographic dropped_out_n dropped_out_of_cohort local_n local_of_cohort ... regents_w_o_advanced_of_grads school_name still_enrolled_n still_enrolled_of_cohort total_cohort total_grads_n total_grads_of_cohort total_regents_n total_regents_of_cohort total_regents_of_grads
0 s NaN NaN 2003 01M292 Total Cohort s NaN s NaN ... NaN HENRY STREET SCHOOL FOR INTERNATIONAL s NaN 5 s NaN s NaN NaN
1 0 0.0 0.0 2004 01M292 Total Cohort 3 5.5 20 36.4 ... 45.9 HENRY STREET SCHOOL FOR INTERNATIONAL 15 27.3 55 37 67.3 17 30.9 45.9
2 0 0.0 0.0 2005 01M292 Total Cohort 9 14.1 16 25.0 ... 62.8 HENRY STREET SCHOOL FOR INTERNATIONAL 9 14.1 64 43 67.2 27 42.2 62.8
3 0 0.0 0.0 2006 01M292 Total Cohort 11 14.1 7 9.0 ... 83.7 HENRY STREET SCHOOL FOR INTERNATIONAL 16 20.5 78 43 55.1 36 46.2 83.7
4 0 0.0 0.0 2006 Aug 01M292 Total Cohort 11 14.1 7 9.0 ... 84.1 HENRY STREET SCHOOL FOR INTERNATIONAL 15 19.2 78 44 56.4 37 47.4 84.1

5 rows × 23 columns

demographics.head()
asian_num asian_per black_num black_per ctt_num dbn ell_num ell_percent female_num female_per ... male_per name prek schoolyear selfcontained_num sped_num sped_percent total_enrollment white_num white_per
0 10 3.6 74 26.3 25 01M015 36 12.8 123.0 43.8 ... 56.2 P.S. 015 ROBERTO CLEMENTE 15 20052006 9 57.0 20.3 281 5 1.8
1 18 7.4 68 28.0 19 01M015 38 15.6 103.0 42.4 ... 57.6 P.S. 015 ROBERTO CLEMENTE 15 20062007 15 55.0 22.6 243 4 1.6
2 16 6.1 77 29.5 20 01M015 52 19.9 118.0 45.2 ... 54.8 P.S. 015 ROBERTO CLEMENTE 18 20072008 14 60.0 23.0 261 7 2.7
3 16 6.3 75 29.8 21 01M015 48 19.0 103.0 40.9 ... 59.1 P.S. 015 ROBERTO CLEMENTE 17 20082009 17 62.0 24.6 252 7 2.8
4 16 7.7 67 32.2 14 01M015 40 19.2 84.0 40.4 ... 59.6 P.S. 015 ROBERTO CLEMENTE 16 20092010 14 46.0 22.1 208 6 2.9

5 rows × 38 columns

math_results.head()
category dbn grade level_1_1 level_1_2 level_2_1 level_2_2 level_3_1 level_3_2 level_3_4_1 level_3_4_2 level_4_1 level_4_2 mean_scale_score number_tested year
0 All Students 01M015 3 2.0 5.1 11.0 28.2 20.0 51.3 26.0 66.7 6.0 15.4 667.0 39 2006
1 All Students 01M015 3 2.0 6.5 3.0 9.7 22.0 71.0 26.0 83.9 4.0 12.9 672.0 31 2007
2 All Students 01M015 3 0.0 0.0 6.0 16.2 29.0 78.4 31.0 83.8 2.0 5.4 668.0 37 2008
3 All Students 01M015 3 0.0 0.0 4.0 12.1 28.0 84.8 29.0 87.9 1.0 3.0 668.0 33 2009
4 All Students 01M015 3 6.0 23.1 12.0 46.2 6.0 23.1 8.0 30.8 2.0 7.7 677.0 26 2010
survey_1.head()
dbn bn schoolname d75 studentssurveyed highschool schooltype rr_s rr_t rr_p ... s_N_q14e_3 s_N_q14e_4 s_N_q14f_1 s_N_q14f_2 s_N_q14f_3 s_N_q14f_4 s_N_q14g_1 s_N_q14g_2 s_N_q14g_3 s_N_q14g_4
0 01M015 M015 P.S. 015 Roberto Clemente 0 No 0.0 Elementary School NaN 88 60 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 01M019 M019 P.S. 019 Asher Levy 0 No 0.0 Elementary School NaN 100 60 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 01M020 M020 P.S. 020 Anna Silver 0 No 0.0 Elementary School NaN 88 73 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 01M034 M034 P.S. 034 Franklin D. Roosevelt 0 Yes 0.0 Elementary / Middle School 89.0 73 50 ... 20.0 16.0 23.0 54.0 33.0 29.0 31.0 46.0 16.0 8.0
4 01M063 M063 P.S. 063 William McKinley 0 No 0.0 Elementary School NaN 100 60 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 1942 columns

survey_2.head()
dbn bn schoolname d75 studentssurveyed highschool schooltype rr_s rr_t rr_p ... s_q14_2 s_q14_3 s_q14_4 s_q14_5 s_q14_6 s_q14_7 s_q14_8 s_q14_9 s_q14_10 s_q14_11
0 75K004 K004 P.S. K004 1 Yes 0.0 District 75 Special Education 38.0 90 72 ... 29.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 75K036 K036 P.S. 36 1 Yes NaN District 75 Special Education 70.0 69 44 ... 20.0 27.0 19.0 9.0 2.0 6.0 1.0 2.0 0.0 0.0
2 75K053 K053 P.S. K053 1 Yes NaN District 75 Special Education 94.0 97 53 ... 14.0 12.0 12.0 10.0 21.0 13.0 11.0 2.0 0.0 0.0
3 75K077 K077 P.S. K077 1 Yes NaN District 75 Special Education 95.0 65 55 ... 14.0 14.0 7.0 11.0 16.0 10.0 6.0 4.0 7.0 7.0
4 75K140 K140 P.S. K140 1 Yes 0.0 District 75 Special Education 77.0 70 42 ... 35.0 34.0 17.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 1773 columns

Things we can start noticing from previews:

  1. DBN is one common field that is present in majority of the datasets. We can combine these datasets on DBN
  2. All datasets do not have one unique row for each school
  3. All schools listed in these datasets are not high-schools
  4. We can choose the important fields from survey data and combine the two survey datasets into one

Okay, so having said that we need to create a DBN field in the class_size dataset. Looking at what DBN looks like, we can see it is a combination of district, borough, and school code. Fragments of dbn are already lying around in class_size, which we need to concatenate into one.

class_size["dbn"] = class_size.apply(lambda x: "{0:02d}{1}".format(x["csd"], x["school_code"]), axis=1)
class_size['dbn'].head(10)
0    01M015
1    01M015
2    01M015
3    01M015
4    01M015
5    01M015
6    01M015
7    01M015
8    01M015
9    01M015
Name: dbn, dtype: object  

Perfect! This looks like a dbn code now.

Let’s now combine the two survey datasets and only keep the fields seeming important

survey = pd.concat([survey_1,survey_2],axis=0)
survey_fields = ["dbn", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", "eng_p_11", "aca_p_11", "saf_t_11", "com_t_11", "eng_t_10", "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11", "com_tot_11", "eng_tot_11", "aca_tot_11"]
survey = survey.loc[:,survey_fields]

Before joining the datasets, let’s reduce datasets to 1 row per school dbn. Starting with class_size

#Only the grade 9-12 is relevant to our analysis 
class_size['grade_'].unique()

array([‘0K’, ‘01’, ‘02’, ‘03’, ‘04’, ‘05’, ‘0K-09’, nan, ‘06’, ‘07’, ‘08’, ‘MS Core’, ‘09-12’], dtype=object)

#Majority of the grade 9-12 programs are general education
#Here CTT stands for Collabortive Team Teaching where both Gened and Special Ed teachers teach together in the class
#For now, let's keep only the gened program
class_size[class_size['grade_']=='09-12'].groupby('program_type').count()
average_class_size borough core_course_ms_core_and_9_12_only_ core_subject_ms_core_and_9_12_only_ csd data_source grade_ number_of_sections number_of_students_seats_filled school_code school_name schoolwide_pupil_teacher_ratio service_category_k_9_only_ size_of_largest_class size_of_smallest_class dbn
program_type
CTT 51 51 51 51 51 51 51 51 51 51 51 0 51 51 51 51
GEN ED 149 149 149 149 149 149 149 149 149 149 149 0 149 149 149 149
SPEC ED 16 16 16 16 16 16 16 16 16 16 16 0 16 16 16 16
class_size = class_size[class_size["grade_"] == "09-12"]
class_size = class_size[class_size["program_type"] == "GEN ED"]
class_size = class_size.groupby("dbn").agg(np.mean)
class_size.reset_index(inplace=True)
class_size.rename(columns = {'grade_':'grade'}, inplace = True)

Moving on to demographics

#Let's keep only the latest year
demographics['schoolyear'].unique()
array([20052006, 20062007, 20072008, 20082009, 20092010, 20102011,
       20112012], dtype=int64)  
demographics = demographics[demographics['schoolyear']==20112012]

#Ensuring there are no duplicate dbn rows
len(demographics)-demographics['dbn'].nunique()
0

Moving on to math results

#Taking results for the highest grade
math_results['grade'].unique()
array(['3', '4', '5', '6', 'All Grades', '7', '8'], dtype=object)
#Taking results for the latest year
math_results['year'].unique()
array([2006, 2007, 2008, 2009, 2010, 2011], dtype=int64)
math_results = math_results[(math_results['grade']=='8') & (math_results['year']==2011)]

#Ensuring there are no duplicate dbn rows
len(math_results)-math_results['dbn'].nunique()
0

Moving on to graduation

#Taking the latest 2006 cohort
graduation['cohort'].unique()
array(['2003', '2004', '2005', '2006', '2006 Aug', '2001', '2002'],
      dtype=object)
#Taking the overall cohort
graduation['demographic'].unique()
array(['Total Cohort', 'Asian', 'Male'], dtype=object)
graduation = graduation[(graduation['cohort']=='2006') & (graduation['demographic']=='Total Cohort')]

len(graduation) - graduation['dbn'].nunique()
0
#This one is already reduced
len(high_schools) - high_schools['dbn'].nunique()
0
#This one requires fixing
len(ap2010) - ap2010['dbn'].nunique()
1
ap2010 = ap2010.groupby('dbn').agg(np.mean)
ap2010.reset_index(inplace=True)

#Checking if it is fixed now
len(ap2010) - ap2010['dbn'].nunique()
0
#This one is already reduced
len(survey) - survey['dbn'].nunique()
0
#Ensuring this is reduced
len(sat_results) - sat_results['dbn'].nunique()
0

Okay, now that we are done cleaning the data sets, let’s proceed to creating few fields which do not exist and might be important later followed by combining the datasets.
Also, I noticed that there are some entries in sat_results which do not have any score associated. Let’s remove them from the analysis as there is no legit way to impute these values.

Combining the datasets

sat_results = sat_results[sat_results['sat_math_avg_score']!='s']

#Converting sat scores to numeric valu
cols = ['sat_math_avg_score', 'sat_critical_reading_avg_score', 'sat_writing_avg_score']
for c in cols:
    sat_results[c] = sat_results[c].apply(lambda x: int(x))

# Adding in total average SAT score as a new field
sat_results['sat_score'] = sat_results['sat_critical_reading_avg_score']+sat_results['sat_math_avg_score']+sat_results['sat_writing_avg_score']

# Extracting latitude and longitude out of the location information provided in high school data
high_schools['Latitude'] = high_schools['location_1'].apply(lambda x: x.get('latitude'))
high_schools['Longitude'] = high_schools['location_1'].apply(lambda x: x.get('longitude'))

# Checking the number of unique DBNs in each database
data_list = {'sat_results':sat_results,
             'enrollment':enrollment,
             'high_schools':high_schools,
             'class_size':class_size,
             'ap2010':ap2010,
             'graduation':graduation,
             'demographics':demographics,
             'districts':districts,
             'math_results':math_results,
             'survey':survey}

for k,v in data_list.items():
    print(k)
    try:
        print(v['dbn'].nunique())
    except Exception as e:
        print(e)
sat_results
421
enrollment
'dbn'
high_schools
435
class_size
15
ap2010
257
graduation
154
demographics
151
districts
'dbn'
math_results
12
survey
1702

Hmm, looks like survey leads in number of dbns by a huge margin. One join approach can be to use that as base and left join every other table to that. But that will result in a high number of blank rows (Not apt for our analysis) Let’s check how many DBNs we have in common between sat_results and high_schools. Maybe we could utilize that number as the base and fetch information for those from other tables

for k,v in data_list.items():
    print(k)
    try:
        print(len(set(sat_results['dbn']).intersection(v['dbn'])))
    except Exception as e:
        print("Exception {0} caught".format(e))
sat_results
421
enrollment
Exception 'dbn' caught
high_schools
339
class_size
10
ap2010
251
graduation
148
demographics
67
districts
Exception 'dbn' caught
math_results
3
survey
418

From what I observe above, I believe it would be best to take an outer join of sat_results, high_schools, ap2010, and graduation. Then we can left join other datasets to it.

full = pd.merge(sat_results,high_schools,on='dbn',how='outer')

full = pd.merge(full,ap2010,on='dbn',how='outer')

full = pd.merge(full,graduation,on='dbn',how='outer')

for k,v in data_list.items():
    if k not in ['sat_results','high_schools','ap2010','graduation','enrollment','districts']:
        full = pd.merge(full,v,on='dbn',how='left')

np.set_printoptions(threshold=200)

cols = full.columns.tolist()

full[['ap_test_takers_','number_of_exams_with_scores_3_4_or_5','total_exams_taken']]
['dbn',
 'num_of_sat_test_takers',
 'sat_critical_reading_avg_score',
 'sat_math_avg_score',
 'sat_writing_avg_score',
 'school_name_x',
 'sat_score',
 ':@computed_region_92fq_4b7q',
 ':@computed_region_efsh_h5xi',
 ':@computed_region_f5dn_yrer',
 ':@computed_region_sbqj_enih',
 ':@computed_region_yeji_bk3q',
 'addtl_info1',
 'addtl_info2',
 'advancedplacement_courses',
 'bbl',
 'bin',
 'boro',
 'building_code',
 'bus',
 'campus_name',
 'census_tract',
 'city',
 'community_board',
 'council_district',
 'ell_programs',
 'end_time',
 'expgrade_span_max',
 'expgrade_span_min',
 'extracurricular_activities',
 'fax_number',
 'grade_span_max',
 'grade_span_min',
 'language_classes',
 'location_1',
 'nta',
 'number_programs',
 'online_ap_courses',
 'online_language_courses',
 'overview_paragraph',
 'partner_cbo',
 'partner_corporate',
 'partner_cultural',
 'partner_financial',
 'partner_highered',
 'partner_hospital',
 'partner_nonprofit',
 'partner_other',
 'phone_number',
 'primary_address_line_1',
 'priority01',
 'priority02',
 'priority03',
 'priority04',
 'priority05',
 'priority06',
 'priority07',
 'priority08',
 'priority09',
 'priority10',
 'program_highlights',
 'psal_sports_boys',
 'psal_sports_coed',
 'psal_sports_girls',
 'school_accessibility_description',
 'school_name_y',
 'school_sports',
 'school_type',
 'se_services',
 'start_time',
 'state_code',
 'subway',
 'total_students',
 'website',
 'zip',
 'Latitude',
 'Longitude',
 'ap_test_takers_',
 'number_of_exams_with_scores_3_4_or_5',
 'total_exams_taken',
 'advanced_regents_n',
 'advanced_regents_of_cohort',
 'advanced_regents_of_grads',
 'cohort',
 'demographic',
 'dropped_out_n',
 'dropped_out_of_cohort',
 'local_n',
 'local_of_cohort',
 'local_of_grads',
 'regents_w_o_advanced_n',
 'regents_w_o_advanced_of_cohort',
 'regents_w_o_advanced_of_grads',
 'school_name',
 'still_enrolled_n',
 'still_enrolled_of_cohort',
 'total_cohort',
 'total_grads_n',
 'total_grads_of_cohort',
 'total_regents_n',
 'total_regents_of_cohort',
 'total_regents_of_grads',
 'average_class_size',
 'csd',
 'number_of_sections',
 'number_of_students_seats_filled',
 'schoolwide_pupil_teacher_ratio',
 'size_of_largest_class',
 'size_of_smallest_class',
 'asian_num',
 'asian_per',
 'black_num',
 'black_per',
 'ctt_num',
 'ell_num',
 'ell_percent',
 'female_num',
 'female_per',
 'fl_percent',
 'frl_percent',
 'grade1',
 'grade10',
 'grade11',
 'grade12',
 'grade2',
 'grade3',
 'grade4',
 'grade5',
 'grade6',
 'grade7',
 'grade8',
 'grade9',
 'hispanic_num',
 'hispanic_per',
 'k',
 'male_num',
 'male_per',
 'name',
 'prek',
 'schoolyear',
 'selfcontained_num',
 'sped_num',
 'sped_percent',
 'total_enrollment',
 'white_num',
 'white_per',
 'category',
 'grade',
 'level_1_1',
 'level_1_2',
 'level_2_1',
 'level_2_2',
 'level_3_1',
 'level_3_2',
 'level_3_4_1',
 'level_3_4_2',
 'level_4_1',
 'level_4_2',
 'mean_scale_score',
 'number_tested',
 'year',
 'rr_s',
 'rr_t',
 'rr_p',
 'N_s',
 'N_t',
 'N_p',
 'saf_p_11',
 'com_p_11',
 'eng_p_11',
 'aca_p_11',
 'saf_t_11',
 'com_t_11',
 'eng_t_10',
 'aca_t_11',
 'saf_s_11',
 'com_s_11',
 'eng_s_11',
 'aca_s_11',
 'saf_tot_11',
 'com_tot_11',
 'eng_tot_11',
 'aca_tot_11']
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# Adding a field for school districs
full["school_dist"] = full["dbn"].apply(lambda x: x[:2])
full.fillna(full.mean(),inplace=True)

It would be beneficial to understand the correlation between sat_score and other columns to understand what fields influence sat_score the most

# Computing correlations between Sat_score and other columns
full.corr()['sat_score'].sort_values(ascending=False)
sat_score                               1.000000e+00
sat_writing_avg_score                   9.810159e-01
sat_critical_reading_avg_score          9.747583e-01
sat_math_avg_score                      9.530106e-01
ap_test_takers_                         5.103909e-01
total_exams_taken                       5.017174e-01
number_of_exams_with_scores_3_4_or_5    4.535005e-01
N_p                                     4.281443e-01
advanced_regents_of_cohort              4.253570e-01
N_s                                     4.221618e-01
advanced_regents_of_grads               3.924787e-01
total_regents_of_cohort                 3.873194e-01
total_students                          3.845674e-01
white_num                               3.777941e-01
total_grads_of_cohort                   3.403242e-01
white_per                               3.328702e-01
N_t                                     2.940519e-01
saf_t_11                                2.757786e-01
rr_s                                    2.736355e-01
asian_num                               2.729442e-01
total_regents_of_grads                  2.697793e-01
aca_s_11                                2.669456e-01
saf_tot_11                              2.608540e-01
saf_s_11                                2.569001e-01
male_num                                2.017470e-01
total_enrollment                        2.013726e-01
asian_per                               1.906019e-01
number_of_sections                      1.726816e-01
female_num                              1.711856e-01
number_of_students_seats_filled         1.708587e-01
total_cohort                            1.619401e-01
aca_tot_11                              1.611355e-01
eng_s_11                                1.551608e-01
com_s_11                                1.512602e-01
aca_t_11                                1.216712e-01
number_programs                         1.153913e-01
number_tested                           1.109474e-01
level_4_1                               1.108240e-01
level_4_2                               1.107934e-01
level_3_4_1                             1.104739e-01
mean_scale_score                        1.072120e-01
saf_p_11                                1.066251e-01
level_3_4_2                             1.027085e-01
rr_p                                    1.021401e-01
eng_tot_11                              8.574072e-02
com_t_11                                8.197490e-02
com_tot_11                              7.766472e-02
regents_w_o_advanced_of_cohort          7.265046e-02
female_per                              6.516539e-02
size_of_largest_class                   4.760754e-02
bin                                     4.132848e-02
census_tract                            3.914908e-02
level_3_1                               3.586515e-02
average_class_size                      3.438492e-02
bbl                                     3.429191e-02
aca_p_11                                3.004570e-02
eng_p_11                                2.891362e-02
:@computed_region_sbqj_enih             1.987861e-02
grade_span_max                          1.830755e-02
:@computed_region_efsh_h5xi             1.638056e-02
rr_t                                    1.164827e-02
expgrade_span_max                       7.847064e-04
expgrade_span_min                       4.104648e-27
size_of_smallest_class                 -2.323112e-02
grade_span_min                         -2.844718e-02
csd                                    -3.282872e-02
:@computed_region_f5dn_yrer            -4.702550e-02
level_3_2                              -4.776999e-02
community_board                        -5.675188e-02
black_num                              -6.104478e-02
zip                                    -6.147304e-02
male_per                               -6.516539e-02
:@computed_region_yeji_bk3q            -6.561082e-02
level_1_2                              -6.972787e-02
level_1_1                              -7.103664e-02
:@computed_region_92fq_4b7q            -7.145365e-02
council_district                       -7.446446e-02
com_p_11                               -8.715157e-02
hispanic_num                           -8.974585e-02
sped_num                               -9.696171e-02
level_2_2                              -1.105178e-01
level_2_1                              -1.109453e-01
regents_w_o_advanced_of_grads          -1.130022e-01
Latitude                               -1.155862e-01
Longitude                              -1.288842e-01
ell_percent                            -1.299355e-01
local_of_cohort                        -2.226132e-01
black_per                              -2.295139e-01
sped_percent                           -2.306069e-01
still_enrolled_of_cohort               -2.361166e-01
local_of_grads                         -2.697793e-01
hispanic_per                           -3.222427e-01
dropped_out_of_cohort                  -3.397260e-01
frl_percent                            -3.546909e-01
schoolwide_pupil_teacher_ratio                   NaN
schoolyear                                       NaN
year                                             NaN
eng_t_10                                         NaN
Name: sat_score, dtype: float64

Let’s take a moment here to note down anything interesting we can find from the above. We can use that later to create an analysis viewpoint.

  • Total number of students correlated with Sat Scores. Interesting, that would mean than large schools perform better than small schools
  • Number of parents, teachers, and students responding to the survey also highly correlates with sat scores
  • FRL (Free or Reduced Lunches) and ELL (English Language Learners) percent correlate strongly negatively with sat score
  • Female_num correlates positively whereas male_num correlates negatively with Sat scores
  • Racial inequality in sat scores can be identified easily from above (white_per, hispanic_per, black_per, asian_per)

Plotting stage

Now that we have prepared our data, drawn some insights from it as well. Let’s now do some exploratory charts and maps for a better understanding of the problem at hand

# Converting Latitude and Longitude to float values
full['Latitude'] = pd.to_numeric(full['Latitude'],errors='coerce')
full['Longitude'] = pd.to_numeric(full['Longitude'],errors='coerce')

schools_map = folium.Map(location=[full['Latitude'].mean(), full['Longitude'].mean()], zoom_start=10)
marker_cluster = plugins.MarkerCluster().add_to(schools_map)
for name, row in full.loc[~full['Latitude'].isnull()].iterrows():
    folium.Marker([row["Latitude"], row["Longitude"]], popup="{0}: {1}".format(row["dbn"], row["school_name"])).add_to(marker_cluster)
schools_map.save('schools.html')
schools_map

Let’s try plotting a heatmap to better visualize school density across NY

schools_heatmap = folium.Map(location=[full['Latitude'].mean(), full['Longitude'].mean()], zoom_start=10)
schools_heatmap.add_children(plugins.HeatMap([[row["Latitude"], row["Longitude"]] for name, row in full.loc[~full['Latitude'].isnull()].iterrows()]))
schools_heatmap.save("heatmap.html")
schools_heatmap

Interesting, Let’s try assessing further if there is any difference in SAT scores between the school districs

# Aggregating data at district level
district_data = full.groupby("school_dist").agg(np.mean)
district_data.reset_index(inplace=True)
district_data['school_dist'] = district_data['school_dist'].apply(lambda x: str(int(x)))
districts_geojson = pygeoj.load(filepath='districts.geojson')

# Defining a function to overlay heatmap distribution of a metric over the district maps
def show_district_map(col,dist):
    districts = folium.Map(location=[full['Latitude'].mean(), full['Longitude'].mean()], zoom_start=10)
    districts.choropleth(
    geo_data=dist,
    data=district_data,
    columns=['school_dist', col],
    key_on='feature.properties.school_dist',
    fill_color='YlGn',
    fill_opacity=0.7,
    line_opacity=0.2,
    )
    return districts

show_district_map("sat_score",districts_geojson).save("districts_sat.html")
show_district_map("sat_score",districts_geojson)

This looks great! Now that we have developed a basic understanding of the dataset and SAT score variation across districts. Let’s look further into the analysis angles that we had proposed earlier.

%matplotlib inline
full.plot.scatter(x='total_enrollment', y='sat_score')

SAT Score vs Total Enrollment

Hmm, There is not exactly a correlation here. Its just that there is a cluster of schools with low enrollment and low sat scores. Let’s try pulling up some names.

full.loc[(full['total_enrollment']<500)&(full['sat_score']<=1200),['name','sat_score']]
name sat_score
0 HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES 1122.0
1 UNIVERSITY NEIGHBORHOOD HIGH SCHOOL 1172.0
3 SATELLITE ACADEMY HS @ FORSYTHE STREET 1174.0
9 47 THE AMERICAN SIGN LANGUAGE AND ENGLISH DUAL L 1182.0
10 FOOD AND FINANCE HIGH SCHOOL 1194.0
11 HIGH SCHOOL FOR HISTORY & COMMUNICATION 1156.0
12 HIGH SCHOOL OF HOSPITALITY MANAGEMENT 1111.0
15 THE FACING HISTORY SCHOOL 1051.0
16 THE URBAN ASSEMBLY ACADEMY OF GOVERNMENT AND LAW 1148.0
17 LOWER MANHATTAN ARTS ACADEMY 1200.0
18 THE JAMES BALDWIN SCHOOL: A SCHOOL FOR EXPEDITIO 1188.0
19 THE URBAN ASSEMBLY SCHOOL OF BUSINESS FOR YOUNG 1127.0
20 GRAMERCY ARTS HIGH SCHOOL 1176.0
22 EMMA LAZARUS HIGH SCHOOL 1188.0
32 LANDMARK HIGH SCHOOL 1170.0
35 LEGACY SCHOOL FOR INTEGRATED STUDIES 1062.0
37 BAYARD RUSTIN EDUCATIONAL COMPLEX 1111.0
38 VANGUARD HIGH SCHOOL 1199.0
43 UNITY CENTER FOR URBAN TECHNOLOGIES 1070.0
49 NEW DESIGN HIGH SCHOOL 1168.0
50 INDEPENDENCE HIGH SCHOOL 1095.0
52 LIBERTY HIGH SCHOOL ACADEMY FOR NEWCOMERS 1156.0
56 SATELLITE ACADEMY HIGH SCHOOL 1032.0

After spending some time on Google to understand what is common between these schools, I have realized that most of these are schools where the minority enrollment and % of economically disadvantaged students is high. So its not actually low enrollment causing low sat scores but high % of minority students

full.plot.scatter(x='frl_percent', y='sat_score')

SAT Score vs FRL Percent

As expected, there is a negative correlation between FRL percent (indicative of % of minority & economically disadvantaged class) and sat scores. Higher the % of FRL in the school, lower is the sat score

full.plot.scatter(x='ell_percent', y='sat_score')

SAT Score vs ELL Percent

It looks like there are a group of schools with a high ell_percentage that also have low average SAT scores. We can investigate this at the district level, by figuring out the percentage of English language learners in each district, and seeing it if matches our map of SAT scores by district:

show_district_map('ell_percent',districts_geojson)

The distinction here is not very clear, probably due to the color palette chosen. We can for sure see that there is small district right in the middle which has very high ell_percent but not high sat scores.

Checking the correlation between survey responses and SAT Scores

full.corr()["sat_score"][["rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_tot_11", "com_tot_11", "aca_tot_11", "eng_tot_11"]].plot.bar()

SAT Score Correlation with Survey Responses

The only factors having a significant correlation with sat scores here are number of parents, students, and teachers responding to the survey, and the % of students responding to the survey. As these also correlate highly with total enrollment, it makes sense to say that schools with low frl percent, also have high survey response rates (that is not our concern at the moment though)

Exploring relationship between race and SAT scores

full.corr()["sat_score"][["white_per", "asian_per", "black_per", "hispanic_per"]].plot.bar()

SAT Score correlation with Race

From above we can conclude that higher percentage of white and asian students lead to higher sat scores and the inverse holds true for black and hispanic students. It is likely that black and hispanic students are migrants who are also learning english language and are from economically disadvantaged community Let’s look at which schools have high white_per

full.plot.scatter(x='hispanic_per',y='sat_score')

SAT Score vs Hispanic Percentage

full.loc[(full['hispanic_per']<40)&(full['sat_score']>1400),['school_name','sat_score']]
school_name sat_score
6 NEW EXPLORATIONS INTO SCIENCE TECHNO 1621.0
8 BARD HIGH SCHOOL EARLY COLLEGE 1856.0
24 INSTITUTE FOR COLLABORATIVE EDUCATION 1424.0
25 PROFESSIONAL PERFORMING ARTS HIGH SCH 1522.0
26 BARUCH COLLEGE CAMPUS HIGH SCHOOL 1577.0
27 N.Y.C. LAB SCHOOL FOR COLLABORATIVE S 1677.0
28 SCHOOL OF THE FUTURE HIGH SCHOOL 1565.0
29 N.Y.C. MUSEUM SCHOOL 1419.0
30 ELEANOR ROOSEVELT HIGH SCHOOL 1758.0
31 MILLENNIUM HIGH SCHOOL 1614.0
41 STUYVESANT HIGH SCHOOL 2096.0
44 TALENT UNLIMITED HIGH SCHOOL 1416.0
51 HIGH SCHOOL FOR DUAL LANGUAGE AND ASI 1424.0
54 HIGH SCHOOL M560 - CITY AS SCHOOL 1415.0
55 URBAN ACADEMY LABORATORY HIGH SCHOOL 1547.0

These schools rank high in charts and host admission test and screening before admission, which is probably why hispanic_per ther is lower than others

Gender differences in SAT scores

full.corr()["sat_score"][["male_per", "female_per"]].plot.bar()

SAT Score Correlation with Gender

AP test scores & SAT scores

One last angle that we would be exploring here is the Advance Placement test and how it affects SAT scores. We already saw that number of AP test takers correlates strongly with SAT scores

full['ap_per'] = full['ap_test_takers_']/full['total_enrollment']
full.plot.scatter(x='ap_per',y='sat_score')

SAT Score vs Advance Placement Percentage

Let’s try pulling some names from the cluster in top right (excluding outliers) where sat_scores are high

full.loc[(full['ap_per']>0.3)&(full['sat_score']>1400),['school_name','sat_score']]
school_name sat_score
30 ELEANOR ROOSEVELT HIGH SCHOOL 1758.0
41 STUYVESANT HIGH SCHOOL 2096.0
55 URBAN ACADEMY LABORATORY HIGH SCHOOL 1547.0
74 FIORELLO H. LAGUARDIA HIGH SCHOOL OF 1707.0
81 MANHATTAN CENTER FOR SCIENCE AND MATH 1430.0
175 NaN 1969.0
182 NaN 1920.0
220 NaN 1833.0
275 NaN 1436.0
313 NaN 1431.0
320 NaN 1473.0
323 NaN 1627.0
351 NaN 1910.0
355 NaN 1514.0
356 NaN 1474.0
358 NaN 1449.0
374 NaN 1407.0
379 NaN 1868.0
405 NaN 1418.0
409 NaN 1953.0

Some Google searching reveals that these are mostly highly selective schools where you need to take a test to get in. It makes sense that these schools would have high proportions of AP test takers.

THE END!

This is all that we are going to cover as part of this project. I would like to thank DataQuest.io for guiding me through the project.

Tags: SAT exam EDA
Share: Twitter Facebook LinkedIn