📜 ⬆️ ⬇️

Making data science portfolio: history through data

Translator's Preface

Translation suddenly successfully hit the stream of other datascript tutorials on Habré. :)
This one is written by Vic Paruchuri, the founder of Dataquest.io , where they are engaged in this kind of interactive training in data science and preparation for real work in this area. Some exclusive know-how is not here, but the process from data collection to initial conclusions about them is described in detail, which may be interesting not only for those who want to write a summary of data science, but also for those who just want to try their hand at practical analysis, but does not know where to start.


Data science companies are increasingly looking at their portfolios when they make decisions about hiring. This, in particular, due to the fact that the best way to judge practical skills is a portfolio. And the good news is that it is completely at your disposal: if you try, you will be able to put together an excellent portfolio that will impress many companies.


The first step in a high-quality portfolio is understanding what skills you need to demonstrate in it.


The main skills that companies want to see in a data scientist, and, accordingly, demonstrated in their portfolio, are:



Every good portfolio contains several projects, each of which can demonstrate 1-2 data points. This is the first post of the cycle that will consider getting a harmonious data science portfolio. We will look at how to make the first project for a portfolio, and how to tell a good story through data. At the end there will be a project that will help unleash your ability to communicate and the ability to make conclusions based on the data.


Note trans.

I definitely will not translate the entire cycle, but I plan to touch on an interesting tutorial on machine learning from the same place.


History through data


In principle, Data science is all about communication. You see some pattern in the data, then you are looking for an effective way to explain this pattern to others, then convince them to take the actions that you consider to be necessary. One of the most important skills in data science is to visually tell the story through data. A successful story can better guide your insights and help others understand your ideas.


History in the context of data science is a summary of all that you have found and what it means. An example is the discovery that your company's profits have decreased by 20% over the last year. Just pointing out this fact is not enough: we must explain why profits have fallen and what to do about it.


The main components of the stories in the data are:



The best way to lucidly tell a story through data is Jupyter notebook . If you are strangers with him - here is a good tutorial. Jupyter notebook allows you to interactively explore data and publish it on various sites, including githab. Publication of the results is useful for collaboration - other people will be able to expand your analysis.


We in this post will use Jupyter notebook together with Python libraries like Pandas and matplotlib.


Choosing a topic for your data science project


The first step to creating a project is to decide on a topic. It is worth choosing something that you are interested in and that there is a desire to research. It is always obvious when people made a project just to have it, and when because it was really interesting for them to dig into the data. At this step, it makes sense to spend time to accurately find something enthralling.


A good way to find a topic is to climb in different places and see what is interesting. Here are some good places to start:



In real data science it is often impossible to find a dataset fully prepared for your research. You may have to aggregate various data sources or seriously clean them. If the topic is very interesting to you, it makes sense to do the same here: you will show yourself better in the end.
We will use data on New York secondary schools for the post, from here .


Note trans.

Just in case, I will give an example of similar datasets closer to us (Russians):



Theme selection


It is important to make the whole project from start to finish. For this, it is useful to limit the field of study so that you know exactly what you are done. It is easier to add something to the already completed project than to try to finish what is already simply tired of bringing to the end.


In our case, we will study the evaluation of the USE of high school students, along with various demographic and other information about them. The Unified State Exam or the Unified State Exam is a test that high school students pass before entering college. Colleges consider grades when they decide to enroll, so it's good to pass it well. The exam consists of three parts, each of which is estimated at 800 points. The total score in the end is 2400 (although sometimes it floated back and forth - in dataset everything was 2400 each). Senior schools are often ranked by the average exam score and a high average grade is usually an indicator of how good a school district is.


There were some complaints about the injustice of assessments of certain national minorities in the United States, so an analysis of New York will help shed light on the validity of the USE.


Dataset with grades EGE - here , and dataset with information on each school - here . This will be the basis of our project, but we will need more information to make a full analysis.


Note trans.

The original exam is called SAT - Scholastic Aptitude Test. But since it is almost identical in meaning to our USE, I decided to translate it in the same way.


We collect data


As soon as there is a good topic - it is useful to look at other datasets that can expand the topic or help to deepen the study. It is better to do this at the beginning, so that there is as much data as possible for research as the project is created. If there is little data, there is a chance that you will give up too soon.


In our case, there are a few more datasets on this topic on the same site that cover demographic information and exam results.


Here are links to all datasets that we will use:



All these data are interconnected, and we can combine them before we start the analysis.


Collection of background information


Before diving into data analysis, it is helpful to find out general information about the subject. In our case, we know something that might be useful:



Note trans.

What I translated as “Regions” is actually called “boro” in NYC, and the columns, respectively, are called Borough.


We understand the data


To really understand the context of the data, you need to spend time and read about this data. In our case, each link above contains a description of the data for each column. It seems that we have data on the estimated USE of high school students, along with other datasets that contain demographic and other information.


Run some code to read the data. We use Jupyter notebook for our research. The code below:



import pandas import numpy as np files = ["ap_2010.csv", "class_size.csv", "demographics.csv", "graduation.csv", "hs_directory.csv", "math_test_results.csv", "sat_results.csv"] data = {} for f in files: d = pandas.read_csv("schools/{0}".format(f)) data[f.replace(".csv", "")] = d 

Once we’ve read everything, you can use the head method on data frames to print the first 5 lines of each:


 for k,v in data.items(): print("\n" + k + "\n") print(v.head()) 

You can already see certain features in datasets:
math_test_results


DBNGradeYearCategoryNumber TestedMean Scale ScoreLevel 1 #\
001M01532006All students396672
one01M01532007All students316722
201M01532008All students376680
301M01532009All students336680
four01M01532010All students266776

Level 1%Level 2 #Level 2%Level 3 #Level 3%Level 4 #Level 4%\
05.1%eleven28.2%2051.3%615.4%
one6.5%39.7%2271%four12.9%
20%616.2%2978.4%25.4%
30%four12.1%2884.8%one3%
four23.1%1246.2%623.1%27.7%

Level 3 + 4 #Level 3 + 4%
02666.7%
one2683.9%
23183.8%
32987.9%
foureight30.8%

ap_2010


DBNSchoolNameAP Test TakersTotal exams takenNumber of Exams with scores 3 4 or 5
001M448UNIVERSITY NEIGHBORHOOD HS3949ten
one01M450EAST SIDE COMMUNITY HSnineteen21s
201M515LOWER EASTSIDE PREP242624
301M539NEW EXPLORATIONS SCI, TECH, MATH255377191
four02M296High School of Hospitality Managementsss

sat_results


DBNSCHOOL NAMENum of SAT Test TakersSAT Critical Reading Avg. ScoreSAT Math Avg. ScoreSAT Writing Avg. Score
001M292HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES29355404363
one01M448UNIVERSITY NEIGHBORHOOD HIGH SCHOOL91383423366
201M450EAST SIDE COMMUNITY SCHOOL70377402370
301M458FORSYTH SATELLITE ACADEMY7414401359
four01M509MARTA VALLE HIGH SCHOOL44390433384

class_size


CSDBOROUGHSCHOOL CODESCHOOL NAMEGRADEPROGRAM TYPECORE SUBJECT (MS CORE and 9-12 ONLY)CORE COURSE (MS CORE and 9-12 ONLY)\
0oneMM015PS 015 Roberto Clemente0KGEN ED--
oneoneMM015PS 015 Roberto Clemente0KCTT--
2oneMM015PS 015 Roberto Clemente01GEN ED--
3oneMM015PS 015 Roberto Clemente01CTT--
fouroneMM015PS 015 Roberto Clemente02GEN E--

SERVICE CATEGORY (K-9 * ONLY)NUMBER OF STUDENTS / SEATS FILLEDNUMBER OF SECTIONSAVERAGE CLASS SIZESIZE OF SMALLEST CLASS\
0-19.01.019.019.0
one-21.01.021.021.0
2-17.01.017.017.0
3-17.01.017.017.0
four-15.01.015.015.0

SIZE OF LARGEST CLASSDATA SOURCESCHOOLWIDE PUPIL-TEACHER RATIO
019.0AtsNaN
one21.0AtsNaN
217.0AtsNaN
317.0AtsNaN
four15.0AtsNaN

demographics


DBNNameschoolyearfl_percentfrl_percent\
001M015PS 015 ROBERTO CLEMENTE2005200689.4NaN
one01M015PS 015 ROBERTO CLEMENTE2006200789.4NaN
201M015PS 015 ROBERTO CLEMENTE2007200889.4NaN
301M015PS 015 ROBERTO CLEMENTE2008200989.4NaN
four01M015PS 015 ROBERTO CLEMENTE2009201096.5

total_enrollmentprekkgrade1grade2...black_numblack_per\
028115364033...7426.3
one24315293938...6828.0
226118433936...7729.5
325217374432...7529.8
four208sixteen402832...6732.2

hispanic_numhispanic_perwhite_numwhite_permale_nummale_perfemale_numfemale_per\
018967.3five1.8158.056.2123.043.8
one15363.0four1.6140.057.6103.042.4
215760.272.7143.054.8118.045.2
314959.172.8149.059.1103.040.9
four11856.762.9124.059.684.040.4

graduation


DemographicDBNSchool nameCohort\
0Total cohort01M292HENRY STREET SCHOOL FOR INTERNATIONAL2003
oneTotal cohort01M292HENRY STREET SCHOOL FOR INTERNATIONAL2004
2Total cohort01M292HENRY STREET SCHOOL FOR INTERNATIONAL2005
3Total cohort01M292HENRY STREET SCHOOL FOR INTERNATIONAL2006
fourTotal cohort01M292HENRY STREET SCHOOL FOR INTERNATIONAL2006 Aug

Total cohortTotal Grads - nTotal Grads -% of cohortTotal Regents - n\
0fivesss
one553767.3%17
2644367.2%27
3784355.1%36
four784456.4%37

Total Regents -% of cohortTotal Regents -% of grads...Regents w / o Advanced - n\
0ss...s
one30.9%45.9%...17
242.2%62.8%...27
346.2%83.7%...36
four47.4%84.1%...37

Regents w / o Advanced -% of cohortRegents w / o Advanced -% of grads\
0ss
one30.9%45.9%
242.2%62.8%
346.2%83.7%
four47.4%84.1%

Local - nLocal -% of cohortLocal -% of grads Still Enrolled - n\
0ssss
one2036.4%54.1%15
2sixteen25%37.200000000000003%9
379%16.3%sixteen
four79%15.9%15

Still Enrolled -% of cohortDropped Out - nDropped Out -% of cohort
0sss
one27.3%35.5%
214.1%914.1%
320.5%eleven14.1%
four19.2%eleven14.1%

hs_directory


dbnschool_nameboro\
017K548Brooklyn School for Music & TheaterBrooklyn
one09X543High School for Violin and DanceBronx
209X327Comprehensive Model School Project MS 327Bronx
302M280Manhattan Early College School for AdvertisingManhattan
four28Q680Queens Gateway to Health Sciences Secondary Sc ...Queens

building_codephone_numberfax_numbergrade_span_mingrade_span_max\
0K440718-230-6250718-230-6262912
oneX400718-842-0687718-589-9849912
2X240718-294-8111718-294-8109612
3M520718-935-3477NaN9ten
fourQ695718-969-3155718-969-3552612

expgrade_span_minexpgrade_span_max...priority02\
0NaNNaN...Then to New York City residents
oneNaNNaN...Then to New York City
2NaNNaN...Then to Bronx
3914.0...Then to New York City
fourNaNNaN...Then to Districts 28 and 29 students or residents

priority03priority04priority05\
0NaNNaNNaN
oneThen to Bronx students or residentsThen to New York City residentsNaN
2Then to New York CityThen to Bronx students or residentsThen to New York City residents
3Then to Manhattan students or residentsThen to New York City residentsNaN
fourThen to Queens students or residentsThen to New York City residentsNaN

priority06priority07priority08priority09priority10Location 1
0NaNNaNNaNNaNNaN883 Classon Avenue \ nBrooklyn, NY 11225 \ n (40.67 ...
oneNaNNaNNaNNaNNaN1110 Boston Road \ nBronx, NY 10456 \ n (40.8276026 ...
2NaNNaNNaNNaNNaN1501 Jerome Avenue \ nBronx, NY 10452 \ n (40.84241 ...
3NaNNaNNaNNaNNaN411 Pearl Street \ nNew York, NY 10038 \ n (40.7106 ...
fourNaNNaNNaNNaNNaN160-20 Goethals Avenue \ nJamaica, NY 11432 (40 ...


Reduction of data to a common denominator


To make it easier to work with data, we need to combine all datasets into one - this will allow us to quickly compare columns in datasets. To do this, first of all, you need to find a common column for combining. Looking at what we previously deduced, we can assume that such a column can be a DBN , since it is repeated in several datasets.


If we google "DBN New York City Schools", then we will come here , which explains that DBN is a unique code for each school. In the study of datasets, especially government ones, it is often necessary to do detective work in order to understand what each column means, even every dataset sometimes.


Now the problem is that the two datasets , class_size and hs_directory , do not contain a DBN . In hs_directory, it is called dbn, so just rename it or copy it to a DBN . For class_size , another approach will be needed.


The DBN column looks like this:


 In [5]: data["demographics"]["DBN"].head() Out[5]: 0 01M015 1 01M015 2 01M015 3 01M015 4 01M015 Name: DBN, dtype: object 

If we look at class_size , this is what we will see in the first 5 lines:


 In [4]: data["class_size"].head() Out[4]: 

CSDBOROUGHSCHOOL CODESCHOOL NAMEGRADEPROGRAM TYPECORE SUBJECT (MS CORE and 9-12 ONLY)/
0oneMM015PS 015 Roberto Clemente0KGEN ED-
oneoneMM015PS 015 Roberto Clemente0KCTT-
2oneMM015PS 015 Roberto Clemente01GEN ED-
3oneMM015PS 015 Roberto Clemente01CTT-
fouroneMM015PS 015 Roberto Clemente02GEN ED-

CORE COURSE (MS CORE and 9-12 ONLY)SERVICE CATEGORY (K-9 * ONLY)NUMBER OF STUDENTS / SEATS FILLED/
0--19.0
one--21.0
2--17.0
3--17.0
four--15.0

NUMBER OF SECTIONSAVERAGE CLASS SIZESIZE OF SMALLEST CLASSSIZE OF LARGEST CLASSDATA SOURCESCHOOLWIDE PUPIL-TEACHER RATIO
01.019.019.019.0AtsNaN
one1.021.021.021.0AtsNaN
21.017.017.017.0AtsNaN
31.017.017.017.0AtsNaN
four1.015.015.015.0AtsNaN

As you can see, a DBN is simply a combination of CSD , BOROUGH and SCHOOL_ CODE . For those unfamiliar with New York: it consists of 5 districts. Each district is an organizational unit, approximately equal in size to a fairly large US city. DBN stands for district-district number. It seems that CSD is a district, BOROUGH is a district and in combination with SCHOOL_CODE it is obtained DBN .


Now that we know how to make a DBN, we can add it to class_size and hs_directory .


 In [ ]: data["class_size"]["DBN"] = data["class_size"].apply(lambda x: "{0:02d}{1}".format(x["CSD"], x["SCHOOL CODE"]), axis=1) data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"] 

Add polls


One of the most potentially interesting datasets is the dataset of surveys of students, parents, and teachers about the quality of schools. These surveys include information about the subjective perception of the security of each school, educational standards and so on. Before combining our datasets, let's add survey data. In real data science projects, you will often stumble upon interesting data in the course of the analysis and you may also want to connect it. The flexible tool, such as Jupyter notebook, allows you to quickly add additional code and redo the analysis.


In our case, we will add additional data on polls to our data dictionary, after which we merge all datasets. Survey data consists of two files, one for all schools, and one for school district 75. To merge them, you need to write some code. In it we will do that:



 In [66]: survey1 = pandas.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252') survey2 = pandas.read_csv("schools/survey_d75.txt", delimiter="\t", encoding='windows-1252') survey1["d75"] = False survey2["d75"] = True survey = pandas.concat([survey1, survey2], axis=0) 

As soon as we combine all the polls, there will be an additional difficulty. We want to minimize the number of columns in our combined dataset so that we can easily compare columns and identify dependencies. Unfortunately, the survey data contains a lot of unnecessary columns for us:


 In [16]: survey.head() Out[16]: 

N_pN_sN_taca_p_11aca_s_11aca_t_11aca_tot_11/
090.0NaN22.07.8NaN7.97.9
one161.0NaN34.07.8NaN9.18.4
2367.0NaN42.08.6NaN7.58.0
3151.0145.029.08.57.47.87.9
four90.0NaN23.07.9NaN8.18.0

bncom_p_11com_s_11...t_q8c_1t_q8c_2t_q8c_3t_q8c_4/
0M0157.6NaN...29.067.05.00.0
oneM0197.6NaN...74.021.06.00.0
2M0208.3NaN...33.035.020.013.0
3M0348.25.9...21.045.028.07.0
fourM0637.9NaN...59.036.05.00.0

t_q9t_q9_1t_q9_2t_q9_3t_q9_4t_q9_5
0NaN5.014.052.024.05.0
oneNaN3.06.03.078.09.0
2NaN3.05.016.070.05.0
3NaN0.018.032.039.011.0
fourNaN10.05.010.060.015.0

We can handle this by looking into the data dictionary file that we downloaded along with the survey data. He will tell us about important fields:


And then we remove all non-related columns in the survey:


 In [17]: survey["DBN"] = survey["dbn"] survey_fields = ["DBN", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", "eng_p_11", "aca_p_11", "saf_t_11", "com_t_11", "eng_t_10", "aca_t_11", "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11", "com_tot_11", "eng_tot_11", "aca_tot_11",] survey = survey.loc[:,survey_fields] data["survey"] = survey survey.shape Out[17]: (1702, 23) 

Understanding exactly what each dataset contains and which columns from it are important can save a lot of time and effort later.


We condense datasets


If we take a look at some datasets, including class_size , we will immediately see the problem:


 In [18]: data["class_size"].head() Out[18]: 

CSDBOROUGHSCHOOL CODESCHOOL NAMEGRADEPROGRAM TYPECORE SUBJECT (MS CORE and 9-12 ONLY)/
0oneMM015PS 015 Roberto Clemente0KGEN ED-
oneoneMM015PS 015 Roberto Clemente0KCTT-
2oneMM015PS 015 Roberto Clemente01GEN ED-
3oneMM015PS 015 Roberto Clemente01CTT-
fouroneMM015PS 015 Roberto Clemente02GEN ED-

CORE COURSE (MS CORE and 9-12 ONLY)SERVICE CATEGORY (K-9 * ONLY)NUMBER OF STUDENTS / SEATS FILLEDNUMBER OF SECTIONSAVERAGE CLASS SIZE/
0--19.01.019.0
one--21.01.021.0
2--17.01.017.0
3--17.01.017.0
four--15.01.015.0

SIZE OF SMALLEST CLASSSIZE OF LARGEST CLASSDATA SOURCESCHOOLWIDE PUPIL-TEACHER RATIODBN
019.019.0AtsNaN01M015
one21.021.0AtsNaN01M015
217.017.0AtsNaN01M015
317.017.0AtsNaN01M015
four15.015.0AtsNaN01M015

There are several lines for each school (which can be understood by repeating DBN and SCHOOL NAME fields). Although, if we take a look at sat_results , there is only one line per school:


 In [21]: data["sat_results"].head() Out[21]: 

DBNSCHOOL NAMENum of SAT Test TakersSAT Critical Reading Avg. ScoreSAT Math Avg. ScoreSAT Writing Avg. Score
001M292HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES29355404363
one01M448UNIVERSITY NEIGHBORHOOD HIGH SCHOOL91383423366
201M450EAST SIDE COMMUNITY SCHOOL70377402370
301M458FORSYTH SATELLITE ACADEMY7414401359
four01M509MARTA VALLE HIGH SCHOOL44390433384

To combine these datasets, we need a way to compact class_size datasets so that they have one line for each high school. If it does not work out, it will not work out and compare the EGE estimates with the size of the class. We can achieve this by better understanding the data, and then making some aggregations.


According to class_size , it seems that GRADE and PROGRAM TYPE contain different grades for each school. By limiting each field to a single value, we will be able to drop all duplicate lines. In the code below, we:



 In [68]: class_size = data["class_size"] class_size = class_size[class_size["GRADE "] == "09-12"] class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"] class_size = class_size.groupby("DBN").agg(np.mean) class_size.reset_index(inplace=True) data["class_size"] = class_size 

We thicken the rest of the datasets


Next we need to shrink the demographics . Data collected over several years for the same schools. We will select only those lines where the schoolyear field is the freshest of all.


 In [69]: demographics = data["demographics"] demographics = demographics[demographics["schoolyear"] == 20112012] data["demographics"] = demographics 

Now we need to compress math_test_results . It is divided by the Grade and Year values. We can choose a single class for a single year:


 In [70]: data["math_test_results"] = data["math_test_results"][data["math_test_results"]["Year"] == 2011] data["math_test_results"] = data["math_test_results"][data["math_test_results"]["Grade"] == 

Finally, graduation also needs to be condensed:


 In [71]: data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"] data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"] 

Cleaning and researching data is critical before working on the core of the project. Good, fit holistic dataset help make analysis faster.


Calculation of aggregated variables


The calculation of variables can speed up our analysis with the ability to make comparisons faster and, in principle, making it possible to do some comparisons that are impossible without them. The first thing we can do is calculate the total exam score from the individual columns of the SAT Math Avg. Score , SAT Critical Reading Avg. Score , and SAT Writing Avg. Score . In the code below, we:



 In [72]: cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score'] for c in cols: data["sat_results"][c] = data["sat_results"][c].convert_objects(convert_numeric=True) data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] 

Next we need to parse the coordinates of each school to make maps. They will allow us to mark the position of each school. In the code we:



Let's output our datasets, see what happened:


 In [74]: for k,v in data.items(): print(k) print(v.head()) 

math_test_results


DBNGradeYearCategoryNumber TestedMean Scale Score\
11101M034eight2011All students48646
28001M140eight2011All students61665
34601M184eight2011All students49727
38801M188eight2011All students49658
41101M292eight2011All students49650

Level 1 #Level 1%Level 2 #Level 2%Level 3 #Level 3%Level 4 #\
1111531.3%2245.8%eleven22.9%0
280one1.6%4370.5%1727.9%0
34600%00%five10.2%44
388ten20.4%2653.1%ten20.4%3
4111530.6%2551%714.3%2

Level 4%Level 3 + 4 #Level 3 + 4%
1110%eleven22.9%
2800%1727.9%
34689.8%49100%
3886.1%1326.5%
4114.1%918.4%

survey


DBNrr_srr_trr_pN_sN_tN_psaf_p_11com_p_11eng_p_11\
001M015NaN8860NaN22.090.08.57.67.5
one01M019NaN10060NaN34.0161.08.47.67.6
201M020NaN8873NaN42.0367.08.98.38.3
301M03489.07350145.029.0151.08.88.28.0
four01M063NaN10060NaN23.090.08.77.98.1

...eng_t_10aca_t_11saf_s_11com_s_11eng_s_11aca_s_11\
0...NaN7.9NaNNaNNaNNaN
one...NaN9.1NaNNaNNaNNaN
2...NaN7.5NaNNaNNaNNaN
3...NaN7.86.25.96.57.4
four...NaN8.1NaNNaNNaNNaN

saf_tot_11com_tot_11eng_tot_11aca_tot_11
08.07.77.57.9
one8.58.18.28.4
28.27.37.58.0
37.36.77.17.9
four8.57.67.98.0

ap_2010


DBNSchoolNameAP Test TakersTotal exams takenNumber of Exams with scores 3 4 or 5
001M448UNIVERSITY NEIGHBORHOOD HS3949ten
one01M450EAST SIDE COMMUNITY HSnineteen21s
201M515LOWER EASTSIDE PREP242624
301M539NEW EXPLORATIONS SCI, TECH, MATH255377191
four02M296High School of Hospitality Managementsss

sat_results


DBNSCHOOL NAMENum of SAT Test TakersSAT Critical Reading Avg. Score\
001M292HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES29355.0
one01M448UNIVERSITY NEIGHBORHOOD HIGH SCHOOL91383.0
201M450EAST SIDE COMMUNITY SCHOOL70377.0
301M458FORSYTH SATELLITE ACADEMY7414.0
four01M509MARTA VALLE HIGH SCHOOL44390.0

SAT Math Avg. ScoreSAT Writing Avg. Scoresat_score
0404.0363.01122.0
one423.0366.01172.0
2402.0370.01149.0
3401.0359.01174.0
four433.0384.01207.0

class_size


DBNCSDNUMBER OF STUDENTS / SEATS FILLEDNUMBER OF SECTIONS\
001M292one88.00004.000000
one01M332one46.00002.000000
201M378one33.00001.000000
301M448one105.68754.750000
four01M450one57.60002.733333

AVERAGE CLASS SIZESIZE OF SMALLEST CLASSSIZE OF LARGEST CLASSSCHOOLWIDE PUPIL-TEACHER RATIO
022.56428618.5026.571429NaN
one22.00000021.0023,500,000NaN
233.00000033.0033.000000NaN
322.23125018.2527.062500NaN
four21.20000019.4022.866667NaN

demographics


DBNNameschoolyear\
601M015PS 015 ROBERTO CLEMENTE20112012
1301M019PS 019 ASHER LEVY20112012
2001M020PS 020 ANNA SILVER20112012
2701M034PS 034 FRANKLIN D ROOSEVELT20112012
3501M063PS 063 WILLIAM MCKINLEY20112012

fl_percentfrl_percenttotal_enrollmentprekkgrade1grade2\
6NaN89.418913313528
13NaN61.532832465254
20NaN92.56265210212187
27NaN99.740114343836
35NaN78.91761820thirty21

...black_numblack_perhispanic_numhispanic_perwhite_num\
6...6333.310957.7four
13...8124.715848.228
20...558.835757.0sixteen
27...9022.427568.6eight
35...4123.311062.515

white_permale_nummale_perfemale_numfemale_per
62.197.051.392.048.7
138.5147.044.8181.055.2
202.6330.052.7296.047.3
272.0204.050.9197.049.1
358.597.055.179.044.9

graduation


DemographicDBNSchool nameCohort\
3Total cohort01M292HENRY STREET SCHOOL FOR INTERNATIONAL2006
tenTotal cohort01M448UNIVERSITY NEIGHBORHOOD HIGH SCHOOL2006
17Total cohort01M450EAST SIDE COMMUNITY SCHOOL2006
24Total cohort01M509MARTA VALLE HIGH SCHOOL2006
31Total cohort01M515LOWER EAST SIDE PREPARATORY HIGH SCHO2006

Total cohortTotal Grads - nTotal Grads -% of cohortTotal Regents - n\
3784355.1%36
ten1245342.7%42
17907077.8%67
24844756%40
3119310554.4%91

Total Regents -% of cohortTotal Regents -% of grads...Regents w / o Advanced - n\
346.2%83.7%...36
ten33.9%79.2%...34
1774.400000000000006%95.7%...67
2447.6%85.1%...23
3147.2%86.7%...22

Regents w / o Advanced -% of cohortRegents w / o Advanced -% of grads\
346.2%83.7%
ten27.4%64.2%
1774.400000000000006%95.7%
2427.4%48.9%
3111.4%21%

Local - nLocal -% of cohortLocal -% of gradsStill Enrolled - n\
379%16.3%sixteen
teneleven8.9%20.8%46
1733.3%4.3%15
2478.300000000000001%14.9%25
31147.3%13.3%53

Still Enrolled -% of cohortDropped Out - nDropped Out -% of cohort
320.5%eleven14.1%
ten37.1%2016.100000000000001%
1716.7%five5.6%
2429.8%five6%
3127.5%3518.100000000000001%

hs_directory


dbnschool_nameboro\
017K548Brooklyn School for Music & TheaterBrooklyn
one09X543High School for Violin and DanceBronx
209X327Comprehensive Model School Project MS 327Bronx
302M280Manhattan Early College School for AdvertisingManhattan
four28Q680Queens Gateway to Health Sciences Secondary Sc ...Queens

building_codephone_numberfax_numbergrade_span_mingrade_span_max\
0K440718-230-6250718-230-6262912
oneX400718-842-0687718-589-9849912
2X240718-294-8111718-294-8109612
3M520718-935-3477NaN9ten
fourQ695718-969-3155718-969-3552612

expgrade_span_minexpgrade_span_max...priority05priority06priority07priority08\
0NaNNaN...NaNNaNNaNNaN
oneNaNNaN...NaNNaNNaNNaN
2NaNNaN...Then to New York City residentsNaNNaNNaN
3914.0...NaNNaNNaNNaN
fourNaNNaN...NaNNaNNaNNaN

priority09priority10Location 1\
0NaNNaN883 Classon Avenue \ nBrooklyn, NY 11225 \ n (40.67 ...
oneNaNNaN1110 Boston Road \ nBronx, NY 10456 \ n (40.8276026 ...
2NaNNaN1501 Jerome Avenue \ nBronx, NY 10452 \ n (40.84241 ...
3NaNNaN411 Pearl Street \ nNew York, NY 10038 \ n (40.7106 ...
fourNaNNaN160-20 Goethals Avenue \ nJamaica, NY 11432 (40 ...

DBNlatlon
017K54840.670299-73.961648
one09X54340.827603-73.904475
209X32740.842414-73.916162
302M28040.710679-74.000807
four28Q68040.718810-73.806500

We combine datasets


After all the preparation, finally, we can merge all datasets on the DBN column. As a result, we get datasets with hundreds of columns, from all the original. When merging it is important to note that in some datasets there are no schools that are in sat_results dataset . To get around this, we need to merge datasets through the outer join, then we will not lose data. In the real analysis, the lack of data is common. Demonstrating the ability to explore and cope with such a lack is an important part of the portfolio.


You can read about different types of joins here .


In the code below, we:



 In [75]: flat_data_names = [k for k,v in data.items()] flat_data = [data[k] for k in flat_data_names] full = flat_data[0] for i, f in enumerate(flat_data[1:]): name = flat_data_names[i+1] print(name) print(len(f["DBN"]) - len(f["DBN"].unique())) join_type = "inner" if name in ["sat_results", "ap_2010", "graduation"]: join_type = "outer" if name not in ["math_test_results"]: full = full.merge(f, on="DBN", how=join_type) full.shape survey 0 ap_2010 1 sat_results 0 class_size 0 demographics 0 graduation 0 hs_directory 0 Out[75]: (374, 174) 

Add values


Now that we have our full full data frame, we have almost all the information for our analysis. Although there are still missing parts. We may want to correlate the test scores with the in-depth program with the USE estimates, but first we need to convert these columns to numbers, and then fill in all the missing values:


 In [76]: cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5'] for col in cols: full[col] = full[col].convert_objects(convert_numeric=True) full[cols] = full[cols].fillna(value=0) 

Next, you need to count the school_dist column, which indicates the school district. It will allow us to compare school districts and draw district statistics using the maps of the districts that we downloaded:


 In [77]: full["school_dist"] = full["DBN"].apply(lambda x: x[:2]) 

Finally, we need to fill in all the missing values ​​in full with the average value from the column in order to calculate the correlations


 In [79]: full = full.fillna(full.mean()) 

We consider correlation


A good way to look at the datasets and see how the columns relate to what they need is to calculate the correlations. This will show which columns are associated with the column of interest. This can be done using the corr method in Pandas data frames. The closer the correlation is to 0, the weaker the connection. The closer to 1, the stronger the direct connection. The closer to -1, the stronger the feedback:


 In [80]: full.corr()['sat_score'] Out[80]: Year NaN Number Tested 8.127817e-02 rr_s 8.484298e-02 rr_t -6.604290e-02 rr_p 3.432778e-02 N_s 1.399443e-01 N_t 9.654314e-03 N_p 1.397405e-01 saf_p_11 1.050653e-01 com_p_11 2.107343e-02 eng_p_11 5.094925e-02 aca_p_11 5.822715e-02 saf_t_11 1.206710e-01 com_t_11 3.875666e-02 eng_t_10 NaN aca_t_11 5.250357e-02 saf_s_11 1.054050e-01 com_s_11 4.576521e-02 eng_s_11 6.303699e-02 aca_s_11 8.015700e-02 saf_tot_11 1.266955e-01 com_tot_11 4.340710e-02 eng_tot_11 5.028588e-02 aca_tot_11 7.229584e-02 AP Test Takers 5.687940e-01 Total Exams Taken 5.585421e-01 Number of Exams with scores 3 4 or 5 5.619043e-01 SAT Critical Reading Avg. Score 9.868201e-01 SAT Math Avg. Score 9.726430e-01 SAT Writing Avg. Score 9.877708e-01 ... SIZE OF SMALLEST CLASS 2.440690e-01 SIZE OF LARGEST CLASS 3.052551e-01 SCHOOLWIDE PUPIL-TEACHER RATIO NaN schoolyear NaN frl_percent -7.018217e-01 total_enrollment 3.668201e-01 ell_num -1.535745e-01 ell_percent -3.981643e-01 sped_num 3.486852e-02 sped_percent -4.413665e-01 asian_num 4.748801e-01 asian_per 5.686267e-01 black_num 2.788331e-02 black_per -2.827907e-01 hispanic_num 2.568811e-02 hispanic_per -3.926373e-01 white_num 4.490835e-01 white_per 6.100860e-01 male_num 3.245320e-01 male_per -1.101484e-01 female_num 3.876979e-01 female_per 1.101928e-01 Total Cohort 3.244785e-01 grade_span_max -2.495359e-17 expgrade_span_max NaN zip -6.312962e-02 total_students 4.066081e-01 number_programs 1.166234e-01 lat -1.198662e-01 lon -1.315241e-01 Name: sat_score, dtype: float64 

This data gives us a number of hints that need to be worked out:



Each item is a potential place for research and history based on data.


Note trans.

Just in case, let me remind you that the correlation (and covariance) show only a measure of linear dependence. If there is a connection, but not a linear one, say, a quadratic one - the correlation will not show anything sensible.
And, of course, the correlation in no way indicates a cause and effect. It is simply that the two quantities tend to change in proportion. Below is just an example of the search for a valid pattern and analyzed.


We define the context


Before diving into the study of data, it would be necessary to define the context, both for ourselves and for those who will read our analysis later. A good way - research charts or maps. In our case, we will draw on the map the location of our schools, which will help readers to understand the problem we are studying.


In the code below, we:



 In [82]: import folium from folium import plugins schools_map = folium.Map(location=[full['lat'].mean(), full['lon'].mean()], zoom_start=10) marker_cluster = folium.MarkerCluster().add_to(schools_map) for name, row in full.iterrows(): folium.Marker([row["lat"], row["lon"]], popup="{0}: {1}".format(row["DBN"], row["school_name"])).add_to(marker_cluster) schools_map.create_map('schools.html') schools_map Out[82]: 


, , - . :


 In [84]: schools_heatmap = folium.Map(location=[full['lat'].mean(), full['lon'].mean()], zoom_start=10) schools_heatmap.add_children(plugins.HeatMap([[row["lat"], row["lon"]] for name, row in full.iterrows()])) schools_heatmap.save("heatmap.html") schools_heatmap Out[84]: 



, - , . , .. . - , .


. :



 In [ ]: district_data = full.groupby("school_dist").agg(np.mean) district_data.reset_index(inplace=True) district_data["school_dist"] = district_data["school_dist"].apply(lambda x: str(int(x)) 

. GeoJSON , , school_dist , , , .


 In [85]: def show_district_map(col): geo_path = 'schools/districts.geojson' districts = folium.Map(location=[full['lat'].mean(), full['lon'].mean()], zoom_start=10) districts.geo_json( geo_path=geo_path, data=district_data, columns=['school_dist', col], key_on='feature.properties.school_dist', fill_color='YlGn', fill_opacity=0.7, line_opacity=0.2, ) districts.save("districts.html") return districts show_district_map("sat_score") Out[85]: 



, ; , . , , . — , , .


, :


 In [87]: %matplotlib inline full.plot.scatter(x='total_enrollment', y='sat_score') Out[87]: <matplotlib.axes._subplots.AxesSubplot at 0x10fe79978> 


, . , . .


, :


 In [88]: full[(full["total_enrollment"] < 1000) & (full["sat_score"] < 1000)]["School Name"] Out[88]: 34 INTERNATIONAL SCHOOL FOR LIBERAL ARTS 143 NaN 148 KINGSBRIDGE INTERNATIONAL HIGH SCHOOL 203 MULTICULTURAL HIGH SCHOOL 294 INTERNATIONAL COMMUNITY HIGH SCHOOL 304 BRONX INTERNATIONAL HIGH SCHOOL 314 NaN 317 HIGH SCHOOL OF WORLD CULTURES 320 BROOKLYN INTERNATIONAL HIGH SCHOOL 329 INTERNATIONAL HIGH SCHOOL AT PROSPECT 331 IT TAKES A VILLAGE ACADEMY 351 PAN AMERICAN INTERNATIONAL HIGH SCHOO Name: School Name, dtype: object 

, , , , , . , — , , .



, , . ell_percent - . :


 In [89]: full.plot.scatter(x='ell_percent', y='sat_score') Out[89]: <matplotlib.axes._subplots.AxesSubplot at 0x10fe824e0> 


, ell_percentage . , , :


 In [90]: show_district_map("ell_percent") Out[90]: 



, , .



, , . , . :


 In [91]: full.corr()["sat_score"][["rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_tot_11", "com_tot_11", "aca_tot_11", "eng_tot_11"]].plot.bar() Out[91]: <matplotlib.axes._subplots.AxesSubplot at 0x114652400> 


, N_p N_s , . , ell_learners . — saf_t_11 . , , . , , — . , , , , . , - , ( — , ).



. , , :


 In [92]: full.corr()["sat_score"][["white_per", "asian_per", "black_per", "hispanic_per"]].plot.bar() Out[92]: <matplotlib.axes._subplots.AxesSubplot at 0x108166ba8> 


, , . , , . , :


 In [93]: show_district_map("hispanic_per") Out[93]: 


, - , .



— . , . :


 In [94]: full.corr()["sat_score"][["male_per", "female_per"]].plot.bar() Out[94]: <matplotlib.axes._subplots.AxesSubplot at 0x10774d0f0> 


, female_per sat_score :


 In [95]: full.plot.scatter(x='female_per', y='sat_score') Out[95]: <matplotlib.axes._subplots.AxesSubplot at 0x104715160> 


, ( ). :


 In [96]: full[(full["female_per"] > 65) & (full["sat_score"] > 1400)]["School Name"] Out[96]: 3 PROFESSIONAL PERFORMING ARTS HIGH SCH 92 ELEANOR ROOSEVELT HIGH SCHOOL 100 TALENT UNLIMITED HIGH SCHOOL 111 FIORELLO H. LAGUARDIA HIGH SCHOOL OF 229 TOWNSEND HARRIS HIGH SCHOOL 250 FRANK SINATRA SCHOOL OF THE ARTS HIGH SCHOOL 265 BARD HIGH SCHOOL EARLY COLLEGE Name: School Name, dtype: object 

, , . . , , , , , .


Note trans.

, 100 ( ).



. , — , , . , , .


 In [98]: full["ap_avg"] = full["AP Test Takers "] / full["total_enrollment"] full.plot.scatter(x='ap_avg', y='sat_score') Out[98]: <matplotlib.axes._subplots.AxesSubplot at 0x11463a908> 


, . , :


 In [99]: full[(full["ap_avg"] > .3) & (full["sat_score"] > 1700)]["School Name"] Out[99]: 92 ELEANOR ROOSEVELT HIGH SCHOOL 98 STUYVESANT HIGH SCHOOL 157 BRONX HIGH SCHOOL OF SCIENCE 161 HIGH SCHOOL OF AMERICAN STUDIES AT LE 176 BROOKLYN TECHNICAL HIGH SCHOOL 229 TOWNSEND HARRIS HIGH SCHOOL 243 QUEENS HIGH SCHOOL FOR THE SCIENCES A 260 STATEN ISLAND TECHNICAL HIGH SCHOOL Name: School Name, dtype: object 

, , , . , .



data science - . , . , , , .


— . — - . — , .


Note trans.

, , . :)


What's next


— , .


Dataquest , , . — .


')

Source: https://habr.com/ru/post/331528/


All Articles