William Ming

- William Ming (@CPWill)

from class Sat 7.00pm

Feedback from Instructor

¶- Well done! Comprehensive analysis that is insightful and well-substantiated. It was great to see how you applied a range of skills learnt from the DS102 in this project.
- Your project was an enjoyable read and I really applaud your effort in documenting your purpose-driven methodology and your code chunks. As a result, I found your notebook very coherent and easy to follow.
- The stacked bar plot under Step 3 looks great, but you might want to consider having the student progress line plot on another graph. It is mildly confusing for readers to see the plots super imposed, and the different scale on the left and right y axes confounds the analysis a little. I think it could suffice to plot the line plots separately and point out how these trends correlate with what's shown by the stacked bar plot.
- Do keep in mind that correlation does not imply causality - even when there is strong correlation between two variables, it would be challenging to even make inferences that imply cause-and-effect relationships, such as government education expenditure on literacy rate for instance. Literacy rates could very well be directly improved by government expenditure on supporting underprivileged groups (that improved overall standard of living), or even cultural factors that gave rise to such a trend.
- Overall great job on the project! :)

Component | Score |
---|---|

Executive Summary | 3/3 |

Problem Statement | 3/3 |

Methodology | 12.5/14 |

Total |
18.5/20 |

Singapore has a strong education system which has markedly improved over the decades, largely as a result of changes in the government's education policy. This study investigates how changes in the education system and policies have impacted 3 key educational outcomes - Literacy Rate, % Pass, and progress to higher levels of education. This was done by implementing various statistical and data visualisation methods found in the matplotlib and statsmodels python libraries, such as scatter plots and correlation.

It was found that the government's increased spending on education is correlated with a higher literacy rate. From this, it can be inferred that basic reading education has been improved due to the increased expenditure. Another change in the education system - decreasing class size - has helped more O Level students pass English and Mathematics, but this effect is minimal for Mother Tongue. Hence, it is recommended that research is done to examine the difficulties faced in MTL so as to craft more effective MTL promotion campaigns. Finally, although more teachers have higher academic qualifications, the study on the impacts of this trend is inconclusive. Further research must be done to confirm the various inferences made from the findings of this study.

Education in Singapore has undeniably improved since the 1980s and this is due in large part to the education policies implemented by the Singapore government. This study aims to understand how different changes in the education policies and system have impacted key educational outcomes. This investigation is structured into 3 parts, each answering one of the following questions:

```
Part 1: How effective has the increasing education expenditure been in increasing our national literacy rate?
Part 2: Do smaller class sizes help students achieve better academic results?
Part 3: How have the academic qualifications of teachers changed and does this have any impact on students?
```

Part 1 looks at education at a national level, whereas part 2 and 3 zoom in specifically on secondary school education.

Part 1:

`government-expenditure-on-education.csv`

from Data.gov.sg retrieved on 1 Jun 2019`literacy-rate-annual.csv`

from Data.gov.sg retrieved on 1 Jun 2019

Part 2:

`pupils-per-teacher-in-secondary-schools.csv`

from Data.gov.sg retrieved on 1 Jun 2019`percentage-of-gce-o-level-students-who-passed-english-language.csv`

from Data.gov.sg retrieved on 1 Jun 2019`percentage-of-gce-o-level-students-who-passed-english-mathematics.csv`

from Data.gov.sg retrieved on 1 Jun 2019`percentage-of-gce-o-level-students-who-passed-mtl.csv`

from Data.gov.sg retrieved on 1 Jun 2019

Part 3:

`teachers-in-schools-academic-qualification.csv`

from Data.gov.sg retrieved on 1 Jun 2019`percentage-of-o-level-cohort-that-progressed-to-post-secondary-education.csv`

from Data.gov.sg retrieved on 1 Jun 2019`percentage-of-n-level-cohort-that-progressed-to-post-secondary-education.csv`

from Data.gov.sg retrieved on 1 Jun 2019

In [1]:

```
#import the relvant libraries and modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from sklearn import linear_model
from functools import reduce
%matplotlib inline
```

**DATA CLEANING:**

In [2]:

```
#read relevant csv files into dataframes
govt_df = pd.read_csv('government-expenditure-on-education.csv')
literacy_df = pd.read_csv('literacy-rate-annual.csv')
display(govt_df.head(),govt_df.describe())
display(literacy_df.head(),literacy_df.describe())
```

In [3]:

```
#merge the 2 dataframes, only including years for which both expenditure and literacy data are available
govt_lit_df = govt_df.merge(literacy_df, how = 'inner', on = 'year')
#express expenditure in millions
govt_lit_df['expenditure_in_millions'] = govt_lit_df['total_expenditure_on_education']/1000000
#remove the unnecessary columns and rename the columns
govt_lit_df.drop(columns = ['level_1','total_expenditure_on_education'], inplace = True)
govt_lit_df.columns = ['year','literacy_rate','expenditure_in_millions']
govt_lit_df.head()
```

Out[3]:

**DATA ANALYSIS**:

First, line charts are plotted for government expenditure on education vs year and literacy rate vs year to examine how each variable has changed over the years. By plotting on a shared x-axis, we can see how both variables change over the same period of time.

In [4]:

```
fig = plt.figure(figsize = (16,8))
ax1 = fig.add_subplot(111)
ax2 = ax1.twinx()
govt_lit_df.plot(kind='line', marker = 'o', color = 'r', x = 'year', y = 'expenditure_in_millions',
ax = ax2, grid = True, xticks = range(1981,2019,2), yticks = range (0,17, 2))
govt_lit_df.plot(kind='line', marker = 'x', color = 'b', x = 'year', y = 'literacy_rate', ax = ax1,
yticks = range(82,99,2), grid = True)
ax1.set_ylabel('Literacy Rate (%)', fontsize = 13, labelpad = 15)
ax2.set_ylabel('Total Expenditure on Education (million SGD)', fontsize = 13, rotation = 270, labelpad = 20)
ax1.set_xlabel('Year', fontsize = 13)
ax1.legend(['Literacy Rate (%)'], fontsize = 13, loc = (0.02,0.93))
ax2.legend(['Total Expenditure on Education (million SGD)'], fontsize = 13, loc = (0.02,0.88))
ax1.set_title('Change in Total Expenditure on Education and Literacy Rate from 1981 to 2017', fontsize = 16)
plt.show()
```

As shown in the graph, from 1981 to 2017, both government expenditure on education and literacy rate have increased. The former suggests that the government is putting a greater focus on improving education while the latter suggests that improvements in basic reading and writing education have been successful, translating to higher literacy rates.

Although both variables have increased in tandem over the years, there are certain outliers in this pattern. For example, from 1983 to 1989, government expenditure on education remained relatively constant around 1.8 million SGD, yet literacy rate rose from around 84.5% to around 88.5% over this same period of time. A similar phenomenon is seen from 2000 to 2005. This could be explained by other factors that affect the literacy rate. Further research would be required in this area to determine which factors are more important.

Nonetheless, there is still an obvious positive correlation between government expenditure on education and literacy rate. Therefore, we shall quantify the strength of this correlation by finding the correlation coefficient and visualise this by drawing the regression line.

In [5]:

```
#create input data for OLS model
X = govt_lit_df[['expenditure_in_millions']]
y = govt_lit_df['literacy_rate']
#statsmodel OLS model
X = sm.add_constant(X)
model = sm.OLS(y,X)
result = model.fit()
govt_lit_df['fitted_LiteracyRate'] = result.fittedvalues
display(result.summary())
print('The Pearson product-moment correlation coefficient is ',
str(np.corrcoef(govt_lit_df['expenditure_in_millions'],govt_lit_df['literacy_rate'])[1][0]))
```

An R-squared value of 0.859 and a correlation coefficient of +0.93 implies that there is a strong positive correlation between government expenditure on education and Singapore's literacy rate. The small (or 0) p-value for both the t-tests and F-test suggest that this correlation is significant.

In [6]:

```
fig2 = plt.figure(figsize=(12,6))
ax3 = fig2.add_subplot(111)
govt_lit_df.plot(kind = 'scatter', x = 'expenditure_in_millions', y = 'literacy_rate', ax = ax3)
govt_lit_df.plot(kind = 'line', x = 'expenditure_in_millions', y = 'fitted_LiteracyRate', ax = ax3, color = 'r')
ax3.set_title('Literacy Rate(%) vs Expenditure on Education(in million SGD)')
ax3.set_xlabel('Government Expenditure on Education (in million SGD)')
ax3.set_ylabel('Literacy Rate(%)')
plt.show()
```

As expected from the summmary statistics, there is a strong positive correlation between government expenditure on education and literacy rate. Most of the points in the scatter plot are evenly distributed around the line of best fit (coloured red in the figure above).

As mentioned above, part 2 focuses on secondary school education. In this section, average class size will be measured by using the pupils-to-teacher ratio (PTTR). A smaller PTTR means that there are fewer students per tear and hence implies a smaller average class size. Academic performance will be measured by the percentage of O Level students who passed the subject. The subjects being studied are English, Mathematics and Mother Tongue.

**DATA CLEANING:**

In [7]:

```
#read relevant csv files into dataframes
pttr_df = pd.read_csv('pupils-per-teacher-in-secondary-schools.csv')
english_df = pd.read_csv('percentage-of-gce-o-level-students-who-passed-english-language.csv')
math_df = pd.read_csv('percentage-of-gce-o-level-students-who-passed-mathematics.csv')
mtl_df = pd.read_csv('percentage-of-gce-o-level-students-who-passed-mtl.csv')
display(pttr_df.head(), pttr_df.describe(),
english_df.head())
#only english_df.head() is displayed as the other 2 subject dataframes are similar
```

In [8]:

```
#merges all three subject dataframes on the 'year' and 'race' columns
def merge_df(df1,df2):
return df1.merge(df2, how = 'inner', on = ['year','race'])
subject_list = [english_df,math_df,mtl_df]
subject_dfs_list = [x[x['race']=='Overall'] for x in subject_list]
merged_subject_df = reduce(merge_df,subject_dfs_list)
display(merged_subject_df.head())
```

In [9]:

```
#merge the PTTR dataframe with the subjects dataframe and drop the unnecessary column
pttr_subject_df = pttr_df.merge(merged_subject_df,how='inner',on='year')
pttr_subject_df = pttr_subject_df.drop(columns='race')
display(pttr_subject_df.head())
```

**DATA ANALYSIS:**

A heatmap of the correlation between all pairs of variables in `pttr_subject_df`

will be used to examine the strenght of correlation between PTTR and academic performance. Scatter plots are then used to further visualise this correlation.

In [10]:

```
#get an array of the pearson correlation coefficients between all variables in the dataframe
correlation_df = pttr_subject_df.corr()
#plot a heatmap
fig3 = plt.figure(figsize=(14,14))
ax4 = fig3.add_subplot(211)
sns.heatmap(correlation_df, cmap = 'RdBu_r', ax = ax4)
ax4.set_title('Correlation between year, PTTR and percentage O level passes by subjects', fontsize = 16)
#plot the various scatter plots
ax5 = fig3.add_subplot(234)
pttr_subject_df.plot(kind='scatter', x='sec_pupil_to_teacher', y='percentage_passed_olevel_el', ax=ax5)
ax5.set_title('Percentage O Level EL pass vs PTTR')
ax6 = fig3.add_subplot(235)
pttr_subject_df.plot(kind='scatter', x='sec_pupil_to_teacher', y='percentage_passed_olevels_math', ax=ax6)
ax6.set_title('Percentage O Level Math pass vs PTTR')
ax7 = fig3.add_subplot(236)
pttr_subject_df.plot(kind='scatter', x='sec_pupil_to_teacher', y='percentage_passed_olevels_mtl', ax=ax7)
ax7.set_title('Percentage O Level MTL pass vs PTTR')
fig3.subplots_adjust(wspace = 0.5,hspace = 0.8)
plt.show()
#correlation coefficients for reference
print('Pearson Correlation Coefficient between:')
print('1) PTTR and Percentage of O Level Students who passed English: ',
correlation_df['sec_pupil_to_teacher']['percentage_passed_olevel_el'])
print('2) PTTR and Percentage of O Level Students who passed Math: ',
correlation_df['sec_pupil_to_teacher']['percentage_passed_olevels_math'])
print('3) PTTR and Percentage of O Level Students who passed MTL: ',
correlation_df['sec_pupil_to_teacher']['percentage_passed_olevels_mtl'])
```

This heatmap and the negative sign in the correlation coefficient imply that overall, PTTR is inversly correlated with pass percentages for the three O Level subjects. The scatter plots for English and Mathematics supports this, whereas the scatter plot for Mother Tongue Language (MTL) seems to suggest that there is very little correlation between pass percentage for O Level MTL and PTTR.

Overall, the percentage of O Level students who passed English, Mathematics and MTL has increased as the PTTR in secondary schools decreased over the years. This may be because a smaller PTTR (implying a smaller class size) encourages student participation and ensures that each student gets more individual attention from teachers. This correlation is slightly stronger for Mathematics than English. However, this correlation is much weaker when it comes to MTL. There are likely other factors that are impeding improvements in MTL, such as the quality of MTL education, teachers and teaching methods. Moreover, the strong correlation found for English and Mathematics do not necessarily mean that PTTR is the main cause for the better performance. Further research must be done to determine what has improved students' performance in these subjects.

It is also worth noting that the increases in percentage pass for English and Mathematics over the years are much larger than the increase in percentage pass for MTL (seen in the heatmap). This may be due to the increasing focus on STEM (Science, Technology, Engineering and Mathematics) and the success of english-promoting campaigns such as the Speak Good English Movement. The improvements in English and Mathematics are particularly important as our increasingly globalised and digitalised economy necessitates better communication and analytical skills.

On the other hand, this observation also suggests that government efforts to promote the learning of MTL were not very effective. Further research could be done to examine how the learning of MTL could be further improved. It is important to continue to protect our MTLs alongside English due to the cultural value, especially in Singapore's multi-racial society.

**DATA CLEANING:**

In [11]:

```
#read the relevant csv files into dataframes
teacher_acad_df = pd.read_csv('teachers-in-schools-academic-qualification.csv')
o_postsec_df = pd.read_csv('percentage-of-o-level-cohort-that-progressed-to-post-secondary-education.csv')
n_postsec_df = pd.read_csv('percentage-of-n-level-cohort-that-progressed-to-post-secondary-education.csv')
display(teacher_acad_df.head(), o_postsec_df.head())
#note MF means Male
```

In [12]:

```
#filter for secondary school teachers only
teacher_acad_df = teacher_acad_df[teacher_acad_df['level_of_school'] == 'SECONDARY'].drop(columns='level_of_school')
#filter for overall percentage rather than percentages for each race
#and merge for all years between the two datasets
o_postsec_df = o_postsec_df[o_postsec_df['race'] == 'Overall'].drop(columns='race')
n_postsec_df = n_postsec_df[n_postsec_df['race'] == 'Overall'].drop(columns='race')
postsec_df = n_postsec_df.merge(o_postsec_df, how = 'outer', on = 'year')
display(teacher_acad_df.head(), postsec_df)
```

In [13]:

```
#find the total number of teachers with each academic qualification in each year
teacher_acad_df = teacher_acad_df.groupby(['year','academic_qualification']).sum().reset_index()
teacher_acad_pivot = teacher_acad_df.pivot(index = 'year', columns = 'academic_qualification', values = 'number_of_teachers')
#flatten the multilevel index to prepare for plotting stacked bar graph
teacher_acad_flat = teacher_acad_pivot.rename_axis(None,axis=1).reset_index()
display(teacher_acad_flat.head())
#create a new dataframe to populate with percentage values for percentage stacked bar graph
teacher_acad_percent = pd.DataFrame(teacher_acad_flat['year'])
qual_list = list(teacher_acad_df['academic_qualification'].unique())
teacher_acad_flat['Total'] = 0
for qual in qual_list:
teacher_acad_flat['Total'] += teacher_acad_flat[qual]
for qual in qual_list:
teacher_acad_percent[qual + ' %'] = teacher_acad_flat[qual]/teacher_acad_flat['Total']*100
display(teacher_acad_percent.head())
```

**DATA ANALYSIS:**

In [14]:

```
fig4 = plt.figure(figsize = (20,14))
ax5 = fig4.add_subplot(111)
ax6 = ax5.twinx()
#percentage stacked bar graph to show change in academic qualifications of secondary school teachers
teacher_acad_percent[teacher_acad_percent['year'] >= 2007].plot(kind = 'bar', x = 'year', stacked=True, ax = ax5)
#line graphs to show change in percentage of O/N Level students who progressed to post secondary education
postsec_df.plot(kind='line', y = 'percentage_o_level_progressed_to_post_secondary_education',
ax = ax6, color = 'b', lw = 5)
postsec_df.plot(kind='line', y = 'percentage_progressed_to_post_sec_education',
ax = ax6, color = 'k', lw = 5)
#set various properties of the graph
ax5.set_ylim((0,120))
ax6.set_ylim((88,102))
ax5.set_title("Teachers' academic qualifications and Students' progress to post secondary education from 2007 to 2017",
fontsize = 18)
ax5.set_xlabel("Year", fontsize = 14)
ax5.set_ylabel("Percentage of Teachers with each academic qualification", fontsize = 14)
ax6.set_ylabel("Percentage of O/N Level Students", fontsize = 14, rotation = 270, labelpad = 20)
ax5.legend(loc=2, fontsize = 10)
ax6.legend(['% progressed to post secondary education (O LEVELS)',
'% progressed to post secondary education (N LEVELS)'],
loc = 1, fontsize = 10)
plt.show()
```

From 2007 to 2017, the academic qualifications of secondary schools teachers have improved, with a larger percentage of teachers having degrees. More specifically, the percentage of teachers with Masters and Honours degrees has increased. This has been accompanied by a very slow/minimal increase in the percentage of O Level students who went on to post-secondary education (97.8% to 98%), as well as a relatively sharp increase in the percentage of N level students who went on to post-secondary education (88% to 96%).

The latter phenomenon could be due to more academically-qualified teachers being able to use their greater knowledge in helping students better grasp and maintain interest in school subjects, encouraging them to pursue post-secondary education. However, overall, this investigation does not seem to be very conclusive, and further research should be done to see how the academic qualifications of teachers might impact the students.

To summarise, the following insights were drawn from the 3 part investigation:

- there is an increasing trend in Singapore's literacy rate
- government expenditure on education and singapore's literacy rate are strongly positively correlated
- class sizes (measured by pupil to teacher ratio) and the percentage pass for English and Mathematics in secondary schools have a negative correlation
- class sizes do not have much correlation with percentage pass for Mother Tongue
- percentage pass for English and Mathematics have increased greatly over the years, whereas it is minimal for MTL
- the academic qualifications of secondary school teachers have improved
- the change in percentage of students who progress to post-secondary education vary widely between streams

From these insights, the following inferences could be made, although more research would be required to confirm these inferences:

- the government's expenditure on education has likely been effective in improving basic reading and writing education, hence improving literacy rate
- smaller class sizes may result in better academic results due to greater attention being given to each student
- campaigns promoting the English Language and the focus on STEM have been successful, at least in students
- further improvements may be required for campaigns promoting MTLs
- the impact of teachers' academic qualifications on students is largely inconclusive

Based on this, further research in the following areas is recommended:

- the effect of smaller class sizes on academic results in other educational levels, and on other non-academic indicators
- the problems that students face in learning MTL (so as to craft more effective campaigns and policies to promote MTL)
- whether teachers' academic qualifications have any impact on students in other educational levels