Linear Regression in the context of m-lab data 24 Aug 2015
Continuing the Outreachy article series, this article describes the outcome of applying linear regression to the m-lab data and the first steps in establishing a model of correlation between the socio-economic factors (population and median income) and the internet speed and availability for the New England region.
We builded on the previous work, meaning that we did linear regression analysis for the two most important socio-economic factors (population and median income) in correlation with MedianRTT, median upload speed and median download speed. We choose those features to analyse because both, the PCA analysis and the k-means analysis outlined the fact that those were the factors that could have been the most correlated. The analysis was done in python, using one of the linear regression libraries (statsmodel).
New England region - Results, Code and Discussion
linear regression - how it works
Why are we using linear regression? It is an easy to use method (not a lot of tuning required), runs fast, is highly interpretable and it can be used as basis for other methods (cross validation).
Linear regression predictics a quantitative response using a single feature=predictor/input variable and has the following form: y=β0+β1x Where: y is the response x is the feature β0 is the intercept β1 is the coefficient for x
Together, β0 and β1 are called the model coefficients. To create our model, we must “learn” the values of those coefficients.
To learn the coefficients means to estimate them. In general they are being estimated using the least squared method. The least squared method finds the line that minimizes the sum of squared residuals - or “the sum of squared errors”.
Looking at Figure 1, we interpret it as follows: the black dots are the values for x and y (observed values), the blue line is the least squared error line and the red lines are the residuals/errors. The residuals are the distance between the observed value and the minimized line.
A good question is how do our coefficients β0 and β1 relate to this line, β0 is the intercept, or the value of y when x=0; and β1 is the slope of the line.
Let’s take a look at the data, ask some questions about it and then answer those questions using linear regression.
The data set we used for this analysis consists of New England characteristics for internet speed and availability, collected with piecewise tool from m-lab and demographics data from the broadbandmap.gov site. We used the population and median income for each of the counties in New England from the census.
What are the features we are interested in?
- MedianRTT : the median round time trip in ms for each county in New England
- download_median: the median download speed in mb/s for each county in New England
- upload_median: the median upload speed in mb/s for each county in New England
What are we interested in? How those features correlate with the population and the median income for each of those counties.
We can visualize the relationship between those features using scatter plots. In Figure 2 below we have the population plotted against the MedianRTT, median_upload and median_download, same with medianIncome.
Questions about the data:
- Is there a relationship between the socio-economic features and the internet features?
- How strong is that relationship?
- Do population and median income influence the RTT and internet speeds?
- What is the effect of each of those (population and median income) on the internet characteristics?
We used the following function to implement linear regression using the statsmodels library.
We are going to analyse each coefficient for all of our models and see what they are and what they represent, but before that in the code above we made the prediction for the smallest and largest observed values of x and after that we used the predicted values to plot the least squared line. Those least squared lines can be seen in Figure 3 for MedianRTT, Figure 4 for median download and Figure 5 for median upload.
Interpreting model coefficients: for example for MedianRTT ~ population model, the population coefficient (β1) means that “unit” decrease in population is associated with 0.000002 “unit” decrease in median RTT. For the MedianRTT ~ medianIncome model, the medianIncome coefficient (β1) means that a “unit” decrease in median income is associated with 0.000479 decrease in median RTT. Or more clearly here, +4.79$ to the median income decreses the median RTT with 1ms.
Here we can interpret the download_median ~ population as follows: the population coefficient (β1) means that “unit” increase in population is associated with 0.000006 “unit” increase in median download speed. Or, 6 units increase in population ads 1M units increase in download speed.
Here we can interpret the download_median ~ population as follows: the population coefficient (β1) means that “unit” increase in population is associated with 0.000002 “unit” increase in median download speed. Or, 2 units increase in population ads 1M units increase in download speed.
The linear regression model is a high bias/low variance model. This means that if we sample repeatedly, the line will stay roughly in the same place (low variance), but the average of those models will not show the true relationship (high bias).
Hypothesis testing and p-values
Using the model created, we tested some conventional hypothesis regarding the model coefficients:
- null hypothesis: there is no relationship between the population/median income feature and median rtt, upload and download (so β1 equals 0)
- alternative hypothesis: there is a relationship between the population/median income feature and median rtt, upload and download (so β1 is not equal to 0)
The hypothesis testing is strongly related to confidence intervals, in statistics a confidence interval is a interval estimate of a population parameter. Usually it is calculated from observations, samples are observed and it is different from sample to sample. Statsmodels calculates 95% confidence intervals for our model coefficients, which are interpreted as follows: if the population from which this sample was drawn was sampled 100 times, approximately 95 of those confidence intervals would contain the “true” coefficient. So, how do we relate this to the hypothesis testing? We reject the null hypothesis and believe the alternative if the 95% confidence interval does not include zero. The p-value (in the data above) is the probability that the coeficient is actually 0.
So, if the 95% confidence interval includes 0, then the p-value for that coeficient will be greater than 0.05. The only instance where this is the case is for the MedianRTT ~ population, where the p-value for population in relation to median rtt is greater than 0.05, so probably there is no relationship between the population and the median RTT. But, we can believe looking at the other values that there is a relationship between median rtt and median income, and download/upload speed and median income/population.
The sources for this file, as well as sources for multivariable regression can be found here.
Using simple linear regression we analysed the m-lab data set and the correlation between the socio-economic features and internet characteristics. After the analysis, we can believe that there is a correlation between those characteristics and the socio-economic features influence the internet speed characteristics of the counties in New England.