This article will be one of a few articles done during my internship with the M-Lab organization link: http://measurementlab.net/ in the Outreachy Summer 2015 program. During this time I will work with different data from broadbandmap.gov link: http://broadbandmap.gov/ and census.gov link: http://census.gov/ and probably other sites to try and establish a correlation between the internet broadband connection penetration rate in a community (or how many people have high-speed internet) and the characteristics of that community. Basically, what I want to do is to characterize the communities with internet connection and those without and see why the former are attractive to the Internet Service Providers and the latter are not. (Maybe some socio-economic factors are influencing the availablity of internet connection: like income, education, age, race).

Principal Component Analysis explained

What is Principal Component Analysis (PCA)? A technique that transforms a number of possibly correlated variables into a smaller number of variables called principal components. It is most commonly used as a first stepin trying to analyse large data sets. (Other applications include de-noissing signals and data compression.)

PCA uses projection (a vector space transformation) to reduce the dimensionality of large data sets. The original data set can be therefore interpreted in just a few variables (the principal components). I am using it on the broadbandmap.gov data set to see if I can spot any trends, patterns and outliers in the data.

We are going to examine 3 data sets, all of them collected from broadbandmap.gov site. The data sets offer information about broad-band internet connection (the first data set) in the counties from the New England Region states (there are 67 counties in the 6 New England States - Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, and Vermont.), the demographics data for the same counties, and the third data set will be a combination of the first data sets, with information for each county that will comprise of broadband internet and demographics (basically we join the two data sets on the county id).

The broadband internet connection data set has around 118 dimensions, for obvious reason it would be impossible to see any trends in this data set. The data set can be seen link: https://github.com/elf11/Outreachy-Mlab/blob/master/code/broad_sum.csv , it has information about upload and download speed, wireless and wireline internet connection (and specific speeds for those as well as the percentage of population that has such a connection).

To make sense of the data and see if there are any trends that are not obvious by looking at the data we could use a series of bivariate plots (scatter diagrams) and analyse these to determine any relationship between variables. But, typically the number of such plots is O(n^2), where n is the number of variables. Clearly, this is not feasible. But, we can use PCA to perform such an analysis simultaneously.

Broadband data

In the broadband example we have 118 dimensional data (dimensions) for 67 counties (observations). If there is any correlation between the observations (the counties) it can be observed in the 118 dimensional space by the correlated points being clustered close together (but we are not able to visualise such a space, so we are not able to see the clustering directly).

First task of PCA is to identify a new set of orthogonal coordinate axes through the data. This is achieved by finding the direction of maximal variance through the coordinates in the 118 dimensional space. It is equivalent to obtaining the (least-squares) line of best fit through the plotted data. We call this new axis the first principal component of the data. After this we can orthogonal projection to map the coordinates on this axis. This is the first principal component.

Figure1

This type of diagram is known as a score plot. We can see already some clusters forming, in the sense that counties on the far right are forming a cluster (44007, 44005, 44001), then there are the central group of counties as a bulk, then another 4 counties that represent interest (23013, 23023, 33005, 25027), two smaller clusters, and another 2 counties that stand out (23021 50009) at the opposite end of the axis.

From the PCA we add another axis - the second principal component, which is orthogonal to the first PC, and is the next best direction for approximating the original data (finds the direction of second largest variance in the data). We project our coordinates down onto this plane and we find the second figure.

Figure2

The PCA method offers information about the contributions of each principal component to the total variance of the coordinates. We can plot this as a graph of the eigenvectors. As it can be seen from the graph the first two components account for the majority of the variation in the data. (A reduction from 118 dimensions to 2 dimensions). (In practice, it is usually sufficient to include enough principal components so that somewhere in the region of 70-80% of the variation in the data is accounted for.)

Figure3

We are also considering the influence of each of the original variables (118 ones of them) upon the principal components.

Figure4

Observe that there is a central group of variables around the middle of each principal component, with 4 variables on the periphery that do not seem to be part of the group (mostCommonUploadSpeed ~(-0.8,0.0), mostCommonDownloadSpeed ~(-0.4, -0.2), greatestUploadSpeed ~(-0.2, 0.8), greatestDownloadSpeed ~(0.0,0.4) ). Perhaps, there is some association to be made between the counties that were away from the clusters in the score plot and those 4 variables.

Next, I am going to add the same graphs for the demographics data set and the combination of the two data sets.

Demographics data

The demographics data set can be accessed at following the link: https://github.com/elf11/Outreachy-Mlab/blob/master/code/broad_sum.csv .

The score plot:

Figure5

We can observe a cluster forming in the far left, and then a series of alone counties (insert_counties_ids) .

The PCA in 2D:

Figure6

The contributions of each component to the total variance:

Figure7

We can observe that the first component contributes the most to the total variance, the other components having a much much more smaller contribution.

Influence of each of the initial 20 - something variables to the principal components.

Figure8

Again it can be observed those in the central group of variables in the middle of each of the principal components (on the graph we can see only 2 of them, the other ones having values less than e^-04) and those variables on the periphery: medianIncome ~(0.0, -1.0), households ~(0.4, 0.5), population ~(1.0, -0.2) .

Combining the two data sets

Now, the data set has around 144 dimensions (the variables) and 67 observations (the counties from New England region).

The score plot:

Figure9

The county code for the one on the far right is 25017, looking at the data for the 3 most influencial variables for this county (medianIncome, households, and population) we can observe that all of them are high for the said county.

Figure13

The PCA in 2D:

Figure10

The contributions of each component to the total variance:

Figure11

As before (in the demographics data set) the first component gives the most variance.

Influence of each of the initial 140 - something variables to the principal components.

Figure12

The 3 variables that are situated on the periphery are again population, medianIncome and households.

Conclusions

The data-sets for broadband internet connection as well as the combined data set might (probably) suffer from the curse of dimensionality. We can observe that the data spreads out away from the central point. The curse of dimensionality states that the higher the number of dimensions the more the data will spread out away from the center. So, the larger the number of variables the more samples we will need. So for making those last two data sets usable we need to either increase the number of samples (at the moment we used the 67 counties in the New England region - maybe increase it at national level, but that will be around ~3200 counties, it might be a little bit difficult to read such a graph), or to decrease the number of variables (we have to decide which varibles to drop and which ones to keep).

NetFlow data generation with nfdump and softflowd

how to get NetFlow data from your local network Continue reading