Men are 7 time more likely to die from a gun death than women in USA

Also you are more likely to die by suicide if you are a white person with only High School or GED diploma.

. . .

Privacy in Location Based Services - Part 2

In a previous post we discussed about Location Based Services (LBS), what they are, why are they important and what initial ways were there to achieve privacy in LBS and how that privacy was broken. In this post we are going to continue discussing about LBS and more specifically we are going to look at a few algorithms that were used back in the day for achieving it (back in the day means before they were proven to be total useless gatbage :p - just kidding, or am I?). Anyway, we are going to discuss two different types of algorithms for this. One of them requires a third party trusted architecture the other one doesn’t.

. . .

Open-addressed double hashed hash table

While I was still in my undergrad program at the University Politehnica of Bucharest, great place btw - you should check out their courses (https://ocw.cs.pub.ro), I took some low level programming classes, and one of the first classes that was actually raising some difficulties was the Operating Systems one. The first assignment for that class was to write a multi-platform hash table, meaning your C code had to run on both Windows and Linux platforms. I am not going to get into details about how you make that work, or why is important now, but I was thinking about why did we get a hash table as a first assignment and about the importance of hash tables in computing in general. So I came up with the idea of writing a tutorial about how to write a hash table in C. What you will get from this is a deeper understanding of how the data structure works and how and when to use it, why sometimes it’s great to use hash tables and other times it’s not.

. . .

k-Anonymity and cluster based methods for privacy

Society is experiencing exponential growth in the number and variety of data collections containing person-specific information. This data has been collected both by governments and by private entities. Data and knowledge extracted by data mining techniques represent a key asset to the society, giving information about trends and patterns, helping formulate public poliicies. Ideally, we would want that our personal data to be off the internet, but laws and regulations require that some collected data must be made public, for example census data. Besides that there is a lot of data out there about each one of us over which we don’t have any power of decision. Health-care datasets might be made public for clinical studies, or just because that’s the policy of the state where you are living, to make public the hospital discharge database. There is a new trend to find out your ancestry by buying one of those DYI at home kits, so there are huge genetic datasets, see 23andme, HapMap, where your data is stored and a company owns it. Demographic datasets can offer a lot of information as well, the U.S. Census Bureau or any sociology study that you took part in might make that data public and let the entire world know intimate aspects of your person. And let’s not forget about all the data collected by searching engines, social networks, Amazon, Netflix.

. . .

Privacy in Location Based Services - Part 1

In the mid 2000s (around 2006-2008) people started being concerned with their location privacy, so there was a boom of research papers coming out on this subject. Incresingly many people started having access to mobile communication devices (mobile phones, PDAs) and those devices had positioning capabilities (GPS, location). So the fear that your location data is public and people can find out your shopping habits, schedule, health issues and so on become real. This is still a threat today. But what this post (part one of two series) is trying to do is to present and discuss some of the solutions used to mitigate the privacy risks presented by the presence of a location based service in a mobile device. Most of those things are outdated nowadays, the state of the art in privacy is differential privacy, but it’s still interesting to know the long way privacy has come and is especially interesting to me since I am a security/privacy person. So take this as a personal project about the study of LBS privacy methods and it’s early days.

. . .

k-Anonymity and cluster based methods for privacy

Society is experiencing exponential growth in the number and variety of data collections containing person-specific information. This data has been collected both by governments and by private entities. Data and knowledge extracted by data mining techniques represent a key asset to the society, giving information about trends and patterns, helping formulate public poliicies. Ideally, we would want that our personal data to be off the internet, but laws and regulations require that some collected data must be made public, for example census data. Besides that there is a lot of data out there about each one of us over which we don’t have any power of decision. Health-care datasets might be made public for clinical studies, or just because that’s the policy of the state where you are living, to make public the hospital discharge database. There is a new trend to find out your ancestry by buying one of those DYI at home kits, so there are huge genetic datasets, see 23andme, HapMap, where your data is stored and a company owns it. Demographic datasets can offer a lot of information as well, the U.S. Census Bureau or any sociology study that you took part in might make that data public and let the entire world know intimate aspects of your person. And let’s not forget about all the data collected by searching engines, social networks, Amazon, Netflix.

. . .

HBase Installation Guide

As we have discussed in the previous post for my research we needed an open source solution that could deal with Big Data, we stopped on the Hadoop Ecosystem and we went through the componets of HDFS and MapReduce as well as through the steps needed to install Hadoop on a cluster. Now we are going to discuss about another component of the Hadoop Ecosystem, HBase. Accessing data in the HDFS system is slow, so we needed a way to be able to get to the particular data we were interested in - in the TBs of data we had saved on the cluster - in a fast manner. HBase allows random access to the data in a convenient time, being a NoSQL database.

. . .

Hadoop Installation Guide

In my research I am working with a lot of networking data and trying to find ways to secure that data, more about the securing part later in another post. But the thing is we needed a way to store NetFlow Data, data generated by routers and switches in a 100GB network. So we decided that this problem was great for trying some Big Data solutions. Big Data is nowadays more like buzz word, it has an unclear definition but most people would consider dealing with TB of data or more as dealing with Big Data. When dealing with Big Data we are intersted in the 3V’s: Volume, Variety and Velocity. We needed a solution that can deal well with those things, a lot of incoming data from different points in the 100GB network, the data had some variety but not much and the speed at which the data was coming to us was pretty high - we were supposed to deal with anything between 1 and 3GB per hour.

. . .

Analyse a tcpdump capture using libpcap in C

In the past I have taken some security courses, and during one of them we had as assignment to use lipcap to sniff and spoof the DNS inside a network. That gave me an idea for this article, more like a gentle introduction to libpcap and how to use it to analyze the type and number of packets you are getting in the network at a particular moment. This could be done very easily with Wireshark and a series of filters, but the purpose of this article is an educational one and a basis for understanding and developing greater and better things with libpcap.

. . .

NetFlow data generation with nfdump and softflowd

Recently I needed some NetFlow data samples, I’ve looked all over the internet for some of those, but for obvious privacy reasons there were none. No one shares their NetFlow data. Not even a little sample. So what could I do, I had no Cisco equipments to generate traffic on and then to collect it in data flows. So I’ve improvised by using my laptop as a router in the campus network and collecting the traffic that went through it in data flows. This post is about how to generate and collect Netflow data on your own network.

. . .

Linear Regression in the context of m-lab data

Continuing the Outreachy article series, this article describes the outcome of applying linear regression to the m-lab data and the first steps in establishing a model of correlation between the socio-economic factors (population and median income) and the internet speed and availability for the New England region.

. . .

Analysing internet speeds trends using k-means

This article is the second one in the m-lab Outreachy series and it describes the process we followed in obtaining a cluster visualization and the following analysis of the internet speeds and characteristics for USA using K-means. If you remember from the previous article the purpose of this project is to be able to state some facts about the internet connection speed/availability in different parts of the United States and how those speeds correlate with socio-economic factors. The next step we decided on was to do a cluster analysis and see if any communities are clustering together based on their internet speed (upload and download speed) and RTT (round time trip) and then observe if any of those cluster share some socio-economic traits.

. . .

Principal Component Analysis on the broadbandmap.gov data

This article will be one of a few articles done during my internship with the M-Lab organization link: http://measurementlab.net/ in the Outreachy Summer 2015 program. During this time I will work with different data from broadbandmap.gov link: http://broadbandmap.gov/ and census.gov link: http://census.gov/ and probably other sites to try and establish a correlation between the internet broadband connection penetration rate in a community (or how many people have high-speed internet) and the characteristics of that community. Basically, what I want to do is to characterize the communities with internet connection and those without and see why the former are attractive to the Internet Service Providers and the latter are not. (Maybe some socio-economic factors are influencing the availablity of internet connection: like income, education, age, race).

. . .

Short dive into Conditional Probability - refresher

I hope this to be one of the many to come short and to the point posts about different aspects of data analysis and how to tackle data analysing problems. The purpose of this is to be a short refresher course in different core aspects of machine learning and data analysis. This being said the first problem I tackle is the Conditional probability.

. . .