Blockchain and Online Identity

This article has appeared in a modified form in the Winter issue of ACM XRDS magazine in 2018. There it was a joint work with another author but here is my part of the article. All credit goes to me for this one, if you want to read the joint work you can do so on the ACM XRDS website.

. . .

Applying Data Science for Anomaly and Change Point Detection

What do we mean when we say that we are trying to find anomalies in a data set? What are anomalies? How can we define a point in the data starting with the behavior of the data is becoming anomalous? Those are the questions that we are going to try to answer in this introductory article about anomaly detection and with the help of a running example using network data we are going to devise a couple of very simple algorithms for anomaly detection.

. . .

Classifying data with decision trees

What is a decision tree?

. . .

A primer on differential privacy ACM XRDS

I have started recently collaborating with the ACM XRDS magazine and my first article is out. It’s a gentle introduction to differential privacy and you can read it here.

. . .

Fitting distributions to data in Python

Those days I have been looking into fitting a Laplacian distribution to some data that I was having. The data was presented as a histogram and I wanted to know how the Laplacian distribution was looking over it. After some looking around and not too many straight ways to do it, I figured it out. The code is below and you should get something similar to what can be seen in the picture.

. . .

Men are 7 time more likely to die from a gun death than women in USA

Also you are more likely to die by suicide if you are a white person with only High School or GED diploma.

. . .

Privacy in Location Based Services - Part 2

In a previous post we discussed about Location Based Services (LBS), what they are, why are they important and what initial ways were there to achieve privacy in LBS and how that privacy was broken. In this post we are going to continue discussing about LBS and more specifically we are going to look at a few algorithms that were used back in the day for achieving it (back in the day means before they were proven to be total useless gatbage :p - just kidding, or am I?). Anyway, we are going to discuss two different types of algorithms for this. One of them requires a third party trusted architecture the other one doesn’t.

. . .

Open-addressed double hashed hash table

While I was still in my undergrad program at the University Politehnica of Bucharest, great place btw - you should check out their courses (https://ocw.cs.pub.ro), I took some low level programming classes, and one of the first classes that was actually raising some difficulties was the Operating Systems one. The first assignment for that class was to write a multi-platform hash table, meaning your C code had to run on both Windows and Linux platforms. I am not going to get into details about how you make that work, or why is important now, but I was thinking about why did we get a hash table as a first assignment and about the importance of hash tables in computing in general. So I came up with the idea of writing a tutorial about how to write a hash table in C. What you will get from this is a deeper understanding of how the data structure works and how and when to use it, why sometimes it’s great to use hash tables and other times it’s not.

. . .

k-Anonymity and cluster based methods for privacy

Society is experiencing exponential growth in the number and variety of data collections containing person-specific information. This data has been collected both by governments and by private entities. Data and knowledge extracted by data mining techniques represent a key asset to the society, giving information about trends and patterns, helping formulate public poliicies. Ideally, we would want that our personal data to be off the internet, but laws and regulations require that some collected data must be made public, for example census data. Besides that there is a lot of data out there about each one of us over which we don’t have any power of decision. Health-care datasets might be made public for clinical studies, or just because that’s the policy of the state where you are living, to make public the hospital discharge database. There is a new trend to find out your ancestry by buying one of those DYI at home kits, so there are huge genetic datasets, see 23andme, HapMap, where your data is stored and a company owns it. Demographic datasets can offer a lot of information as well, the U.S. Census Bureau or any sociology study that you took part in might make that data public and let the entire world know intimate aspects of your person. And let’s not forget about all the data collected by searching engines, social networks, Amazon, Netflix.

. . .

Privacy in Location Based Services - Part 1

In the mid 2000s (around 2006-2008) people started being concerned with their location privacy, so there was a boom of research papers coming out on this subject. Incresingly many people started having access to mobile communication devices (mobile phones, PDAs) and those devices had positioning capabilities (GPS, location). So the fear that your location data is public and people can find out your shopping habits, schedule, health issues and so on become real. This is still a threat today. But what this post (part one of two series) is trying to do is to present and discuss some of the solutions used to mitigate the privacy risks presented by the presence of a location based service in a mobile device. Most of those things are outdated nowadays, the state of the art in privacy is differential privacy, but it’s still interesting to know the long way privacy has come and is especially interesting to me since I am a security/privacy person. So take this as a personal project about the study of LBS privacy methods and it’s early days.

. . .

k-Anonymity and cluster based methods for privacy

Society is experiencing exponential growth in the number and variety of data collections containing person-specific information. This data has been collected both by governments and by private entities. Data and knowledge extracted by data mining techniques represent a key asset to the society, giving information about trends and patterns, helping formulate public poliicies. Ideally, we would want that our personal data to be off the internet, but laws and regulations require that some collected data must be made public, for example census data. Besides that there is a lot of data out there about each one of us over which we don’t have any power of decision. Health-care datasets might be made public for clinical studies, or just because that’s the policy of the state where you are living, to make public the hospital discharge database. There is a new trend to find out your ancestry by buying one of those DYI at home kits, so there are huge genetic datasets, see 23andme, HapMap, where your data is stored and a company owns it. Demographic datasets can offer a lot of information as well, the U.S. Census Bureau or any sociology study that you took part in might make that data public and let the entire world know intimate aspects of your person. And let’s not forget about all the data collected by searching engines, social networks, Amazon, Netflix.

. . .

HBase Installation Guide

As we have discussed in the previous post for my research we needed an open source solution that could deal with Big Data, we stopped on the Hadoop Ecosystem and we went through the componets of HDFS and MapReduce as well as through the steps needed to install Hadoop on a cluster. Now we are going to discuss about another component of the Hadoop Ecosystem, HBase. Accessing data in the HDFS system is slow, so we needed a way to be able to get to the particular data we were interested in - in the TBs of data we had saved on the cluster - in a fast manner. HBase allows random access to the data in a convenient time, being a NoSQL database.

. . .

Hadoop Installation Guide

In my research I am working with a lot of networking data and trying to find ways to secure that data, more about the securing part later in another post. But the thing is we needed a way to store NetFlow Data, data generated by routers and switches in a 100GB network. So we decided that this problem was great for trying some Big Data solutions. Big Data is nowadays more like buzz word, it has an unclear definition but most people would consider dealing with TB of data or more as dealing with Big Data. When dealing with Big Data we are intersted in the 3V’s: Volume, Variety and Velocity. We needed a solution that can deal well with those things, a lot of incoming data from different points in the 100GB network, the data had some variety but not much and the speed at which the data was coming to us was pretty high - we were supposed to deal with anything between 1 and 3GB per hour.

. . .

Analyse a tcpdump capture using libpcap in C

In the past I have taken some security courses, and during one of them we had as assignment to use lipcap to sniff and spoof the DNS inside a network. That gave me an idea for this article, more like a gentle introduction to libpcap and how to use it to analyze the type and number of packets you are getting in the network at a particular moment. This could be done very easily with Wireshark and a series of filters, but the purpose of this article is an educational one and a basis for understanding and developing greater and better things with libpcap.

. . .

NetFlow data generation with nfdump and softflowd

Recently I needed some NetFlow data samples, I’ve looked all over the internet for some of those, but for obvious privacy reasons there were none. No one shares their NetFlow data. Not even a little sample. So what could I do, I had no Cisco equipments to generate traffic on and then to collect it in data flows. So I’ve improvised by using my laptop as a router in the campus network and collecting the traffic that went through it in data flows. This post is about how to generate and collect Netflow data on your own network.

. . .

Short dive into Conditional Probability - refresher

I hope this to be one of the many to come short and to the point posts about different aspects of data analysis and how to tackle data analysing problems. The purpose of this is to be a short refresher course in different core aspects of machine learning and data analysis. This being said the first problem I tackle is the Conditional probability.

. . .