Stanford InfoLab Publication Server

Data Analytics: Integration and Privacy

Whang, Steven Euijong (2012) Data Analytics: Integration and Privacy. PhD thesis, Stanford University.


[img]PDF - Published Version


Data analytics has become an extremely important and challenging problem in disciplines like computer science, biology, medicine, finance, and homeland security. As massive amounts of data are available for analysis, scalable integration techniques become important. At the same time, new privacy issues arise where one's sensitive information can easily be inferred from the large amounts of data. In this thesis, we first cover the problem of {\it entity resolution} (ER), which identifies database records that refer to the same real-world entity. The recent explosion of data has now made ER a challenging problem in a wide range of applications. We propose scalable ER techniques and new ER functionalities that have not been studied in the past. We also view ER as a black-box operation and provide general techniques that can be used across applications. Next, we introduce the problem of managing {\it information leakage}, where one must try to prevent important bits of information from being resolved by ER, to guard against loss of data privacy. As more of our sensitive data gets exposed to a variety of merchants, health care providers, employers, social sites and so on, there is a higher chance that an adversary can ``connect the dots'' and piece together our information, leading to even more loss of privacy. We propose a measure for quantifying information leakage and use ``disinformation'' as a tool for containing information leakage.

Item Type:Thesis (PhD)
ID Code:1065
Deposited By:Steven Whang
Deposited On:01 Feb 2013 01:20
Last Modified:01 Feb 2013 01:20

Download statistics

Repository Staff Only: item control page