Sadikov, Eldar and Medina, Montserrat and Leskovec, Jure and Garcia-Molina, Hector (2010) Correcting for Missing Data in Information Cascades. Technical Report. Stanford InfoLab.
The transmission of infectious diseases, the propagation of information, and the spread of ideas and influence through social networks are all examples of diffusion. In such cases we say that a contagion spreads through the network, a process that can be modeled by a cascade graph. Studying cascades and network diffusion is challenging due to missing data. Even a single missing observation in a sequence of propagation events can significantly alter our inferences about the diffusion process. We address the problem of missing data in information cascades. Specifically, given only a fraction C' of the complete cascade C, our goal is to estimate the properties of the complete cascade C, such as its size or depth. To estimate the properties of C, we first formulate a k-tree model of cascades and analytically study its properties in the face of missing data. We then propose a numerical method that given a cascade model and observed cascade C' can estimate properties of the complete cascade C. We evaluate our methodology using information propagation cascades in the Twitter network (70 million nodes and 2 billion edges), as well as information cascades arising in the blogosphere. Our experiments show that the k-tree model is an effective tool to study the effects of missing data in cascades. Most importantly, we show that our method (and the k-tree model) can accurately estimate properties of the complete cascade C even when 90% of the data is missing.
|Item Type:||Techreport (Technical Report)|
|Uncontrolled Keywords:||information cascades, social networks, missing data, sampling, Twitter, blogs, information diffusion, balanced trees, numerical methods, analytical models|
|Deposited By:||Eldar Sadikov|
|Deposited On:||23 Jul 2010 00:31|
|Last Modified:||14 Nov 2010 19:16|
Repository Staff Only: item control page