Salihoglu, Semih (2015) Massive-scale Processing of Record-oriented and Graph Data. PhD thesis, Stanford University.
Many data-driven applications perform computations on large volumes of data that do not fit on a single computer. These applications typically must use parallel shared-nothing distributed software systems to perform their computations. This thesis addresses challenges in large-scale distributed data processing with a particular focus on two primary areas: (i) theoretical foundations for understanding the costs of distribution; and (ii) processing large-scale graph data. The first part of this thesis presents a theoretical framework for the MapReduce system, to analyze the cost of distribution for different problems domains, and for evaluating the ``goodness'' of different algorithms. We identify a fundamental tradeoff between the parallelism and communication costs of algorithms. We first study the setting when computations are constrained to a single round of MapReduce. In this setting, we capture the cost of distributing a problem by deriving a lower-bound curve on the communication cost of any algorithm that solves the problem for different parallelism levels. We derive lower-bound curves for several problems, and prove that existing or new one-round algorithms solving these problems are optimal, i.e., incur the minimum possible communication cost for different parallelism levels. We then show that by allowing multiple rounds of MapReduce computations, we can solve problems more efficiently than any possible one-round algorithm. The second part of this thesis addresses challenges in systems for processing large-scale graph data, with the goal of making graph computation more efficient and easier to program and debug. We focus on systems that are modeled after Google's Pregel framework for large-scale distributed graph processing. We begin by describing an open-source version of Pregel we developed, called GPS (for Graph Processing System). We then describe new static and dynamic schemes for partitioning graphs across machines, and we present experimental results on the performance effects of different partitioning schemes. Next, we describe a set of algorithmic optimizations that address commonly-appearing inefficiencies in algorithms programmed on Pregel-like systems. Because it can be very difficult to debug programs in Pregel-like systems, we developed a new replay-style debugger called Graft. In addition, we defined and implemented a set of high-level parallelizable graph primitives, called HelP (for High-level Primitives), as an alternative to programming graph algorithms using the low-level vertex-centric functions of existing systems. HelP primitives capture several commonly appearing operations in large-scale graph computations. We motivate and describe Graft and HelP using real-world applications and algorithms.
|Item Type:||Thesis (PhD)|
|Deposited By:||Semih Salihoglu|
|Deposited On:||07 Jul 2015 04:32|
|Last Modified:||05 Sep 2015 12:57|
Repository Staff Only: item control page