Babcock, Brian and Babu, Shivnath and Datar, Mayur and Motwani, Rajeev and Thomas, Dilys (2003) Operator Scheduling in Data Stream Systems. Technical Report. Stanford.
This is the latest version of this item.
In many applications involving continuous data streams, data arrival is bursty and data rate fluctuates over time. Systems that seek to give rapid or real-time query responses in such an environment must be prepared to deal gracefully with bursts in data arrival without compromising system performance. We discuss one strategy for processing bursty streams --- adaptive, load-aware scheduling of query operators to minimize resource consumption during times of peak load. We show that the choice of an operator scheduling strategy can have significant impact on the run-time system memory usage as well as output latency. Our aim is to design a scheduling strategy that minimizes the maximum run-time system memory, while maintaining the output latency within prespecified bounds. We first present Chain scheduling, an operator scheduling strategy for data stream systems that is near-optimal in minimizing run-time memory usage for any collection of single-stream queries involving selections, projections, and foreign-key joins with stored relations. Chain scheduling also performs well for queries with sliding-window joins over multiple streams, and multiple queries of the above types. However, during bursts in input streams, when there is a buildup of unprocessed tuples, Chain scheduling may lead to high output latency. We study the online problem of minimizing maximum run-time memory, subject to a constraint on maximum latency. We present preliminary observations, negative results, and heuristics for this problem. A thorough experimental evaluation is provided where we demonstrate the potential benefits of Chain scheduling and its different variants, compare it with competing scheduling strategies, and validate our analytical conclusions.
|Item Type:||Techreport (Technical Report)|
|Additional Information:||This paper is an extended version of our paper titled "Chain: Operator Scheduling for Memory Minimization in Data Stream Systems" that appeared in the proceedings of SIGMOD 2003. The basic Chain algorithm and its theoretical and experimental analysis were reported in the SIGMOD paper. The NP-completeness result showing the intractability of the problem of minimizing memory in Section 4, and the theoretical results and experiments for handling latency constraints in Sections 5 and 6.2 respectively are being presented for the first time in this paper.|
|Subjects:||Computer Science > Data Streams|
|Related URLs:||Project Homepage||http://infolab.stanford.edu/stream/|
|Deposited By:||Import Account|
|Deposited On:||21 Oct 2003 17:00|
|Last Modified:||24 Dec 2008 08:27|
Available Versions of this Item
- Chain: Operator Scheduling for Memory Minimization in Data Stream Systems. (deposited 08 Mar 2003 16:00)
- Operator Scheduling in Data Stream Systems. (deposited 21 Oct 2003 17:00) [Currently Displayed]
Repository Staff Only: item control page