Afrati, Foto N. and Ullman, Jeffrey D. (2009) Optimizing Joins in a Map-Reduce Environment. Technical Report. Stanford InfoLab. (Publication Note: To appear in EDBT 2010)
BibTeX | DublinCore | EndNote | HTML |
This is the latest version of this item.
| PDF 273Kb |
Abstract
Implementations of map-reduce are being used to perform many operations on very large data. We examine strategies for joining several relations in the map-reduce environment. Our new approach begins by identifying the "map-key," the set of attributes that identify the Reduce process to which a Map process must send a particular tuple. Each attribute of the map-key gets a "share," which is the number of buckets into which its values are hashed, to form a component of the identifier of a Reduce process. Relations have their tuples replicated in limited fashion, the degree of replication depending on the shares for those map-key attributes that are missing from their schema. We study the problem of optimizing the shares, given a fixed number of Reduce processes. An algorithm for detecting and fixing problems where an attribute is mistakenly included in the map-key is given. Then, we consider two important special cases: chain joins and star joins. In each case we are able to determine the map-key and determine the shares that yield the least replication. While the method we propose is not always superior to the conventional way of using map-reduce to implement joins, there are some important cases involving large-scale data where our method wins, including: (1) analytic queries in which a very large fact table is joined with smaller dimension tables, and (2) queries involving paths through graphs with high out-degree, such as the Web or a social network.
Item Type: | Techreport (Technical Report) |
---|---|
Uncontrolled Keywords: | map, reduce, join, optimization, lagrangean, fact table, dimension table, analytic query, chain join, multiway join |
ID Code: | 957 |
Deposited By: | Prof Jeffrey Ullman |
Deposited On: | 19 Jan 2010 11:19 |
Last Modified: | 19 Jan 2010 11:19 |
Available Versions of this Item
- Optimizing Joins in a Map-Reduce Environment. (deposited 24 Dec 2009 07:34)
- Optimizing Joins in a Map-Reduce Environment. (deposited 19 Jan 2010 11:19) [Currently Displayed]
Download statistics
Repository Staff Only: item control page