Stanford InfoLab Publication Server

Anchor Points Algorithms for Hamming and Edit Distance

Afrati, Foto and Das Sarma, Anish and Rajaraman, Anand and Rule, Pokey and Salihoglu, Semih and Ullman, Jeffrey (2014) Anchor Points Algorithms for Hamming and Edit Distance. In: International Conference on Database Theory, 24-28 March 2014, Athens, Greece.


[img]PDF (Anchor Points Algorithms for Hamming and Edit Distance)


Algorithms for computing similarity joins in MapReduce were offered in [2]. Similarity joins ask to find input pairs that are within a certain distance d according to some distance measure. Here we explore the ``anchor-points algorithm'' of [2]. We continue looking at Hamming distance, and show that the method of that paper can be improved; in particular, if we want to find strings within Hamming distance d, and anchor points are chosen so that every possible input is within Hamming distance k of some anchor point, then it is sufficient to send each input to all anchor points within distance (d/2)+k, rather than d+k as was suggested in the earlier paper. This improves on the communication cost of the MapReduce algorithm, i.e., reduces the amount of data transmitted among machines. Further, the same holds for edit distance, provided inputs all have the same length n and either the length of all anchor points is n-k or the length of all anchor points is n+k. We then explore the problem of finding small sets of anchor points for edit distance, which also provides an improvement on the communication cost. We give a close-to-optimal technique to extend anchor sets (called ``covering codes'') from the k=1 case to any k. We then give small covering codes that use either a single deletion or a single insertion, or --~in one algorithm~-- two deletions. Discovering covering codes for edit distance is important in its own right, since very little work is known.

Item Type:Conference or Workshop Item (Paper)
ID Code:1082
Deposited By:Semih Salihoglu
Deposited On:11 Dec 2013 14:26
Last Modified:11 Dec 2013 14:26

Download statistics

Repository Staff Only: item control page