Track Join – Distributed Joins with Minimal Network Traffic
Speaker: Prof. Kenneth A. Ross, Columbia University, http://www.cs.columbia.edu/~kar/
Date, Time, Venue: Thu August 21 2014, 2-3 PM, POST 302
Abstract: Network communication is the slowest component of many operators in distributed parallel databases deployed for large-scale analytics. Whereas considerable work has focused on speeding up databases on modern hardware, communication reduction has received less recent attention. Existing parallel DBMSs rely on algorithms designed for disks with minor modifications for networks. A more complicated algorithm may burden the CPUs, but could avoid redundant transfers of records across the network. We introduce track join, a novel distributed join algorithm that minimizes network traffic by generating an optimal transfer schedule for each distinct join key. Track join extends the trade-off options between CPU and network. Our evaluation based on real and synthetic data shows that track join adapts to diverse cases and degrees of locality. Considering both network traffic and execution time, even with no locality, track join outperforms hash join on the most expensive queries of real workloads.
This talk represents joint work with Orestis Polychroniou and Rajkumar Sen.