Performance optimizations for distributed real-time text indexing

Performance optimizations for distributed real-time text indexing,10.1109/HIPC.2009.5433188,Ankur Narang,Karthik Swaminathan,Prashant Agrawal

Performance optimizations for distributed real-time text indexing  
BibTex | RIS | RefWorks Download
DISC (Data-Intensive Super Computing) is gaining strong research momentum. DISC systems differ from conventional supercomputers in their focus on data, as they acquire and maintain continually changing massive data sets, in addition to performing large-scale computations over the data. Towards this end, we consider the problem of real-time text indexing and search with high input data rates (10 GB/s or more) along with small index age-off time while sustaining search response time. Load imbalance and communication bottlenecks make this problem particularly challenging. We present performance optimizations for distributed in-memory text indexing of massive input data sets on parallel systems having large number of cores/processors, with sustained search performance. Our distributed indexing algorithm uses a hybrid group-based approach which enables scalable indexing and search over massively parallel systems. In addition, we designed and analyzed communication optimizations including routing using Steiner nodes and topology mapping. Using theoretical analysis for the asymptotic parallel time complexity we establish the scalability of our algorithm with |P| for small index-group size and scalability with |C| for larger index-group size; where |P| and |C| are the number of Producers and Consumers respectively, in the index-group. We have obtained indexing throughput of 524 GB/min on 4K nodes of Blue Gene/L1 using actual IBM intranet data, which is 3.36x better than the previous best throughput and 10.3x better than typical indexing approaches such as CLucene2 on the same number of nodes. This gives an estimated throughput of 17 TB/min on 128K nodes with sustained search performance. We also demonstrate the improvements in strong and weak scalability of our distributed indexing algorithm over the previous best on Blue Gene/L. To the best of our knowledge, this is the highest indexing throughput ever published on a large scale system with sustained - search performance.
Conference: High Performance Computing - HiPC , pp. 398-407, 2009
Cumulative Annual
View Publication
The following links allow you to view full publications. These links are maintained by other sources not affiliated with Microsoft Academic Search.