Efficient Similarity Search with Cache-Conscious Data Traversal
- Degree Grantor:
- University of California, Santa Barbara. Computer Science
- Degree Supervisor:
- Tao Yang
- Place of Publication:
- [Santa Barbara, Calif.]
- Publisher:
- University of California, Santa Barbara
- Creation Date:
- 2015
- Issued Date:
- 2015
- Topics:
- Computer Science
- Keywords:
- Tree-based ensemble model,
Cache locality optimization,
All-pairs similarity search,
Load balance,
Information retrieval, and
Search result ranking - Genres:
- Online resources and Dissertations, Academic
- Dissertation:
- Ph.D.--University of California, Santa Barbara, 2015
- Description:
Similarity search is important for many data-intensive applications to identify a set of similar objects. Examples of such applications include near-duplicate detection and clustering, collaborative filtering for similarity-based recommendations, search query suggestion, and data cleaning. Conducting similarity search is a time-consuming process, especially when a massive amount of data is involved, and when all pairs are compared. Previous work has used comparison filtering, inverted indexing, and parallel accumulation of partial intermediate results to expedite its execution. However, shuffling intermediate results can incur significant communication overhead as data scales up.
We have developed a fast two-stage partition-based approach for all-pairs similarity search which incorporates static partitioning, optimized load balancing, and cache-conscious data traversal. Static partitioning places dissimilar documents into different groups to eliminate unnecessary comparison between their content. To overcome the challenges introduced by skewed distribution of data partition sizes and irregular dissimilarity relationship in large datasets, we conduct computation load balancing for partitioned similarity search, with competitiveness analysis. These techniques can improve performance by one to two orders of magnitude with less unnecessary I/O and data communication and better load balance. We also discuss how to further accelerate similarity search by incorporating incremental computing and approximation methods such as Locality Sensitive Hashing.
Because of data sparsity and irregularity, accessing feature vectors in memory for runtime comparison incurs significant overhead in modern memory hierarchy. We have designed and implemented cache-conscious algorithms to improve runtime efficiency in similarity search. The idea of optimizing data layout and traversal patterns is also applied to the search result ranking problem in runtime with multi-tree ensemble models.
- Physical Description:
- 1 online resource (163 pages)
- Format:
- Text
- Collection(s):
- UCSB electronic theses and dissertations
- Other Versions:
- http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqm&rft_dat=xri:pqdiss:3700215
- ARK:
- ark:/48907/f33f4msg
- ISBN:
- 9781321700671
- Catalog System Number:
- 990045119700203776
- Copyright:
- Xun Tang, 2015
- Rights:
- In Copyright
- Copyright Holder:
- Xun Tang
File | Description |
---|---|
Access: Public access | |
Tang_ucsb_0035D_12538.pdf | pdf (Portable Document Format) |