Querying Patterns in High-Dimensional Heterogenous Datasets
- Degree Grantor:
- University of California, Santa Barbara. Computer Science
- Degree Supervisor:
- Ambuj K. Singh
- Place of Publication:
- [Santa Barbara, Calif.]
- Publisher:
- University of California, Santa Barbara
- Creation Date:
- 2012
- Issued Date:
- 2012
- Topics:
- Engineering, Computer, Computer Science, and Information Technology
- Keywords:
- Scalable,
Heterogenous,
Multi-Modal,
Search,
Patterns, and
High Dimensional - Genres:
- Online resources and Dissertations, Academic
- Dissertation:
- Ph.D.--University of California, Santa Barbara, 2012
- Description:
The recent technological advancements have led to the availability of a plethora of heterogenous datasets, e.g., images tagged with geo-location and descriptive keywords. An object in these datasets is described by a set of high-dimensional feature vectors. For example, a keyword-tagged image is represented by a color-histogram and a word-histogram. Analyzing these datasets gives better insights into the processes generating the datasets, opens new frontiers of scientific research, and fuels development of life-changing products.
An effective mechanism for exploring these heterogenous datasets is querying. One such kind of query is a pattern query. Given a heterogenous dataset and a query, the task here is to find a set of objects which are constrained by a relationship and satisfy the query. For example, given a dataset of keyword-tagged objects, a useful pattern query is to find a set of similar objects that contains a given set of keywords.
Querying patterns in high-dimensional heterogenous datasets brings about a new set of computational challenges. High performance algorithms to efficiently and accurately query patterns are presented in this thesis. First, a scalable algorithm, SIMP, is described for accurately querying near neighbors in a high-dimensional dataset. SIMP significantly outperforms the state-of-the-art techniques. Next, a novel algorithm, ProMiSH, is proposed for efficiently querying patterns by keywords. ProMiSH has a speed-up of more than four orders over the state-of-the-art techniques. Then, an algorithm, QUIP, is described for querying patterns by example in a spatial dataset, e.g., geographical maps. QUIP offers an improvement of 87% in running time over the baseline approach. Next, an algorithm for querying patterns by example in a temporal dataset is described. It specifically solves the problem of finding duplicate videos. The proposed algorithm yields a practical query time for video duplicate detection. Finally, a scalable method to compute statistical significance of results of a multi-object query is discussed. Statistical significance or p-value provides a more useful criterion for ranking the results of a query.
- Physical Description:
- 1 online resource (280 pages)
- Format:
- Text
- Collection(s):
- UCSB electronic theses and dissertations
- Other Versions:
- http://gateway.proquest.com/openurl?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:dissertation&res_dat=xri:pqm&rft_dat=xri:pqdiss:3505307
- ARK:
- ark:/48907/f39z92t7
- ISBN:
- 9781267294876
- Catalog System Number:
- 990037519200203776
- Copyright:
- Vishwakarma Singh, 2011
- Rights:
- In Copyright
- Copyright Holder:
- Vishwakarma Singh
Access: This item is restricted to on-campus access only. Please check our FAQs or contact UCSB Library staff if you need additional assistance. |