Alexandria Digital Research Library

Querying Patterns in High-Dimensional Heterogenous Datasets

Author:
Singh, Vishwakarma
Degree Grantor:
University of California, Santa Barbara. Computer Science
Degree Supervisor:
Ambuj K. Singh
Place of Publication:
[Santa Barbara, Calif.]
Publisher:
University of California, Santa Barbara
Creation Date:
2012
Issued Date:
2012
Topics:
Engineering, Computer, Computer Science, and Information Technology
Keywords:
Scalable
Heterogenous
Multi-Modal
Search
Patterns
High Dimensional
Genres:
Online resources and Dissertations, Academic
Dissertation:
Ph.D.--University of California, Santa Barbara, 2012
Description:

The recent technological advancements have led to the availability of a plethora of heterogenous datasets, e.g., images tagged with geo-location and descriptive keywords. An object in these datasets is described by a set of high-dimensional feature vectors. For example, a keyword-tagged image is represented by a color-histogram and a word-histogram. Analyzing these datasets gives better insights into the processes generating the datasets, opens new frontiers of scientific research, and fuels development of life-changing products.

An effective mechanism for exploring these heterogenous datasets is querying. One such kind of query is a pattern query. Given a heterogenous dataset and a query, the task here is to find a set of objects which are constrained by a relationship and satisfy the query. For example, given a dataset of keyword-tagged objects, a useful pattern query is to find a set of similar objects that contains a given set of keywords.

Querying patterns in high-dimensional heterogenous datasets brings about a new set of computational challenges. High performance algorithms to efficiently and accurately query patterns are presented in this thesis. First, a scalable algorithm, SIMP, is described for accurately querying near neighbors in a high-dimensional dataset. SIMP significantly outperforms the state-of-the-art techniques. Next, a novel algorithm, ProMiSH, is proposed for efficiently querying patterns by keywords. ProMiSH has a speed-up of more than four orders over the state-of-the-art techniques. Then, an algorithm, QUIP, is described for querying patterns by example in a spatial dataset, e.g., geographical maps. QUIP offers an improvement of 87% in running time over the baseline approach. Next, an algorithm for querying patterns by example in a temporal dataset is described. It specifically solves the problem of finding duplicate videos. The proposed algorithm yields a practical query time for video duplicate detection. Finally, a scalable method to compute statistical significance of results of a multi-object query is discussed. Statistical significance or p-value provides a more useful criterion for ranking the results of a query.

Physical Description:
1 online resource (280 pages)
Format:
Text
Collection(s):
UCSB electronic theses and dissertations
ARK:
ark:/48907/f39z92t7
ISBN:
9781267294876
Catalog System Number:
990037519200203776
Rights:
Inc.icon only.dark In Copyright
Copyright Holder:
Vishwakarma Singh
Access: This item is restricted to on-campus access only. Please check our FAQs or contact UCSB Library staff if you need additional assistance.