Querying Patterns in High-Dimensional Heterogenous Datasets

Author:

Singh, Vishwakarma

Degree Grantor:

University of California, Santa Barbara. Computer Science

Degree Supervisor:

Ambuj K. Singh

Place of Publication:

[Santa Barbara, Calif.]

Publisher:

University of California, Santa Barbara

Creation Date:

2012

Issued Date:

2012

Topics:

Engineering, Computer, Computer Science, and Information Technology

Keywords:

Scalable,
Heterogenous,
Multi-Modal,
Search,
Patterns, and
High Dimensional

Genres:

Online resources and Dissertations, Academic

Dissertation:

Ph.D.--University of California, Santa Barbara, 2012

Description:

The recent technological advancements have led to the availability of a plethora of heterogenous datasets, e.g., images tagged with geo-location and descriptive keywords. An object in these datasets is described by a set of high-dimensional feature vectors. For example, a keyword-tagged image is represented by a color-histogram and a word-histogram. Analyzing these datasets gives better insights into the processes generating the datasets, opens new frontiers of scientific research, and fuels development of life-changing products.

An effective mechanism for exploring these heterogenous datasets is querying. One such kind of query is a pattern query. Given a heterogenous dataset and a query, the task here is to find a set of objects which are constrained by a relationship and satisfy the query. For example, given a dataset of keyword-tagged objects, a useful pattern query is to find a set of similar objects that contains a given set of keywords.

Querying patterns in high-dimensional heterogenous datasets brings about a new set of computational challenges. High performance algorithms to efficiently and accurately query patterns are presented in this thesis. First, a scalable algorithm, SIMP, is described for accurately querying near neighbors in a high-dimensional dataset. SIMP significantly outperforms the state-of-the-art techniques. Next, a novel algorithm, ProMiSH, is proposed for efficiently querying patterns by keywords. ProMiSH has a speed-up of more than four orders over the state-of-the-art techniques. Then, an algorithm, QUIP, is described for querying patterns by example in a spatial dataset, e.g., geographical maps. QUIP offers an improvement of 87% in running time over the baseline approach. Next, an algorithm for querying patterns by example in a temporal dataset is described. It specifically solves the problem of finding duplicate videos. The proposed algorithm yields a practical query time for video duplicate detection. Finally, a scalable method to compute statistical significance of results of a multi-object query is discussed. Statistical significance or p-value provides a more useful criterion for ranking the results of a query.

Physical Description: