Querying and Mining Chemical Databases for Drug Discovery

Author:

Ranu, Sayan

Degree Grantor:

University of California, Santa Barbara. Computer Science

Degree Supervisor:

Ambuj K. Singh

Place of Publication:

[Santa Barbara, Calif.]

Publisher:

University of California, Santa Barbara

Creation Date:

2012

Issued Date:

2012

Topics:

Computer Science

Keywords:

Machine Learning,
Indexing,
Databases,
Drug Discovery, and
Data Mining

Genres:

Online resources and Dissertations, Academic

Dissertation:

Ph.D.--University of California, Santa Barbara, 2012

Description:

Drug discovery and development has exploded into a multi-billion dollar industry. Unfortunately, despite a steady increase in pharmaceutical research, the number of new drugs discovered has been, at best, flat. The low productivity of current approaches to drug discovery has been ascribed to a number of factors including limited focus to a single protein target and undesirable effects, such as toxicity, that are discovered too late in the discovery process. In this dissertation, I propose strategies to combat the low productivity of current drug-discovery techniques and show that by integrating the principles of statistical significance and diversity into the molecular analysis framework, we can accelerate the drug discovery rate.

In the first part of my thesis, I explore the importance of mining statistically significant patterns from large collections of scientific data and demonstrate their utility in drug discovery. I show that over-represented subgraphs in molecular databases are correlated with biological activity and can be used to learn accurate classification models. Furthermore, statistically significant pharmacophoric patterns can be employed to predict the binding mechanisms between small molecules and protein targets. Finally, I show that mining discriminative subgraphs from protein-protein interaction networks allows us to learn the complex network-encoded logic functions that decide the clinical outcomes of diseases.

In the second part of my thesis, I explore the importance of structural diversity in top-k queries, and develop index structures to answer such queries in a scalable manner. First, I explore the importance of modeling attractive and repulsive dimensions in molecular analysis and demonstrate their utility in going beyond traditional similarity or distance measures. Next, I show that diversity-aware top-k answer sets are informationally denser than traditional top-k answer sets.

Overall, this thesis proposes core indexing and mining algorithms that extend the current state of the art in computer science research. Among the various applications of the developed algorithms, impact in the field of drug discovery acts as the unifying theme binding all of the chapters together. However, these methods are also applicable in other scientific domains such as software bug mining, analysis of communication graphs, social networks, sensor networks, and transportation networks.

Physical Description: