UVM Theses and Dissertations

Format:

Online

Author:

Hanley, John P.

Title:

A New Evolutionary Algorithm for Mining Noisy, Epistatic, Geospatial Survey Data Associated With Chagas Disease

Dept./Program:

Civil and Environmental Engineering

Year:

2017

Degree:

PhD

Abstract:

The scientific community is just beginning to understand some of the profound affects that feature interactions and heterogeneity have on natural systems. Despite the belief that these nonlinear and heterogeneous interactions exist across numerous real-world systems (e.g., from the development of personalized drug therapies to market predictions of consumer behaviors), the tools for analysis have not kept pace. This research was motivated by the desire to mine data from large socioeconomic surveys aimed at identifying the drivers of household infestation by a Triatomine insect that transmits the life-threatening Chagas disease. To decrease the risk of transmission, our colleagues at the laboratory of applied entomology and parasitology have implemented mitigation strategies (known as Ecohealth interventions); however, limited resources necessitate the search for better risk models. Mining these complex Chagas survey data for potential predictive features is challenging due to imbalanced class outcomes, missing data, heterogeneity, and the non-independence of some features. We develop an evolutionary algorithm (EA) to identify feature interactions in "Big Datasets" with desired categorical outcomes (e.g., disease or infestation). The method is non-parametric and uses the hypergeometric PMF as a fitness function to tackle challenges associated with using p-values in Big Data (e.g., p-values decrease inversely with the size of the dataset). To demonstrate the EA effectiveness, we first test the algorithm on three benchmark datasets. These include two classic Boolean classifier problems: (1) the 'majority-on' problem and (2) the multiplexer problem, as well as (3) a simulated single nucleotide polymorphism (SNP) disease dataset. Next, we apply the EA to real-world Chagas Disease survey data and successfully archived numerous high-order feature interactions associated with infestation that would not have been discovered using traditional statistics. These feature interactions are also explored using network analysis. The spatial autocorrelation of the genetic data (SNPs of Triatoma dimidiata) was captured using geostatistics. Specifically, a modified semivariogram analysis was performed to characterize the SNP data and help elucidate the movement of the vector within two villages. For both villages, the SNP information showed strong spatial autocorrelation albeit with different geostatistical characteristics (sills, ranges, and nuggets). These metrics were leveraged to create risk maps that suggest the more forested village had a sylvatic source of infestation, while the other village had a domestic/peridomestic source. This initial exploration into using Big Data to analyze disease risk shows that novel and modified existing statistical tools can improve the assessment of risk on a fine-scale.

Request print copy from Annex

Search Website

Search Directory

A to Z

Search Website

Search Directory

Collections

Research

Services

About

Help

Ask a Librarian

Threre are lots of ways to contact a librarian. Choose what works best for you.

11:00 am - 3:00 pm

Reference Desk

(802) 656-2022

Voice

(802) 503-1703

Text

Meet with a librarian or subject specialist for in-depth help.

Submit a question for reply by e-mail.

WANT TO TALK TO SOMEONE RIGHT AWAY?

Library Hours for Friday, April 26th

All of the hours for today can be found below. We look forward to seeing you in the library.

HOURS TODAY

MAIN LIBRARY

WITHIN HOWE LIBRARY

OTHER DEPARTMENTS

CATQuest

Search the UVM Libraries' collections

UVM Theses and Dissertations