Drexel University - Comprehensive, integrated academics enhanced by co-operative education, technology, and research opportunities. | Drexel University
Drexel University
Search events. View events.

All Categories

Click for help in using calendar displays. Print the contents of the current screen.
Display Format: 
Event Details
Notify me if this event changes.Add this event to my personal calendar.
Go Back
PhD Dissertation Defense: Scalable Subset Selection with Filters and Its Applications
Start Date: 4/15/2015Start Time: 3:00 PM
End Date: 4/15/2015End Time: 5:00 PM

Event Description
Scalable Subset Selection with Filters and Its Applications
 Ph.D. Dissertation Defense of Gregory Ditzler
 
Advisor: Dr. Gail Rosen
 
Abstract

Increasingly many applications of machine learning are encountering large data that were almost unimaginable just a few years ago, and hence, many of the current algorithms cannot handle, i.e., do not scale to, today's extremely large volumes of data. The data are made up of a large set of features describing each observation, and the complexity of the models for making predictions tend to increase not only with the number of observations, but also the number of features. Fortunately, not all of the features that make up the data carry meaningful information about making the predictions. Thus irrelevant features should be filtered from the data prior to building a model. Such a process of removing features to produce a subset is commonly referred to as feature subset selection. In this work, we present two new filter-based feature subset selection algorithms that are scalable to large data sets that address: (i) potentially large & distributed data sets, and (ii) they are capable of scaling to very large feature sets. Our first proposed algorithm, Neyman-Pearson Feature Selection (NPFS), uses a statistical hypothesis test derived from the Neyman-Pearson lemma for determining if a feature is statistically relevant. The proposed approach can be applied as a wrapper to any feature selection algorithm, regardless of the feature selection criteria used, to determine whether a feature belongs in the relevant set. Perhaps more importantly, this procedure efficiently determines the number of relevant features given an initial starting point, and it fits into a computationally attractive MapReduce model. We also describe a sequential learning framework for feature subset selection (SLSS) that scales with both the number of features as well as the number of observations. SLSS uses bandit algorithms to process features and form a level of importance for each feature. Feature selection is performed independently from the optimization of any classifier to reduce unnecessary complexity. We demonstrate the capabilities of NPFS and SLSS on synthetic and real-world data sets. We also present a new approach for classifier-dependent feature selection that is an online learning algorithm that easily handles large amounts of missing feature values in a data stream.

There are many real-world applications that can benefit from scalable feature subset selection algorithms; one such area is the study of the microbiome (i.e., the study of micro-organisms and their influence on the environments that they inhabit). Feature subset selection algorithms can be used to sift through massive amounts of data collected from the genomic sciences to help microbial ecologists understand the microbes -- particularly the micro-organisms that are the best indicators by some phenotype, such as healthy or unhealthy. In this work, we provide insights into data collected from the American Gut Project, and deliver open-source software implementations for feature selection with biological data formats.

Contact Information:
Name: Electrical and Computer Engineering Department
Phone: 215-895-2241
Email: ece@drexel.edu
Electrical and Computer Engineering Department
Location:
ECE Conference Room 302
3rd Floor, Bossone Research Enterprise Center
Audience:
  • Graduate Students
  • Faculty

  • Display Month:

    Advanced Search (New Search)
    Date Range:
    Time Range:
    Category(s):
    Audience: 

    Special Features: 

    Keyword(s):
    Submit
    Select item(s) to Search
    Select item(s) to Search
    Select item(s) to Search
    Select item(s) to Search