Computer Science & Engineering

Research Experiences for Undergraduates

Projects for Summer 2016

Project #1: Data-centric Approaches to Modeling Individual Behavior in Large-scale Online Social Systems

Faculty: Sanmay Das

REU students will work on different aspects of a project that attempts to unify micro-modeling of agent behavior based on data and large-scale modeling of social systems in which these agents interact. We have developed a model that uses positive or negative pairwise interaction data to estimate a maximum likelihood model of point-of-view. We would now like to develop algorithms that can use prior information and other side information (for example, mined from the natural language aspects of what the user writes on a website) to build much richer models of opinion that can be applied to web-scale data such as that on Wikipedia, Reddit, or Yelp. Building such models will allow us to study the dynamics of opinion formation and change on the social network.

Skills Required: mathematical maturity for the theoretical project (familiarity with game theory and/or machine learning is a plus); proficiency with Java, C/C++, or Python for the simulation project.

Project #2: Big Data Analysis for Active Drug Discovery

Faculty: Roman Garnett and Benjamin Moseley

REU participants will design intelligent policies for actively querying a large, real-world database of compounds to quickly detect potential drugs. The database contains 120 different biological targets of relevance to humans and a background set of 1 million putative inactive compounds gathered from the ZINC database. Along with these, a baseline implementation of a state-of-the-art virtual screening system will be made available for comparison. There are several numerous outstanding questions for students to pursue. (1): Previous work on active search assumed that the result of each experiment ("is this compound a potential drug?") would be made immediately available before selecting the next. Modern high-throughput screening devices, however, can process many compounds at once. How can active search policies be best adapted to the batch setting? (2): Previous work on active search showed that myopic polices can perform arbitrarily bad compared to the theoretically optimal policy. Is that still true in the batch setting? (3): Can we design feasible non-myopic search policies in the batch setting? (4): Can the search process by enhanced by encouraging diversity among the compounds in each batch?

Skills required: Familiarity with MATLAB and machine learning and mathematical maturity.  Familiarity with chemistry/biology a plus but not required.

Project #3: Intelligently Segmenting the Long Tail

Faculty: Brendan Juba

Students will run a study using these new algorithms to investigate the quality of models produced and overall proportion of the population covered by the discovered segments on some real world domain, as compared to standard clustering techniques. For example, we might consider the domain of providing personalized medicine: For a complex, heterogeneous disease like cancer, we might seek to use patient records to pick out subpopulations for which we can effectively model the risk factors or progression of the disease. Along the way, participants will learn to use standard data science tools such as Python, R, and/or MATLAB, and will gain experience in handling real datasets. Students will also be encouraged to experiment with variants of the proposed algorithms to try to improve the quality of models and/or coverage of the population achieved.

Skills Required: mathematical maturity for the theoretical project (familiarity with game theory and/or machine learning is a plus); proficiency with Java, C/C++, or Python for the simulation project.

I liked having close contact with Ph.D. students and talking to professors.

If there's some topic or aspect of the research process you want to do, don't be afraid to talk to your advisor about it. If you don't, you may lose out on a good opportunity.

I really liked seeing so many new applications of the things I learned in my coursework. As someone who is definitely interested in computer science but isn't sure about where exactly my interests lie, this summer was really helpful.

Interested in working with our faculty?

     Find out more about their research!

           See something that excites you?

               Join us for a semester or a summer!

I enjoyed working on original research in a field about which I'd previously known nothing, with the chance to publish our findings.

I learned about new and developing technologies, which was exciting, and my experience will give me an edge over others entering the field.

My lab was a good environment for research, but the people were what made my summer really enjoyable.

Project #4: Executing Big Data Applications on Heterogeneous Architectures

Faculty: Roger Chamberlain and Ron Cytron

Students will implement a set of big data applications in the Auto-Pipe and ScalaPipe development environments, assessing (via measurement and modeling) the performance of these applications on a variety of heterogeneous computer architectures. The two development environments support streaming data computation on traditional multicores, graphics engines, and reconfigurable logic. Targeted applications include astrophysics [Tyson 2008], computational biology [Jacob 2008], and computational finance, each of which can be characterized by large data streams that must be considered in their entirety to answer the scientific question(s) of interest.

Skills Required: C++ and facility with basic algorithms and data structures (sorting, hashing, graph, traveral, possibly dynamic programming). Familiarity with a scripting lanauage such as Python or Perl is a plus. Prior biology background is not required.

Project #5: Systems-biology Approaches to Understanding Gene Expression Regulation Underlying Complex Traits

Faculty: Weixiong Zhang and Sharlee Climer

We focus on complex human diseases, such as psoriasis, an autoimmune skin disorder with no cure, and many types of cancer; most of these diseases are devastating or detrimental, leading to enormous economic and societal burden. Understanding the genetic and molecular bases of disease mechanisms is the key to developing effective diagnostic and therapeutic means for such diseases. The REU students will be involved in developing machine learning and data mining approaches for analyzing biological data to understand the causal relationships between genotypic variations, gene expression changes, and disease phenotype.

Skills Required: Proficiency with Jave, C or Python

Project #6: Combinatorial complexity, heterogeneity, and big data: The challenges of mining genetic code.

Faculty: Weixiong Zhang and Sharlee Climer

Students will experience a full computational biology experience. First they will analyze genome-wide genetic datasets using novel computational tools under development in our lab. While searching for significant associations, they will gain appreciation for the challenges inherent with the analysis of big data. Students will periodically present their results and brainstorm and critique each other’s projects. Once significant associations have been identified, the interns will use functional annotation and analysis tools and study relevant literature to explore the potential biological meaning underlying these associations. Finally, they will prepare manuscripts describing their results and critique each other’s manuscripts, with the ultimate goal of producing publishable work.

Skills Required: Proficiency with Jave, C or Python

Project #7: The global network of webcams

Faculty: Robert Pless

This project is an ongoing archive of images from 28,000 publicly available webcams, each capturing one image per half hour for the last 8 years. This creates a big data problem to calibrate the cameras, annotate the images by what they contain, and discover trends and anomalies about how scenes are changing. Our current collaborations include biologists working on continental scale estimates of tree-phenology patterns, large-scale urban re-forestation projects, and social scientists tracking how people use public spaces and how that changes with changes in climate and weather patterns.

Students will work to build: (1) automation of large-scale deployment of anomaly detection, clustering and visualization for long term webcam data, and (2) user-in-the-loop tools to make it easy to deploy automated detection and counting of objects of interest for particular cameras.

Skills Required: Proficiency with Jave, Jave Script, Python, C, Cobol or Matlab

Project #9: Accelerating Bioinformatics Computations on GPUs with MERCATOR

Faculty: Roger Chamberlain and Jeremy Buhler

In this project, students will implement core computational biology algorithms on NVIDIA GPUs using MERCATOR, a novel framework being developed by our group to help build complex GPU apps efficiently. Potential target applications include biosequence comparison, short-read mapping, and correlation clustering for SNPs and/or gene expression data. We'll focus on methods that are both algorithmically non-trivial (e.g. subquadratic-time correlation discovery and index-based mapping) and potentially parallelizable on a GPU. Through our connections with Washington University's School of Medicine, we'll obtain real data sets and databases on which to test new implementations.

Skills Required: C++ and facility with basic algorithms and data structures (sorting, hashing, graph, traveral, possibly dynamic programming). Prior experience programming with CUDA is a strong plus but is not absolutely required. Familiarity with a scripting lanauage such as Python or Perl is a plus. Prior biology background is not required.

Project #10: Design and Implementation of Language Constructs for Parallel Programming

Faculty: I-Ting Angelina Lee

Cilk is a C/C++-based multithreaded language that provides a high-level language abstraction for parallel execution. When writing a parallel program in Cilk, the programmer expresses the logical parallelism of the computation, and an underlying runtime scheduler schedules computation in a way that respects the logical parallelism specified by the programmer while taking full advantage of the processors available at runtime. Reducer hyperobjects is a construct in Cilk that allows the program to perform parallel reduction. Students will work with the PI to design, implement, and evaluate different runtime strategies to support reducer hyperobjects efficiently.

Skills Required: Familiarity with C/C++; experience with parallel programming (in any language or any platform) is a plus but not required.

Project #11: Comprehensive Static Instrumentation for Dynamic-Analysis Tools

Faculty: I-Ting Angelina Lee

Key to understanding and improving the behavior of any system is visibility --- the ability to know what is going on inside the system. Various dynamic-analysis tools, such as race detectors, memory checkers, call-graph generators, code-coverage analyzers, and performance profilers, rely on compiler instrumentation to gain visibility into the program behaviors during execution. With this approach, the tool writer modifies the compiler to insert instrumentation code into the program-under-test so that it can execute behind the scene while the program-under-test run. This approach, however, means that the development of new tools requires compiler work, which many potential tool writers are ill equipped to do, and thus raises the bar for building new and innovative tools. We are developing CSI, a comprehensive static instrumentation framework, which allows the tool writers to easily develop analysis tools that require compiler instrumentation without actually doing the compiler work themselves. In this project, students will work with the PIs to develop the CSI framework and implement dynamic analysis tools using the CSI framework.

Skills Required: Familiarity with C/C++; basic understanding ofhow a compiler works (e.g. having taken a compiler course).

Project #12: Implementation of Speech-based Biometric System

Faculty: Shantanu Chakrabartty

The main goal of this project is to implement and optimize a biometric system that can recognize target speakers based on their speech samples. For this project the student will implement a text-independent speaker recognition system using a C or C++ programming language. The student will start by understanding a MATLAB level implementation of an existing text-independent speaker recognition system which they then will be required to translate into C or C++. During the translation process, the student will be involved in optimizing the algorithm and code for a back-end support vector machine classification engine and an auditory feature extraction module. The final objective would be to make the software to be scalable such that any number of target speakers (to be recognized) could be added at a later stage and the software can be optimized for different hardware platforms.

Skills Required: Working knowledge of MATLAB, and proficiency in C or C++ is required. Prior experience in algorithm design and optimization will be beneficial.

Project #13: Creating an Example-Rich Programming Environment

Faculty: Caitlin Kelleher

Looking Glass is a 3D programming environment designed for kids with an online community. With Looking Glass kids can program their own 3D animated stories, remix other programs, and then share their creations to the community. For this project we want to utilize the shared programs as examples to teach kids programming within Looking Glass. For this project you will work with the Looking Glass lab to make Looking Glass an example-rich programming environment. To make sure we pick exciting examples for each user, you will need to filter and download examples from the online community that are customized to each kid's preferences and expertise. Then, you'll design and implement several ways to incorporate the examples into Looking Glass and then user test these changes to verify their effectiveness at helping kids learn new programming concepts.

Skills Required: Basic programming skills.

Project #14: Locality-aware concurrency platforms:

Faculty: Kunal Agrawal and I-Ting Angelina Lee

We are building a platform for writing cache efficient parallel programs. As part of that platform, we are designing compiler and runtime transformations that can convert divide and conquer programs that are written without considering cache-efficiency and transform them so into cache efficient ones. In order to develop these transformations, we plan to conduct an algorithmic study in order to understand the potential of these transformations to improve performance. Students will conduct experimental study by implementing various algorithms and work with the PIs to design compiler / runtime transformations.

Skills Required: Familiarity with C/C++; mathematical maturity including an undergraduate algorithms course.

Project #15: Designing a programmable RFID platform for Internet-of-things

Faculty: Shantanu Chakrabartty

The goal of this project is to implement a complete Gen-2 UHF RFID communication protocol stack on a Texas Instruments MSP microcontroller. The embedded platform can then be used to implement different types of passive IoT sensors and computing devices that are wirelessly powered using a commercial RFID reader. The student will have to first understand the architecture of the Intel WISP platform and then program an on-board MSP micro controller (in C and some machine code) to communicate with the reader.

Skills Required: Proficiency in C and embedded programming. Some background on RFID tags would be beneficial.

Project #16: Designing a real-time user-interface for Multi-RFID tags

Faculty: Shantanu Chakrabartty

The goal of this project is to design a software user- interface that can read and manipulate multiple RFID tags in real-time. The student will investigate different multi-tag configurations using different commercial-grade RFID tags and will also be working on a commercial RFID reader and its SDK to design a user- interface. The end goal will be explore different RFID algorithms based on the tag's received signal strength and its orientation with respect to each other and with respect to the reader's antenna.

Skills Required: Proficiency in C++ or C#. Good background in operating systems architecture. Some background on RFID operation would be beneficial.

Project #17: Service Differentiation in the Cloud

Faculty: Roch Guerin

The project's overall goal is to explore offering different levels of service guarantees in cloud systems. In a cloud system, most resources have been virtualized so that interactions between virtual resources and physical resources are often difficult to predict with the level of accuracy that tight service guarantees call for. In particular, support for latency sensitive applications is challenging in cloud systems. The project involves two possible targets, both of which would be carried out in collaboration with a Ph.D. student. The first involves implementing extensions to a thread-based system we have developed to offer latency guarantees to different virtual machines (VMs) in the Xen system. The work would target adding scheduling and/or traffic shaping mechanisms to protect service guarantees against misbehaving VMs. An alternative target is to implement a "scavenger" service for Xen, which would migrate docker containers running low-priority applications across idle VMs, basically harnessing unused computational cycles. The challenge here is to realize a lightweight implementation that also ensures that resources are resources are immediately freed when requested by their primary owner. In other words, the scavenging domains have no service guarantees, i.e., they can be preempted at any time, but at the same time, their ability to access unused resources should be transparent to the service guarantees of other users.

Skills Required: Proficiency with C/C++ and java, good knowledge of the Linux operating system and if possible its networking stack. Familiarity with network protocols and virtualization platforms such as Xen, is a plus but not required. Above all, a willingness to learn new material and to dive into complex software systems.