Computer Science & Engineering

Research Experiences for Undergraduates

Projects for Summer 2017

Project #1: Data-centric Approaches to Modeling Individual Behavior in Large-scale Online Social Systems

Faculty: Sanmay Das

REU students will work on different aspects of a project that attempts to unify micro-modeling of agent behavior based on data and large-scale modeling of social systems in which these agents interact. We have developed a model that uses positive or negative pairwise interaction data to estimate a maximum likelihood model of point-of-view. We would now like to develop algorithms that can use prior information and other side information (for example, mined from the natural language aspects of what the user writes on a website) to build much richer models of opinion that can be applied to web-scale data such as that on Wikipedia, Reddit, or Yelp. Building such models will allow us to study the dynamics of opinion formation and change on the social network.

Skills Required: mathematical maturity for the theoretical project (familiarity with game theory and/or machine learning is a plus); proficiency with Java, C/C++, or Python for the simulation project.

Project #2: Big Data Analysis for Active Scientific Discovery

Faculty: Roman Garnett and Benjamin Moseley

REU participants will design intelligent policies for actively querying a large, real-world database of compounds to quickly detect potential drugs. The database contains 120 different biological targets of relevance to humans and a background set of 1 million putative inactive compounds gathered from the ZINC database. Along with these, a baseline implementation of a state-of-the-art virtual screening system will be made available for comparison. There are several numerous outstanding questions for students to pursue. (1): Previous work on active search assumed that the result of each experiment ("is this compound a potential drug?") would be made immediately available before selecting the next. Modern high-throughput screening devices, however, can process many compounds at once. How can active search policies be best adapted to the batch setting? (2): Previous work on active search showed that myopic polices can perform arbitrarily bad compared to the theoretically optimal policy. Is that still true in the batch setting? (3): Can we design feasible non-myopic search policies in the batch setting? (4): Can the search process by enhanced by encouraging diversity among the compounds in each batch?
We will also consider applications of these policies in other application settings, including the discovery of novel alloys with desirable properties.

Skills required: Familiarity with MATLAB and machine learning and mathematical maturity.  Familiarity with chemistry/biology a plus but not required.

I liked having close contact with Ph.D. students and talking to professors.

If there's some topic or aspect of the research process you want to do, don't be afraid to talk to your advisor about it. If you don't, you may lose out on a good opportunity.

I really liked seeing so many new applications of the things I learned in my coursework. As someone who is definitely interested in computer science but isn't sure about where exactly my interests lie, this summer was really helpful.

Interested in working with our faculty?

     Find out more about their research!

           See something that excites you?

               Join us for a semester or a summer!

I enjoyed working on original research in a field about which I'd previously known nothing, with the chance to publish our findings.

I learned about new and developing technologies, which was exciting, and my experience will give me an edge over others entering the field.

My lab was a good environment for research, but the people were what made my summer really enjoyable.

Project #3: Intelligently Segmenting the Long Tail

Faculty: Brendan Juba

Students will run a study using these new algorithms to investigate the quality of models produced and overall proportion of the population covered by the discovered segments on some real world domain, as compared to standard clustering techniques. For example, we might consider the domain of providing personalized medicine: For a complex, heterogeneous disease like cancer, we might seek to use patient records to pick out subpopulations for which we can effectively model the risk factors or progression of the disease. Along the way, participants will learn to use standard data science tools such as Python, R, and/or MATLAB, and will gain experience in handling real datasets. Students will also be encouraged to experiment with variants of the proposed algorithms to try to improve the quality of models and/or coverage of the population achieved.

Skills Required: mathematical maturity for the theoretical project (familiarity with game theory and/or machine learning is a plus); proficiency with Java, C/C++, or Python for the simulation project.

Project #4: Executing Big Data Applications on Heterogeneous Architectures

Faculty: Roger Chamberlain and Ron Cytron

Students will implement a set of big data applications in the Auto-Pipe and ScalaPipe development environments, assessing (via measurement and modeling) the performance of these applications on a variety of heterogeneous computer architectures. The two development environments support streaming data computation on traditional multicores, graphics engines, and reconfigurable logic. Targeted applications include astrophysics [Tyson 2008], computational biology [Jacob 2008], and computational finance, each of which can be characterized by large data streams that must be considered in their entirety to answer the scientific question(s) of interest.

Skills Required: C++ and facility with basic algorithms and data structures (sorting, hashing, graph, traversal, possibly dynamic programming). Familiarity with a scripting language such as Python or Perl is a plus. Prior biology background is not required.

Project #9: Accelerating Scientific Computations on GPUs with MERCATOR

Faculty: Roger Chamberlain and Jeremy Buhler

In this project, students will implement core computational biology algorithms on NVIDIA GPUs using MERCATOR, a novel framework being developed by our group to help build complex GPU apps efficiently. Potential target applications include DNA short-read mapping, correlation clustering for SNPs and/or gene expression data, random forest evaluation for machine learning, and N-body simulation. We'll focus on methods that are both algorithmically non-trivial (e.g. subquadratic-time correlation discovery and index-based mapping) and potentially parallelizable on a GPU.

Skills Required: C++ and facility with basic algorithms and data structures (sorting, hashing, graph, traversal, possibly dynamic programming). Prior experience programming with CUDA is a strong plus but is not absolutely required. Familiarity with a scripting language such as Python or Perl is a plus. Prior biology or other application-specific background is not required.

Project #10: Design and Implementation of Language Constructs for Parallel Programming

Faculty: I-Ting Angelina Lee

Cilk is a C/C++-based multithreaded language that provides a high-level language abstraction for parallel execution. When writing a parallel program in Cilk, the programmer expresses the logical parallelism of the computation, and an underlying runtime scheduler schedules computation in a way that respects the logical parallelism specified by the programmer while taking full advantage of the processors available at runtime. Reducer hyperobjects is a construct in Cilk that allows the program to perform parallel reduction. Students will work with the PI to design, implement, and evaluate different runtime strategies to support reducer hyperobjects efficiently.

Skills Required: Familiarity with C/C++; experience with parallel programming (in any language or any platform) is a plus but not required.

Project #13: Detecting Opportunities to Teach Problem Solving in Code Puzzles

Faculty: Caitlin Kelleher

Looking Glass is a 3D programming environment designed for kids with an online community. With Looking Glass kids can program their own 3D animated stories, remix other programs, and then share their creations to the community. Over the past couple of years, we've been exploring code puzzles as a way to help users learn new skills, first focusing on the design of puzzles and the interface support, and then on putting together personalized pathways of puzzles based on an individuals history. In do that, we've identified some behavior patterns that suggest a need for problem solving skills and metacognition. In this project, we're interested in using log data that we've collected from past learning pathways studies to develop new methods for detecting when students need help around problem solving.

Skills Required: Working knowledge of Java is required. Prior experience with statistics, data analysis, user-centered design and machine learning will be beneficial.

Project #15: Leveraging Eye-Tracking for Modeling Knowledge Discovery and Decision-Making with Visualizations

Faculty: Alvitta Ottley

When we read a body of text, process an image, or reason with a data visualization, our eyes constantly move. This pattern of movements can reveal important information about how we interpret visual designs and whether a specific visualization is effective at communicating data. The goal of this project is to explore how eye-movements can be leveraged to understand how people use different visualization to ultimately improve visualization design. The REU students will run a study to collect eye tracking and mouse interaction data as users interact with visualizations. The students will then analyze the data using a variety of visualization, machine learning, statistical analysis techniques

Skills Required: Proficiency with web programming. Some background in Machine Learning or Statistics would be beneficial.