Large-scale machine learning for data-driven discovery (Dr. Boley).
Large-scale machine learning problems demand scalable algorithms to extract patterns. We investigate novel mathematical algorithms to solve large-scale optimization problems that arise in machine learning and graph analysis. Machine learning in the context of extremely large datasets requires distribution of data and/or computation. We use proximal splitting methods to split a general machine learning problem into separate intermediate steps, each of which can be easily optimized or solved in closed form. The result is a class of first-order methods capable of converging to the general optimal solution. These methods are scalable to very large sizes, but the convergence rate can be extremely variable in ways that can be hard to predict. Using spectral methods, we are able to speed up convergence substantially. The goal of this project is to gain a better understanding of the convergence behavior and to use this understanding to construct accelerated algorithms with more consistent convergence properties. This will allow the application of machine learning techniques to a much wider class of problems.
REU students will apply and evaluate these methods on data from current projects including scalable computation and analysis of elementary pathways through metabolic networks of single-cell organisms, a Markov model of evolution of the avian influenza virus, and scalable data mining algorithms for corpora of short text fragments. Students will explore how advanced mathematical concepts help solve large data mining problems and extract novel patterns from data. Spectral methods in optimization, clustering, shortest path routing, graph partitioning, and dimensionality reduction are examples of some of the tools that students will learn.
Big data processing in mobile cloud platforms (Dr. Chandra and Dr. Weissman).
We are developing an intelligent ubiquitous cloud platform that can support end-users running latency-sensitive big data applications on low-powered edge devices, such as mobile phones and Google Glass, which are resource-limited. The project has two primary components: the edge cloud Nebula that can harness nearby edge resources to form a geographically-dispersed cloud that has low latency to large data sources; and a smart middleware that uses machine learning to identify user similarities and opportunities for caching and speculative execution to optimize application performance. The big data aspects of this project include: (1) efficient storage and retrieval of large data streams from edge devices or external sources (e.g. websites, data repositories, and online services); (2) data analysis and mining of these data streams to identify optimization opportunities, and (3) scheduling of cloud computing resources for efficient and timely execution of data-intensive applications.
Towards these open research questions, REU students will implement and evaluate an augmented reality-based application for Android devices, which is both data-intensive and latency-sensitive. For the first component, students will design and implement this application on the Android/Dalvik platform, then use it to develop and evaluate outsourcing techniques that interface with the UMN Amazon EC2-based cloud middleware. For the second component, students will develop intelligent cloud-side optimization techniques that use machine learning to utilize cross-user outsourcing optimization opportunities.
Data-driven simulation and evaluation of virtual social behaviors (Dr. Guy).
Simulating crowds of virtual humans is a new and active research area with application in diverse fields, including evacuation planning, virtual reality, architectural design, and simulation and training. These domains require realistic interactions between virtual characters, which is a complex, multi-faceted simulation task. Virtual agents should respect social space, stay close to family and friends, avoid collision, choose meaningful destinations, and display emotional reactions, all authentic to real human behavior. To address the challenges of realistic interactions, we are capitalizing on recent technological advances in data collection that have produced large datasets of human motion, behavior, and interaction in both small and crowd settings. Student researchers will use these data sources to improve simulation of social interactions, and for comparison of crowd simulation to real human behavior.
Student projects will be in either of two broad research areas. (1) To address questions of architectural evaluation and pedestrian safety, students will focus on techniques that use large datasets to evaluate the accuracy of predicted human flows with respect to varying criteria. This work is in collaboration with faculty in biomechanics and architecture. (2) Students will be part of an ongoing collaboration with doctors in the UMN Otolaryngology Department developing new methods to mine large datasets to find key facial features that are most indicative of emotional state. This research can help develop new efficient methods of animating the emotional state of virtual agents and lead to new techniques for facial reconstructive surgery.
Improving metropolitan-scale transportation systems with data-driven cyber-control (Dr. He).
Under the Smart Cities Initiative from the White House, our work on data-driven cyber-physical systems on smart cities is aimed to address emerging urban mobility challenges by big-data-driven analytics. In particular, we combine data sources from various urban systems, e.g., phones, smartcards, taxis, buses, trucks, subways, bikes, personal & electric vehicles, to improve existing data-driven modeling for urban phenomena (e.g., human mobility, traffic speed, passenger demand, congestion, energy consumption) and then design novel applications (e.g., ridesharing, taxi dispatching, and personalized navigation) to improve mobility efficiency by optimizing data-driven models. This project is uniquely built upon various urban infrastructure data, e.g., vehicle GPS data, smartcard transaction data, cellphone calling records, and bike rental transaction data.
After familiarizing with heterogeneous urban data, REU students will have two assignments: (i) designing, implementing and evaluating predictive data-driven models to capture urban mobility phenomena, e.g., real-time passenger demand, i.e., how many people travel from one urban region to another in real time; (ii) based on a combination of these models, designing and verifying novel applications, e.g., Uber-like dispatching and transit coordination, in order to improve urban mobility efficiency, e.g., minimizing driving distances or travelling time. The results will be shown by data-driven simulations and interactive visualization.
Enhancing the realism of immersive virtual environments (Dr. Interrante).
Immersive virtual reality technology has the potential to enable fundamental and transformative advances in education, training, rehabilitation, architectural design, psychotherapy, and a wide range of other application areas. To help realize this potential, our lab is addressing key challenges in data acquisition and model building, 3D self-representation and body tracking, spatially aware locomotion, and multi-user interaction, in collaboration with colleagues from the Department of Architecture in the College of Design at UMN.
Two specific examples of Big Data related projects that REU students could work on include (1) plausible full body animation based on limited input data (e.g. head position and orientation + environmental context); (2) predictive hand tracking in the presence of occlusion to support seamless pseudo-haptic interactions utilizing tangible props in video see-through based augmented reality.
Automated out-of-core execution of parallel message-passing applications (Dr. Karypis).
Big Data analytics represent a range of computational methods that are designed to harness the power of the vast amounts of data generated and collected in various scientific, engineering, military, commercial, and educational domains. The shear size of the data to be analyzed, and in many cases, the complexity of the analysis, has made out-of-core (OOC) distributed computing approaches, which primarily store their data on disks, the preferred computational methodology. Converting existing applications to operate efficiently in an OOC distributed computing fashion is a non-trivial task as it requires a significant software re- engineering effort. Moreover, the current high-level frameworks for developing new OOC distributed computing applications support restrictive computational models, which limit the achievable performance for many classes of computations. This project is developing the methods and software tools to allow message-passing distributed applications to efficiently solve problems whose data and/or memory requirements far exceed the memory available in the underlying computer system. It will develop an OOC distributed computing framework that couples scalable distributed memory message-passing programs with a runtime system that facilitates OOC distributed execution.
As part of this project, students will work on two main tasks. (1) Develop BDMPI, an implementation of the Message Passing Interface (MPI) and its associated runtime system to enable the efficient and automated OOC execution of parallel distributed-memory programs written in MPI. This will enable a large number of existing MPI applications to operate on very large datasets without the need for any software and/or algorithmic re-engineering. (2) Develop methods to reduce the memory requirements of BDMPI programs and the costs associated with saving/restoring the process's address space. These methods will augment the standard message-passing model with shared memory constructs and optimize the interface between BDMPI and the virtual memory management system. This will further reduce the overhead associated with the OOC execution of message-passing programs and will expand the classes of problems that can be efficiently executed in an OOC fashion.
Interactive and perceptually accurate visualization of multidimensional data (Dr. Keefe).
Emerging Big Data (data-intensive) science and engineering workflows require new data analysis tools to help researchers move from data to insight. Since the highest-bandwidth input to the human brain is through the visual system, it makes sense that new data analysis tools harness the power of data visualization. New research is needed to unlock the potential of visualization in this context.
REU students will investigate specific research questions including: (1) How can new computer graphics rendering algorithms be best utilized to turn data into a visual form that scientists can understand? (2) To what extent can we validate that the human visual system can correctly interpret these visual representations for data? (3) How can we build effective interactive data visualizations that support exploratory data analysis through interactive querying? (4) To what extent can emerging user interface hardware (e.g., multi-touch displays, 3D depth cameras, optical tracking of gestures, voice input) be utilized to make interactive data visualizations more effective for working with today's massive and complex real-world datasets. This research will be conducted and evaluated through the context of a number of real-world big data problems that have been a driving force for Dr. Keefe's research in the recent years, including simulation-based medical device design, understanding the biomechanics of the human spine, and developing objective data-driven metrics for evaluating surgical skills.
Understanding recovery from addiction at scale (Dr. Yarosh).
People find support, advice, and a community in online spaces that focus on specific health conditions. We have been working with an online community for recovery from substance use disorders (intherooms.com) to understand how people find recovery in technology-mediated Alcoholics Anonymous and Narcotics Anonymous groups. Through this process, we have collected a data set of public interactions between the 300k+ members of this community. Analyzing this corpus of data can help us understand recovery on a larger scale than previously possible and provide implications for the design of such health communities in the future.
The REU student will be asked to investigate one of the following questions: What characterizes active membership in these communities? How do members give and receive support in these communities? What role does anonymity play online and how does this compare with in-person meetings? Students may use a variety of mixed methods approaches in answering these questions. Students with experience and interest in any or all of the following methods are particularly encouraged to apply: Natural Language Processing, Content Analysis and Grounded Theory, Machine Learning, Data Mining, and Social Network Analysis.
Understanding principles underlying large-scale online social systems. (Dr. Zhu).
Today, social computing systems (e.g., Wikipedia, Facebook, and Airbnb) are central parts of each of our daily lives and have an important impact on our cultural, social, and economic experience of the world and each other. Our research combines social science theory, quantitative methods, and computational techniques (machine learning and statistics) to understand the principles underlying large-scale online social systems such as peer production communities (e.g., Wikipedia and StackOverflow), social networking sites (e.g., Facebook and Twitter), massive online open classes (e.g., Coursera) and sharing economy systems (e.g., Uber, Airbnb and Couchsurfing).
REU students will (1) develop and improve methods for extracting data from the social computing sites, (2) contribute to the analysis of the social computing data to uncover the underlying principles of these social computing systems, and (3) apply the findings to develop prototypes of next-generation social computing systems.