The amount of data available in various research domains increases steadily which leads to new challenges in its scientific and industrial utilization. The data captured and managed in research environments is often enormous in size, available in different formats such as text files (e.g. Twitter messages) or binary data (cell phone data or floating car data), and produced continuously (data streams). This raises the need for an innovative eScience infrastructure capable of providing support for real-time data access in order to enable interactive usage scenarios on top of complex data-intensive workflows.
Enabling real-time, large-scale analytical workflows for data-intensive science requires the integration of state-of-the-art technologies from various different fields. We thus propose to bring together latest developments from Big Data (NoSQL solutions and data-intensive programming, streaming engines), HPC (parallel programming, multi-/many-core architectures, GPUs, clusters), data analytics (analytical models and algorithms), and workflow management (definition and orchestration) in an innovative manner. Optimizing and tuning the execution strategy for data analysis jobs requires explicit knowledge about the system (data and compute resources) and the algorithms used inside the workflows. Even for experts, tuning of complex workflows gets more and more involved and time-consuming and thus is often restricted to a single application or the outer workflow level. Therefore, there is an urgent need for adaptive mechanisms that automatically configure and tune workflows and their execution environment by taking into account the characteristics of the data, the workflow modules (existing in different implementation variants), and the available heterogeneous hardware infrastructure in order to achieve the required QoS.
Realizing an eScience infrastructure supporting the adaptive configuration of analytical workflows will enable more researches from various domains to efficiently utilize such complex environments. In this project we will focus on the transportation domain where researchers have a strong interest in optimal planning and management of transportation systems which requires timely detection and fast reaction upon incidents in order to minimize their impact on the traffic system and on the accuracy of transport demand models. A promising approach is to enhance current transport demand modeling solutions (mainly based on cross sectional household and counting surveys and mostly static sensor data) with novel data sources (e.g. mobile phone and social media data) and data analytical approaches towards real-time traffic management. This necessitates a responsive system enabling the interactive visualization and exploration and analysis of ubiquitously available real-time information together with historical data, building mathematic models covering all transportation modes, and simulating the future behavior of the traffic system.
The project will result in the development of a real-time data analytics platform enhancing an existing data pipeline toolchain with novel Big Data and HPC technologies and adaptive execution strategies for heterogeneous parallel architectures. Additionally, the project will exploit this framework to realize innovative transportation domain scenarios integrating various data sources and interactively visualizing the results in state-of-the art transportation domain softw are solutions (PTV).