Distributed Stream Processing Systems are typically deployed within a single data center in order to achieve high performance and low-latency computation. The data streams analyzed by such systems are expected to be available in the same data center. Either the data streams are generated within the data center (e.g., logs, transactions, user clicks) or they are aggregated by external systems from various sources and buffered into the data center for processing (e.g., IoT, sensor data, traffic information).
The data center approach for stream processing analytics fits the requirements of the majority of the applications that exists today. However, for latency sensitive applications, such as real-time decision-making, which relies on analyzing geographically distributed data streams, a data center approach might not be sufficient. Aggregating data streams incurs high overheads in terms of latency and bandwidth consumption in addition to the overhead of sending the analysis outcomes back to where an action needs to be taken.
We propose a new stream processing architecture for efficiently analyzing geographically distributed data streams. Our approach utilizes emerging distributed virtualized environments, such as Mobile Edge Clouds, to extend stream processing systems outside the data center in order to push critical parts of the analysis closer to the data sources. This will enable real-time applications to respond faster to geographically distributed events.
In order to realize our approach, we have implemented a geo-aware scheduler plugin for the Apache Storm stream processing system. The scheduler takes as an input a Storm topology (Figure 1) to be scheduled and executed on the available datacenter/edge Cloud resources. The scheduler enables the developers of the topology to annotate groups of stream processing elements that can be replicated and pushed outside of the data center to an Edge Cloud with close proximity to the stream source. In order to operate, the scheduler requires knowledge of the available datacenter/edge Clouds, data stream types available at each Cloud, and latencies among clouds.
Publications
Source Code
Github: SpanEdge