In this paper we present PFDCMSS (Parallel Forward Decay Count-Min Space Saving) which, to the best of our knowledge, is the world first message-passing parallel algorithm for mining time-faded heavy hitters. The algorithm is a parallel version of the recently published FDCMSS (Forward Decay Count-Min Space Saving) sequential algorithm. We formally prove its correctness by showing that the underlying data structure, a sketch augmented with a Space Saving stream summary holding exactly two counters, is mergeable. Whilst mergeability of traditional sketches derives immediately from theory, we show that, instead, merging our augmented sketch is non trivial. Nonetheless, the resulting parallel algorithm is fast and simple to implement. The very large volumes of modern datasets in the context of Big Data present new challenges that current sequential algorithms can not cope with; on the contrary, parallel computing enables near real time processing of very large datasets, which are growing at an unprecedented scale. Our algorithm's implementation, taking advantage of the MPI (Message Passing Interface) library, is portable, reliable and provides cutting-edge performance. Extensive experimental results confirm that PFDCMSS retains the extreme accuracy and error bound provided by FDCMSS whilst providing excellent parallel scalability. Our contributions are three-fold: (i) we prove the non trivial mergeability of the augmented sketch used in the FDCMSS algorithm; (ii) we derive PFDCMSS, a novel message-passing parallel algorithm; (iii) we experimentally prove that PFDCMSS is extremely accurate and scalable, allowing near real time processing of large datasets. The result supports both casual users and seasoned, professional scientists working on expert and intelligent systems.
File in questo prodotto:
Non ci sono file associati a questo prodotto.