UDDSKETCH is a recent algorithm for accurate tracking of quantiles in data streams, derived from the DDSKETCH algorithm. UDDSKETCH provides accuracy guarantees covering the full range of quantiles independently of the input distribution and greatly improves the accuracy with regard to DDSKETCH. In this paper we show how to compress and fuse two or more data streams (or datasets) by leveraging the mergeability of the UDDSKETCH data summaries. In general, two summaries on two data streams are said to be mergeable if there exists an algorithm that allows combining the two summaries into a single one related to the union of the two datasets, simultaneously preserving the error and size guarantees. The property of mergeability of a sketch enables the parallel and distributed processing of big volume data streams that can be compressed and fused by means of such mergeable data structures. Among the applications strictly related to accurate tracking of quantiles, requiring parallel and/or distributed processing we recall here estimating the latency of a web site, database query optimizers and the need of succinctly summarizing the distribution of values occurring over a sensor network. We prove that UDDSKETCH is fully mergeable and introduce PUDDSKETCH, a parallel version of UDDSKETCH suitable for message-passing based architectures. We formally prove its correctness and compare it to a parallel version of DDSKETCH, showing through extensive experimental results that our parallel algorithm almost always outperforms the parallel DDSKETCH algorithm with regard to the overall accuracy in determining the quantiles. Moreover, we also design and implement parallel versions of both the state of the art KLL and REQ sequential algorithms in order to compare and contrast PUDDSKETCH versus the corresponding parallel algorithms. Our experiments clearly show that PUDDSKETCH is faster or on par with regard to parallel running time, whilst providing simultaneously greater accuracy.

Data stream fusion for accurate quantile tracking and analysis

Cafaro M.
Primo
Formal Analysis
;
Melle C.
Secondo
Software
;
Epicoco I.
Penultimo
Formal Analysis
;
Pulimeno M.
Ultimo
Formal Analysis
2023-01-01

Abstract

UDDSKETCH is a recent algorithm for accurate tracking of quantiles in data streams, derived from the DDSKETCH algorithm. UDDSKETCH provides accuracy guarantees covering the full range of quantiles independently of the input distribution and greatly improves the accuracy with regard to DDSKETCH. In this paper we show how to compress and fuse two or more data streams (or datasets) by leveraging the mergeability of the UDDSKETCH data summaries. In general, two summaries on two data streams are said to be mergeable if there exists an algorithm that allows combining the two summaries into a single one related to the union of the two datasets, simultaneously preserving the error and size guarantees. The property of mergeability of a sketch enables the parallel and distributed processing of big volume data streams that can be compressed and fused by means of such mergeable data structures. Among the applications strictly related to accurate tracking of quantiles, requiring parallel and/or distributed processing we recall here estimating the latency of a web site, database query optimizers and the need of succinctly summarizing the distribution of values occurring over a sensor network. We prove that UDDSKETCH is fully mergeable and introduce PUDDSKETCH, a parallel version of UDDSKETCH suitable for message-passing based architectures. We formally prove its correctness and compare it to a parallel version of DDSKETCH, showing through extensive experimental results that our parallel algorithm almost always outperforms the parallel DDSKETCH algorithm with regard to the overall accuracy in determining the quantiles. Moreover, we also design and implement parallel versions of both the state of the art KLL and REQ sequential algorithms in order to compare and contrast PUDDSKETCH versus the corresponding parallel algorithms. Our experiments clearly show that PUDDSKETCH is faster or on par with regard to parallel running time, whilst providing simultaneously greater accuracy.
File in questo prodotto:
File Dimensione Formato  
1-s2.0-S1566253522000975-main.pdf

solo utenti autorizzati

Descrizione: Articolo
Tipologia: Versione editoriale
Licenza: NON PUBBLICO - Accesso privato/ristretto
Dimensione 1.69 MB
Formato Adobe PDF
1.69 MB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11587/474490
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 1
social impact