Scalable Distributed Data Anonymization for Large Datasets
This journal article investigate how it is possible to distribute the Mondrian algorithm to obtain K-Anonymity and L-diversity: two well-known privacy metrics that guarantee protection of the respondents of a dataset by obfuscating information that can disclose their identities and sensitive information. The motivation of this work is that existing solutions for enforcing them implicitly assume to operate in a centralized scenario, since they require complete visibility over the dataset to be anonymized, and can therefore have limited applicability in anonymizing large datasets. This paper propose a solution that extends Mondrian to allow its employment on large datasets in a distributed manner, leveraging the parallel computation of multiple workers. The presented approach efficiently distributes the computation among the workers, without requiring visibility over the dataset in its entirety. The data partitioning limits the need for workers to exchange data, so that each worker can independently anonymize a portion of the dataset. We implemented our approach providing parallel execution on a dynamically chosen number of workers. The experimental evaluation shows that our solution provides scalability, while not affecting the quality of the resulting anonymization.
Authors:
Sabrina De Capitani di Vimercati, Dario Facchinetti, Sara Foresti, Giovanni Livraga, Gianluca Oldani, Stefano Paraboschi, Matthew Rossi, Pierangela Samarati