Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com ›...

355
Introduction to Privacy-Preserving Data Publishing Concepts and Techniques

Transcript of Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com ›...

Page 1: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Introduction to Privacy-Preserving

Data PublishingConcepts and Techniques

C9148_FM.indd 1 7/8/10 12:56:45 PM

Page 2: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Chapman & Hall/CRC Data Mining and Knowledge Discovery Series

UNDERSTANDING COMPLEX DATASETS: DATA MINING WITH MATRIX DECOMPOSITIONSDavid Skillicorn

COMPUTATIONAL METHODS OF FEATURE SELECTION Huan Liu and Hiroshi Motoda

CONSTRAINED CLUSTERING: ADVANCES IN ALGORITHMS, THEORY, AND APPLICATIONSSugato Basu, Ian Davidson, and Kiri L. Wagstaff

KNOWLEDGE DISCOVERY FOR COUNTERTERRORISM AND LAW ENFORCEMENT David Skillicorn

MULTIMEDIA DATA MINING: A SYSTEMATIC INTRODUCTION TO CONCEPTS AND THEORYZhongfei Zhang and Ruofei Zhang

NEXT GENERATION OF DATA MINING Hillol Kargupta, Jiawei Han, Philip S. Yu, Rajeev Motwani, and Vipin Kumar

DATA MINING FOR DESIGN AND MARKETING Yukio Ohsawa and Katsutoshi Yada

THE TOP TEN ALGORITHMS IN DATA MINING Xindong Wu and Vipin Kumar

GEOGRAPHIC DATA MINING AND KNOWLEDGE DISCOVERY, SECOND EDITIONHarvey J. Miller and Jiawei Han

TEXT MINING: CLASSIFICATION, CLUSTERING, AND APPLICATIONS Ashok N. Srivastava and Mehran Sahami

BIOLOGICAL DATA MINING Jake Y. Chen and Stefano Lonardi

INFORMATION DISCOVERY ON ELECTRONIC HEALTH RECORDS Vagelis Hristidis

TEMPORAL DATA MINING Theophano Mitsa

RELATIONAL DATA CLUSTERING: MODELS, ALGORITHMS, AND APPLICATIONS Bo Long, Zhongfei Zhang, and Philip S. Yu

KNOWLEDGE DISCOVERY FROM DATA STREAMS João Gama

STATISTICAL DATA MINING USING SAS APPLICATIONS, SECOND EDITION George Fernandez

INTRODUCTION TO PRIVACY-PRESERVING DATA PUBLISHING: CONCEPTS AND TECHNIQUES Benjamin C. M. Fung, Ke Wang, Ada Wai-Chee Fu, and Philip S. Yu

PUBLISHED TITLES

SERIES EDITORVipin Kumar

University of MinnesotaDepartment of Computer Science and Engineering

Minneapolis, Minnesota, U.S.A

AIMS AND SCOPEThis series aims to capture new developments and applications in data mining and knowledge discovery, while summarizing the computational tools and techniques useful in data analysis. This series encourages the integration of mathematical, statistical, and computational methods and techniques through the publication of a broad range of textbooks, reference works, and hand-books. The inclusion of concrete examples and applications is highly encouraged. The scope of the series includes, but is not limited to, titles in the areas of data mining and knowledge discovery methods and applications, modeling, algorithms, theory and foundations, data and knowledge visualization, data mining systems and tools, and privacy and security issues.

Chapman & Hall/CRC Data Mining and Knowledge Discovery Series

Benjamin C. M. Fung, Ke Wang,

Ada Wai-Chee Fu, and Philip S. Yu

Introduction to Privacy-Preserving

Data PublishingConcepts and Techniques

C9148_FM.indd 2 7/8/10 12:56:46 PM

Page 3: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Chapman & Hall/CRC Data Mining and Knowledge Discovery Series

UNDERSTANDING COMPLEX DATASETS: DATA MINING WITH MATRIX DECOMPOSITIONSDavid Skillicorn

COMPUTATIONAL METHODS OF FEATURE SELECTION Huan Liu and Hiroshi Motoda

CONSTRAINED CLUSTERING: ADVANCES IN ALGORITHMS, THEORY, AND APPLICATIONSSugato Basu, Ian Davidson, and Kiri L. Wagstaff

KNOWLEDGE DISCOVERY FOR COUNTERTERRORISM AND LAW ENFORCEMENT David Skillicorn

MULTIMEDIA DATA MINING: A SYSTEMATIC INTRODUCTION TO CONCEPTS AND THEORYZhongfei Zhang and Ruofei Zhang

NEXT GENERATION OF DATA MINING Hillol Kargupta, Jiawei Han, Philip S. Yu, Rajeev Motwani, and Vipin Kumar

DATA MINING FOR DESIGN AND MARKETING Yukio Ohsawa and Katsutoshi Yada

THE TOP TEN ALGORITHMS IN DATA MINING Xindong Wu and Vipin Kumar

GEOGRAPHIC DATA MINING AND KNOWLEDGE DISCOVERY, SECOND EDITIONHarvey J. Miller and Jiawei Han

TEXT MINING: CLASSIFICATION, CLUSTERING, AND APPLICATIONS Ashok N. Srivastava and Mehran Sahami

BIOLOGICAL DATA MINING Jake Y. Chen and Stefano Lonardi

INFORMATION DISCOVERY ON ELECTRONIC HEALTH RECORDS Vagelis Hristidis

TEMPORAL DATA MINING Theophano Mitsa

RELATIONAL DATA CLUSTERING: MODELS, ALGORITHMS, AND APPLICATIONS Bo Long, Zhongfei Zhang, and Philip S. Yu

KNOWLEDGE DISCOVERY FROM DATA STREAMS João Gama

STATISTICAL DATA MINING USING SAS APPLICATIONS, SECOND EDITION George Fernandez

INTRODUCTION TO PRIVACY-PRESERVING DATA PUBLISHING: CONCEPTS AND TECHNIQUES Benjamin C. M. Fung, Ke Wang, Ada Wai-Chee Fu, and Philip S. Yu

PUBLISHED TITLES

SERIES EDITORVipin Kumar

University of MinnesotaDepartment of Computer Science and Engineering

Minneapolis, Minnesota, U.S.A

AIMS AND SCOPEThis series aims to capture new developments and applications in data mining and knowledge discovery, while summarizing the computational tools and techniques useful in data analysis. This series encourages the integration of mathematical, statistical, and computational methods and techniques through the publication of a broad range of textbooks, reference works, and hand-books. The inclusion of concrete examples and applications is highly encouraged. The scope of the series includes, but is not limited to, titles in the areas of data mining and knowledge discovery methods and applications, modeling, algorithms, theory and foundations, data and knowledge visualization, data mining systems and tools, and privacy and security issues.

Chapman & Hall/CRC Data Mining and Knowledge Discovery Series

Benjamin C. M. Fung, Ke Wang,

Ada Wai-Chee Fu, and Philip S. Yu

Introduction to Privacy-Preserving

Data PublishingConcepts and Techniques

C9148_FM.indd 3 7/8/10 12:56:46 PM

Page 4: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Chapman & Hall/CRCTaylor & Francis Group6000 Broken Sound Parkway NW, Suite 300Boca Raton, FL 33487-2742

© 2011 by Taylor and Francis Group, LLCChapman & Hall/CRC is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed in the United States of America on acid-free paper10 9 8 7 6 5 4 3 2 1

International Standard Book Number: 978-1-4200-9148-9 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe.

Library of Congress Cataloging‑in‑Publication Data

Introduction to privacy-preserving data publishing : concepts and techniques / Benjamin C.M. Fung … [et al.].

p. cm. -- (Data mining and knowledge discovery series)Includes bibliographical references and index.ISBN 978-1-4200-9148-9 (hardcover : alk. paper)1. Database security. 2. Confidential communications. I. Fung, Benjamin C. M. II.

Title. III. Series.

QA76.9.D314I58 2010005.8--dc22 2010023845

Visit the Taylor & Francis Web site athttp://www.taylorandfrancis.com

and the CRC Press Web site athttp://www.crcpress.com

C9148_FM.indd 4 7/8/10 12:56:46 PM

Page 5: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

To Akina, Cyrus, Daphne, and my parents- B.F.

To Lucy, Catherine, Caroline, and Simon- K.W.

To my family- A.F.

To my family- P.Y.

Page 6: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Contents

List of Figures xv

List of Tables xvii

List of Algorithms xxi

Preface xxiii

Acknowledgments xxix

About the Authors xxxi

I The Fundamentals 1

1 Introduction 31.1 Data Collection and Data Publishing . . . . . . . . . . . . . 41.2 What Is Privacy-Preserving Data Publishing? . . . . . . . . 71.3 Related Research Areas . . . . . . . . . . . . . . . . . . . . . 9

2 Attack Models and Privacy Models 132.1 Record Linkage Model . . . . . . . . . . . . . . . . . . . . . . 14

2.1.1 k-Anonymity . . . . . . . . . . . . . . . . . . . . . . . 152.1.2 (X, Y )-Anonymity . . . . . . . . . . . . . . . . . . . . 182.1.3 Dilemma on Choosing QID . . . . . . . . . . . . . . . 18

2.2 Attribute Linkage Model . . . . . . . . . . . . . . . . . . . . 192.2.1 �-Diversity . . . . . . . . . . . . . . . . . . . . . . . . 202.2.2 Confidence Bounding . . . . . . . . . . . . . . . . . . . 232.2.3 (X, Y )-Linkability . . . . . . . . . . . . . . . . . . . . 232.2.4 (X, Y )-Privacy . . . . . . . . . . . . . . . . . . . . . . 242.2.5 (α, k)-Anonymity . . . . . . . . . . . . . . . . . . . . . 242.2.6 LKC-Privacy . . . . . . . . . . . . . . . . . . . . . . . 252.2.7 (k, e)-Anonymity . . . . . . . . . . . . . . . . . . . . . 262.2.8 t-Closeness . . . . . . . . . . . . . . . . . . . . . . . . 272.2.9 Personalized Privacy . . . . . . . . . . . . . . . . . . . 272.2.10 FF -Anonymity . . . . . . . . . . . . . . . . . . . . . . 28

2.3 Table Linkage Model . . . . . . . . . . . . . . . . . . . . . . 292.4 Probabilistic Model . . . . . . . . . . . . . . . . . . . . . . . 30

2.4.1 (c, t)-Isolation . . . . . . . . . . . . . . . . . . . . . . . 30

vii

Page 7: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

viii Contents

2.4.2 ε-Differential Privacy . . . . . . . . . . . . . . . . . . . 302.4.3 (d, γ)-Privacy . . . . . . . . . . . . . . . . . . . . . . . 312.4.4 Distributional Privacy . . . . . . . . . . . . . . . . . . 31

2.5 Modeling Adversary’s Background Knowledge . . . . . . . . 322.5.1 Skyline Privacy . . . . . . . . . . . . . . . . . . . . . . 322.5.2 Privacy-MaxEnt . . . . . . . . . . . . . . . . . . . . . 332.5.3 Skyline (B, t)-Privacy . . . . . . . . . . . . . . . . . . 33

3 Anonymization Operations 353.1 Generalization and Suppression . . . . . . . . . . . . . . . . 353.2 Anatomization and Permutation . . . . . . . . . . . . . . . . 383.3 Random Perturbation . . . . . . . . . . . . . . . . . . . . . . 41

3.3.1 Additive Noise . . . . . . . . . . . . . . . . . . . . . . 413.3.2 Data Swapping . . . . . . . . . . . . . . . . . . . . . . 423.3.3 Synthetic Data Generation . . . . . . . . . . . . . . . 42

4 Information Metrics 434.1 General Purpose Metrics . . . . . . . . . . . . . . . . . . . . 43

4.1.1 Minimal Distortion . . . . . . . . . . . . . . . . . . . . 434.1.2 ILoss . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.1.3 Discernibility Metric . . . . . . . . . . . . . . . . . . . 444.1.4 Distinctive Attribute . . . . . . . . . . . . . . . . . . . 45

4.2 Special Purpose Metrics . . . . . . . . . . . . . . . . . . . . . 454.3 Trade-Off Metrics . . . . . . . . . . . . . . . . . . . . . . . . 47

5 Anonymization Algorithms 495.1 Algorithms for the Record Linkage Model . . . . . . . . . . . 49

5.1.1 Optimal Anonymization . . . . . . . . . . . . . . . . . 495.1.2 Locally Minimal Anonymization . . . . . . . . . . . . 515.1.3 Perturbation Algorithms . . . . . . . . . . . . . . . . . 54

5.2 Algorithms for the Attribute Linkage Model . . . . . . . . . 555.2.1 �-Diversity Incognito and �+-Optimize . . . . . . . . . 555.2.2 InfoGain Mondrian . . . . . . . . . . . . . . . . . . . . 565.2.3 Top-Down Disclosure . . . . . . . . . . . . . . . . . . . 565.2.4 Anatomize . . . . . . . . . . . . . . . . . . . . . . . . 575.2.5 (k, e)-Anonymity Permutation . . . . . . . . . . . . . 585.2.6 Personalized Privacy . . . . . . . . . . . . . . . . . . . 58

5.3 Algorithms for the Table Linkage Model . . . . . . . . . . . . 595.3.1 δ-Presence Algorithms SPALM and MPALM . . . . . 59

5.4 Algorithms for the Probabilistic Attack Model . . . . . . . . 595.4.1 ε-Differential Additive Noise . . . . . . . . . . . . . . . 615.4.2 αβ Algorithm . . . . . . . . . . . . . . . . . . . . . . . 61

5.5 Attacks on Anonymous Data . . . . . . . . . . . . . . . . . . 615.5.1 Minimality Attack . . . . . . . . . . . . . . . . . . . . 615.5.2 deFinetti Attack . . . . . . . . . . . . . . . . . . . . . 63

Page 8: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Contents ix

5.5.3 Corruption Attack . . . . . . . . . . . . . . . . . . . . 64

II Anonymization for Data Mining 67

6 Anonymization for Classification Analysis 696.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 696.2 Anonymization Problems for Red Cross BTS . . . . . . . . . 74

6.2.1 Privacy Model . . . . . . . . . . . . . . . . . . . . . . 746.2.2 Information Metrics . . . . . . . . . . . . . . . . . . . 756.2.3 Problem Statement . . . . . . . . . . . . . . . . . . . . 81

6.3 High-Dimensional Top-Down Specialization (HDTDS) . . . . 826.3.1 Find the Best Specialization . . . . . . . . . . . . . . . 836.3.2 Perform the Best Specialization . . . . . . . . . . . . . 846.3.3 Update Score and Validity . . . . . . . . . . . . . . . . 876.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.4 Workload-Aware Mondrian . . . . . . . . . . . . . . . . . . . 896.4.1 Single Categorical Target Attribute . . . . . . . . . . . 896.4.2 Single Numerical Target Attribute . . . . . . . . . . . 906.4.3 Multiple Target Attributes . . . . . . . . . . . . . . . 916.4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.5 Bottom-Up Generalization . . . . . . . . . . . . . . . . . . . 926.5.1 The Anonymization Algorithm . . . . . . . . . . . . . 926.5.2 Data Structure . . . . . . . . . . . . . . . . . . . . . . 936.5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.6 Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 956.6.1 The Anonymization Algorithm . . . . . . . . . . . . . 966.6.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.7 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . 966.7.1 Data Utility . . . . . . . . . . . . . . . . . . . . . . . . 976.7.2 Efficiency and Scalability . . . . . . . . . . . . . . . . 102

6.8 Summary and Lesson Learned . . . . . . . . . . . . . . . . . 103

7 Anonymization for Cluster Analysis 1057.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1057.2 Anonymization Framework for Cluster Analysis . . . . . . . 105

7.2.1 Anonymization Problem for Cluster Analysis . . . . . 1087.2.2 Overview of Solution Framework . . . . . . . . . . . . 1127.2.3 Anonymization for Classification Analysis . . . . . . . 1147.2.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 1187.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 121

7.3 Dimensionality Reduction-Based Transformation . . . . . . . 1247.3.1 Dimensionality Reduction . . . . . . . . . . . . . . . . 1247.3.2 The DRBT Method . . . . . . . . . . . . . . . . . . . 125

7.4 Related Topics . . . . . . . . . . . . . . . . . . . . . . . . . . 1267.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

Page 9: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

x Contents

III Extended Data Publishing Scenarios 129

8 Multiple Views Publishing 1318.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1318.2 Checking Violations of k-Anonymity on Multiple Views . . . 133

8.2.1 Violations by Multiple Selection-Project Views . . . . 1338.2.2 Violations by Functional Dependencies . . . . . . . . . 1368.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 136

8.3 Checking Violations with Marginals . . . . . . . . . . . . . . 1378.4 MultiRelational k-Anonymity . . . . . . . . . . . . . . . . . . 1408.5 Multi-Level Perturbation . . . . . . . . . . . . . . . . . . . . 1408.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

9 Anonymizing Sequential Releases with New Attributes 1439.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

9.1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . 1439.1.2 Anonymization Problem for Sequential Releases . . . . 147

9.2 Monotonicity of Privacy . . . . . . . . . . . . . . . . . . . . . 1519.3 Anonymization Algorithm for Sequential Releases . . . . . . 153

9.3.1 Overview of the Anonymization Algorithm . . . . . . 1539.3.2 Information Metrics . . . . . . . . . . . . . . . . . . . 1549.3.3 (X, Y )-Linkability . . . . . . . . . . . . . . . . . . . . 1559.3.4 (X, Y )-Anonymity . . . . . . . . . . . . . . . . . . . . 158

9.4 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1599.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

10 Anonymizing Incrementally Updated Data Records 16110.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 16110.2 Continuous Data Publishing . . . . . . . . . . . . . . . . . . 164

10.2.1 Data Model . . . . . . . . . . . . . . . . . . . . . . . . 16410.2.2 Correspondence Attacks . . . . . . . . . . . . . . . . . 16510.2.3 Anonymization Problem for Continuous Publishing . . 16810.2.4 Detection of Correspondence Attacks . . . . . . . . . . 17510.2.5 Anonymization Algorithm for Correspondence Attacks 17810.2.6 Beyond Two Releases . . . . . . . . . . . . . . . . . . 18010.2.7 Beyond Anonymity . . . . . . . . . . . . . . . . . . . . 181

10.3 Dynamic Data Republishing . . . . . . . . . . . . . . . . . . 18110.3.1 Privacy Threats . . . . . . . . . . . . . . . . . . . . . 18210.3.2 m-invariance . . . . . . . . . . . . . . . . . . . . . . . 183

10.4 HD-Composition . . . . . . . . . . . . . . . . . . . . . . . . . 18510.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

Page 10: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Contents xi

11 Collaborative Anonymization for Vertically PartitionedData 19311.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 19311.2 Privacy-Preserving Data Mashup . . . . . . . . . . . . . . . . 194

11.2.1 Anonymization Problem for Data Mashup . . . . . . . 19911.2.2 Information Metrics . . . . . . . . . . . . . . . . . . . 20111.2.3 Architecture and Protocol . . . . . . . . . . . . . . . . 20211.2.4 Anonymization Algorithm for Semi-Honest Model . . 20411.2.5 Anonymization Algorithm for Malicious Model . . . . 21011.2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 217

11.3 Cryptographic Approach . . . . . . . . . . . . . . . . . . . . 21811.3.1 Secure Multiparty Computation . . . . . . . . . . . . 21811.3.2 Minimal Information Sharing . . . . . . . . . . . . . . 219

11.4 Summary and Lesson Learned . . . . . . . . . . . . . . . . . 220

12 Collaborative Anonymization for Horizontally PartitionedData 22112.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 22112.2 Privacy Model . . . . . . . . . . . . . . . . . . . . . . . . . . 22212.3 Overview of the Solution . . . . . . . . . . . . . . . . . . . . 22312.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

IV Anonymizing Complex Data 227

13 Anonymizing Transaction Data 22913.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 229

13.1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . 22913.1.2 The Transaction Publishing Problem . . . . . . . . . . 23013.1.3 Previous Works on Privacy-Preserving Data Mining . 23113.1.4 Challenges and Requirements . . . . . . . . . . . . . . 232

13.2 Cohesion Approach . . . . . . . . . . . . . . . . . . . . . . . 23413.2.1 Coherence . . . . . . . . . . . . . . . . . . . . . . . . . 23513.2.2 Item Suppression . . . . . . . . . . . . . . . . . . . . . 23613.2.3 A Heuristic Suppression Algorithm . . . . . . . . . . . 23713.2.4 Itemset-Based Utility . . . . . . . . . . . . . . . . . . 24013.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 241

13.3 Band Matrix Method . . . . . . . . . . . . . . . . . . . . . . 24213.3.1 Band Matrix Representation . . . . . . . . . . . . . . 24313.3.2 Constructing Anonymized Groups . . . . . . . . . . . 24313.3.3 Reconstruction Error . . . . . . . . . . . . . . . . . . . 24513.3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 246

13.4 km-Anonymization . . . . . . . . . . . . . . . . . . . . . . . . 24713.4.1 km-Anonymity . . . . . . . . . . . . . . . . . . . . . . 24713.4.2 Apriori Anonymization . . . . . . . . . . . . . . . . . 24813.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 250

Page 11: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

xii Contents

13.5 Transactional k-Anonymity . . . . . . . . . . . . . . . . . . . 25213.5.1 k-Anonymity for Set Valued Data . . . . . . . . . . . 25213.5.2 Top-Down Partitioning Anonymization . . . . . . . . 25413.5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 254

13.6 Anonymizing Query Logs . . . . . . . . . . . . . . . . . . . . 25513.6.1 Token-Based Hashing . . . . . . . . . . . . . . . . . . 25513.6.2 Secret Sharing . . . . . . . . . . . . . . . . . . . . . . 25613.6.3 Split Personality . . . . . . . . . . . . . . . . . . . . . 25713.6.4 Other Related Works . . . . . . . . . . . . . . . . . . . 257

13.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258

14 Anonymizing Trajectory Data 26114.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

14.1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . 26114.1.2 Attack Models on Trajectory Data . . . . . . . . . . . 262

14.2 LKC-Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . 26414.2.1 Trajectory Anonymity for Maximal Frequent Sequences 26514.2.2 Anonymization Algorithm for LKC-Privacy . . . . . . 26914.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 276

14.3 (k, δ)-Anonymity . . . . . . . . . . . . . . . . . . . . . . . . . 27814.3.1 Trajectory Anonymity for Minimal Distortion . . . . . 27814.3.2 The Never Walk Alone Anonymization Algorithm . . 28014.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 281

14.4 MOB k-Anonymity . . . . . . . . . . . . . . . . . . . . . . . 28214.4.1 Trajectory Anonymity for Minimal Information Loss . 28214.4.2 Anonymization Algorithm for MOB k-Anonymity . . 284

14.5 Other Spatio-Temporal Anonymization Methods . . . . . . . 28814.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289

15 Anonymizing Social Networks 29115.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 291

15.1.1 Data Models . . . . . . . . . . . . . . . . . . . . . . . 29215.1.2 Attack Models . . . . . . . . . . . . . . . . . . . . . . 29215.1.3 Utility of the Published Data . . . . . . . . . . . . . . 297

15.2 General Privacy-Preserving Strategies . . . . . . . . . . . . . 29715.2.1 Graph Modification . . . . . . . . . . . . . . . . . . . 29815.2.2 Equivalence Classes of Nodes . . . . . . . . . . . . . . 298

15.3 Anonymization Methods for Social Networks . . . . . . . . . 29915.3.1 Edge Insertion and Label Generalization . . . . . . . . 29915.3.2 Clustering Nodes for k-Anonymity . . . . . . . . . . . 30015.3.3 Supergraph Generation . . . . . . . . . . . . . . . . . 30115.3.4 Randomized Social Networks . . . . . . . . . . . . . . 30215.3.5 Releasing Subgraphs to Users: Link Recovery . . . . . 30215.3.6 Not Releasing the Network . . . . . . . . . . . . . . . 302

15.4 Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303

Page 12: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Contents xiii

15.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303

16 Sanitizing Textual Data 30516.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 30516.2 ERASE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306

16.2.1 Sanitization Problem for Documents . . . . . . . . . . 30616.2.2 Privacy Model: K-Safety . . . . . . . . . . . . . . . . 30616.2.3 Problem Statement . . . . . . . . . . . . . . . . . . . . 30716.2.4 Sanitization Algorithms for K-Safety . . . . . . . . . . 308

16.3 Health Information DE-identification (HIDE) . . . . . . . . . 30916.3.1 De-Identification Models . . . . . . . . . . . . . . . . . 30916.3.2 The HIDE Framework . . . . . . . . . . . . . . . . . . 310

16.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311

17 Other Privacy-Preserving Techniques and Future Trends 31317.1 Interactive Query Model . . . . . . . . . . . . . . . . . . . . 31317.2 Privacy Threats Caused by Data Mining Results . . . . . . . 31517.3 Privacy-Preserving Distributed Data Mining . . . . . . . . . 31617.4 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . 317

References 319

Index 341

Page 13: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

List of Figures

1.1 Data collection and data publishing . . . . . . . . . . . . . . 51.2 Linking to re-identify record owner . . . . . . . . . . . . . . 8

2.1 Taxonomy trees for Job, Sex, Age . . . . . . . . . . . . . . 17

3.1 Taxonomy trees for Job, Sex, Age . . . . . . . . . . . . . . 36

6.1 Overview of the Red Cross BTS . . . . . . . . . . . . . . . . 706.2 Data flow in BTS . . . . . . . . . . . . . . . . . . . . . . . . 716.3 Taxonomy trees and QIDs . . . . . . . . . . . . . . . . . . . 736.4 Taxonomy trees for Tables 6.3-6.7 . . . . . . . . . . . . . . . 766.5 Taxonomy tree for Work Hrs . . . . . . . . . . . . . . . . . 806.6 The TIPS data structure . . . . . . . . . . . . . . . . . . . . 846.7 A complete TIPS data structure . . . . . . . . . . . . . . . 856.8 Illustrate the flexibility of multidimensional generalization

scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 906.9 The TEA structure for QID = {Relationship,Race,

Workclass} . . . . . . . . . . . . . . . . . . . . . . . . . . . 946.10 Classification error on the Blood data set (C = 20%) . . . . 996.11 Discernibility ratio on the Blood data set (C = 20%) . . . . 996.12 Classification error on the Adult data set (C = 20%) . . . . 1006.13 Classification error on the Adult data set (K = 100) . . . . 1016.14 Discernibility ratio on the Adult data set (C = 20%) . . . . 1016.15 Discernibility ratio on the Adult data set (K = 100) . . . . 1026.16 Scalability (L = 4, K = 20, C = 100%) . . . . . . . . . . . . 103

7.1 Taxonomy trees . . . . . . . . . . . . . . . . . . . . . . . . . 1107.2 The framework . . . . . . . . . . . . . . . . . . . . . . . . . 113

8.1 Views V3, V4, and V5 . . . . . . . . . . . . . . . . . . . . . . 1358.2 Views V3 and V4 . . . . . . . . . . . . . . . . . . . . . . . . 136

9.1 The static Tree1 . . . . . . . . . . . . . . . . . . . . . . . . 1569.2 Evolution of Tree2 (y = HIV ) . . . . . . . . . . . . . . . . . 1579.3 Evolution of X-tree . . . . . . . . . . . . . . . . . . . . . . . 157

10.1 Taxonomy trees for Birthplace and Job . . . . . . . . . . . . 166

xv

Page 14: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

xvi List of Figures

10.2 Correspondence attacks . . . . . . . . . . . . . . . . . . . . 167

11.1 Collaborative anonymization for vertically partitioned data 19411.2 Architecture of privacy-preserving data mashup . . . . . . . 19511.3 Taxonomy trees and QIDs . . . . . . . . . . . . . . . . . . . 19711.4 Service-oriented architecture for privacy-preserving data

mashup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20311.5 The TIPS after the first specialization . . . . . . . . . . . . 20611.6 The TIPS after the second specialization . . . . . . . . . . . 20611.7 Rational participation . . . . . . . . . . . . . . . . . . . . . 21211.8 Payoff matrix for prisoner’s dilemma . . . . . . . . . . . . . 214

13.1 MOLE-tree: k = 2, p = 2, h = 80% . . . . . . . . . . . . . . 24013.2 MOLE-tree: k = 2, p = 2, h = 80% . . . . . . . . . . . . . . 24013.3 A band matrix . . . . . . . . . . . . . . . . . . . . . . . . . 24413.4 Transformation using generalization rule {a1, a2} → A . . . 24813.5 Possible generalizations and cut hierarchy . . . . . . . . . . 24913.6 Possible generalizations and cut hierarchy . . . . . . . . . . 25013.7 Outliers e and i cause generalization to top item T . . . . . 251

14.1 MVS-tree and MFS-tree for efficient Score updates . . . . . 27414.2 MVS-tree and MFS-tree after suppressing c4 . . . . . . . . 27414.3 Data flow in RFID system . . . . . . . . . . . . . . . . . . . 27714.4 Time and spatial trajectory volume . . . . . . . . . . . . . . 27914.5 Graphical representation of MOD D . . . . . . . . . . . . . 28314.6 Attack graph of the unsafe 2-anonymity scheme of D in

Table 14.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285

15.1 A social network . . . . . . . . . . . . . . . . . . . . . . . . 29315.2 A social network . . . . . . . . . . . . . . . . . . . . . . . . 29515.3 Anonymization by addition and deletion of edges . . . . . . 29815.4 Anonymization by clustering nodes . . . . . . . . . . . . . . 299

16.1 Health Information DE-identification (HIDE) . . . . . . . . 310

Page 15: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

List of Tables

2.1 Privacy models . . . . . . . . . . . . . . . . . . . . . . . . . 142.2 Original patient data . . . . . . . . . . . . . . . . . . . . . . 152.3 External data . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4 3-anonymous patient data . . . . . . . . . . . . . . . . . . . 162.5 4-anonymous external data . . . . . . . . . . . . . . . . . . 162.6 (7, 50)-anonymous group . . . . . . . . . . . . . . . . . . . . 27

3.1 Anatomy: original patient data . . . . . . . . . . . . . . . . 393.2 Anatomy: intermediate QID-grouped table . . . . . . . . . 393.3 Anatomy: quasi-identifier table (QIT) for release . . . . . . 403.4 Anatomy: sensitive table (ST) for release . . . . . . . . . . . 40

5.1 Anonymization algorithms for record linkage . . . . . . . . . 505.2 Anonymization algorithms for attribute linkage . . . . . . . 565.3 Anonymization algorithms for table linkage . . . . . . . . . 595.4 Anonymization algorithms for probabilistic attack . . . . . . 605.5 Minimality attack: original patient data . . . . . . . . . . . 625.6 Minimality attack: published anonymous data . . . . . . . . 635.7 Minimality attack: external data . . . . . . . . . . . . . . . 635.8 deFinetti attack: quasi-identifier table (QIT) . . . . . . . . . 645.9 deFinetti attack: sensitive table (ST) . . . . . . . . . . . . . 655.10 Corruption attack: 2-diverse table . . . . . . . . . . . . . . . 65

6.1 Raw patient data . . . . . . . . . . . . . . . . . . . . . . . . 726.2 Anonymous data (L = 2, K = 2, C = 50%) . . . . . . . . . 726.3 Compress raw patient table . . . . . . . . . . . . . . . . . . 776.4 Statistics for the most masked table . . . . . . . . . . . . . 776.5 Final masked table by InfoGain . . . . . . . . . . . . . . . 786.6 Intermediate masked table by Score . . . . . . . . . . . . . 786.7 Final masked table by Score . . . . . . . . . . . . . . . . . . 796.8 Compressed raw patient table for illustrating optimal split on

numerical attribute . . . . . . . . . . . . . . . . . . . . . . . 806.9 Attributes for the Adult data set . . . . . . . . . . . . . . . 98

7.1 The labelled table . . . . . . . . . . . . . . . . . . . . . . . . 1097.2 The masked table, satisfying {〈QID1, 4〉, 〈QID2, 11〉} . . . . 1187.3 The masked labelled table for evaluation . . . . . . . . . . . 119

xvii

Page 16: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

xviii List of Tables

7.4 The similarity of two cluster structures . . . . . . . . . . . . 1197.5 The F-measure computed from Table 7.4 . . . . . . . . . . . 1207.6 Matrix(C) of clusters {1, 2, 3} and {4, 5} . . . . . . . . . . . 1217.7 Matrix(Cg) of clusters {1, 2} and {3, 4, 5} . . . . . . . . . . 1217.8 Match Point table for Matrix(C) and Matrix(Cg) . . . . . 121

8.1 Multiple views publishing: T1 . . . . . . . . . . . . . . . . . 1328.2 Multiple views publishing: T2 . . . . . . . . . . . . . . . . . 1328.3 Multiple views publishing: the join of T1 and T2 . . . . . . . 1328.4 Raw data table T . . . . . . . . . . . . . . . . . . . . . . . . 1348.5 View V1 = ΠName,Job(T ) . . . . . . . . . . . . . . . . . . . . 1348.6 View V2 = ΠJob,Disease(T ) . . . . . . . . . . . . . . . . . . . 1348.7 View V3 = ΠNameσAge>45(T ) . . . . . . . . . . . . . . . . . 1348.8 View V4 = ΠNameσAge<55(T ) . . . . . . . . . . . . . . . . . 1348.9 View V5 = ΠNameσ40<Age<55(T ) . . . . . . . . . . . . . . . 1348.10 Raw data table R for illustrating privacy threats caused by

functional dependency . . . . . . . . . . . . . . . . . . . . . 1378.11 View V6 = ΠName,Doctor(R) . . . . . . . . . . . . . . . . . . 1378.12 View V7 = ΠDoctor,Disease(R) . . . . . . . . . . . . . . . . . 1378.13 Raw patient table . . . . . . . . . . . . . . . . . . . . . . . . 1388.14 (2,3)-diverse patient table . . . . . . . . . . . . . . . . . . . 1388.15 Salary marginal . . . . . . . . . . . . . . . . . . . . . . . . . 1398.16 Zip code/disease marginal . . . . . . . . . . . . . . . . . . . 1398.17 Zip code/salary marginal . . . . . . . . . . . . . . . . . . . . 140

9.1 Raw data table T1 . . . . . . . . . . . . . . . . . . . . . . . 1449.2 Raw data table T2 . . . . . . . . . . . . . . . . . . . . . . . 1449.3 The join of T1 and T2 . . . . . . . . . . . . . . . . . . . . . 1459.4 The patient data (T1) . . . . . . . . . . . . . . . . . . . . . 1489.5 The medical test data (T2) . . . . . . . . . . . . . . . . . . . 149

10.1 Continuous data publishing: 2-diverse T1 . . . . . . . . . . . 16210.2 Continuous data publishing: 2-diverse T2 after an insertion to

T1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16210.3 Continuous data publishing: 2-diverse T2 after a deletion from

and an insertion to T1 . . . . . . . . . . . . . . . . . . . . . 16310.4 5-anonymized R1 . . . . . . . . . . . . . . . . . . . . . . . . 16610.5 5-anonymized R2 . . . . . . . . . . . . . . . . . . . . . . . . 16610.6 Types of correspondence attacks . . . . . . . . . . . . . . . 17010.7 Raw data T1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 18210.8 2-anonymous and 2-diverse R1 . . . . . . . . . . . . . . . . . 18210.9 Raw data T2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 18410.10 2-anonymous and 2-diverse R2 . . . . . . . . . . . . . . . . . 18410.11 R2 by counterfeited generalization . . . . . . . . . . . . . . 18510.12 Auxiliary table . . . . . . . . . . . . . . . . . . . . . . . . . 185

Page 17: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

List of Tables xix

10.13 Voter registration list (RL) . . . . . . . . . . . . . . . . . . 18610.14 Raw data (T ) . . . . . . . . . . . . . . . . . . . . . . . . . . 18810.15 Published tables T ∗ satisfying 3-invariance . . . . . . . . . . 189

11.1 Raw tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19711.2 Anonymous tables, illustrating a selfish Party A . . . . . . . 211

12.1 Node 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22412.2 Node 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22412.3 Node 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22412.4 Node 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22412.5 Union of the generalized data . . . . . . . . . . . . . . . . . 225

13.1 D, k = 2, p = 2, h = 0.8 . . . . . . . . . . . . . . . . . . . . . 23613.2 Example of band matrix: original data . . . . . . . . . . . . 24513.3 Example of band matrix: re-organized data . . . . . . . . . 24513.4 Example of band matrix: anonymized data . . . . . . . . . . 24513.5 Original database . . . . . . . . . . . . . . . . . . . . . . . . 25313.6 Generalized 2-anonymous database . . . . . . . . . . . . . . 25313.7 Summary of approaches . . . . . . . . . . . . . . . . . . . . 258

14.1 Raw trajectory and health data . . . . . . . . . . . . . . . . 26314.2 Traditional 2-anonymous data . . . . . . . . . . . . . . . . . 26314.3 Anonymous data with L = 2, K = 2, C = 50% . . . . . . . 26414.4 Counter example for monotonic property . . . . . . . . . . . 27014.5 Initial Score . . . . . . . . . . . . . . . . . . . . . . . . . . . 27314.6 Score after suppressing c4 . . . . . . . . . . . . . . . . . . . 27514.7 Raw moving object database D . . . . . . . . . . . . . . . . 28414.8 A 2-anonymity scheme of D that is not safe . . . . . . . . . 28414.9 Extreme union over Table 14.7 . . . . . . . . . . . . . . . . 28614.10 Symmetric anonymization over Table 14.7 . . . . . . . . . . 28614.11 Equivalence classes produced by extreme union and

symmetric union over Table 14.7 . . . . . . . . . . . . . . . 287

17.1 Interactive query model: original examination data . . . . . 31417.2 Interactive query model: after adding one new record . . . . 31417.3 3-anonymous patient data . . . . . . . . . . . . . . . . . . . 316

Page 18: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

List of Algorithms

6.3.1 High-Dimensional Top-Down Specialization (HDTDS) . . 826.5.2 Bottom-Up Generalization . . . . . . . . . . . . . . . . . . 926.5.3 Bottom-Up Generalization . . . . . . . . . . . . . . . . . . 937.2.4 Top-Down Refinement (TDR) . . . . . . . . . . . . . . . . 1169.3.5 Top-Down Specialization for Sequential Anonymization

(TDS4SA) . . . . . . . . . . . . . . . . . . . . . . . . . . . 15410.2.6 BCF-Anonymizer . . . . . . . . . . . . . . . . . . . . . . . 17911.2.7 PPMashup for Party A (Same as Party B) in Semi-Honest

Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20511.2.8 PPMashup for Every Party in Malicious Model . . . . . . 21713.2.9 The Greedy Algorithm . . . . . . . . . . . . . . . . . . . . 23813.2.10 Identifying Minimal Moles . . . . . . . . . . . . . . . . . . 23913.4.11 Apriori-Based Anonymization . . . . . . . . . . . . . . . . 24913.5.12 Top-Down, Local Partitioning Algorithm . . . . . . . . . . 25414.2.13 MVS Generator . . . . . . . . . . . . . . . . . . . . . . . . 27114.2.14 Trajectory Data Anonymizer . . . . . . . . . . . . . . . . . 273

xxi

Page 19: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Preface

Organization of the Book

The book has four parts. Part I discusses the fundamentals of privacy-preserving data publishing. Part II presents anonymization methods for pre-serving information utility for some specific data mining tasks. The data pub-lishing scenarios discussed in Part I and Part II assume publishing a singledata release from one data holder. In real-life data publishing, the scenariois more complicated. For example, the same data may be published severaltimes. Each time, the data is anonymized differently for different purposes,or the data is published incrementally as new data are collected. Part IIIdiscusses the privacy issues, privacy models, and anonymization methods forthese more realistic, yet more challenging, data publishing scenarios. All worksdiscussed in the first three parts focus on anonymizing relational data. Whatabout other types of data? Recent studies have shown that publishing trans-action data, trajectory data, social networks data, and textual data may alsoresult in privacy threats and sensitive information leakages. Part IV studiesthe privacy threats, privacy models, and anonymization methods for thesecomplex data.

Part I: The Fundamentals

Chapter 1 provides an introduction to privacy-preserving data publishing.Motivated by the advancement of data collection and information sharingtechnologies, there is a clear demand for information sharing and data pub-lication without compromising the individual privacy in the published data.This leads to the development of the research topic, privacy-preserving datapublishing, discussed in this book. We define some desirable requirements andproperties of privacy-preserving data publishing, followed by a general discus-sion on other closely related research topics.

Chapter 2 explains various types of attacks that can be performed on thepublished data, and the corresponding privacy models proposed for prevent-ing such attacks. The privacy models are systematically grouped by theirattack models. Industrial practitioners can use the privacy models discussedin the chapter to estimate the degree of privacy risks on their shared data.Researchers in the field of privacy protection may find this chapter as a handyreference.

xxiii

Page 20: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

xxiv Preface

Chapter 3 discusses the anonymization operations employed to achieve theprivacy models discussed in Chapter 2. The chapter compares the pros andcons of different anonymization operations.

Chapter 4 studies various types of information metrics that capture dif-ferent information needs for different data analysis and data mining tasks.Data can be anonymized in different ways to serve different information needs.These “needs” are captured by an information metric that aims at maximizingthe preservation of certain type of information in the anonymization process.

Chapter 5 studies some representative anonymization algorithms for achiev-ing the privacy models presented in Chapter 2 and systematically classifiesthem by their addressed privacy attacks without considering any specific datamining tasks. In contrast, the anonymization algorithms discussed in Part IIare designed for preserving some specific types of data mining information inthe anonymous data.

Part II: Anonymization for Data Mining

Chapter 6 uses the Red Cross Blood Transfusion Service as a real-life casestudy to illustrate the requirements and challenges of the anonymization prob-lem for classification analysis. We present a privacy model to overcome thechallenges of anonymizing high-dimensional relational data without signifi-cantly compromising the data quality, followed by an efficient anonymiza-tion algorithm for achieving the privacy model with two different informationneeds. The first information need maximizes the information preserved forclassification analysis; the second information need minimizes the distortionon the anonymous data for general data analysis. The chapter also stud-ies other anonymization algorithms that address the anonymization problemfor classification analysis, and presents a methodology to evaluate the dataquality of the anonymized data with respect to the information needs. Re-searchers and industrial practitioners may adopt the methodology to evaluatetheir anonymized data.

Chapter 7 studies the anonymization problem for cluster analysis andpresents a framework to tackle the problem. The framework transforms theanonymization problem for cluster analysis to the anonymization problemfor classification analysis discussed in Chapter 6. The framework includes anevaluation phase that allows the data holder to evaluate the quality of theiranonymized data with respect to the information need on cluster analysis.

Part III: Extended Data Publishing Scenarios

Chapter 8 studies the scenario of multiple views publishing in which eachdata release is a view of an underlying data table serving different information

Page 21: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Preface xxv

needs. Even if each release is individually anonymized, an adversary may beable to crack the anonymity by comparing different anonymized views. Often,additional statistics are published together with the anonymized data. Thischapter also studies the privacy threats caused by the published data togetherwith the additional statistical information.

Chapter 9 studies the scenario of sequential data publishing in which eachdata release is a vertical partition of the underlying raw data table and maycontain some new attributes. Since the releases could be overlapping, an ad-versary may be able to crack the anonymity by comparing multiple releases.This chapter provides a detailed analysis on the privacy threats in this sce-nario and presents an anonymization method to thwart the potential privacythreats.

Chapter 10 studies two scenarios of incremental data publishing. The firstscenario is called continuous data publishing in which each data release in-cludes new data records together with all previously published data records.In other words, each data release is a history of all previously occurred events.We illustrate the potential privacy threats in this data publishing model, suc-cinctly model the privacy attacks, and present an anonymization algorithmto thwart them. The second scenario is called dynamic data republishing inwhich each data release is a snapshot of the current database. The privacymodel for dynamic data republishing takes both record insertions and dele-tions into the consideration of potential privacy threats.

Chapter 11 studies the scenario of collaborative data publishing in whichmultiple data holders want to integrate their data together and publish theirintegrated data to each other or to a third party without disclosing specificdetails of their data to each other. We study the problem of collaborativeanonymization for vertically partitioned data in the context of data mashupapplication, and present a web service architecture together with a secure pro-tocol to achieve the privacy and information utility requirements agreed byall participating data holders.

Chapter 12 studies a similar scenario of collaborative data publishing but inthe problem of collaborative anonymization for horizontally partitioned data.Each data holder owns different sets of data records on the same data schema,and would like to integrate their data together to achieve some common datamining task.

Part IV: Anonymizing Complex Data

Chapter 13 defines the anonymization problem for transaction data, dis-cusses the challenges, and presents various anonymization methods for ad-dressing the challenges. Transaction data can be found in many daily ap-

Page 22: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

xxvi Preface

plications. Examples of transaction data include supermarket data, creditcard data, and medical history, etc. The major challenge of transaction dataanonymization is to overcome the problem of high dimensionality. The highdimensionality of transaction data comes from the fact that most transactiondatabases, for example a supermarket database, contain many distinct items.Every distinct item constitutes a dimension in the transaction data.

Chapter 14 defines the anonymization problem for trajectory data, andpresents various privacy models and anonymization methods to address theproblem. In addition to the challenge of high dimensionality discussed above,anonymizing trajectory is even more complicated due to the presence of se-quences. The chapter also discusses how to preserve and evaluate the infor-mation utility in the anonymized trajectory data.

Chapter 15 discusses different data models and attack models on social net-works data. Social network application is one the fastest growing web applica-tions. In social network applications, e.g., Facebook and LinkedIn, participantsshare their personal information, preferences, and opinion with their friendsand in their participated social groups. This valuable information preciselycaptures the lifestyle of individuals as well as the trends in some particularsocial groups. This chapter concerns the privacy threats and information util-ity if such data are shared with a third party for data mining.

Chapter 16 studies some sanitization methods on textual data. Textual datacan be found everywhere, from text documents on the Web to patients’ med-ical history written by doctors. Sanitization on textual data refers to theprocedure of removing personal identifiable and/or sensitive information fromthe text. Unlike the structural relational data and transaction data discussedearlier, textual data is unstructural, making the anonymization problem muchmore complicated because the anonymization method has to first determinethe identifiable and/or sensitive information from the textual data and thenapply the anonymization. This research topic is still in its infancy stage. Thischapter discusses two recently proposed methods.

Chapter 17 briefly discusses other privacy-preserving techniques that areorthogonal to privacy-preserving data publishing, and concludes the book withfuture trends in privacy-preserving data publishing.

To the Instructor

This book is designed to provide a detailed overview of the field of privacy-preserving data publishing. The materials presented are suitable for an ad-vanced undergraduate course or a graduate course. Alternatively, privacy-preserving data publishing can be one of the topics in a database security or

Page 23: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Preface xxvii

a data mining course. This book can serve as a textbook or a supplementaryreference for these types of courses.

If you intend to use this book to teach an introductory course in privacy-preserving data publishing, you should start from Part I, which contains theessential information of privacy-preserving data publishing. Students withsome data mining knowledge will probably find Part II interesting becausethe anonymization problems are motivated by a real-life case study, and thechapter presents both data mining and privacy requirements in the context ofthe case study.

Part III covers more advanced topics in privacy-preserving data publishing.For an undergraduate course, you may want to skip this part. For a graduatecourse, you may consider covering some of the selected chapters in this part.For this part, the students are expected to have some basic knowledge ofdatabase operations, such as selection, projection, and join. Chapters in thispart are standalone, so you may selectively skip some chapters and sections,or change the order.

Part IV covers some recent works on addressing the anonymization problemfor different types of data. The data models and challenges are explained indetail. The materials are suitable for a graduate course.

To the Student

This book is a good entry point to the research field of privacy-preservingdata publishing. If you have some basic knowledge in computer science andinformation technology, then you should have no problem understanding thematerials in Part I, which contains the essential information of the researchtopic. It will be beneficial if you already have some basic knowledge of datamining, such as classification analysis, cluster analysis, and association rulesmining, before proceeding to Part II and Part IV. Han and Kamber [109] havewritten an excellent textbook on data mining. Alternatively, you can easilyfind lots of introductory information on these general data mining topics onthe Web. In order to understand the materials in Part III, you should havesome basic knowledge of database operations.

To the Researcher

If you are a beginner in the field of privacy-preserving data publishing, thenthis book will provide you a broad yet detailed overview of the research topic.If you are already a researcher in the field, then you may find Part I a goodreference, and proceed directly to the more advanced topics in subsequentparts. Despite a lot of effort by the research community spent on this topic,there are many challenging problems remaining to be solved. We hope thisbook will spark some new research problems and ideas for the field.

Page 24: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

xxviii Preface

To the Industrial Practitioner

Information sharing has become the daily operation of many businesses intoday’s information age. This book is suitable for industrial IT practitioners,such as database administrators, database architects, information engineers,and chief information officers, who manage a large volume of person-specificsensitive data and are looking for methods to share the data with businesspartners or to the public without compromising the privacy of their clients.Some chapters present the privacy-preserving data publishing problems inthe context of the industrial collaborative projects and reflect the authors’experience in these projects.

You are recommended to start from Part I to gain some basic knowledge ofprivacy-preserving data publishing. If you intend to publish the data for somespecific data mining tasks, you may find Part II useful. Chapter 6 describesa data publishing scenario in the healthcare sector. Though the data miningtask and the sector may not directly be relevant to yours, the materials areeasy to generalize to be adopted for other data mining tasks in other sectors.If you intend to make multiple releases of your continuously evolving data,then some of the data publishing scenarios described in Part III may matchyours. Part IV describes the privacy models and anonymization algorithmsfor different types of data. Due to the complexity of real-life data, you mayneed to employ different anonymization methods depending on the types ofdata you have at hand.

The book not only discusses the privacy and information utility issues, butalso the efficiency and scalability issues in privacy-preserving data publishing.In many chapters, we highlight the efficient and scalable methods and providean analytical discussion to compare the strengths and weaknesses of differ-ent solutions. We would be glad to listen to your stories if you have appliedprivacy-preserving data publishing methods in real-life projects. Our e-mailaddress is [email protected].

Page 25: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Acknowledgments

This work is supported in part by the Natural Sciences and Engineering Re-search Council of Canada (NSERC) Discovery Grant 356065-2008, RGC Re-search Grant Direct Allocation 2050421 of the Chinese University of HongKong, and NSF Grant IIS-0914934.

We would like to express our sincere thanks to our current and pastresearch collaborators in the fields of privacy-preserving data publishing,privacy-preserving data mining, statistical disclosure control, and secure mul-tiparty computation. These include Ming Cao, Rui Chen, Mourad Debbabi,Rachida Dssouli, Bipin C. Desai, Guozhu Dong, Amin Hammad, Patrick C. K.Hung, Cheuk-kwong Lee, Noman Mohammed, Lalita Narupiyakul, Jian Pei,Harpreet Sandhu, Qing Shi, Jarmanjit Singh, Thomas Trojer, Lingyu Wang,Wendy Hui Wang, Yiming Wang, Raymond Chi-Wing Wong, Li Xiong, HengXu, and Yabo Xu. We also specially thank the graduate students, Khalil Al-Hussaeni, Rui Chen, Noman Mohammed, and Minu Venkat, in the ConcordiaInstitute for Information Systems Engineering (CIISE) at Concordia Univer-sity for their contributions to this book.

We would also like to thank Randi Cohen, our editor at CRC Press/Taylorand Francis Group for her patience and support during the writing of thisbook. Finally, we thank our families for their continuous support throughoutthis project.

xxix

Page 26: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

About the Authors

Benjamin C. M. Fung received a Ph.D. degree in computing science fromSimon Fraser University, Canada, in 2007. He is currently an assistant profes-sor in the Concordia Institute for Information Systems Engineering (CIISE)at Concordia University, Montreal, Quebec, Canada, and a research scien-tist and the treasurer of the National Cyber-Forensics and Training AllianceCanada (NCFTA Canada). His current research interests include data mining,databases, privacy preservation, information security, and digital forensics, aswell as their interdisciplinary applications on current and emerging technolo-gies. His publications span across the research forums of data mining, privacyprotection, cyber forensics, and communications, including ACM SIGKDD,ACM Computing Surveys, ACM TKDD, ACM CIKM, IEEE TKDE, IEEEICDE, IEEE ICDM, EDBT, SDM, IEEE RFID, and Digital Investigation. Healso consistently serves as program committee member and reviewer for theseprestigious forums. His research has been supported in part by the NaturalSciences and Engineering Research Council of Canada (NSERC), Le Fondsquebecois de la recherche sur la nature et les technologies (FQRNT), theNCFTA Canada, and Concordia University.

Before pursuing his academic career, Dr. Fung worked at SAP Business Ob-jects and designed reporting systems for various Enterprise Resource Planning(ERP) and Customer Relationship Management (CRM) systems. He is a li-censed professional engineer in software engineering, and is currently affiliatedwith the Computer Security Lab in CIISE.

Ke Wang received a Ph.D. degree from the Georgia Institute of Tech-nology. He is currently a professor at the School of Computing Science, Si-mon Fraser University, Burnaby, British Columbia, Canada. He has taughtin the areas of database and data mining. His research interests includedatabase technology, data mining and knowledge discovery, machine learn-ing, and emerging applications, with recent interests focusing on the end useof data mining. This includes explicitly modeling the business goal and ex-ploiting user prior knowledge (such as extracting unexpected patterns andactionable knowledge). He is interested in combining the strengths of variousfields such as database, statistics, machine learning, and optimization to pro-vide actionable solutions to real-life problems. He has published in database,information retrieval, and data mining conferences, including ACM SIGMOD,ACM SIGKDD, ACM SIGIR, ACM PODS, VLDB, IEEE ICDE, IEEE ICDM,EDBT, and SDM. He was an associate editor of the IEEE Transactions onKnowledge and Data Engineering, and an editorial board member for theJournal of Data Mining and Knowledge Discovery.

xxxi

Page 27: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

xxxii About the Authors

Ada Wai-Chee Fu received her B.Sc degree in computer science from theChinese University of Hong Kong in 1983, and both M.Sc and Ph.D degreesin computer science from Simon Fraser University of Canada in 1986, 1990,respectively. She worked at Bell Northern Research in Ottawa, Canada from1989 to 1993 on a wide-area distributed database project and joined the Chi-nese University of Hong Kong in 1993. Her research interests include topics indata mining, privacy-preserving data publishing, and database managementsystems. Her work has been published in journals and conference proceedingsin both the data mining and database areas.

Philip S. Yu received Ph.D. and M.S. degrees in electrical engineeringfrom Stanford University, the M.B.A. degree from New York University, andthe B.S. Degree in Electrical Engineering from National Taiwan University.His main research interests include data mining, privacy-preserving publish-ing and mining, data streams, database systems, Internet applications andtechnologies, multimedia systems, parallel and distributed processing, andperformance modeling. He is a Professor in the Department of Computer Sci-ence at the University of Illinois at Chicago and also holds the Wexler Chairin Information and Technology. He was a manager of the Software Tools andTechniques group at the IBM Thomas J. Watson Research Center. Dr. Yuhas published more than 560 papers in refereed journals and conferences. Heholds or has applied for more than 300 US patents.

Dr. Yu is a Fellow of the ACM and of the IEEE. He is associate editorof ACM Transactions on the Internet Technology and ACM Transactions onKnowledge Discovery from Data. He is on the steering committee of IEEEConference on Data Mining and was a member of the IEEE Data Engi-neering steering committee. He was the Editor-in-Chief of IEEE Transactionson Knowledge and Data Engineering (2001-2004), an editor, advisory boardmember, and also a guest co-editor of the special issue on mining of databases.He had also served as an associate editor of Knowledge and Information Sys-tems. He has received several IBM honors including 2 IBM Outstanding In-novation Awards, an Outstanding Technical Achievement Award, 2 ResearchDivision Awards, and the 93rd plateau of Invention Achievement Awards.He was an IBM Master Inventor. Dr. Yu received a Research ContributionsAward from the IEEE International Conference on Data Mining in 2003 andalso an IEEE Region 1 Award for “promoting and perpetuating numerousnew electrical engineering concepts” in 1999.

Page 28: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Part I

The Fundamentals

Page 29: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Chapter 1

Introduction

Data mining is the process of extracting useful, interesting, and previouslyunknown information from large data sets. The success of data mining relieson the availability of high quality data and effective information sharing. Thecollection of digital information by governments, corporations, and individ-uals has created an environment that facilitates large-scale data mining anddata analysis. Moreover, driven by mutual benefits, or by regulations that re-quire certain data to be published, there is a demand for sharing data amongvarious parties. For example, licensed hospitals in California are required tosubmit specific demographic data on every patient discharged from their fa-cility [43]. In June 2004, the Information Technology Advisory Committeereleased a report entitled Revolutionizing Health Care Through InformationTechnology [190]. One key point was to establish a nationwide system of elec-tronic medical records that encourages sharing of medical knowledge throughcomputer-assisted clinical decision support. Data publishing is equally ubiq-uitous in other domains. For example, Netflix, a popular online movie rentalservice, recently published a data set containing movie ratings of 500,000 sub-scribers, in a drive to improve the accuracy of movie recommendations basedon personal preferences (New York Times, Oct. 2, 2006); AOL published arelease of query logs but quickly removed it due to the re-identification of asearcher [27].

Information sharing has a long history in information technology. Tradi-tional information sharing refers to exchanges of data between a data holderand a data recipient. For example, the Electronic Data Interchange (EDI)is a successful implementation of electronic data transmission between orga-nizations with the emphasis on the commercial sector. The development ofEDI began in the late 1970s and remains in use today. Nowadays, the terms“information sharing” and “data publishing” not only refer to the traditionalone-to-one model, but also the more general models with multiple data holdersand data recipients. Recent standardization of information sharing protocols,such as eXtensible Markup Language (XML), Simple Object Access Protocol(SOAP), and Web Services Description Language (WSDL) are catalysts forthe recent development of information sharing technology.

The standardization enables different electronic devices to communicatewith each other, sometimes even without interference of the device owner.For example, the so-called intelligent fridge is capable to detect the RFID-tagged food items inside, display the ingredients and nutritional data, suggest

3

Page 30: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

4 Introduction to Privacy-Preserving Data Publishing

recipes, retrieve the latest information of the product from the producers,and allow users to browse the web for further information. The advancementof information technology has improved our standard of living. Though noteveryone agrees that an intelligent fridge is useful, having such a fridge willat least make the kitchen more fun. Yet, detailed data in its original formoften contain sensitive information about individuals, and sharing such datacould potentially violate individual privacy. For example, one may not want toshare her web browsing history and the information of items in her intelligentfridge to a third party. The data recipient having access to such informationcould potentially infer the fridge owner’s current health status and predict herfuture status. The general public expresses serious concerns on their privacyand the consequences of sharing their person-specific information.

The current privacy protection practice primarily relies on policies andguidelines to restrict the types of publishable data, and agreements on theuse and storage of sensitive data. The limitation of this approach is that it ei-ther distorts data excessively or requires a trust level that is impractically highin many data-sharing scenarios. Also, policies and guidelines cannot preventadversaries who do not follow rules in the first place. Contracts and agree-ments cannot guarantee that sensitive data will not be carelessly misplacedand end up in the wrong hands. For example, in 2007, two computer diskscontaining names, addresses, birth dates, and national insurance numbers for25 million people went missing while being sent from one British governmentdepartment to another.

A task of the utmost importance is to develop methods and tools for pub-lishing data in a hostile environment so that the published data remain prac-tically useful while individual privacy is preserved. This undertaking is calledprivacy-preserving data publishing (PPDP), which can be viewed as a techni-cal response to complement the privacy policies. In the past few years, researchcommunities have responded to this challenge and proposed many approaches.While the research field is still rapidly developing, it is a good time to discussthe assumptions and desirable properties for PPDP, clarify the differencesand requirements that distinguish PPDP from other related problems, andsystematically summarize and evaluate different approaches to PPDP. Thisbook aims to achieve these goals.

1.1 Data Collection and Data Publishing

A typical scenario of data collection and publishing is described in Fig-ure 1.1. In the data collection phase, the data holder collects data from recordowners (e.g., Alice and Bob). In the data publishing phase, the data holderreleases the collected data to a data miner or the public, called the data recip-

Page 31: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Introduction 5

Data Publisher

Dat

a C

olle

ctio

nD

ata

Publ

ishi

ng

Alice Bob Cathy Doug

Data Recipient

FIGURE 1.1: Data collection and data publishing ([92])

ient , who will then conduct data mining on the published data. In this book,data mining has a broad sense, not necessarily restricted to pattern miningor model building. For example, a hospital collects data from patients andpublishes the patient records to an external medical center. In this example,the hospital is the data holder, patients are record owners, and the medicalcenter is the data recipient. The data mining conducted at the medical centercould be any analysis task from a simple count of the number of men withdiabetes to a sophisticated cluster analysis.

There are two models of data holders [99]. In the untrusted model, the dataholder is not trusted and may attempt to identify sensitive information fromrecord owners. Various cryptographic solutions [258], anonymous communica-tions [45, 126], and statistical methods [242] were proposed to collect recordsanonymously from their owners without revealing the owners’ identity. In thetrusted model, the data holder is trustworthy and record owners are willing toprovide their personal information to the data holder; however, the trust is nottransitive to the data recipient. In this book, we assume the trusted model ofdata holders and consider privacy issues in the data publishing phase. Everydata publishing scenario in practice has its own assumptions and requirementson the data holder, the data recipients, and the data publishing purpose. Thefollowing are different assumptions and properties that may be adopted inpractical data publishing depending on the application:

The non-expert data holder. The data holder is not required to have theknowledge to perform data mining on behalf of the data recipient. Anydata mining activities have to be performed by the data recipient afterreceiving the data from the data holder. Sometimes, the data holderdoes not even know who the recipients are at the time of publication, orhas no interest in data mining. For example, the hospitals in Californiapublish patient records on the web [43]. The hospitals do not know who

Page 32: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

6 Introduction to Privacy-Preserving Data Publishing

the recipients are and how the recipients will use the data. The hospitalpublishes patient records because it is required by regulations [43], orbecause it supports general medical research, not because the hospitalneeds the result of data mining. Therefore, it is not reasonable to expectthe data holder to do more than anonymizing the data for publicationin such a scenario.

In other scenarios, the data holder is interested in the data mining result,but lacks the in-house expertise to conduct the analysis and, therefore,outsources the data mining activities to some external data miners. Inthis case, the data mining task performed by the recipient is knownin advance. In the effort to improve the quality of the data miningresult, the data holder could release a customized data set that preservesspecific types of patterns for such a data mining task. Still, the actualdata mining activities are performed by the data recipient, not by thedata holder.

The data recipient could be an adversary. In PPDP, one assumption isthat the data recipient could also be an adversary. For example, thedata recipient, say a drug research company, is a trustworthy entity;however, it is difficult to guarantee every staff in the company to betrustworthy as well. This assumption makes the PPDP problems andsolutions to be very different from the encryption and cryptographicapproaches, in which only authorized and trustworthy recipients aregiven the private key for accessing the cleartext. A major challenge inPPDP is to simultaneously preserve both the privacy and informationutility in the anonymous data.

Publish data, not data mining results. PPDP emphasizes publishingdata records about individuals (i.e., micro data). This requirement ismore stringent than publishing data mining results, such as classifiers,association rules, or statistics about groups of individuals. For example,in the case of the Netflix data release, useful information may be sometype of associations of movie ratings. Netflix decided to publish datarecords instead of publishing such associations because the participants,with data records, have greater flexibility to perform required analy-sis and data exploration, such as mining patterns in one partition butnot in other partitions, visualizing the transactions containing a spe-cific pattern, trying different modeling methods and parameters, and soforth. The assumption of publishing data, not data mining results, isalso closely related to the assumption of a non-expert data holder.

Truthfulness at the record level. In some data publishing scenarios, it isimportant that each published record corresponds to an existing indi-vidual in real life. Consider the example of patient records. The pharma-ceutical researcher (the data recipient) may need to examine the actualpatient records to discover some previously unknown side effects of the

Page 33: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Introduction 7

tested drug [79]. If a published record does not correspond to an exist-ing patient in real life, it is difficult to deploy data mining results in thereal world. Randomized and synthetic data do not meet this require-ment. Also, randomized and synthetic data become meaningless to datarecipients who may want to manually interpret individual data records.

Encryption is another commonly employed technique for privacy pro-tection. Although an encrypted record corresponds to a real life patient,the encryption hides the semantics required for acting on the representedpatient. Encryption aims to prevent an unauthorized party from access-ing the data, but enable an authorized party to have full access to thedata. In PPDP, it is the authorized party who may also play the role ofthe adversary with the goal of inferring sensitive information from thedata received. Thus, encryption may not be directly applicable to somePPDP problems.

1.2 What Is Privacy-Preserving Data Publishing?

In the most basic form of privacy-preserving data publishing (PPDP), thedata holder has a table of the form

D(Explicit Identifier, Quasi Identifier, Sensitive Attributes,Non-Sensitive Attributes),

where Explicit Identifier is a set of attributes, such as name and social securitynumber (SSN), containing information that explicitly identifies record owners;Quasi Identifier is a set of attributes that could potentially identify recordowners; Sensitive Attributes consist of sensitive person-specific informationsuch as disease, salary, and disability status; and Non-Sensitive Attributescontains all attributes that do not fall into the previous three categories [40].Most works assume that the four sets of attributes are disjoint. Most worksassume that each record in the table represents a distinct record owner.

Anonymization [52, 56] refers to the PPDP approach that seeks to hidethe identity and/or the sensitive data of record owners, assuming that sen-sitive data must be retained for data analysis. Clearly, explicit identifiers ofrecord owners must be removed. Even with all explicit identifiers removed,Sweeney [216] shows a real-life privacy threat on William Weld, former gov-ernor of the state of Massachusetts. In Sweeney’s example, an individual’sname in a public voter list was linked with his record in a published med-ical database through the combination of zip code, date of birth, and sex,as shown in Figure 1.2. Each of these attributes does not uniquely identifya record owner, but their combination, called the quasi-identifier [56], oftensingles out a unique or a small number of record owners. Research showed

Page 34: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

8 Introduction to Privacy-Preserving Data Publishing

Diagnosis

Ethnicity

Medication

Procedure

Total charge

Visit date

Medical Data Voter List

Date ofBirth

Sex

ZIP

Address

Date lastvoted

Dateregistered

Name

Partyaffiliation

FIGURE 1.2: Linking to re-identify record owner ([217] c©2002 World Scien-tific Publishing)

that 87% of the U.S. population had reported characteristics that made themunique based on only such quasi-identifiers [216].

In the above example, the owner of a record is re-identified by linking hisquasi-identifier. To perform such linking attacks , the adversary needs twopieces of prior knowledge: the victim’s record in the released data and thequasi-identifier of the victim. Such knowledge can be obtained by observa-tions. For example, the adversary noticed that his boss was hospitalized,therefore, knew that his boss’s medical record would appear in the releasedpatient database. Also, it is not difficult for an adversary to obtain his boss’szip code, date of birth, and sex, which could serve as the quasi-identifier inlinking attacks.

To prevent linking attacks, the data holder publishes an anonymous table

T (QID′, Sensitive Attributes, Non-Sensitive Attributes),

QID′ is an anonymous version of the original QID obtained by applyinganonymization operations to the attributes in QID in the original table D.Anonymization operations hide some detailed information so that mulitplerecords become indistinguishable with respect to QID′. Consequently, if aperson is linked to a record through QID′, the person is also linked to allother records that have the same value for QID′, making the linking am-biguous. Alternatively, anonymization operations could generate a syntheticdata table T based on the statistical properties of the original table D, oradd noise to the original table D. The anonymization problem is to producean anonymous T that satisfies a given privacy requirement determined by thechosen privacy model and to retain as much data utility as possible. An in-formation metric is used to measure the utility of an anonymous table. Note,the Non-Sensitive Attributes are published if they are important to the datamining task.

Page 35: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Introduction 9

1.3 Related Research Areas

Several polls [41, 144] show that the public has an increased sense of privacyloss. Since data mining is often a key component of information systems, home-land security systems [208], and monitoring and surveillance systems [88], itgives a wrong impression that data mining is a technique for privacy intrusion.This lack of trust has become an obstacle to the benefit of the technology. Forexample, the potentially beneficial data mining research project, TerrorismInformation Awareness (TIA), was terminated by the U.S. Congress due toits controversial procedures of collecting, sharing, and analyzing the trails leftby individuals [208].

Motivated by the privacy concerns on data mining tools, a research areacalled privacy-preserving data mining (PPDM ) emerged in 2000 [19, 50]. Theinitial idea of PPDM was to extend traditional data mining techniques towork with the data modified to mask sensitive information. The key issueswere how to modify the data and how to recover the data mining resultfrom the modified data. The solutions were often tightly coupled with thedata mining algorithms under consideration. In contrast, privacy-preservingdata publishing (PPDP) may not necessarily tie to a specific data miningtask, and the data mining task is sometimes unknown at the time of datapublishing. Furthermore, some PPDP solutions emphasize preserving the datatruthfulness at the record level as discussed earlier, but PPDM solutions oftendo not preserve such property.

PPDP differs from PPDM in several major ways.

1. PPDP focuses on techniques for publishing data, not techniques for datamining. In fact, it is expected that standard data mining techniques areapplied on the published data. In contrast, the data holder in PPDMneeds to randomize the data in such a way that data mining results canbe recovered from the randomized data. To do so, the data holder mustunderstand the data mining tasks and algorithms involved. This level ofinvolvement is not expected of the data holder in PPDP who usually isnot an expert in data mining.

2. Both randomization and encryption do not preserve the truthfulnessof values at the record level; therefore, the released data are basicallymeaningless to the recipients. In such a case, the data holder in PPDMmay consider releasing the data mining results rather than the scrambleddata.

3. PPDP primarily “anonymizes” the data by hiding the identity of recordowners, whereas PPDM seeks to directly hide the sensitive data. Excel-lent surveys and books in randomization [3, 19, 21, 81, 158, 212, 232]and cryptographic techniques [50, 187, 230] for PPDM can be found inthe existing literature. In this book, we focus on techniques for PPDP.

Page 36: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

10 Introduction to Privacy-Preserving Data Publishing

A family of research work [69, 71, 89, 133, 134, 228, 229, 259] called privacy-preserving distributed data mining (PPDDM ) [50] aims at performing somedata mining task on a set of private databases owned by different parties. Itfollows the principle of Secure Multiparty Computation (SMC ) [260, 261], andprohibits any data sharing other than the final data mining result. Cliftonet al. [50] present a suite of SMC operations, like secure sum, secure setunion, secure size of set intersection, and scalar product, that are useful formany data mining tasks. In contrast, PPDP does not perform the actualdata mining task, but concerns with how to publish the data so that theanonymous data are useful for data mining. We can say that PPDP protectsprivacy at the data level while PPDDM protects privacy at the process level.They address different privacy models and data mining scenarios. PPDDM isbriefly discussed in Chapter 17.3. Refer to [50, 187, 232] for more discussionson PPDDM.

In the field of statistical disclosure control (SDC ) [3, 36], the research worksfocus on privacy-preserving publishing methods for statistical tables. SDCfocuses on three types of disclosures, namely identity disclosure, attributedisclosure, and inferential disclosure [51]. Identity disclosure occurs if an ad-versary can identify a respondent from the published data. Revealing thatan individual is a respondent of a data collection may or may not violateconfidentiality requirements. Attribute disclosure occurs when confidential in-formation about a respondent is revealed and can be attributed to the respon-dent. Attribute disclosure is the primary concern of most statistical agenciesin deciding whether to publish tabular data [51]. Inferential disclosure oc-curs when individual information can be inferred with high confidence fromstatistical information of the published data.

Some other works of SDC focus on the study of the non-interactive querymodel, in which the data recipients can submit one query to the system.This type of non-interactive query model may not fully address the informa-tion needs of data recipients because, in some cases, it is very difficult fora data recipient to accurately construct a query for a data mining task inone shot. Consequently, there are a series of studies on the interactive querymodel [32, 62, 76], in which the data recipients, including adversaries, cansubmit a sequence of queries based on previously received query results. Thedatabase server is responsible to keep track of all queries of each user and de-termine whether or not the currently received query has violated the privacyrequirement with respect to all previous queries.

One limitation of any interactive privacy-preserving query system is that itcan only answer a sublinear number of queries in total; otherwise, an adver-sary (or a group of corrupted data recipients) will be able to reconstruct allbut 1− o(1) fraction of the original data [33], which is a very strong violationof privacy. When the maximum number of queries is reached, the query ser-vice must be closed to avoid privacy leak. In the case of the non-interactivequery model, the adversary can issue only one query and, therefore, the non-interactive query model cannot achieve the same degree of privacy defined by

Page 37: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Introduction 11

the interactive model. One may consider that privacy-preserving data pub-lishing is a special case of the non-interactive query model. The interactivequery model will be briefly discussed in Chapter 17.1.

In this book, we review recent works on privacy-preserving data publishing(PPDP) and provide our insights into this topic. There are several fundamen-tal differences between the recent works on PPDP and the previous worksproposed by the statistics community. Recent works on PPDP consider back-ground attacks, inference of sensitive attributes, generalization, and variousnotions of information metrics, but the works in statistics community do not.The term “privacy-preserving data publishing” has been widely adopted bythe computer science community to refer to the recent works discussed in thisbook. SDC is an important topic for releasing statistics on person-specificsensitive information. In this book, we do not intend to provide a detailedcoverage on the statistics methods because there are already many decentsurveys [3, 63, 173, 267] on that topic.

Page 38: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Chapter 2

Attack Models and Privacy Models

What is privacy preservation? In 1977, Dalenius [55] provided a very stringentdefinition:

access to the published data should not enable the adversary to learnanything extra about any target victim compared to no access to the database,

even with the presence of any adversary’s background knowledge obtainedfrom other sources.

In real-life application, such absolute privacy protection is impossible dueto the presence of the adversary’s background knowledge [74]. Suppose theage of an individual is sensitive information. Assume an adversary knows thatAlice’s age is 5 years younger than the average age of American women, butdoes not know the average age of American women. If the adversary has accessto a statistical database that discloses the average age of American women,then Alice’s privacy is considered to be compromised according to Dalenius’definition, regardless of the presence of Alice’s record in the database. Mostliterature on privacy-preserving data publishing considers a more relaxed,more practical notion of privacy protection by assuming the adversary haslimited background knowledge. Below, the term “victim” refers to the recordowner targeted by the adversary. We can broadly classify privacy models totwo categories based on their attack principles.

The first category considers that a privacy threat occurs when an adversaryis able to link a record owner to a record in a published data table, to a sensi-tive attribute in a published data table, or to the published data table itself.We call these record linkage, attribute linkage, and table linkage, respectively.In all three types of linkages, we assume that the adversary knows the QID ofthe victim. In record and attribute linkages, we further assume that the adver-sary knows the victim’s record is in the released table, and seeks to identify thevictim’s record and/or sensitive information from the table. In table linkage,the attack seeks to determine the presence or absence of the victim’s record inthe released table. A data table is considered to be privacy-preserving if it caneffectively prevent the adversary from successfully performing these linkages.Chapters 2.1-2.3 study this category of privacy models.

The second category aims at achieving the uninformative principle [160]:The published table should provide the adversary with little additional in-formation beyond the background knowledge. If the adversary has a large

13

Page 39: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

14 Introduction to Privacy-Preserving Data Publishing

Table 2.1: Privacy modelsAttack Model

Privacy Model Record Attribute Table Probabilisticlinkage linkage linkage attack

k-Anonymity [201, 217] �MultiR k-Anonymity [178] ��-Diversity [162] � �Confidence Bounding [237] �(α, k)-Anonymity [246] � �(X, Y )-Privacy [236] � �(k, e)-Anonymity [269] �(ε, m)-Anonymity [152] �Personalized Privacy [250] �t-Closeness [153] � �δ-Presence [176] �(c, t)-Isolation [46] � �ε-Differential Privacy [74] � �(d, γ)-Privacy [193] � �Distributional Privacy [33] � �

variation between the prior and posterior beliefs, we call it the probabilisticattack. Many privacy models in this family do not explicitly classify attributesin a data table into QID and Sensitive Attributes, but some of them couldalso thwart the sensitive linkages in the first category, so the two categoriesoverlap. Chapter 2.4 studies this family of privacy models. Table 2.1 summa-rizes the attack models addressed by the privacy models.

2.1 Record Linkage Model

In the attack of record linkage, some value qid on QID identifies a smallnumber of records in the released table T , called a group. If the victim’s QIDmatches the value qid, the victim is vulnerable to being linked to the smallnumber of records in the group. In this case, the adversary faces only a smallnumber of possibilities for the victim’s record, and with the help of additionalknowledge, there is a chance that the adversary could uniquely identify thevictim’s record from the group.

Example 2.1Suppose that a hospital wants to publish patients’ records in Table 2.2 to aresearch center. Suppose that the research center has access to the externaltable Table 2.3 and knows that every person with a record in Table 2.3 has a

Page 40: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Attack Models and Privacy Models 15

Table 2.2: Original patient dataJob Sex Age Disease

Engineer Male 35 HepatitisEngineer Male 38 HepatitisLawyer Male 38 HIVWriter Female 30 FluWriter Female 30 HIVDancer Female 30 HIVDancer Female 30 HIV

Table 2.3: External dataName Job Sex AgeAlice Writer Female 30Bob Engineer Male 35

Cathy Writer Female 30Doug Lawyer Male 38Emily Dancer Female 30Fred Engineer Male 38

Gladys Dancer Female 30Henry Lawyer Male 39Irene Dancer Female 32

record in Table 2.2. Joining the two tables on the common attributes Job, Sex,and Age may link the identity of a person to his/her Disease. For example,Doug, a male lawyer at 38 years old, is identified as an HIV patient byqid = 〈Lawyer, Male, 38〉 after the join.

2.1.1 k-Anonymity

To prevent record linkage through QID, Samarati and Sweeney [201, 202,203, 217] propose the notion of k-anonymity: If one record in the table hassome value qid, at least k − 1 other records also have the value qid. In otherwords, the minimum equivalence group size on QID is at least k. A tablesatisfying this requirement is called k-anonymous. In a k-anonymous table,each record is indistinguishable from at least k− 1 other records with respectto QID. Consequently, the probability of linking a victim to a specific recordthrough QID is at most 1/k.

k-anonymity cannot be replaced by the privacy models in attribute link-age discussed in Chapter 2.2. Consider a table T that contains no sensitiveattributes (such as the voter list in Figure 1.2). An adversary could possiblyuse the QID in T to link to the sensitive information in an external source. Ak-anonymous T can still effectively prevent this type of record linkage with-out considering the sensitive information. In contrast, the privacy models inattribute linkage assume the existence of sensitive attributes in T .

Page 41: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

16 Introduction to Privacy-Preserving Data Publishing

Table 2.4: 3-anonymous patient dataJob Sex Age Disease

Professional Male [35-40) HepatitisProfessional Male [35-40) HepatitisProfessional Male [35-40) HIV

Artist Female [30-35) FluArtist Female [30-35) HIVArtist Female [30-35) HIVArtist Female [30-35) HIV

Table 2.5: 4-anonymous external dataName Job Sex AgeAlice Artist Female [30-35)Bob Professional Male [35-40)

Cathy Artist Female [30-35)Doug Professional Male [35-40)Emily Artist Female [30-35)Fred Professional Male [35-40)

Gladys Artist Female [30-35)Henry Professional Male [35-40)Irene Artist Female [30-35)

Example 2.2Table 2.4 shows a 3-anonymous table by generalizing QID = {Job, Sex, Age}from Table 2.2 using the taxonomy trees in Figure 2.1. It has two dis-tinct groups on QID, namely 〈Professional, Male, [35-40)〉 and 〈Artist,Female, [30-35)〉. Since each group contains at least 3 records, the table is3-anonymous. If we link the records in Table 2.3 to the records in Table 2.4through QID, each record is linked to either no record or at least 3 recordsin Table 2.4.

The k-anonymity model assumes that QID is known to the data holder.Most works consider a single QID containing all attributes that can be po-tentially used in the quasi-identifier. The more attributes included in QID,the more protection k-anonymity would provide. On the other hand, this alsoimplies more distortion is needed to achieve k-anonymity because the recordsin a group have to agree on more attributes. To address this issue, Fung etal. [95, 96] allow the specification of multiple QIDs, assuming that the dataholder knows the potential QIDs for record linkage. The following exampleillustrates the use of this specification.

Example 2.3The data holder wants to publish a table T (A, B, C, D, S), where S is thesensitive attribute, and knows that the data recipient has access to previously

Page 42: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Attack Models and Privacy Models 17

Engineer

ANY

Professional Artist

Lawyer Dancer Writer

ANY

Male Female

SexJob

[30-33)

[30-40)

[30-35) [35-40)

[33-35)

Age

FIGURE 2.1: Taxonomy trees for Job, Sex, Age

published tables T 1(A, B, X) and T 2(C, D, Y ), where X and Y are attributesnot in T . To prevent linking the records in T to the information on X or Y , thedata holder can specify k-anonymity on QID1 = {A, B} and QID2 = {C, D}for T . This means that each record in T is indistinguishable from a groupof at least k records with respect to QID1 and is indistinguishable from agroup of at least k records with respect to QID2. The two groups are notnecessarily the same. Clearly, this requirement is implied by k-anonymity onQID = {A, B, C, D}, but having k-anonymity on both QID1 and QID2 doesnot imply k-anonymity on QID.

The second scenario is that there are two adversaries, one has access to T 1and one has access to T 2. Since none of them has access of all of A, B, C,and D in a table, considering a single QID = {A, B, C, D} would result inover-protection and, therefore, over data distortion.

Specifying multiple QIDs is practical only if the data holder knows how theadversary might perform the linking. Nevertheless, the data holder often doesnot have such information. A wrong decision may cause higher privacy risksor higher information loss. Later, we discuss the dilemma and implications ofchoosing attributes in QID. In the presence of multiple QIDs, some QIDsmay be redundant and can be removed by the following subset property:

Observation 2.1.1 (Subset property) Let QID′ ⊆ QID. If a table T isk-anonymous on QID, then T is also k-anonymous on QID′. In other words,QID′ is covered by QID, so QID′ can be removed from the privacy require-ment [95, 96, 148].

The k-anonymity model assumes that each record represents a distinct in-dividual. If several records in a table represent the same record owner, a groupof k records may represent fewer than k record owners, and the record ownermay be underprotected. The following example illustrates this point.

Example 2.4A record in the table Inpatient(Pid, Job, Sex, Age, Disease) represents thata patient identified by Pid has Job, Sex, Age, and Disease. A patient may

Page 43: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

18 Introduction to Privacy-Preserving Data Publishing

have several records, one for each disease. In this case, QID = {Job, Sex, Age}is not a key and k-anonymity on QID fails to ensure that each group on QIDcontains at least k (distinct) patients. For example, if each patient has at least3 diseases, a group of k records will involve no more than k/3 patients.

2.1.2 (X, Y )-Anonymity

To address the shortcoming of k-anonymity discussed in Example 2.4, Wangand Fung [236] propose the notion of (X, Y )-anonymity, where X and Y aredisjoint sets of attributes. For a table T , Π(T ) and σ(T ) denote the projectionand selection over T , att(T ) denotes the set of attributes in T , and |T | denotesthe number of distinct records in T .

DEFINITION 2.1 (X, Y )-anonymity [236] Let x be a value on X .The anonymity of x with respect to Y , denoted by aY (x), is the number ofdistinct values on Y that co-occur with x, i.e., |ΠY σx(T )|. If Y is a key in T ,aY (x), also written as a(x), is equal to the number of records containing x.Let AY (X) = min{aY (x) | x ∈ X}. A table T satisfies the (X, Y )-anonymityfor some specified integer k if AY (X) ≥ k.

(X, Y )-anonymity specifies that each value on X is linked to at least kdistinct values on Y . The k-anonymity is the special case where X is theQID and Y is a key in T that uniquely identifies record owners. (X, Y )-anonymity provides a uniform and flexible way to specify different types ofprivacy requirements. If each value on X describes a group of record owners(e.g., X = {Job, Sex, Age}) and Y represents the sensitive attribute (e.g.,Y = {Disease}), this means that each group is associated with a diverseset of sensitive values, making it difficult to infer a specific sensitive value.The next example shows the usefulness of (X, Y )-anonymity for modeling k-anonymity in the case that several records may represent the same recordowner.

Example 2.5Continue from Example 2.4. With (X, Y )-anonymity, we specify k-anonymitywith respect to patients by letting X = {Job, Sex, Age} and Y = {Pid}.That is, each X group is linked to at least k distinct patient IDs, therefore, kdistinct patients.

2.1.3 Dilemma on Choosing QID

One challenge faced by a data holder is how to classify the attributes ina data table into three disjoint sets: QID, Sensitive Attributes, and Non-Sensitive Attributes. QID in principle should contain an attribute A if the ad-

Page 44: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Attack Models and Privacy Models 19

versary could potentially obtain A from other external sources. After the QIDis determined, remaining attributes are grouped into Sensitive Attributes andNon-Sensitive Attributes based on their sensitivity. There is no definite an-swer for the question of how a data holder can determine whether or not anadversary can obtain an attribute A from some external sources, but it isimportant to understand the implications of a misclassification: misclassify-ing an attribute A into Sensitive Attributes or Non-Sensitive Attributes maycompromise another sensitive attribute S because an adversary may obtain Afrom other sources and then use A to perform record linkage or attribute link-age on S. On the other hand, misclassifying a sensitive attribute S into QIDmay directly compromise sensitive attribute S of some target victim becausean adversary may use attributes in QID−S to perform attribute linkage on S.Furthermore, incorrectly including S in QID causes unnecessary informationloss due to the curse of dimensionality [6].

Motwani and Xu [174] present a method to determine the minimal set ofquasi-identifiers for a data table T . The intuition is to identify a minimal set ofattributes from T that has the ability to (almost) distinctly identify a recordand the ability to separate two data records. Nonetheless, the minimal set ofQID does not imply the most appropriate privacy protection setting becausethe method does not consider what attributes the adversary could potentiallyhave. If the adversary can obtain a bit more information about the targetvictim beyond the minimal set, then he may be able to conduct a successfullinking attack. The choice of QID remains an open issue.

k-anonymity and (X, Y )-anonymity prevent record linkage by hiding therecord of a victim in a large group of records with the same QID. However,if most records in a group have similar values on a sensitive attribute, theadversary can still associate the victim to her sensitive value without hav-ing to identify her record. This situation is illustrated in Table 2.4, whichis 3-anonymous. For a victim matching qid = 〈Artist, Female, [30-35)〉, theconfidence of inferring that the victim has HIV is 75% because 3 out of the4 records in the group have HIV . Though (X, Y )-anonymity requires thateach X group is linked to at least k distinct Y values, if some Y values oc-cur more frequently than others, there is a higher confidence of inferring themore frequent values. This leads us to the next family of privacy models forpreventing this type of attribute linkage.

2.2 Attribute Linkage Model

In the attack of attribute linkage, the adversary may not precisely identifythe record of the target victim, but could infer his/her sensitive values fromthe published data T , based on the set of sensitive values associated to the

Page 45: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

20 Introduction to Privacy-Preserving Data Publishing

group that the victim belongs to. In case some sensitive values predominatein a group, a successful inference becomes relatively easy even if k-anonymityis satisfied.

Example 2.6Consider the 3-anonymous data in Table 2.4. Suppose the adversary knowsthat the target victim Emily is a female dancer at age 30 and owns a record inthe table. The adversary may infer that Emily has HIV with 75% confidencebecause 3 out of the 4 female artists with age [30-35) have HIV . Regardlessof the correctness of the inference, Emily’s privacy has been compromised.

Clifton [49] suggests eliminating attribute linkages by limiting the releaseddata size. Limiting data size may not be desirable if data records, such as HIVpatients’ data, are valuable data and are difficult to obtain. Below we discussseveral other approaches proposed to address this type of threat. The gen-eral idea is to diminish the correlation between QID attributes and sensitiveattributes.

2.2.1 �-Diversity

Machanavajjhala et al. [160, 162] propose the diversity principle, called �-diversity, to prevent attribute linkage. The �-diversity requires every qid groupto contain at least � “well-represented” sensitive values. A similar idea waspreviously discussed in [181]. There are several instantiations of this princi-ple, which differ in what it means by being well-represented. The simplestunderstanding of “well-represented” is to ensure that there are at least � dis-tinct values for the sensitive attribute in each qid group, where qid groupis the set of records having the value qid on QID. This distinct �-diversityprivacy model (also known as p-sensitive k-anonymity [226]) automaticallysatisfies k-anonymity, where k = �, because each qid group contains at least� records. Distinct �-diversity cannot prevent probabilistic inference attacksbecause some sensitive values are naturally more frequent than others in agroup, enabling an adversary to conclude that a record in the group is verylikely to have those values. For example, Flu is more common than HIV .This motivates the following two stronger notions of �-diversity.

A table is entropy �-diverse if for every qid group

−∑

s∈S

P (qid, s)log(P (qid, s)) ≥ log(�) (2.1)

where S is a sensitive attribute, P (qid, s) is the fraction of records in a qidgroup having the sensitive value s. The left-hand side, called the entropy ofthe sensitive attribute, has the property that more evenly distributed sensitivevalues in a qid group produce a larger value. Therefore, a large threshold value

Page 46: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Attack Models and Privacy Models 21

� implies less certainty of inferring a particular sensitive value in a group. Notethat the inequality does not depend on the choice of the log base.

Example 2.7Consider Table 2.4. For the first group 〈Professional, Male, [35-40)〉,

− 23 log 2

3 − 13 log 1

3 = log(1.9),

and for the second group 〈Artist, Female, [30-35)〉,− 3

4 log 34 − 1

4 log 14 = log(1.8).

So the table satisfies entropy �-diversity where � ≤ 1.8.

To achieve entropy �-diversity, the table as a whole must be at least log(l)since the entropy of a qid group is always greater than or equal to the minimumentropy of its subgroups {qid1, . . . , qidn} where qid = qid1 ∪ · · · ∪ qidn, thatis,

entropy(qid) ≥ min(entropy(qid1), . . . , entropy(qidn)).

This requirement is hard to achieve, especially if some sensitive value fre-quently occurs in S.

One limitation of entropy �-diversity is that it does not provide aprobability-based risk measure, which tends to be more intuitive to the humandata holder. For example in Table 2.4, being entropy 1.8-diverse in Exam-ple 2.7 does not convey the risk level that the adversary has 75% probabilityof success to infer HIV where 3 out of the 4 record owners in the qid grouphave HIV . Also, it is difficult to specify different protection levels based onvaried sensitivity and frequency of sensitive values.

It is interesting to note that entropy was also used to measure the anonymityof senders in an anonymous communication system [61, 209] where an adver-sary employs techniques such as traffic analysis to identify the potential sender(or receiver) of a message. The goal of anonymous communication is to hidethe identity of senders during data transfer by using techniques such as mixnetwork [45] and crowds [197]. The general idea is to blend the senders intoa large and characteristically diverse crowd of senders; the crowd collectivelysends messages on behalf of its members. Diaz et al. [61], and Serjantov andDanezis [209] employ entropy to measure the diversity of a crowd.

DEFINITION 2.2 recursive (c, �)-diverse [162] Let c > 0 be a con-stant. Let S be a sensitive attribute. Let s1, . . . , sm be the values of S that ap-pear in the qid group. Let f1, . . . , fm be their corresponding frequency countsin qid. Let f(1), . . . , f(m) be those counts sorted in non-increasing order. Atable is recursive (c, �)-diverse if every qid group satisfies f(1) ≤ c

∑mi=� f(i)

for some constant c.

Page 47: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

22 Introduction to Privacy-Preserving Data Publishing

The recursive (c, �)-diversity makes sure that the most frequent value doesnot appear too frequently, and the less frequent values do not appear toorarely. A qid group is recursive (c, �)-diverse if the frequency of the mostfrequent sensitive value is less than the sum of the frequencies of the m −� + 1 least frequent sensitive values multiplying by some publisher-specifiedconstant c, i.e., f(1) < c

∑mi=� f(i). The intuition is that even if the adversary

excludes some possible sensitive values of a victim by applying backgroundknowledge, the inequality remains to hold for the remaining values; therefore,the remaining ones remain hard to infer. A table is considered to have recursive(c, �)-diversity if all of its groups have recursive (c, �)-diversity.

This instantiation is less restrictive than the entropy �-diversity becausea larger c, which is a parameter independent of the frequencies of sensitivevalues, can relax the restrictiveness. However, if a sensitive value occurs veryfrequently in S, this requirement is still hard to satisfy. For example, if p%of the records in the table contains the most frequent sensitive value s, thenat least one qid group will have |qid ∧ s|/|qid| ≥ p% where |qid| denotes thenumber of records containing the qid value, and |qid∧ s| denotes the numberof records containing both the qid and s values. This qid group could easilyviolate the recursive (c, �)-diversity if c is small.

Machanavajjhala et al. [160, 162] also present two other instantia-tions, called positive disclosure-recursive (c, �)-diversity and negative/positivedisclosure-recursive (c, �)-diversity to capture the adversary’s backgroundknowledge. Suppose a victim is in a qid group that contains three differentsensitive values {Flu, Cancer, HIV }, and suppose the adversary knows thatthe victim has no symptom of having a flu. Given this piece of backgroundknowledge, the adversary can eliminate Flu from the set of candidate sensitivevalues of the victim. Martin et al. [165] propose a language to capture thistype of background knowledge and to represent the knowledge as k units ofinformation. Furthermore, the language could capture the type of implicationknowledge. For example, given that Alice, Bob, and Cathy have flu, the adver-sary infers that Doug is very likely to have flu, too, because all four of themlive together. This implication is considered to be one unit of information.Given an anonymous table T and k units of background knowledge, Martin etal. [165] estimate the maximum disclosure risk of T , which is the probabilityof the most likely predicted sensitive value assignment of any record owner inT .

�-diversity has the limitation of implicitly assuming that each sensitive at-tribute takes values uniformly over its domain. In case the frequencies ofsensitive values are not similar, achieving �-diversity may cause a large datautility loss. Consider a data table containing data of 1000 patients on someQID attributes and a single sensitive attribute Disease with two possiblevalues, HIV or Flu. Assume that there are only 5 patients with HIV in thetable. To achieve 2-diversity, at least one patient with HIV is needed in eachqid group; therefore, at most 5 groups can be formed [66], resulting in highinformation loss in this case.

Page 48: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Attack Models and Privacy Models 23

A common view in the literature is that �-diversity should replace k-anonymity. In fact, it depends on the data publishing scenario. Usually linkingattack involves data from two sources, one table T1 containing names and iden-tity of individuals (e.g., voter list), and one table T2 containing sensitive at-tributes (e.g., medical data), and both containing QID attribute. k-anonymityis suitable for anonymizing T1 and �-diversity is suitable for anonymizing T2.In this sense, these two privacy notions are not competitors, but rather aredifferent tools to be used under different scenarios.

2.2.2 Confidence Bounding

Wang et al. [237] consider bounding the confidence of inferring a sensitivevalue from a qid group by specifying one or more privacy templates of theform, 〈QID → s, h〉. s is a sensitive value, QID is a quasi-identifier, and h isa threshold. Let Conf(QID → s) be max{conf(qid→ s)} over all qid groupson QID, where conf(qid → s) denotes the percentage of records containings in the qid group. A table satisfies 〈QID → s, h〉 if Conf(QID → s) ≤ h. Inother words, 〈QID → s, h〉 bounds the adversary’s confidence of inferring thesensitive value s in any group on QID to the maximum h. Note, confidencebounding is also known as �+-diversity in [157].

For example, with QID = {Job, Sex, Age}, 〈QID → HIV, 10%〉 statesthat the confidence of inferring HIV from any group on QID is no more than10%. For the data in Table 2.4, this privacy template is violated because theconfidence of inferring HIV is 75% in the group {Artist, Female, [30− 35)}.

The confidence measure has two advantages over recursive (c, �)-diversityand entropy �-diversity. First, the confidence measure is more intuitive becausethe risk is measured by the probability of inferring a sensitive value. The dataholder relies on this intuition to specify the acceptable maximum confidencethreshold. Second, it allows the flexibility for the data holder to specify adifferent threshold h for each combination of QID and s according to theperceived sensitivity of inferring s from a group on QID. The recursive (c, �)-diversity cannot be used to bound the frequency of sensitive values that arenot the most frequent. For example, if Flu has frequency of 90% and HIVhas 5%, and others have 3% and 2%, it is difficult to satisfy (c, l)-diversity.Even if it is satisfied, it does not protect the less frequent HIV which needs amuch stronger (c, l)-diversity than Flu. Confidence bounding provides greaterflexibility than �-diversity in this aspect. However, recursive (c, �)-diversity canstill prevent attribute linkages even in the presence of background knowledgeas discussed earlier. Confidence bounding does not share the same merit.

2.2.3 (X, Y )-Linkability

(X, Y )-anonymity in Chapter 2.1.2 states that each group on X has at leastk distinct values on Y (e.g., diseases). However, being linked to k personsdoes not imply that the probability of being linked to any of them is 1/k. If

Page 49: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

24 Introduction to Privacy-Preserving Data Publishing

some Y values occur more frequently than others, the probability of inferringa particular Y value can be higher than 1/k. The (X, Y )-linkability belowaddresses this issue.

DEFINITION 2.3 (X, Y )-linkability [236] Let x be a value on Xand y be a value on Y . The linkability of x to y, denoted by ly(x), is thepercentage of the records that contain both x and y among those that containx, i.e., a(y, x)/a(x), where a(x) denotes the number of records containingx and a(y, x) is number of records containing both y and x. Let Ly(X) =max{ly(x) | x ∈ X} and LY (X) = max{Ly(X) | y ∈ Y }. We say that Tsatisfies the (X, Y )-linkability for some specified real 0 < k ≤ 1 if LY (X) ≤ h.

In words, (X, Y )-linkability limits the confidence of inferring a value onY from a value on X . With X and Y describing individuals and sensitiveproperties, any such inference with a high confidence is a privacy breach.Often, not all but some values y on Y are sensitive, in which case Y canbe replaced with a subset of yi values on Y , written Y = {y1, . . . , yp}, anda different threshold h can be specified for each yi. More generally, we canallow multiple Yi, each representing a subset of values on a different set ofattributes, with Y being the union of all Yi. For example, Y1 = {HIV } onTest and Y2 = {Banker} on Job. Such a “value-level” specification providesa great flexibility essential for minimizing the data distortion.

2.2.4 (X, Y )-Privacy

Wang and Fung [236] propose a general privacy model, called (X, Y )-privacy, which combines both (X, Y )-anonymity and (X, Y )-linkability. Thegeneral idea is to require each group x on X to contain at least k recordsand the confidence of inferring any y ∈ Y from any x ∈ X is limited to amaximum confidence threshold h. Note, the notion of (X, Y )-privacy is notonly applicable to a single table, but is also applicable to the scenario of mul-tiple releases. Chapter 9 discusses (X, Y )-privacy in the context of sequentialanonymization in details.

2.2.5 (α, k)-Anonymity

Wong et al. [246] propose a similar integrated privacy model called (α, k)-anonymity, requiring every qid in a Table T to be shared by at least krecords and conf(qid → s) ≤ α for any sensitive value s, where k and α aredata holder-specified thresholds. Nonetheless, both (X, Y )-Privacy and (α, k)-anonymity may result in high distortion if the sensitive values are skewed.

Page 50: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Attack Models and Privacy Models 25

2.2.6 LKC-Privacy

In most previously studied record and attribute linkage model, the usualapproach is to generalize the records into equivalence groups so that eachgroup contains at least k records with respect to some QID attributes, and thesensitive values in each qid group are diversified enough to disorient confidentinferences. However, Aggarwal [6] has shown that when the number of QIDattributes is large, that is, when the dimensionality of data is high, most ofthe data have to be suppressed in order to achieve k-anonymity. Applying k-anonymity on the high-dimensional patient data would significantly degradethe data quality. This problem is known as the curse of high-dimensionalityon k-anonymity [6].

To overcome this problem, Mohammed et al. [171] exploit the limited priorknowledge of adversary: in real-life privacy attacks, it is very difficult for anadversary to acquire all the information in QID of a target victim becauseit requires non-trivial effort to gather each piece of prior knowledge from somany possible values. Thus, it is reasonable to assume that the adversary’sprior knowledge is bounded by at most L values of the QID attributes of thetarget victim. Based on this assumption, Mohammed et al. [171] propose aprivacy model called LKC-privacy, which will be further discussed in a real-life application in Chapter 6, for anonymizing high-dimensional data. Theassumption of limited adversary’s knowledge has also been previously appliedfor anonymizing high-dimensional transaction data [255, 256], which will bediscussed in Chapter 13.

The general intuition of LKC-privacy is to ensure that every combination ofvalues in QIDj ⊆ QID with maximum length L in the data table T is sharedby at least K records, and the confidence of inferring any sensitive values in Sis not greater than C, where L, K, C are thresholds and S is a set of sensitivevalues specified by the data holder (the hospital). LKC-privacy bounds theprobability of a successful record linkage to be ≤ 1/K and the probability ofa successful attribute linkage to be ≤ C, provided that the adversary’s priorknowledge does not exceed L values.

DEFINITION 2.4 LKC-privacy Let L be the maximum number ofvalues of the prior knowledge. Let S be a set of sensitive values. A data tableT satisfies LKC-privacy if and only if for any qid with |qid| ≤ L,

1. |T [qid]| ≥ K, where K > 0 is an integer anonymity threshold, whereT [qid] denotes the set of records containing qid in T , and

2. conf(qid → s) ≤ C for any s ∈ S, where 0 < C ≤ 1 is a real numberconfidence threshold.

The data holder specifies the thresholds L, K, and C. The maximum lengthL reflects the assumption of the adversary’s power. LKC-privacy guarantees

Page 51: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

26 Introduction to Privacy-Preserving Data Publishing

the probability of a successful record linkage to be ≤ 1/K and the probabilityof a successful attribute linkage to be ≤ C. LKC-privacy has several niceproperties that make it suitable for anonymizing high-dimensional data.

• LKC-privacy requires only a subset of QID attributes to be sharedby at least K records. This is a major relaxation from traditional k-anonymity, based on a very reasonable assumption that the adversaryhas limited power.

• LKC-privacy generalizes several traditional privacy models. k-anonymity is a special case of LKC-privacy with L = |QID|, K = k,and C = 100%, where |QID| is the number of QID attributes in thedata table. Confidence bounding is also a special case of LKC-privacywith L = |QID| and K = 1. (α, k)-anonymity is also a special case ofLKC-privacy with L = |QID|, K = k, and C = α. �-diversity is alsoa special case of LKC-privacy with L = |QID|, K = 1, and C = 1/�.Thus, the hospital can still achieve the traditional models if needed.

• LKC-privacy is flexible to adjust the trade-off between data privacyand data utility. Increasing L and K, or decreasing C would improvethe privacy in the expense of data utility loss.

• LKC-privacy is a general privacy model that thwarts both recordlinkage and attribute linkage, i.e., the privacy model is applicable toanonymize data with or without sensitive attributes.

2.2.7 (k, e)-Anonymity

Most works on k-anonymity and its extensions assume categorical sensi-tive attributes. Zhang et al. [269] propose the notion of (k, e)-anonymity toaddress numerical sensitive attributes such as salary. The general idea is topartition the records into groups so that each group contains at least k dif-ferent sensitive values with a range of at least e. However, (k, e)-anonymityignores the distribution of sensitive values within the range λ. If some sensi-tive values occur frequently within a subrange of λ, then the adversary couldstill confidently infer the subrange in a group. This type of attribute linkageattack is called the proximity attack [152]. Consider a qid group of 10 datarecords with 7 different sensitive values, where 9 records have sensitive valuesbetween 30 and 35, and 1 record has value 80. As shown in Table 2.6, thegroup is (7, 50)-anonymous because 80 − 30 = 50. Still, the adversary caninfer that a victim inside the group has a sensitive value falling into [30-35]with 90% confidence because 9 out of the 10 records contain values in thesensitive range. Li et al. [152] propose an alternative privacy model, called(ε, m)-anonymity. Given any numerical sensitive value s in T , this privacymodel bounds the probability of inferring [s− ε, s + ε] to be at most 1/m.

Page 52: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Attack Models and Privacy Models 27

Table 2.6: (7, 50)-anonymous groupQuasi-identifier Sensitive CommentJob Sex Salary

Artist Female 30 sensitiveArtist Female 31 sensitiveArtist Female 30 sensitiveArtist Female 32 sensitiveArtist Female 35 sensitiveArtist Female 34 sensitiveArtist Female 33 sensitiveArtist Female 32 sensitiveArtist Female 35 sensitiveArtist Female 80 not sensitive

2.2.8 t-Closeness

In a spirit similar to the uninformative principle discussed earlier, Li etal. [153] observe that when the overall distribution of a sensitive attributeis skewed, �-diversity does not prevent attribute linkage attacks. Consider apatient table where 95% of records have Flu and 5% of records have HIV .Suppose that a qid group has 50% of Flu and 50% of HIV and, therefore,satisfies 2-diversity. However, this group presents a serious privacy threat be-cause any record owner in the group could be inferred as having HIV with50% confidence, compared to 5% in the overall table.

To prevent skewness attack, Li et al. [153] propose a privacy model, called t-Closeness, which requires the distribution of a sensitive attribute in any groupon QID to be close to the distribution of the attribute in the overall table.t-closeness uses the Earth Mover Distance (EMD) function to measure thecloseness between two distributions of sensitive values, and requires the close-ness to be within t. t-closeness has several limitations and weaknesses. First, itlacks the flexibility of specifying different protection levels for different sensi-tive values. Second, the EMD function is not suitable for preventing attributelinkage on numerical sensitive attributes [152]. Third, enforcing t-closenesswould greatly degrade the data utility because it requires the distributionof sensitive values to be the same in all qid groups. This would significantlydamage the correlation between QID and sensitive attributes. One way todecrease the damage is to relax the requirement by adjusting the thresholdswith the increased risk of skewness attack [66], or to employ the probabilisticprivacy models in Chapter 2.4.

2.2.9 Personalized Privacy

Xiao and Tao [250] propose the notion of personalized privacy to allow eachrecord owner to specify her own privacy level. This model assumes that eachsensitive attribute has a taxonomy tree and that each record owner specifies

Page 53: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

28 Introduction to Privacy-Preserving Data Publishing

a guarding node in this tree. The record owner’s privacy is violated if anadversary is able to infer any domain sensitive value within the subtree ofher guarding node with a probability, called breach probability, greater thana certain threshold. For example, suppose HIV and SARS are child nodesof Infectious Disease in the taxonomy tree. An HIV patient Alice can setthe guarding node to Infectious Disease, meaning that she allows peopleto infer that she has some infectious diseases, but not any specific type ofinfectious disease. Another HIV patient, Bob, does not mind disclosing hismedical information, so he does not set any guarding node for this sensitiveattribute.

Although both confidence bounding and personalized privacy take an ap-proach to bound the confidence or probability of inferring a sensitive valuefrom a qid group, they have differences. In the confidence bounding approach,the data holder imposes a universal privacy requirement on the entire dataset, so the minimum level of privacy protection is the same for every recordowner. In the personalized privacy approach, a guarding node is specified foreach record by its owner. The advantage is that each record owner may specifya guarding node according to her own tolerance on sensitivity. Experimentsshow that this personalized privacy requirement could result in lower informa-tion loss than the universal privacy requirement [250]. In practice, however, itis unclear how individual record owners would set their guarding node. Often,a reasonable guarding node depends on the distribution of sensitive values inthe whole table or in a group. For example, knowing that her disease is verycommon, a record owner may set a more special (lower privacy protected)guarding node for her record. Nonetheless, the record owners usually have noaccess to the distribution of sensitive values in their qid group or in the wholetable before the data is published. Without such information, the tendencyis to play safe by setting a more general (higher privacy protected) guardingnode, which may negatively affect the utility of data.

2.2.10 FF -Anonymity

Most previously discussed works assume that the data table can be dividedinto quasi-identifying (QID) attributes and sensitive attributes. Yet, this as-sumption does not hold when an attribute contains both sensitive values andquasi-identifying values. Wong et al. [238] identify a class of freeform attacksof the form X → s, where s and the values in X can be any values of any at-tributes in the table T . X → s is a privacy breach if any record in T matchingX can infer a sensitive value s with a high probability. The privacy model,FF -anonymity, bounds the probability of all potential privacy breaches in theform X → s to be below a given threshold.

The key idea of FF -anonymity is that it distinguish between observablevalues and sensitive values at the value level, instead of at the attribute level.Every attribute potentially contains both types of values. For example, somediseases are sensitive and some are not. Thus, non-sensitive diseases become

Page 54: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Attack Models and Privacy Models 29

observable to an adversary and may be used in X for linking attacks. Thereis no notion of a fixed QID as in the other works.

2.3 Table Linkage Model

Both record linkage and attribute linkage assume that the adversary alreadyknows the victim’s record is in the released table T . However, in some cases,the presence (or the absence) of the victim’s record in T already reveals thevictim’s sensitive information. Suppose a hospital releases a data table witha particular type of disease. Identifying the presence of the victim’s recordin the table is already damaging. A table linkage occurs if an adversary canconfidently infer the presence or the absence of the victim’s record in thereleased table. The following example illustrates the privacy threat of a tablelinkage.

Example 2.8Suppose the data holder has released a 3-anonymous patient table T (Ta-ble 2.4). To launch a table linkage on a target victim, for instance, Alice, onT , the adversary is presumed to also have access to an external public tableE (Table 2.5) where T ⊆ E. The probability that Alice is present in T is45 = 0.8 because there are 4 records in T (Table 2.4) and 5 records in E (Ta-ble 2.5) containing 〈Artist, Female, [30-35)〉. Similarly, the probability thatBob is present in T is 3

4 = 0.75.

δ-Presence

To prevent table linkage, Ercan Nergiz et al. [176] propose to bound theprobability of inferring the presence of any potential victim’s record withina specified range δ = (δmin, δmax). Formally, given an external public ta-ble E and a private table T , where T ⊆ E, a generalized table T ′ satisfies(δmin, δmax)-presence if δmin ≤ P (t ∈ T |T ′) ≤ δmax for all t ∈ E. δ-presencecan indirectly prevent record and attribute linkages because if the adversaryhas at most δ% of confidence that the target victim’s record is present in thereleased table, then the probability of a successful linkage to her record andsensitive attribute is at most δ%. Though δ-presence is a relatively “safe” pri-vacy model, it assumes that the data holder has access to the same externaltable E as the adversary does. This may not be a practical assumption insome situations.

Page 55: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

30 Introduction to Privacy-Preserving Data Publishing

2.4 Probabilistic Model

There is another family of privacy models that does not focus on exactlywhat records, attribute values, and tables the adversary can link to a targetvictim, but focuses on how the adversary would change his/her probabilisticbelief on the sensitive information of a victim after accessing the publisheddata. In general, this group of privacy models aims at achieving the uninfor-mative principle [160], whose goal is to ensure that the difference between theprior and posterior beliefs is small.

2.4.1 (c, t)-Isolation

Chawla et al. [46] suggest that having access to the published anonymousdata table should not enhance an adversary’s power of isolating any recordowner. Consequently, they proposed a privacy model to prevent (c, t)-isolationin a statistical database. Suppose p is a data point of a target victim v in a datatable, and q is the adversary’s inferred data point of v by using the publisheddata and the background information. Let δp be the distance between p andq. We say that point q (c, t)-isolates point p if B(q, cδp) contains fewer than tpoints in the table, where B(q, cδp) is a ball of radius cδp centered at point q.Preventing (c, t)-isolation can be viewed as preventing record linkages. Theirmodel considers distances among data records and, therefore, is more suitablefor numerical attributes in statistical databases.

2.4.2 ε-Differential Privacy

Dwork [74] proposes an insightful privacy notion: the risk to the recordowner’s privacy should not substantially increase as a result of participatingin a statistical database. Instead of comparing the prior probability and theposterior probability before and after accessing the published data, Dwork [74]proposes to compare the risk with and without the record owner’s data in thepublished data. Consequently, the privacy model called ε-differential privacyensures that the removal or addition of a single database record does not sig-nificantly affect the outcome of any analysis. It follows that no risk is incurredby joining different databases. Based on the same intuition, if a record ownerdoes not provide his/her actual information to the data holder, it will notmake much difference in the result of the anonymization algorithm.

The following is a more formal definition of ε-differential privacy [74]: Arandomized function F ensures ε-differential privacy if for all data sets T1 andT2 differing on at most one record,

|lnP [F (T1) = S]P [F (T2) = S]

| ≤ ε (2.2)

Page 56: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Attack Models and Privacy Models 31

for all S ∈ Range(F ), where Range(F ) is the set of possible outputs of the ran-domized function F . Although ε-differential privacy does not prevent recordand attribute linkages studied in earlier chapters, it assures record ownersthat they may submit their personal information to the database securely inthe knowledge that nothing, or almost nothing, can be discovered from thedatabase with their information that could not have been discovered withouttheir information. Dwork [74] formally proves that ε-differential privacy canprovide a guarantee against adversaries with arbitrary background knowledge.This strong guarantee is achieved by comparison with and without the recordowner’s data in the published data. Dwork [75] proves that if the number ofqueries is sub-linear in n, the noise to achieve differential privacy is boundedby o(

√n), where n is the number of records in the database. Dwork [76] further

shows that the notion of differential privacy is applicable to both interactiveand non-interactive query models, discussed in Chapters 1.2 and 17.1. Referto [76] for a survey on differential privacy.

2.4.3 (d, γ)-Privacy

Rastogi et al. [193] present a probabilistic privacy definition (d, γ)-privacy.Let P (r) and P (r|T ) be the prior probability and the posterior probabilityof the presence of a victim’s record in the data table T before and afterexamining the published table T . (d, γ)-privacy bounds the difference of theprior and posterior probabilities and provides a provable guarantee on privacyand information utility, while most previous work lacks such formal guarantee.Rastogi et al. [193] show that a reasonable trade-off between privacy andutility can be achieved only when the prior belief is small. Nonetheless, (d, γ)-privacy is designed to protect only against attacks that are d-independent : anattack is d-independent if the prior belief P (r) satisfies the conditions P (r) = 1or P (r) ≤ d for all records r, where P (r) = 1 means that the adversaryalready knows that r is in T and no protection on r is needed. However, thisd-independence assumption may not hold in some real-life applications [161].Differential privacy in comparison does not have to assume that records areindependent or that an adversary has a prior belief bounded by a probabilitydistribution.

2.4.4 Distributional Privacy

Motivated by the learning theory, Blum et al. [33] present a privacy modelcalled distributional privacy for a non-interactive query model. The key ideais that when a data table is drawn from a distribution, the table should re-veal only information about the underlying distribution, and nothing else.Distributional privacy is a strictly stronger privacy notion than differentialprivacy, and can answer all queries over a discretized domain in a conceptclass of polynomial VC-dimension, where Vapnik-Chervonenkis (VC) dimen-sion is a measure of the capacity of a statistical classification algorithm. Yet,

Page 57: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

32 Introduction to Privacy-Preserving Data Publishing

the algorithm has high computational cost. Blum et al. [33] present an effi-cient algorithm specifically for simple interval queries with limited constraints.The problems of developing efficient algorithms for more complicated queriesremain open.

2.5 Modeling Adversary’s Background Knowledge

Most privacy models, such as k-anonymity, �-diversity, confidence bounding,(α, k)-anonymity, and t-closeness, assume the adversary has only very limitedbackground knowledge. Specifically, they assume that the adversary’s back-ground knowledge is limited to knowing the quasi-identifier. Yet, recent workhas shown the importance of integrating an adversary’s background knowl-edge in privacy quantification. A robust privacy notion has to take backgroundknowledge into consideration. Since an adversary can easily learn backgroundknowledge from various sources, such as common sense, demographic statis-tic data, social networks, and other individual-level information, a commonchallenge faced by all research on integrating background knowledge is todetermine what and how much knowledge should be considered.

2.5.1 Skyline Privacy

Chen et al. [48] argue that since it is infeasible for a data publisher toanticipate the background knowledge possessed by an adversary, the inter-esting research direction is to consider only the background knowledge thatarises naturally in practice and can be efficiently handled. In particular, threetypes of background knowledge, known as three-dimensional knowledge, areconsidered in [48]:

• knowledge about the target victim,

• knowledge about other record owners in the published data, and

• knowledge about the group of record owners having the same sensitivevalue as that of the target victim.

Background knowledge is expressed by conjunctions of literals that are inthe form of either s ∈ t[S] or s /∈ t[S], denoting that the individual t has asensitive value s or does not have a sensitive value s respectively. The three-dimensional knowledge is quantified as a (�, k, m) triplet, which indicates thatan adversary knows:

1. � sensitive values that the target victim t does not have,

2. the sensitive values of other k individuals, and

Page 58: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Attack Models and Privacy Models 33

3. m individuals having the same sensitive value as that of t.

Then, the authors propose skyline privacy criterion, which allows the datapublisher to specify a set of incomparable (�, k, m) triplets, called a skyline,along with a set of confidence thresholds for a sensitive value σ in order toprovide more precise and flexible privacy quantification. Under the definitionof skyline privacy criterion, an anonymized data set is safe if and only if forall triplets in the skyline, the adversary’s maximum confidence of inferringthe sensitive value σ is less than the corresponding threshold specified by thedata publisher.

2.5.2 Privacy-MaxEnt

The expressive power of background knowledge in [48] is limited. For ex-ample, it fails to express probabilistic background knowledge. Du et al. [70]specifically address the background knowledge in the form of probabilities, forexample, P (Ovarian Cancer|Male) = 0. The primary privacy risks in PPDPcome from the linkages between the sensitive attributes (S) and the quasi-identifiers (QID). Quantifying privacy is, therefore, to derive P (S|QID) forany instance of S and QID with the probabilistic background knowledge. Duet al. [70] formulate the derivation of P (S|QID) as a non-linear program-ming problem. P (s|qid) is considered as a variable for each combination ofs ∈ S and qid ∈ QID. To assign probability values to these variables, bothbackground knowledge and the anonymized data set are integrated as con-straints guiding the assignment. The meaning of each variable is actually theinference on P (s|qid). The authors deem that such inference should be as un-biased as possible. This thought leads to the utilization of maximum entropyprinciple, which states that when the entropy of these variables is maximized,the inference is the most unbiased [70]. Thus, the problem is re-formulated asfinding the maximum-entropy assignment for the variables while satisfying theconstraints. Since it is impossible for a data publisher to enumerate all proba-bilistic background knowledge, a bound of knowledge is used instead. In [70],the bound is specified by both the Top-K positive association rules and theTop-K negative association rules derived from the published data set. Cur-rently, the paper is limited in handling only equality background knowledgeconstraints.

2.5.3 Skyline (B, t)-Privacy

Since both [48] and [70] are unaware of exact background knowledge pos-sessed by an adversary, Li and Li [154] propose a generic framework to sys-tematically model different types of background knowledge an adversary maypossess. Yet, the paper narrows its scope to background knowledge that isconsistent with the original data set T . Modeling background knowledge isto estimate the adversary’s prior belief of the sensitive attribute values over

Page 59: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

34 Introduction to Privacy-Preserving Data Publishing

all possible QID values. This can be achieved by identifying the underlyingprior belief function Ppri that best fits T using a kernel regression estimationmethod. One of the components of kernel estimation, the bandwidth B, is usedto measure how much background knowledge an adversary has. Based on thebackground knowledge and the anonymized data set, the paper proposes anapproximate inference method called Ω-estimate to efficiently calculate theadversary’s posterior belief of the sensitive values. Consequently, the skyline(B, t)-privacy principle is defined based on the adversary’s prior and posteriorbeliefs. Given a skyline {(B1, t1), (B2, t2), . . . , (Bn, tn)}, an anonymized dataset satisfies the skyline (B, t)-privacy principle if and only if for i = 1 to n,the maximum difference between the adversary’s prior and posterior beliefsfor all tuples in the data set is at most ti. Li and Li [154] point out that slightchanges of B do not incur significant changes of the corresponding privacyrisks. Thus, the data publisher only needs to specify a set of well-chosen Bvalues to protect the privacy against all adversaries.

Page 60: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Chapter 3

Anonymization Operations

The raw data table usually does not satisfy a specified privacy requirementand the table must be modified before being published. The modificationis done by applying a sequence of anonymization operations to the table.An anonymization operation comes in several flavors: generalization, suppres-sion, anatomization, permutation, and perturbation. Generalization and sup-pression replace values of specific description, typically the QID attributes,with less specific description. Anatomization and permutation de-associatethe correlation between QID and sensitive attributes by grouping and shuf-fling sensitive values in a qid group. Perturbation distorts the data by addingnoise, aggregating values, swapping values, or generating synthetic data basedon some statistical properties of the original data. Below, we discuss theseanonymization operations in detail.

3.1 Generalization and Suppression

Each generalization or suppression operation hides some details in QID. Fora categorical attribute, a specific value can be replaced with a general valueaccording to a given taxonomy. In Figure 3.1, the parent node Professionalis more general than the child nodes Engineer and Lawyer. The root node,ANY Job, represents the most general value in Job. For a numerical attribute,exact values can be replaced with an interval that covers exact values. If a tax-onomy of intervals is given, the situation is similar to categorical attributes.More often, however, no pre-determined taxonomy is given for a numericalattribute. Different classes of anonymization operations have different impli-cations on privacy protection, data utility, and search space. But they allresult in a less precise but consistent representation of original data.

A generalization replaces some values with a parent value in the taxonomyof an attribute. The reverse operation of generalization is called specialization.A suppression replaces some values with a special value, indicating that thereplaced values are not disclosed. The reverse operation of suppression is calleddisclosure. Below, we summarize five generalization schemes.

Full-domain generalization scheme [148, 201, 217]. In this scheme, all

35

Page 61: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

36 Introduction to Privacy-Preserving Data Publishing

Engineer

ANY

Professional Artist

Lawyer Dancer Writer

ANY

Male Female

SexJob

[30-33)

[30-40)

[30-35) [35-40)

[33-35)

Age

FIGURE 3.1: Taxonomy trees for Job, Sex, Age

values in an attribute are generalized to the same level of the taxonomytree. For example, in Figure 3.1, if Lawyer and Engineer are generalizedto Professional, then it also requires generalizing Dancer and Writerto Artist. The search space for this scheme is much smaller than thesearch space for other schemes below, but the data distortion is thelargest because of the same granularity level requirement on all paths ofa taxonomy tree.

Subtree generalization scheme [29, 95, 96, 123, 148]. In this scheme, ata non-leaf node, either all child values or none are generalized. For ex-ample, in Figure 3.1, if Engineer is generalized to Professional, thisscheme also requires the other child node, Lawyer, to be generalized toProfessional, but Dancer and Writer, which are child nodes of Artist,can remain ungeneralized. Intuitively, a generalized attribute has valuesthat form a “cut” through its taxonomy tree. A cut of a tree is a subsetof values in the tree that contains exactly one value on each root-to-leafpath.

Sibling generalization scheme [148]. This scheme is similar to the subtreegeneralization, except that some siblings may remain ungeneralized. Aparent value is then interpreted as representing all missing child values.For example, in Figure 3.1, if Engineer is generalized to Professional,and Lawyer remains ungeneralized, Professional is interpreted as alljobs covered by Professional except for Lawyer. This scheme producesless distortion than subtree generalization schemes because it only needsto generalize the child nodes that violate the specified threshold.

Cell generalization scheme [148, 246, 254]. In all of the above schemes,if a value is generalized, all its instances are generalized. Such schemesare called global recoding. In cell generalization, also known as local re-coding, some instances of a value may remain ungeneralized while otherinstances are generalized. For example, in Table 2.2 the Engineer in thefirst record is generalized to Professional, while the Engineer in thesecond record can remain ungeneralized. Compared with global recodingschemes, this scheme is more flexible; therefore, it produces a smaller

Page 62: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization Operations 37

data distortion. Nonetheless, it is important to note that the utility ofdata is adversely affected by this flexibility, which causes a data explo-ration problem: most standard data mining methods treat Engineer andProfessional as two independent values, but, in fact, they are not. Forexample, building a decision tree from such a generalized table may re-sult in two branches, Professional → class2 and Engineer → class1.It is unclear which branch should be used to classify a new engineer.Though very important, this aspect of data utility has been ignored byall works that employed the local recoding scheme. Data produced byglobal recoding does not suffer from this data exploration problem.

Multidimensional generalization [149, 150]. Let Di be the domain of anattribute Ai. A single-dimensional generalization, such as full-domaingeneralization and subtree generalization, is defined by a function fi :DAi → D′ for each attribute Ai in QID. In contrast, a multidimensionalgeneralization is defined by a single function f : DA1 × · · ·×DAn → D′,which is used to generalize qid = 〈v1, . . . , vn〉 to qid′ = 〈u1, . . . , un〉where for every vi, either vi = ui or vi is a child node of ui in the tax-onomy of Ai. This scheme flexibly allows two qid groups, even havingthe same value on some vi and ui, to be independently generalized intodifferent parent groups. For example 〈Engineer, Male〉 can be general-ized to 〈Engineer, ANY Sex〉 while 〈Engineer, Female〉 can be gener-alized to 〈Professional, Female〉. The generalized table contains bothEngineer and Professional. This scheme produces less distortion thanthe full-domain and subtree generalization schemes because it needsto generalize only the qid groups that violate the specified threshold.Note, in this multidimensional scheme, all records in a qid are gen-eralized to the same qid′, but cell generalization does not have suchconstraint. Both schemes suffer from the data exploration problem dis-cussed above though. Ercan Nergiz and Clifton [177] further evaluatea family of clustering-based algorithms that even attempts to improvedata utility by ignoring the restrictions of the given taxonomies.

There are also different suppression schemes. Record suppression [29, 123,148, 201] refers to suppressing an entire record. Value suppression [237] refersto suppressing every instance of a given value in a table. Cell suppression (orlocal suppression) [52, 168] refers to suppressing some instances of a givenvalue in a table.

In summary, the choice of generalization and suppression operations hasan implication on the search space of anonymous tables and data distortion.The full-domain generalization has the smallest search space but the largestdistortion, and the local recoding scheme has the largest search space butthe least distortion. For a categorical attribute with a taxonomy tree H , thenumber of possible cuts in subtree generalization, denoted by C(H), is equalto C(H1)× · · · ×C(Hu) + 1 where H1, . . . , Hu are the subtrees rooted at the

Page 63: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

38 Introduction to Privacy-Preserving Data Publishing

children of the root of H , and 1 is for the trivial cut at the root of H . Thenumber of potential modified tables is equal to the product of such numbers forall the attributes in QID. The corresponding number is much larger if a localrecoding scheme is adopted because any subset of values can be generalizedwhile the rest remains ungeneralized for each attribute in QID.

A table is minimally anonymous if it satisfies the given privacy requirementand its sequence of anonymization operations cannot be reduced without vi-olating the requirement. A table is optimally anonymous if it satisfies thegiven privacy requirement and contains most information according to thechosen information metric among all satisfying tables. See Chapter 4 for dif-ferent types of information metrics. Various works have shown that findingthe optimal anonymization is NP-hard: Samarati [201] shows that the opti-mal k-anonymity by full-domain generalization is very costly. Meyerson andWilliams [168] and Aggarwal et al. [12] prove that the optimal k-anonymity bycell suppression, value suppression, and cell generalization is NP-hard. Wonget al. [246] prove that the optimal (α, k)-anonymity by cell generalization isNP-hard. In most cases, finding a minimally anonymous table is a reasonablesolution and can be done efficiently.

3.2 Anatomization and Permutation

Unlike generalization and suppression, anatomization (a.k.a. bucketiza-tion) [249] does not modify the quasi-identifier or the sensitive attribute, butde-associates the relationship between the two. Precisely, the method releasesthe data on QID and the data on the sensitive attribute in two separatetables: a quasi-identifier table (QIT ) contains the QID attributes, a sensi-tive table (ST ) contains the sensitive attributes, and both QIT and ST haveone common attribute, GroupID. All records in the same group will havethe same value on GroupID in both tables and, therefore, are linked to thesensitive values in the group in the exact same way. If a group has � distinctsensitive values and each distinct value occurs exactly once in the group, thenthe probability of linking a record to a sensitive value by GroupID is 1/�.The attribute linkage attack can be distorted by increasing �.

Example 3.1Suppose that the data holder wants to release the patient data in Table 3.1,where Disease is a sensitive attribute and QID = {Age, Sex}. First, partition(or generalize) the original records into qid groups so that, in each group, atmost 1/� of the records contain the same Disease value. This intermediateTable 3.2 contains two qid groups: 〈[30-35), Male〉 and 〈[35-40), F emale〉.Next, create QIT (Table 3.3) to contain all records from the original Table 3.1,

Page 64: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization Operations 39

Table 3.1: Anatomy: originalpatient data

Age Sex Disease(sensitive)

30 Male Hepatitis30 Male Hepatitis30 Male HIV32 Male Hepatitis32 Male HIV32 Male HIV36 Female Flu38 Female Flu38 Female Heart38 Female Heart

Table 3.2: Anatomy: intermediateQID-grouped table

Age Sex Disease(sensitive)

[30− 35) Male Hepatitis[30− 35) Male Hepatitis[30− 35) Male HIV[30− 35) Male Hepatitis[30− 35) Male HIV[30− 35) Male HIV[35− 40) Female Flu[35− 40) Female Flu[35− 40) Female Heart[35− 40) Female Heart

but replace the sensitive values by the GroupIDs, and create ST (Table 3.4)to contain the count of each Disease for each qid group. QIT and ST satisfythe privacy requirement with � ≤ 2 because each qid group in QIT infers anyassociated Disease in ST with probability at most 1/� = 1/2 = 50%.

The major advantage of anatomy is that the data in both QIT and ST areunmodified. Xiao and Tao [249] show that the anatomized tables can moreaccurately answer aggregate queries involving domain values of the QID andsensitive attributes than the generalization approach. The intuition is that ina generalized table domain values are lost and, without additional knowledge,the uniform distribution assumption is the best that can be used to answera query about domain values. In contrast, all domain values are retained inthe anatomized tables, which give the exact distribution of domain values.For instance, suppose that the data recipient wants to count the number ofpatients of age 38 having heart disease. The correct count from the origi-

Page 65: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

40 Introduction to Privacy-Preserving Data Publishing

Table 3.3: Anatomy: quasi-identifiertable (QIT) for release

Age Sex GroupID30 Male 130 Male 130 Male 132 Male 132 Male 132 Male 136 Female 238 Female 238 Female 238 Female 2

Table 3.4: Anatomy: sensitivetable (ST) for release

GroupID Disease Count(sensitive)

1 Hepatitis 31 HIV 32 Flu 22 Heart 2

nal Table 3.1 is 2. The expected count from the anatomized Table 3.3 andTable 3.4 is 3 × 2

4 = 1.5 since 2 out of the 4 records in GroupID = 2 inTable 3.4 have heart disease. This count is more accurate than the expectedcount 2 × 1

5 = 0.4, from the generalized Table 3.2, where the 15 comes from

the fact that the 2 patients with heart disease have an equal chance to be atage {35, 36, 37, 38, 39}.

Yet, with the data published in two tables, it is unclear how standard datamining tools, such as classification, clustering, and association mining tools,can be applied to the published data, and new tools and algorithms need tobe designed. Also, anatomy is not suitable for incremental data publishing,which will be further discussed in Chapter 10. The generalization approachdoes not suffer from the same problem because all attributes are released inthe same table.

Sharing the same spirit of anatomization, Zhang et al. [269] propose anapproach called permutation. The idea is to de-associate the relationship be-tween a quasi-identifier and a numerical sensitive attribute by partitioning aset of data records into groups and shuffling their sensitive values within eachgroup.

Page 66: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization Operations 41

3.3 Random Perturbation

Random perturbation has a long history in statistical disclosure con-trol [3, 242, 252] due to its simplicity, efficiency, and ability to preserve statis-tical information. The general idea is to replace the original data values withsome synthetic data values so that the statistical information computed fromthe perturbed data does not differ significantly from the statistical informationcomputed from the original data. Depending on the degree of randomization,the perturbed data records may or may not correspond to real-world recordowners, so the adversary cannot perform the sensitive linkage attacks or re-cover sensitive information from the published data with high confidence.

Compared to the other anonymization operations discussed earlier, one lim-itation of the random perturbation approach is that the published records are“synthetic” in that they do not correspond to the real world entities repre-sented by the original data; therefore, individual records in the perturbed dataare basically meaningless to the human recipients. Randomized data, however,can still be very useful if aggregate properties (such as frequency distributionof disease) are the target of data analysis. In fact, one can accurately recon-struct such distribution if the prob used for perturbation is known. Yet, theone may argue that, in such a case, the data holder may consider releasing thestatistical information or the data mining results rather than the perturbeddata [64].

In contrast, generalization and suppression make the data less precise than,but semantically consistent with, the raw data, therefore, preserve the truth-fulness of data. For example, after analyzing the statistical properties of acollection of perturbed patient records, a drug company wants to focus on asmall number of patients for further analysis. This stage requires the truthfulrecord information instead of perturbed record information. However, the gen-eralized or suppressed data may not be able to preserve the desired statisticalinformation. Below, we discuss several commonly used random perturbationmethods, including additive noise, data swapping, and synthetic data gener-ation.

3.3.1 Additive Noise

Additive noise is a widely used privacy protection method in statisticaldisclosure control [3, 36]. It is often used for hiding sensitive numerical data(e.g., salary). The general idea is to replace the original sensitive value s withs + r where r is a random value drawn from some distribution. Privacy wasmeasured by how closely the original values of a modified attribute can beestimated [16]. Fuller [90] and Kim and Winkler [140] show that some sim-ple statistical information, like means and correlations, can be preserved byadding random noise. Experiments in [19, 72, 84] further suggest that some

Page 67: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

42 Introduction to Privacy-Preserving Data Publishing

data mining information can be preserved in the randomized data. However,Kargupta et al. [136] point out that some reasonably close sensitive valuescan be recovered from the randomized data when the correlation among at-tributes is high, but the noise is not. Huang et al. [118] present an improvedrandomization method to limit this type of privacy breach. Some represen-tative privacy models and statistical disclosure control methods that employadditive noise are discussed in Chapters 2.4 and 5.4.

3.3.2 Data Swapping

The general idea of data swapping is to anonymize a data table by ex-changing values of sensitive attributes among individual records while theswaps maintain the low-order frequency counts or marginals for statisticalanalysis. It can be used to protect numerical attributes [196] and categori-cal attributes [195]. An alternative swapping method is rank swapping: Firstrank the values of an attribute A in ascending order. Then for each valuev ∈ A, swap v with another value u ∈ A, where u is randomly chosen withina restricted range p% of v. Rank swapping can better preserve statisticalinformation than the ordinary data swapping [65].

3.3.3 Synthetic Data Generation

Many statistical disclosure control methods use synthetic data generation topreserve record owners’ privacy and retain useful statistical information [200].The general idea is to build a statistical model from the data and then tosample points from the model. These sampled points form the synthetic datafor data publication instead of the original data. An alternative syntheticdata generation approach is condensation [9, 10]. The idea is to first condensethe records into multiple groups. For each group, extract some statistical in-formation, such as sum and covariance, that suffices to preserve the meanand correlations across the different attributes. Then, based on the statisti-cal information, for publication generate points for each group following thestatistical characteristics of the group.

Page 68: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Chapter 4

Information Metrics

Privacy preservation is one side of anonymization. The other side is retaininginformation so that the published data remains practically useful. There arebroad categories of information metrics for measuring the data usefulness. Adata metric measures the data quality in the entire anonymous table withrespect to the data quality in the original table. A search metric guides eachstep of an anonymization (search) algorithm to identify an anonymous tablewith maximum information or minimum distortion. Often, this is achievedby ranking a set of possible anonymization operations and then greedily per-forming the “best” one at each step in the search. Since the anonymous tableproduced by a search metric is eventually evaluated by a data metric, the twotypes of metrics usually share the same principle of measuring data quality.

Alternatively, an information metric can be categorized by their informa-tion purposes including general purpose, special purpose, or trade-off purpose.Below, we discuss some commonly used data and search metrics by their pur-poses.

4.1 General Purpose Metrics

In many cases, the data holder does not know how the published data willbe analyzed by the recipient. This is very different from privacy-preservingdata mining (PPDM), which assumes that the data mining task is known. InPPDP, for example, the data may be published on the web and a recipientmay analyze the data according to her own purpose. An information metricgood for one recipient may not be good for another recipient. In such sce-narios, a reasonable information metric is to measure “similarity” betweenthe original data and the anonymous data, which underpins the principle ofminimal distortion [201, 215, 217].

4.1.1 Minimal Distortion

In the minimal distortion metric or MD, a penalty is charged to each in-stance of a value generalized or suppressed. For example, generalizing 10 in-stances of Engineer to Professional causes 10 units of distortion, and further

43

Page 69: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

44 Introduction to Privacy-Preserving Data Publishing

generalizing these instances to ANY Job causes another 10 units of distor-tion. This metric is a single attribute measure, and it was previously used in[201, 216, 217, 236] as a data metric and search metric.

4.1.2 ILoss

ILoss is a data metric proposed in [250] to capture the information loss ofgeneralizing a specific value to a general value vg: ILoss(vg) = |vg|−1

|DA| where|vg| is the number of domain values that are descendants of vg, and |DA| is thenumber of domain values in the attribute A of vg. This data metric requiresall original data values to be at the leaves in the taxonomy. ILoss(vg) = 0if vg is an original data value in the table. In words, ILoss(vg) measures thefraction of domain values generalized by vg. For example, generalizing oneinstance of Dancer to Artist in Figure 3.1 has ILoss(Artist) = 2−1

4 = 0.25.The loss of a generalized record r is given by:

ILoss(r) =∑

vg∈r

(wi × ILoss(vg)), (4.1)

where wi is a positive constant specifying the penalty weight of attribute Ai

of vg. The overall loss of a generalized table T is given by:

ILoss(T ) =∑

r∈T

ILoss(r). (4.2)

4.1.3 Discernibility Metric

The discernibility metric or DM [213] addresses the notion of loss by charg-ing a penalty to each record for being indistinguishable from other records withrespect to QID. If a record belongs to a qid group of size |T [qid]|, the penaltyfor the record will be |T [qid]|. Thus, the penalty on a group is |T [qid]|2. Theoverall penalty cost of generalized table T is given by

DM(T ) =∑

qidi

|T [qidi]|2 (4.3)

over all qidi. This data metric, used in [29, 149, 160, 162, 234, 254], worksexactly against k-anonymization that seeks to make records indistinguishablewith respect to QID.

Another version of DM is to use as a search metric to guide the search ofan anonymization algorithm. For example, a bottom-up generalization (or atop-down specialization) anonymization algorithm chooses the generalizationchild(v)→ v (or the specialization v → child(v)) that minimizes (maximizes)the value of

DM(v) =∑

qidv

|T [qidv]|2 (4.4)

Page 70: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Information Metrics 45

over all qidv containing v.Let us compare DM and MD. MD charges a penalty for generalizing a

value in a record independently of other records. For example, in MD, gener-alizing 99 instances of Engineer and 1 instance of Lawyer to Professionalhas the same penalty as generalizing 50 instances of Dancer and 50 instancesof Writer to Artist. In both cases, 100 instances are made indistinguishable;therefore, the costs of both generalizations are the same. The difference isthat before the generalization, 99 instances are already indistinguishable inthe first case, whereas only 50 instances are indistinguishable in the secondcase. Therefore, the second case makes more originally distinguishable recordsbecome indistinguishable. In contrast, DM can differentiate the two cases:

DM(Professional) = 992 + 12 = 9802DM(Artist) = 502 + 502 = 5000

DM can determine that generalizing Dancer and Writer to Artist costs lessthan generalizing Engineer and Lawyer to Professional.

4.1.4 Distinctive Attribute

A simple search metric, called distinctive attribute or DA, was employedin [215] to guide the search for a minimally anonymous table in a full-domaingeneralization scheme. The heuristic selects the attribute having the mostnumber of distinctive values in the data for generalization. Note, this type ofsimple heuristic only serves the purpose of guiding the search, but does notquantify the utility of an anonymous table.

4.2 Special Purpose Metrics

If the purpose of the data is known at the time of publication, the purposecan be taken into account during anonymization to better retain information.For example, if the data is published for modeling the classification of a targetattribute in the table, then it is important not to generalize the values whosedistinctions are essential for discriminating the class labels in the target at-tribute. A frequently asked question is [177]: if the purpose of data is known,why not extract and publish a data mining result for that purpose (such asa classifier) instead of the data? The answer is that publishing a data miningresult is a commitment at the algorithmic level, which is neither practical forthe non-expert data holder nor desirable for the data recipient. In practice,there are many ways to mine the data even for a given purpose, and typicallyit is unknown which one is the best until the data is received and differentways are tried. A real life example is the release of the Netflix data (New York

Page 71: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

46 Introduction to Privacy-Preserving Data Publishing

Times, Oct. 2, 2006) discussed in Chapter 1. Netflix wanted to provide theparticipants the greatest flexibility to perform their desired analysis, insteadof limiting them to a specific type of analysis.

For concreteness, let us consider the classification problem where the goalis to classify future cases into some pre-determined classes, drawn from thesame underlying population as the training cases in the published data. Thetraining cases contain both the useful classification information that can im-prove the classification model, and the useless noise that can degrade theclassification model. Specifically, the useful classification information is theinformation that can differentiate the target classes, and holds not only ontraining cases, but also on future cases. In contrast, the useless noise holdsonly on training cases. Clearly, only the useful classification information thathelps classification should be retained. For example, a patient’s birth year islikely to be part of the information for classifying Lung Cancer if the diseaseoccurs more frequently among elderly people, but the exact birth date is likelyto be noise. In this case, generalizing birth date to birth year in fact helpsclassification because it eliminates the noise. This example shows that sim-ply minimizing the distortion to the data, as adopted by all general purposemetrics and optimal k-anonymization, is not addressing the right problem.

To address the classification goal, the distortion should be measured by theclassification error on future cases. Since future data is not available in mostscenarios, most developed methods [95, 96, 123] measure the accuracy on thetraining data. Research results in [95, 96] suggest that the useful classificationknowledge is captured by different combinations of attributes. Generalizationand suppression may destroy some of these useful “classification structures,”but other useful structures may emerge to help. In some cases, generalizationand suppression may even improve the classification accuracy because somenoise has been removed.

Iyengar [123] presents the first work on PPDP for classification. He proposedthe classification metric or CM to measure the classification error on thetraining data. The idea is to charge a penalty for each record suppressed orgeneralized to a group in which the record’s class is not the majority class.The intuition is that a record having a non-majority class in a group will beclassified as the majority class, which is an error because it disagrees with therecord’s original class. The classification metric CM is defined as follows:

CM =∑

r∈T penalty(r)|T | , (4.5)

where r is a record in data table T , and

penalty(r) =

⎧⎪⎨

⎪⎩

1 if r is suppressed1 if class(r) = majority(G(r))0 otherwise

(4.6)

Page 72: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Information Metrics 47

where class(r) is the class value of record r, G(r) is the qid group containingr, and majority(G(r)) is the majority class in G(r).

CM is a data metric and, thus, penalizes modification to the training data.This does not quite address the classification goal, which is actually betteroff generalizing useless noise into useful classification information. For classi-fication, a more relevant approach is searching for a “good” anonymizationaccording to some heuristics. In other words, instead of optimizing a data met-ric, this approach employs a search metric to rank anonymization operationsat each step in the search. An anonymization operation is ranked high if itretains useful classification information. The search metric could be adoptedby different anonymization algorithms. For example, a greedy algorithm ora hill climbing optimization algorithm can be used to identify a minimal se-quence of anonymization operations for a given search metric. We will discussanonymization algorithms in Chapter 5.

Neither a data metric nor a search metric guarantees a good classifica-tion on future cases. It is essential to experimentally evaluate the impact ofanonymization by building a classifier from the anonymous data and seeinghow it performs on testing cases. Few works [95, 96, 123, 150, 239] have ac-tually conducted such experiments, although many such as [29] adopt CM inan attempt to address the classification problem.

4.3 Trade-Off Metrics

The special purpose information metrics aim at preserving data usefulnessfor a given data mining task. The catch is that the anonymization opera-tion that gains maximum information may also lose so much privacy that noother anonymization operation can be performed. The idea of trade-off met-rics is to consider both the privacy and information requirements at everyanonymization operation and to determine an optimal trade-off between thetwo requirements.

Fung et al. [95, 96] propose a search metric based on the principle of in-formation/privacy trade-off. Suppose that the anonymous table is searchedby iteratively specializing a general value into child values. Each specializa-tion operation splits each group containing the general value into a number ofgroups, one for each child value. Each specialization operation s gains someinformation, denoted by IG(s), and loses some privacy, denoted by PL(s).This search metric prefers the specialization s that maximizes the informa-tion gained per each loss of privacy:

IGPL(s) =IG(s)

PL(s) + 1. (4.7)

The choice of IG(s) and PL(s) depends on the information metric and privacy

Page 73: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

48 Introduction to Privacy-Preserving Data Publishing

model. For example in classification analysis, IG(s) could be the informationgain InfoGain(v) [191] defined as the decrease of the class entropy [210]after specializing a general group into several specialized groups. InfoGain(v)measures the goodness of a specialization on v.

InfoGain(v): Let T [x] denote the set of records in T generalized to thevalue x. Let freq(T [x], cls) denote the number of records in T [x] having theclass cls. Note that |T [v]| = ∑

c |T [c]|, where c ∈ child(v).

InfoGain(v) = E(T [v])−∑

c

|T [c]||T [v]|E(T [c]), (4.8)

where E(T [x]) is the entropy of T [x] [191, 210]:

E(T [x]) = −∑

cls

freq(T [x], cls)|T [x]| × log2

freq(T [x], cls)|T [x]| , (4.9)

Intuitively, E(T [x]) measures the entropy, or the impurity, of classes for therecords in T [x] by estimating the average minimum number of bits requiredto encode a string of symbols. The more dominating the majority class in T [x]is, the smaller E(T [x]) is, due to less entropy in E(T [x])). InfoGain(v) is thereduction of the impurity by specializing v into c ∈ child(v). InfoGain(v) isnon-negative.

Alternatively, IG(s) could be the decrease of distortion measured by MD,described in Chapter 4.1, after performing s. For k-anonymity, Fung et al. [95,96] measure the privacy loss PL(s) by the average decrease of anonymity overall QIDj that contain the attribute of s, that is,

PL(s) = avg{A(QIDj)−As(QIDj)}, (4.10)

where A(QIDj) and As(QIDj) denote the anonymity of QIDj before andafter the specialization. One variant is to maximize the gain of information bysetting PL(s) to zero. The catch is that the specialization that gains maximuminformation may also lose so much privacy that no other specializations canbe performed. Note that the principle of information/privacy trade-off canalso be used to select a generalization g, in which case it will minimize

ILPG(g) =IL(g)PG(g)

(4.11)

where IL(g) denotes the information loss and PG(g) denotes the privacy gainby performing g.

Page 74: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Chapter 5

Anonymization Algorithms

In this chapter, we examine some representative anonymization algorithms.Refer to Tables 5.1, 5.2, 5.3, and 5.4 for a characterization based on privacymodel (Chapter 2), anonymization operation (Chapter 3), and informationmetric (Chapter 4). Our presentation of algorithms is organized according tolinkage models. Finally, we discuss the potential privacy threats even thougha data table has been anonymized.

5.1 Algorithms for the Record Linkage Model

We broadly classify record linkage anonymization algorithms into three fam-ilies: the first two, optimal anonymization and minimal anonymization, usegeneralization and suppression methods; the third family uses perturbationmethods.

5.1.1 Optimal Anonymization

The first family finds an optimal k-anonymization, for a given data metric,by limiting to full-domain generalization and record suppression. Since thesearch space for the full-domain generalization scheme is much smaller thanother schemes, finding an optimal solution is feasible for small data sets. Thistype of exhaustive search, however, is not scalable to large data sets, especiallyif a more flexible anonymization scheme is employed.

5.1.1.1 MinGen

Sweeney [217]’s MinGen algorithm exhaustively examines all potential full-domain generalizations to identify the optimal generalization, measured inMD. Sweeney acknowledged that this exhaustive search is impractical even forthe modest sized data sets, motivating the second family of k-anonymizationalgorithms to be discussed later. Samarati [201] proposes a binary search al-gorithm that first identifies all minimal generalizations, and then finds theoptimal generalization measured in MD. Enumerating all minimal general-

49

Page 75: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

50 Introduction to Privacy-Preserving Data Publishing

Table 5.1: Anonymization algorithms for record linkageAlgorithm Operation Metric OptimalityBinary Search [201] FG,RS MD yesMinGen [217] FG,RS MD yesIncognito [148] FG,RS MD yesK-Optimize [29] SG,RS DM ,CM yesμ-argus [120] SG,CS MD noDatafly [215] FG,RS DA noGenetic Algorithm [123] SG,RS CM noBottom-Up Generalization [239] SG ILPG noTop-Down Specialization SG,VS IGPL no(TDS) [95, 96]TDS for Cluster Analysis [94] SG,VS IGPL noMondrian Multidimensional [149] MG DM noBottom-Up & Top-Down CG DM noGreedy [254]TDS for 2-Party [172] SG IGPL noCondensation [9, 10] CD heuristics nor-Gather Clustering [13] CL heuristics no

FG=Full-domain Generalization, SG=Subtree Generalization,CG=Cell Generalization, MG=Multidimensional Generalization,

RS=Record Suppression, VS=Value Suppression, CS=Cell Suppression,CD=Condensation, CL=Clustering

izations is an expensive operation and, therefore, not scalable for large datasets.

5.1.1.2 Incognito

LeFevre et al. [148] present a suite of optimal bottom-up generalizationalgorithms, called Incognito, to generate all possible k-anonymous full-domaingeneralizations. The algorithms exploit the rollup property for computing thesize of qid groups.

Observation 5.1.1 (Rollup property) If qid is a generalization of{qid1, . . . , qidc}, then |qid| = ∑c

i=1 |qidi|.The rollup property states that the parent group size |qid| can be directly

computed from the sum of all child group sizes |qidi|, implying that the groupsize |qid| of all possible generalizations can be incrementally computed in abottom-up manner. This property not only allows efficient computation ofgroup sizes, but also provides a terminating condition for further generaliza-tions, leading to the generalization property:

Page 76: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization Algorithms 51

Observation 5.1.2 (Generalization property) Let T ′ be a table notmore specific than table T on all attributes in QID. If T is k-anonymouson QID, then T ′ is also k-anonymous on QID.

The generalization property provides the basis for effectively pruning thesearch space of generalized tables. This property is essential for efficientlydetermining an optimal k-anonymization [148, 201]. Consider a qid in a tableT . If qid′ is a generalization of qid and |qid| ≥ k, then |qid′| ≥ k. Thus, if Tis k-anonymous, there is no need to further generalize T because any furthergeneralizations of T must also be k-anonymous but with higher distortion and,therefore, not optimal according to, for example, the minimal distortion metricMD. Although Incognito significantly outperforms the binary search [201]in efficiency, the complexity of all three algorithms, namely MinGen, binarysearch, and Incognito, increases exponentially with the size of QID.

5.1.1.3 K-Optimize

Another algorithm called K-Optimize [29] effectively prunes non-optimalanonymous tables by modeling the search space using a set enumeration tree.Each node represents a k-anonymous solution. The algorithm assumes a to-tally ordered set of attribute values, and examines the tree in a top-downmanner starting from the most general table and prunes a node in the treewhen none of its descendants could be a global optimal solution based ondiscernibility metric DM and classification metric CM . Unlike the above al-gorithms, K-Optimize employs the subtree generalization and record suppres-sion schemes. It is the only efficient optimal algorithm that uses the flexiblesubtree generalization.

5.1.2 Locally Minimal Anonymization

The second family of algorithms produces a minimal k-anonymous tableby employing a greedy search guided by a search metric. Being heuristic innature, these algorithms may not find the optimal anonymous solution, butare more scalable than the previous family. Since the privacy guarantee is notcompromisable, the algorithms typically preserve the required privacy butmay not reach optimal utility in the anonymization. Often local minimalityin data distortion is achieved when a greedy search stops at a point within alocal search space where no further distortion is required or where any furtherdistortion may compromise the privacy. We sometimes refer to such locallyminimal anonymization simply as minimal anonymization.

5.1.2.1 μ-argus

The μ-argus algorithm [120] computes the frequency of all 3-value combina-tions of domain values, then greedily applies subtree generalizations and cellsuppressions to achieve k-anonymity. Since the method limits the size of at-

Page 77: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

52 Introduction to Privacy-Preserving Data Publishing

tribute combination, the resulting data may not be k-anonymous when morethan 3 attributes are considered.

5.1.2.2 Datafly

Sweeney [215]’s Datafly system is the first k-anonymization algorithm scal-able to handle real-life large data sets. It achieves k-anonymization by gener-ating an array of qid group sizes and greedily generalizing those combinationswith less than k occurrences based on a heuristic search metric DA that selectsthe attribute having the largest number of distinct values. Datafly employsfull-domain generalization and record suppression schemes.

5.1.2.3 Genetic Algorithm

Iyengar [123] is among the first researchers who aims at preserving classi-fication information in k-anonymous data by employing a genetic algorithmwith an incomplete stochastic search based on classification metric CM anda subtree generalization scheme. The idea is to encode each state of general-ization as a “chromosome” and encode data distortion by a fitness function.The search process is a genetic evolution that converges to the fittest chro-mosome. Iyengar’s experiments suggest that, by considering the classificationpurpose, a classifier built from the anonymous data generated with a clas-sification purpose produces a lower classification error when compared to aclassifier built from anonymous data generated with a general purpose. How-ever, experiments also showed that this genetic algorithm is inefficient forlarge data sets.

5.1.2.4 Bottom-Up Generalization

To address the efficiency issue in k-anonymization, a bottom-up generaliza-tion algorithm was proposed in [239] for finding a minimal k-anonymizationfor classification. The algorithm starts from the original data that violatesk-anonymity, and greedily selects a generalization operation at each step ac-cording to a search metric similar to ILPG in Equation 4.11. Each operationincreases the group size according to the rollup property in Observation 5.1.1.The generalization process is terminated as soon as all groups have the mini-mum size k. To select a generalization operation, it first considers those thatwill increase the minimum group size, called critical generalizations, with theintuition that a loss of information should trade for some gain on privacy.When there are no critical generalizations, it considers other generalizations.Wang et al. [239] show that this heuristic significantly reduces the searchspace. Chapter 6.5 discusses this method in details.

5.1.2.5 Top-Down Specialization

Instead of bottom-up, the top-down specialization (TDS ) method [95, 96]generalizes a table by specializing it from the most general state in which all

Page 78: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization Algorithms 53

values are generalized to the most general values of their taxonomy trees. Ateach step, TDS selects the specialization according to the search metric IGPLin Equation 4.7. The specialization process terminates if no specializationcan be performed without violating k-anonymity. The data on terminationis a minimal k-anonymization according to the generalization property inObservation 5.1.2. TDS handles both categorical attributes and numericalattributes in a uniform way, except that the taxonomy tree for a numericalattribute is grown on the fly as specializations are searched at each step.

Fung et al. [94] further extend the k-anonymization algorithm to preservethe information for cluster analysis. The major challenge of anonymizing datafor cluster analysis is the lack of class labels that could be used to guide theanonymization process. Their solution is to first partition the original data intoclusters on the original data, convert the problem into the counterpart problemfor classification analysis where class labels encode the cluster information inthe data, and then apply TDS to preserve k-anonymity and the encodedcluster information.

In contrast to the bottom-up approach [148, 201, 239], the top-down ap-proach has several advantages. First, the user can stop the specialization pro-cess at any time and have a k-anonymous table. In fact, every step in thespecialization process produces a k-anonymous solution. Second, TDS han-dles multiple QIDs, which is essential for avoiding the excessive distortionsuffered by a single high-dimensional QID. Third, the top-down approach ismore efficient by going from the most generalized table to a more specifictable. Once a group cannot be further specialized, all data records in thegroup can be discarded. In contrast, the bottom-up approach has to keep alldata records until the end of computation. However, data holders employingTDS may encounter the dilemma of choosing (multiple) QID discussed inChapter 2.1.1.

In terms of efficiency, TDS is an order of magnitude faster than the GeneticAlgorithm [123]. For instance, TDS required only 7 seconds to anonymize thebenchmark Adult data set [179], whereas the genetic algorithm required 18hours for the same data using the same parameter k. They produce compa-rable classification quality [95, 96].

5.1.2.6 Mondrian Multidimensional

LeFevre et al. [149] present a greedy top-down specialization algorithm forfinding a minimal k-anonymization in the case of the multidimensional gen-eralization scheme. This algorithm is very similar to TDS. Both algorithmsperform a specialization on a value v one at a time. The major difference isthat TDS specializes in all qid groups containing v. In other words, a spe-cialization is performed only if each specialized qid group contains at least krecords. In contrast, Mondrian performs a specialization on one qid group ifeach of its specialized qid groups contains at least k records. Due to such arelaxed constraint, the resulting anonymous data in multidimensional general-

Page 79: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

54 Introduction to Privacy-Preserving Data Publishing

ization usually has a better quality than in single generalization. The trade-offis that multidimensional generalization is less scalable than other schemes dueto the increased search space. Xu et al. [254] show that employing cell general-ization could further improve the data quality. Though the multidimensionaland cell generalization schemes cause less information loss, they suffer fromthe data exploration problem discussed in Chapter 3.1.

5.1.3 Perturbation Algorithms

This family of anonymization methods employs perturbation to de-associatethe linkages between a target victim and a record while preserving some sta-tistical information.

5.1.3.1 Condensation

Aggarwal and Yu [8, 9, 10] present a condensation method to thwartrecord linkages. The method first assigns records into multiple non-overlappinggroups in which each group has a size of at least k records. For each group,extract some statistical information, such as sum and covariance, that sufficesto preserve the mean and correlations across the different attributes. Then,for publishing, based on the statistical information, generate points for eachgroup following the statistical characteristics of the group. This method doesnot require the use of taxonomy trees and can be effectively used in situa-tions with dynamic data updates as in the case of data streams. As each newdata record is received, it is added to the nearest group, as determined bythe distance to each group centroid. As soon as the number of data recordsin the group equals 2k, the corresponding group needs to be split into twogroups of k records each. The statistical information of the new group is thenincrementally computed from the original group.

5.1.3.2 r-Gather Clustering

In a similar spirit, Aggarwal et al. [13] propose a perturbation method calledr-gather clustering for anonymizing numerical data. This method partitionsrecords into several clusters such that each cluster contains at least r datapoints (i.e., records). Instead of generalizing individual records, this approachreleases the cluster centers, together with their size, radius, and a set of as-sociated sensitive values. To eliminate the impact of outliers, they relaxedthis requirement to (r, ε)-gather clustering so that at most ε fraction of datarecords in the data set can be treated as outliers for removal from the releaseddata.

5.1.3.3 Cross-Training Round Sanitization

Recall from Chapter 2.4.1 that a point q (c, t)-isolates a point p if B(q, cδp)contains fewer than t points in the table, where B(q, cδp) is a ball of radius cδp

Page 80: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization Algorithms 55

centered at point q. Chawla et al. [46] propose two sanitization (anonymiza-tion) techniques, recursive histogram sanitization and density-based perturba-tion, to prevent (c, t)-isolation.

Recursive histogram sanitization recursively divides original data into a setof subcubes according to local data density until all the subcubes have no morethan 2t data points. The method outputs the boundaries of the subcubes andthe number of points in each subcube. However, this method cannot handlehigh-dimensional spheres and balls. Chawla et al. [47] propose an extension tohandle high-dimensionality. Density-based perturbation, a variant of the oneproposed by Agrawal and Srikant [19], in which the magnitude of the addednoise is relatively fixed, takes into consideration the local data density nearthe point that needs to be perturbed. Points in dense areas are perturbedmuch less than points in sparse areas. Although the privacy of the perturbedpoints are protected, the privacy of the points in the t-neighborhood of theperturbed points could be compromised because the sanitization radius itselfcould leak information about these points. To prevent such privacy leakagefrom t-neighborhood points, Chawla et al. [46] further suggest a cross-traininground sanitization method by combining recursive histogram sanitization anddensity-based perturbation. In cross-training round sanitization, a data setis randomly divided into two subsets, A and B. B is sanitized using onlyrecursive histogram sanitization, while A is perturbed by adding Gaussiannoise generated according to the histogram of B.

5.2 Algorithms for the Attribute Linkage Model

The following algorithms anonymize the data to prevent attribute linkages.They use the privacy models discussed in Chapter 2.2. Though their privacymodels are different from those of record linkage, many algorithms for at-tribute linkage are simple extensions from algorithms for record linkage.

5.2.1 �-Diversity Incognito and �+-Optimize

Machanavajjhala et al. [160, 162] modify the bottom-up Incognito [148] toidentify an optimal �-diverse table. Recall that �-diversity requires every qidgroup to contain at least � “well-represented” sensitive values. The �-DiversityIncognito operates based on the generalization property, similar to Observa-tion 5.1.2, that �-diversity is non-decreasing with respect to generalization. Inother words, generalizations help to achieve �-diversity, just as generalizationshelp achieve k-anonymity. Therefore, k-anonymization algorithms that employfull-domain and subtree generalization can also be extended into �-diversityalgorithms.

Page 81: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

56 Introduction to Privacy-Preserving Data Publishing

Table 5.2: Anonymization algorithms for attribute linkageAlgorithm Operation Metric Optimality�-Diversity Incognito [162] FG,RS MD,DM yesInfoGain Mondrian [150] MG IG noTop-Down Disclosure [237] VS IGPL no�+-Optimize [157] VS MD,DM ,CM yesAnatomize [249] AM heuristics no(k, e)-Anonymity PM min. error yesPermutation [269]Greedy Personalized [250] SG,CG ILoss noProgressive Local CG MD noRecoding [246]t-Closeness Incognito [153] FG,RS DM yes

FG=Full-domain Generalization, SG=Subtree Generalization,CG=Cell Generalization, MG=Multidimensional Generalization,

RS=Record Suppression, VS=Value Suppression, AM=Anatomization

Most works discussed in this Chapter 5.2 employ some heuristic to achieveminimal anonymization. Recently, Liu and Wang [157] present a subtree gen-eralization algorithm, called �+-optimize, to achieve optimal anonymizationfor �+-diversity, which is the same as confidence bounding studied in Chap-ter 2.2.2. �+-optimize organizes all possible cuts into a cut enumeration treesuch that the nodes are ranked by a cost function that measures the in-formation loss. Consequently, the optimal solution is obtained by effectivelypruning the non-optimal nodes (subtrees) that have higher cost than the cur-rently examined candidate. The efficiency of the optimal algorithm relies onthe monotonic property of the cost function. It is a scalable method for findingan optimal �-diversity solution by pruning large search space.

5.2.2 InfoGain Mondrian

LeFevre et al. [150] propose a suite of greedy algorithms to identify a min-imally anonymous table satisfying k-anonymity and/or entropy �-diversitywith the consideration of a specific data analysis task such as classifica-tion modeling multiple target attributes, regression analysis on numericalattributes, and query answering with minimal imprecision. Their top-downalgorithms are similar to TDS [95], but LeFevre et al. [150] employ multidi-mensional generalization.

5.2.3 Top-Down Disclosure

Recall that a privacy template has the form 〈QID → s, h〉 and states thatthe confidence of inferring the sensitive value s from any group on QID isno more than h. Wang et al. [237] propose an efficient algorithm to mini-

Page 82: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization Algorithms 57

mally suppress a table to satisfy a set of privacy templates. Their algorithm,called Top-Down Disclosure (TDD), iteratively discloses domain values start-ing from the table in which all domain values are suppressed. In each iteration,it discloses the suppressed domain value that maximizes the search metricIGPL in Equation 4.7, and terminates the iterative process when a furtherdisclosure leads to a violation of some privacy templates. This approach isbased on the following key observation.

Observation 5.2.1 (Disclosure property) Consider a privacy template〈QID → s, h〉. If a table violates the privacy template, so does any tableobtained by disclosing a suppressed value. [237]

This property ensures that the algorithm finds a locally minimal sup-pressed table. This property and, therefore, the algorithm, is extendable tofull-domain, subtree, and sibling generalization schemes, with the disclosureoperation being replaced with the specialization operation. The basic obser-vation is that the confidence in at least one of the specialized groups will beas large as the confidence in the general group. Based on a similar idea, Wonget al. [246] employ the cell generalization scheme and proposed some greedytop-down and bottom-up methods to identify a locally minimal anonymoussolution that satisfies (α, k)-anonymity.

5.2.4 Anatomize

Anatomization (also known as bucketization) provides an alternative way toachieve �-diversity. Refer to Chapter 3.2 for the notion of anatomization. Theproblem can be described as follows: given a person-specific data table T anda parameter �, we want to obtain a quasi-identifier table (QIT) and a sensitivetable (ST) such that an adversary can correctly infer the sensitive value of anyindividual with probability at most 1/�. The Anatomize algorithm [249] firstpartitions the data table into buckets and then separates the quasi-identifierswith the sensitive attribute by randomly permuting the sensitive attributevalues in each bucket. The anonymized data consists of a set of buckets withpermuted sensitive attribute values.

Anatomize starts by initiating an empty QIT and ST. Then, it hashes therecords of T into buckets by the sensitive values, so that each bucket includesthe records with the same sensitive value. The subsequent execution involvesa group-creation step and a residue-assignment step.

The group-creation step iteratively yields a new qid group as follows. First,Anatomize obtains a set S consisting of the � hash buckets that currently havethe largest number of records. Then, a record is randomly chosen from eachbucket in S and is added to the new qid group. As a result, the qid groupcontains � records with distinct sensitive values. Repeat this group creationstep while there are at least � non-empty buckets. The term residue recordrefers to a record remaining in a bucket, at the end of the group-creation

Page 83: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

58 Introduction to Privacy-Preserving Data Publishing

phase. There are at most �− 1 of residue records. For each residue record r,the residue assignment step collects a set S′ of qid groups produced from theprevious step, where no record has the same sensitive value as r. Then, therecord r is assigned to an arbitrary group in S′.

5.2.5 (k, e)-Anonymity Permutation

To achieve (k, e)-anonymity, Zhang et al. [269] propose an optimal permu-tation method to assign data records into groups together so that the sum oferror E is minimized, where E, for example, could be measured by the rangeof sensitive values in each group. The optimal algorithm has time and spacecomplexity in O(n2) where n is the number of data records. (k, e)-anonymityis also closely related to range coding technique, which is used in both processcontrol [199] and statistics [113]. In process control, range coding (also knownas coarse coding) permits generalization by allowing the whole numerical areato be mapped to a set of groups defined by a set of boundaries, which is sim-ilar to the idea of grouping data records by ranges and keeping boundariesof each group for fast computation in (k, e)-anonymity. Hegland et al. [113]also suggest handling large data sets, as population census data, by dividingthem into generalized groups (blocks) and applying a computational modelto each group. Any aggregate computation can hence be performed basedon manipulation of individual groups. Similarly, (k, e)-anonymity exploits thegroup boundaries to efficiently answer aggregate queries.

5.2.6 Personalized Privacy

Refer to the requirement of personalized privacy discussed in Chapter 2.2.9.Xiao and Tao [250] propose a greedy algorithm to achieve every record owner’sprivacy requirement in terms of a guarding node as follows: initially, all QIDattributes are generalized to the most general values, and the sensitive at-tributes remain ungeneralized. At each iteration, the algorithm performs atop-down specialization on a QID attribute and, for each qid group, performscell generalization on the sensitive attribute to satisfy the personalized privacyrequirement; the breach probability of inferring any domain sensitive valueswithin the subtree of guarding nodes is below certain threshold. Since thebreach probability is non-increasing with respect to generalization on the sen-sitive attribute, and the sensitive values could possibly be generalized to themost general values, the generalized table found at every iteration is publish-able without violating the privacy requirement, although a table with lowerinformation loss ILoss, measured by Equation 4.2, is preferable. When nobetter solution with lower ILoss is found, the greedy algorithm terminatesand outputs a locally minimal anonymization. Since this approach general-izes the sensitive attribute, ILoss is measured on both QID and sensitiveattributes.

Page 84: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization Algorithms 59

Table 5.3: Anonymization algorithms for tablelinkage

Algorithm Operation Metric OptimalitySPALM [176] FG DM yesMPALM [176] MG heuristics no

FG=Full-domain Generalization, MG=Multidimensional Generalization

5.3 Algorithms for the Table Linkage Model

The following algorithm aims at preventing table linkages, that is, to preventadversaries from determining the presence or the absence of a target victim’srecord in a released table.

5.3.1 δ-Presence Algorithms SPALM and MPALM

Recall that a generalized table T ′ satisfies (δmin, δmax)-presence (or simplyδ-presence) with respect to an external table E if δmin ≤ P (t ∈ T |T ′) ≤ δmax

for all t ∈ E. To achieve δ-presence, Ercan Nergiz et al. [176] present twoanonymization algorithms, SPALM and MPALM. SPALM is an optimal al-gorithm that employs a full-domain single-dimensional generalization scheme.Ercan Nergiz et al. [176] prove the anti-monotonicity property of δ-presencewith respect to full-domain generalization; if table T is δ-present, then a gen-eralized version of T ′ is also δ-present. SPALM is a top-down specializationapproach and exploits the anti-monotonicity property of δ-presence to prunethe search space effectively. MPALM is a heuristic algorithm that employs amulti-dimensional generalization scheme, with complexity O(|C||E|log2|E|),where |C| is the number of attributes in private table T and |E| is the numberof records in the external table E. Their experiments showed that MPALMusually results in much lower information loss than SPALM because MPALMemploys a more flexible generalization scheme.

5.4 Algorithms for the Probabilistic Attack Model

Many algorithms for achieving the probabilistic privacy models studied inChapter 2.4 employ random perturbation methods, so they do not suffer fromthe problem of minimality attacks. The random perturbation algorithms arenon-deterministic; therefore, the anonymization operations are non-reversible.The random perturbation algorithms for the probabilistic attack model canbe divided into two groups. The first group is local perturbation [21], which

Page 85: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

60 Introduction to Privacy-Preserving Data Publishing

Table 5.4: Anonymization algorithms for probabilistic attackAlgorithm Operation MetricCross-Training Round Sanitization [46] AN statisticalε-Differential Privacy Additive Noise [74] AN statisticalαβ Algorithm [193] AN,SP statistical

AN=Additive Noise, SP=Sampling

assumes that a record owner does not trust anyone except himself and perturbshis own data record by adding noise before submission to the untrusted dataholder. The second group is to perturb all records together by a trusted dataholder, which is the data publishing scenario studied in this book. Althoughthe methods in the first group are also applicable to the second by adding noiseto each individual record, Rastogi et al. [193] and Dwork [75] demonstratethat the information utility can be improved with a stronger lower bounds byassuming a trusted data holder who has the capability to access all records andexploit the overall distribution to perturb the data, rather than perturbingthe records individually.

A number of PPDP methods [19, 268] have been proposed for preserv-ing classification information with randomization. Agrawal and Srikant [19]present a randomization method for decision tree classification with the use ofthe aggregate distributions reconstructed from the randomized distribution.The general idea is to construct the distribution separately from the differ-ent classes. Then, a special decision tree algorithm is developed to determinethe splitting conditions based on the relative presence of the different classes,derived from the aggregate distributions. Zhang et al. [268] present a ran-domization method for naive Bayes classifier. The major shortcoming of thisapproach is that ordinary classification algorithms will not work on this ran-domized data. There is a large family of works in randomization perturbationfor data mining and data publishing [20, 21, 72, 83, 119, 193, 198].

The statistics community conducts substantial research in the disclosurecontrol of statistical information and aggregate query results [46, 52, 73, 166,184]. The goal is to prevent adversaries from obtaining sensitive information bycorrelating different published statistics. Cox [52] propose the k%-dominancerule which suppresses a sensitive cell if the values of two or three entities inthe cell contribute more than k% of the corresponding SUM statistic. Theproposed mechanisms include query size and query overlap control, aggrega-tion, data perturbation, and data swapping. Nevertheless, such techniques areoften complex and difficult to implement [87], or address privacy threats thatare unlikely to occur. There are some decent surveys [3, 63, 173, 267] in thestatistics community.

Page 86: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization Algorithms 61

5.4.1 ε-Differential Additive Noise

One representative work to thwart probabilistic attack is differential pri-vacy [74]; its definition can be found in Chapter 2.4. Dwork [74] proposesan additive noise method to achieve ε-differential privacy. The added noiseis chosen over a scaled symmetric exponential distribution with variance σ2

in each component, and σ ≥ ε/Δf , where Δf is the maximum difference ofoutputs of a query f caused by the removal or addition of a single data record.Machanavajjhala et al. [161] propose a revised version of differential privacycalled probabilistic differential privacy that yields a practical privacy guaran-tee for synthetic data generation. The idea is to first build a model from theoriginal data, then sample points from the model to substitute original data.The key idea is to filter unrepresentative data and shrink the domain. Otheralgorithms [32, 62, 77, 78] have been proposed to achieve differential privacy.Refer to [76] for a decent survey on the recent developments in this line ofprivacy model.

5.4.2 αβ Algorithm

Recall that (d, γ)-privacy in Chapter 2.4.3 bounds the difference of P (r) andP (r|T ), where P (r) and P (r|T ) are the prior probability and the posteriorprobability respectively of the presence of a victim’s record in the data tableT before and after examining the published table T . To achieve (d, γ)-Privacy,Rastogi et al. [193] propose a perturbation method called αβ algorithm con-sisting of two steps. The first step is to select a subset of records from theoriginal table D with probability α + β and insert them to the data tableT , which is to be published. The second step is to generate some counterfeitrecords from the domain of all attributes. If the counterfeit records are not inthe original table D, then insert them into T with probability β. Hence, theresulting perturbed table T consists of both records randomly selected fromthe original table and counterfeit records from the domain. The number ofrecords in the perturbed data could be larger than the original data table, incomparison with FRAPP [21] which has a fixed table size. The drawback ofinserting counterfeits is that the released data could no longer preserve thetruthfulness of the original data at the record level, which is important insome applications, as explained in Chapter 1.1.

5.5 Attacks on Anonymous Data

5.5.1 Minimality Attack

Most privacy models assume that the adversary knows the QID of a targetvictim and/or the presence of the victim’s record in the published data. In

Page 87: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

62 Introduction to Privacy-Preserving Data Publishing

addition to this background knowledge, the adversary can possibly determinethe privacy requirement (e.g., 10-anonymity or 5-diversity), the anonymiza-tion operations (e.g., subtree generalization scheme) to achieve the privacyrequirement, and the detailed mechanism of an anonymization algorithm. Theadversary can possibly determine the privacy requirement and anonymizationoperations by examining the published data, or its documentation, and learnthe mechanism of the anonymization algorithm by, for example, reading re-search papers. Wong et al. [245] point out that such additional backgroundknowledge can lead to extra information that facilitates an attack to compro-mise data privacy. This is called the minimality attack.

Many anonymization algorithms discussed in this chapter follow an implicitminimality principle. For example, when a table is generalized from bottom-upto achieve k-anonymity, the table is not further generalized once it minimallymeets the k-anonymity requirement. Minimality attack exploits this mini-mality principle to reverse the anonymization operations and filter out theimpossible versions of original table [245]. The following example illustratesminimality attack on confidence bounding [237].

Example 5.1Consider the original patient Table 5.5, the anonymous Table 5.6, and an ex-ternal Table 5.7 in which each record has a corresponding original recordin Table 5.5. Suppose the adversary knows that the confidence boundingrequirement is 〈{Job, Sex} → HIV, 60%〉. With the minimality principle,the adversary can infer that Andy and Bob have HIV based on the fol-lowing reason: From Table 5.5, qid = 〈Lawyer, Male〉 has 5 records, andqid = 〈Engineer, Male〉 has 2 records. Thus, 〈Lawyer, Male〉 in the orig-inal table must already satisfy 〈{Job, Sex} → HIV, 60%〉 because even ifboth records with HIV have 〈Lawyer, Male〉, the confidence for inferringHIV is only 2/5 = 40%. Since a subtree generalization has been per-formed, 〈Engineer, Male〉 must be the qid that has violated the 60% con-fidence requirement on HIV , and that is possible only if both records with〈Engineer, Male〉 have a disease value of HIV .

Table 5.5: Minimality attack:original patient data

Job Sex DiseaseEngineer Male HIVEngineer Male HIVLawyer Male FluLawyer Male FluLawyer Male FluLawyer Male FluLawyer Male Flu

Page 88: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization Algorithms 63

Table 5.6: Minimality attack:published anonymous data

Job Sex DiseaseProfessional Male HIVProfessional Male FluProfessional Male FluProfessional Male FluProfessional Male FluProfessional Male FluProfessional Male HIV

Table 5.7: Minimality attack:external data

Name Job SexAndy Engineer MaleCalvin Lawyer MaleBob Engineer MaleDoug Lawyer MaleEddy Lawyer MaleFred Lawyer Male

Gabriel Lawyer Male

To thwart minimality attack, Wong et al. [245] propose a privacy modelcalled m-confidentiality that limits the probability of the linkage from anyrecord owner to any sensitive value set in the sensitive attribute. Minimal-ity attack is applicable to both optimal and minimal anonymization algo-rithms that employ generalization, suppression, anatomization, or permuta-tion to achieve privacy models including, but not limited to, �-diversity [162],(α, k)-anonymity [246], (k, e)-anonymity [269], personalized privacy [250],anatomy [249], t-closeness [153], m-invariance [251], and (X, Y )-privacy [236].To avoid minimality attack on �-diversity, Wong et al. [245] propose to firstk-anonymize the table. Then, for each qid group in the k-anonymous tablethat violates �-diversity, their method distorts the sensitive values to satisfy�-diversity.

5.5.2 deFinetti Attack

Kifer [138] presents a noble attack on partition-based anonymous data thatare obtained by generalization, suppression, or anatomization. Though thegeneral methodology of the attack is applicable to any partition-based algo-rithm, we illustrate a specific attack against Anatomy (Chapter 3.2), which isone kind of partition-based algorithm. Refer to [138] for details on how thisalgorithm can be modified to attack other partition-based algorithms such asgeneralization. We illustrate the attack by the following example.

Page 89: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

64 Introduction to Privacy-Preserving Data Publishing

Table 5.8: deFinetti attack:quasi-identifier table (QIT)

Rec# Smoker? GroupID1 Y 12 Y 13 N 24 N 25 Y 36 N 37 Y 48 Y 49 N 510 N 511 Y 612 N 6

Source: [138] c©2009 Association for ComputingMachinery, Inc. Reprinted by permission.

Example 5.2Tables 5.8 and 5.9 show a pair of anatomized tables, where the quasi-identifer(Smoker?) and the sensitive attribute (Disease) are de-associated by theGroupID. Each group has size 2. Let’s assume an adversary wants to predictthe disease of Rec#11. According to the random worlds model (which is theway the adversary reasons), the probability of Rec#11 to have cancer is 50%because group 6 contains two records and one of them has cancer disease.However, Kifer claims that it is appropriate to model the adversary usingrandom worlds model. By observing the anonymized data, an adversary canlearn that whenever a group contains a smoker (groups 1, 3, 4, and 6), thegroup also contains a cancer. Also, a group with no smoker (groups 2 and 5)has no cancer. Given the correlation between smoker and cancer in the tables,the probability that Rec#12 has cancer should be less than 0.5, therefore,Rec#11 is more likely to have cancer. The probability of Rec#12 having can-cer and Rec#11 having no disease is approximately 0.16, which is significantlylower than the random worlds estimate.

Kifer calls this class of attack as deFinetti attack because the attackingalgorithm is based on the deFinetti’s representation theorem to predict thesensitive attribute of a victim.

5.5.3 Corruption Attack

Many previously discussed privacy models that thwart record linkages andattribute linkages in Chapters 5.1 and 5.2 assume that the background knowl-edge of an adversary is limited to the QID attributes, and the adversary uses

Page 90: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization Algorithms 65

Table 5.9: deFinetti attack: sensitivetable (ST)

GroupID Disease1 Cancer1 Flu2 Flu2 No Disease3 Cancer3 No Disease4 Cancer4 No Disease5 Flu5 No Disease6 Cancer6 No Disease

Source: [138] c©2009 Association for ComputingMachinery, Inc. Reprinted by permission.

Table 5.10: Corruption attack: 2-diverse tableJob Sex Disease

Engineer Male HIVEngineer Male FluLawyer Female DiabetesLawyer Female Flu

such background knowledge to link the sensitive attribute of a target victim.What if an adversary has additional background knowledge? For example, anadversary may acquire additional knowledge by colluding with a record owner(a patient) or corrupting a data holder (an employee of a hospital). By usingthe sensitive values of some records, an adversary may be able to compromisethe privacy of other record owners in the anonymous data.

Example 5.3Suppose an adversary knows that Bob is a male engineer appearing in Ta-ble 5.10, therefore, Bob has either HIV or Flu. We further suppose that theadversary “colludes” with another patient John who is also a male engineerappearing the table, and John discloses to the adversary that he has Flu.Due to this additional knowledge, the adversary is now certain that Bob hasHIV . Note, John has not done anything wrong because he has just disclosedhis own disease.

To guarantee privacy against this kind of additional knowledge of the ad-versary, Tao et al. [218] propose an anonymization algorithm that integratesgeneralization with perturbation and stratified sampling. The proposed per-

Page 91: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

66 Introduction to Privacy-Preserving Data Publishing

turbed generalization algorithm provides strong privacy guarantee even if theadversary has corrupted an arbitrary number of record owners in the table. Ingeneral, the randomization-based approaches are less vulnerable to the abovetypes of attacks due to the random nature, compared to the deterministicpartition-based approaches.

Page 92: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Part II

Anonymization for DataMining

Page 93: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Chapter 6

Anonymization for ClassificationAnalysis: A Case Study on the RedCross

6.1 Introduction

This chapter uses the Red Cross Blood Transfusion Service (BTS) as acase study to illustrate how the components studied in previous chapterswork together to solve a real-life privacy-preserving data publishing prob-lem. Chapter 2 studies the privacy threats caused by data publishing andthe privacy models that thwart the privacy threats. Chapter 3 discusses somecommonly employed anonymization operations for achieving the privacy mod-els. Chapter 4 studies various types of information metrics that can guide theanonymization operations for preserving the required information dependingon the purpose of the data publishing. Chapter 5 provides an overview of dif-ferent anonymization algorithms for achieving the privacy requirements andpreserving the information utility.

Gaining access to high-quality health data is a vital requirement to informeddecision making for medical practitioners and pharmaceutical researchers.Driven by mutual benefits and regulations, there is a demand for healthcareinstitutes to share patient data with various parties for research purposes.However, health data in its raw form often contains sensitive informationabout individuals, and publishing such data will violate their privacy.

Health Level 7 (HL7) is a not-for-profit organization involved in develop-ment of international healthcare standards. Based on an extended data schemafrom the HL7 framework, this chapter exploits a real-life information sharingscenario in the Hong Kong Red Cross BTS to bring out the challenges ofpreserving both individual privacy and data mining quality in the context ofhealthcare information systems. This chapter employs the information systemof Red Cross BTS as a use case to motivate the problem and to illustrate theessential steps of the anonymization methods for classification analysis. Thenext chapter presents a unified privacy-preserving data publishing frameworkfor cluster analysis.

Figure 6.1 illustrates an overview of different parties in the Red Cross BTSsystem. After collecting and examining the blood collected from donors, the

69

Page 94: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

70 Introduction to Privacy-Preserving Data Publishing

Collect Blood

Donors

Blood Transfusion

Data

Hospitals(store)

Distribute Blood to Hospitals

Patients

Collect Patient Data

Red Cross(share)

Privacy-PreservingHealthcare Data

Publishing Service

FIGURE 6.1: Overview of the Red Cross BTS ([171] c©2009 Association forComputing Machinery, Inc. Reprinted by permission.)

BTS distributes the blood to different public hospitals. The hospitals collectand maintain the health records of their patients and transfuse the bloodto the patients if necessary. The blood transfusion information, such as thepatient data, type of surgery, names of medical practitioners in charge, andreason for transfusion, is clearly documented and is stored in the databaseowned by each individual hospital.

The public hospitals are required to submit the blood usage data, togetherwith the patient-specific surgery data, to the Government Health Agency.Periodically, the Government Health Agency submits the data to the BTSfor the purpose of data analysis and auditing. The objectives of the datamining and auditing procedures are to improve the estimated future bloodconsumption in different hospitals and to make recommendation on the bloodusage in future medical cases. In the final step, the BTS submits a reportto the Government Health Agency. Referring to the privacy regulations, suchreports have the property of keeping patients’ privacy protected, althoughuseful patterns and structures have to be preserved. Figure 6.2 provides amore detailed illustration of the information flow among the hospitals, theGovernment Health Agency, and the Red Cross BTS.

The focus of this chapter is to study a data anonymizer in the privacy-preserving healthcare data publishing service so that both information shar-ing and privacy protection requirements can be satisfied. This BTS case illus-trates a typical dilemma in information sharing and privacy protection facedby many health institutes. For example, licensed hospitals in California arealso required to submit specific demographic data on every discharged pa-tient [43]. The solution studied in this chapter, designed for the BTS case,will also benefit other health institutes that face similar challenges in infor-mation sharing. We summarize the concerns and challenges of the BTS caseas follows.

Privacy concern: Giving the Red Cross BTS access to blood transfusiondata for data analysis is clearly legitimate. However, it raises some concerns

Page 95: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization for Classification Analysis 71

Collect BloodDonors

Patient Health Data& Blood Usage

Public

Hospitals

Patients

Collect Pat ient Data

Privacy-PreservingHealthcare Data

Publishing Service

WriteRead

Publish Report

Government Health Agency

Manage

Own

Blood Usage Report GeneratorBlood Donor Data

& Blood Information

Writ

e

Read

Distribute Blood

Read

Submit Report

FIGURE 6.2: Data flow in the Red Cross BTS

on patients’ privacy. The patients are willing to submit their data to a hospitalbecause they consider the hospital to be a trustworthy entity. Yet, the trust tothe hospital may not necessarily be transitive to a third party. Many agenciesand institutes consider that the released data is privacy-preserved if explicitidentifying information, such as name, social security number, address, andtelephone number, is removed. However, as discussed at the beginning of thisbook, substantial research has shown that simply removing explicit identifyinginformation is insufficient for privacy protection. Sweeney [216] shows thatan individual can be re-identified by simply matching other attributes, calledquasi-identifiers (QID), such as gender, date of birth, and postal code. Below,we illustrate the privacy threats by a simplified Red Cross example.

Example 6.1Consider the raw patient data in Table 6.1, where each record represents asurgery case with the patient-specific information. Job, Sex, and Age arequasi-identifying attributes. The hospital wants to release Table 6.1 to theRed Cross for the purpose of classification analysis on the class attribute,Transfuse, which has two values, Y and N , indicating whether or not thepatient has received blood transfusion. Without loss of generality, we as-sume that the only sensitive value in Surgery is Transgender. The Gov-ernment Health Agency and hospitals express concern on the privacy threats

Page 96: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

72 Introduction to Privacy-Preserving Data Publishing

Table 6.1: Raw patient dataQuasi-identifier (QID) Class Sensitive

ID Job Sex Age Transfuse Surgery1 Janitor M 34 Y Transgender2 Doctor M 58 N Plastic3 Mover M 34 Y Transgender4 Lawyer M 24 N Vascular5 Mover M 58 N Urology6 Janitor M 44 Y Plastic7 Doctor M 24 N Urology8 Lawyer F 58 N Plastic9 Doctor F 44 N Vascular10 Carpenter F 63 Y Vascular11 Technician F 63 Y Plastic

Table 6.2: Anonymous data (L = 2, K = 2, C = 50%)Quasi-identifier (QID) Class Sensitive

ID Job Sex Age Transfuse Surgery1 Non-Technical M [30− 60) Y Transgender2 Professional M [30− 60) N Plastic3 Non-Technical M [30− 60) Y Transgender4 Professional M [1− 30) N Vascular5 Non-Technical M [30− 60) N Urology6 Non-Technical M [30− 60) Y Plastic7 Professional M [1− 30) N Urology8 Professional F [30− 60) N Plastic9 Professional F [30− 60) N Vascular10 Technical F [60− 99) Y Vascular11 Technical F [60− 99) Y Plastic

caused by record linkages and attributes linkages discussed in Chapter 2.1and Chapter 2.2, respectively. For instance, the hospital wants to avoidrecord linkage to record #3 via qid = 〈Mover, 34〉 and attribute linkage〈M, 34〉 → Transgender.

Patient data is usually high-dimensional, i.e., containing many attributes.Applying traditional privacy models, such as k-anonymity, �-diversity, andconfidence bounding, suffers from the curse of high-dimensionality problemand often results in useless data. To overcome this problem, Mohammed etal. [171] employ the LKC-privacy model discussed in Chapter 2.2.6 in thisRed Cross BTS problem. Recall that the general intuition of LKC-privacy isto ensure that every combination of values in QIDj ⊆ QID with maximumlength L in the data table T is shared by at least K records, and the confidenceof inferring any sensitive values in S is not greater than C, where L, K, C are

Page 97: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization for Classification Analysis 73

QID1 = {Job, Sex}QID2 = {Job, Age}QID3 = {Sex, Age}

Blue-collar White-collar

Non-Technical

Carpenter

Manager

ANY

Technical

Lawyer

Professional

Job

TechnicianMoverJanitor [1-30)

ANY[1-99)

[1-60) [60-99)

[30-60)

Age

ANY

Male Female

Sex

Accountant

FIGURE 6.3: Taxonomy trees and QIDs

thresholds and S is a set of sensitive values specified by the data holder (thehospital). LKC-privacy bounds the probability of a successful record linkageto be ≤ 1/K and the probability of a successful attribute linkage to be ≤ C,provided that the adversary’s prior knowledge does not exceed L values.

Example 6.2Table 6.2 shows an example of an anonymous table that satisfies (2, 2, 50%)-privacy, where L = 2, K = 2, and C = 50%, by generalizing all the values fromTable 6.1 according to the taxonomies in Figure 6.3 (ignore the dashed curvefor now). Every possible value of QIDj with maximum length 2 in Table 6.2(namely, QID1, QID2, and QID3 in Figure 6.3) is shared by at least 2 records,and the confidence of inferring the sensitive value Transgender is not greaterthan 50%. In contrast, enforcing traditional 2-anonymity will require fur-ther generalization. For example, in order to make 〈Professional, M, [30-60)〉to satisfy traditional 2-anonymity, we need to further generalize [1-30) and[30-60) to [1-60), resulting in much higher utility loss.

Information needs: The Red Cross BTS wants to perform two typesof data analysis on the blood transfusion data collected from the hospitals.First, it wants to employ the surgery information as training data for buildinga classification model on blood transfusion. Second, it wants to obtain somegeneral count statistics.

One frequently raised question is: To avoid the privacy concern, why doesnot the hospital simply release a classifier or some statistical information tothe Red Cross? The Red Cross wants to have access to the blood transfusiondata, not only the statistics, from the hospitals for several reasons. First, thepractitioners in hospitals and the Government Health Agency have no exper-tise and interest in doing the data mining. They simply want to share the

Page 98: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

74 Introduction to Privacy-Preserving Data Publishing

patient data with the Red Cross, who needs the health data for legitimatereasons. Second, having access to the data, the Red Cross has much betterflexibility to perform the required data analysis. It is impractical to continu-ously request practitioners in a hospital to produce different types of statisticalinformation and fine-tune the data mining results for research purposes.

The rest of the chapter is organized as follows.

• In Chapter 6.2, we present the anonymization problem for classificationanalysis with the privacy and information requirements.

• In Chapter 6.3, we study an efficient anonymization algorithm, calledHigh-Dimensional Top-Down Specialization (HDTDS) [171], for achiev-ing LKC-privacy with two different information needs. The first in-formation need maximizes the information preserved for classificationanalysis; the second information need minimizes the distortion on theanonymous data for general data analysis. Minimizing distortion is use-ful when the particular information requirement is unknown during in-formation sharing or the shared data is used for various kinds of datamining tasks.

• In Chapters 6.4, 6.5, 6.6, we study three methods that also address theanonymization problem for classification analysis, but with the privacymodel of k-anonymity instead of LKC-privacy. Yet, they can be easilymodified to achieve LKC-privacy.

• In Chapter 6.7, we present an evaluation methodology for measuringthe data quality of the anonymous data with respect to the informationneeds. Experiments on real-life data suggest that the HDTDS algorithmis flexible and scalable enough to handle large volumes of data thatinclude both categorical and numerical attributes. Scalability is a ba-sic requirement specified by the Red Cross due to the large volume ofrecords.

6.2 Anonymization Problems for Red Cross BTS

We first describe the privacy and information requirements of the Red CrossBlood Transfusion Service (BTS), followed by a problem statement.

6.2.1 Privacy Model

Suppose a health agency wants to publish a health data table

T (ID, D1, . . . , Dm, Class, Sens) (e.g., Table 6.1)

Page 99: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization for Classification Analysis 75

to some recipient (e.g., Red Cross) for data analysis. The ID number is anexplicit identifier and it should be removed before publication. Each Di iseither a categorical or a numerical attribute. Sens is a sensitive attribute. Arecord has the form 〈v1, . . . , vm, cls, s〉, where vi is a domain value of Di, clsis a class value of Class, and s is a sensitive value of Sens. The data holderwants to protect against linking an individual to a record or some sensitivevalue in T through some subset of attributes called a quasi-identifier or QID,where QID ⊆ {D1, . . . , Dm}.

One recipient, who is an adversary, seeks to identify the record or sensitivevalues of some target victim patient V in T . We assume that the adversaryknows at most L values of QID attributes of the victim patient. We use qidto denote such prior known values, where |qid| ≤ L. Based on the prior knowl-edge qid, the adversary could identify a group of records, denoted by T [qid],that contains qid. |T [qid]| denotes the number of records in T [qid]. Supposeqid = 〈Janitor, M〉. T [qid] = {ID#1, 6} and |T [qid]| = 2 in Table 6.1. Theadversary could launch two types of privacy attacks, namely record linkagesand attribute linkages, based on such prior knowledge.

To thwart the record and attribute linkages on any patient in the table T ,every qid with a maximum length L in the anonymous table is required tobe shared by at least a certain number of records, and the ratio of sensitivevalue(s) in any group cannot be too high. The LKC-privacy model discussedin Chapter 2.2.6 reflects this intuition.

6.2.2 Information Metrics

The measure of data utility varies depending on the data analysis task tobe performed on the published data. Based on the information requirementsspecified by the Red Cross, two information metrics are defined. The firstinformation metric is to preserve the maximal information for classificationanalysis. The second information metric is to minimize the overall data dis-tortion when the data analysis task is unknown.

In this Red Cross project, a top-down specialization algorithm calledHDTDS is proposed to achieve LKC-privacy on high-dimensional data bysubtree generalization (Chapter 3.1). The general idea is to anonymize a ta-ble by a sequence of specializations starting from the topmost general statein which each attribute has the topmost value of its taxonomy tree [95, 96].We assume that a taxonomy tree is specified for each categorical attribute inQID. A leaf node represents a domain value and a parent node represents aless specific value. For a numerical attribute in QID, a taxonomy tree can begrown at runtime, where each node represents an interval, and each non-leafnode has two child nodes representing some optimal binary split of the parentinterval. Figure 6.3 shows a dynamically grown taxonomy tree for Age.

Suppose a domain value d has been generalized to a value v in a record. Aspecialization on v, written v → child(v), where child(v) denotes the set ofchild values of v, replaces the parent value v with the child value that gener-

Page 100: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

76 Introduction to Privacy-Preserving Data Publishing

Secondary University

Junior Sec.

11th

Bachelors

Masters

ANY

Senior Sec.

Doctorate

Grad School

Education

12th10th9th [1-35)

ANY[1-99)

[1-37) [37-99)

[35-37)

Work_Hrs

ANY

Male Female

Sex

FIGURE 6.4: Taxonomy trees for Tables 6.3-6.7

alizes the domain value d. A specialization is valid if the specialization resultsin a table satisfying the LKC-privacy requirement after the specialization. Aspecialization is performed only if it is valid. The specialization process canbe viewed as pushing the “cut” of each taxonomy tree downwards. A cut ofthe taxonomy tree for an attribute Di, denoted by Cuti, contains exactly onevalue on each root-to-leaf path. Figure 6.3 shows a solution cut indicated bythe dashed curve representing the anonymous Table 6.2. The specializationstarts from the topmost cut and pushes down the cut iteratively by specializingsome value in the current cut until violating the LKC-privacy requirement.In other words, the specialization process pushes the cut downwards until novalid specialization is possible. Each specialization tends to increase data util-ity and decrease privacy because records are more distinguishable by specificvalues. Two information metrics are defined depending on the informationrequirement to evaluate the “goodness” of a specialization.

Case 1: Score for Classification AnalysisFor the requirement of classification analysis, information gain, denoted by

InfoGain(v), can be used to measure the goodness of a specialization onv. One possible information metric, Score(v) is to favor the specializationv → child(v) that has the maximum InfoGain(v) [210]:

Score(v) = InfoGain(v) = E(T [v])−∑

c

|T [c]||T [v]|E(T [c]), (6.1)

where E(T [x]) is the entropy of T [x]. Refer to Equation 4.8 in Chapter 4.3for more details on InfoGain(v).

Equation 6.1 uses InfoGain alone, that is, maximizing the information gainproduced by a specialization without considering the loss of anonymity. ThisScore function may pick a candidate that has a large reduction in anonymity,which may lead to a quick violation of the anonymity requirement, therebyprohibiting refining the data to a lower granularity. A better information met-ric is to maximize the information gain per unit of anonymity loss. The nextexample compares the two Score functions and illustrates using InfoGainalone may lead to quick violation.

Page 101: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization for Classification Analysis 77

Table 6.3: Compress raw patient tableEducation Sex Work Hrs Class # of Recs.

10th M 40 20Y0N 2010th M 30 0Y4N 49th M 30 0Y2N 29th F 30 0Y4N 49th F 40 0Y6N 68th F 30 0Y2N 28th F 40 0Y2N 2

Total: 20Y20N 40

Table 6.4: Statistics for the most masked tableCandidate InfoGain AnonyLoss ScoreANY Edu 0.6100 40− 4 = 36 0.6100/(36 + 1) = 0.0165ANY Sex 0.4934 40− 14 = 26 0.4934/(26 + 1) = 0.0183

[1-99) 0.3958 40− 12 = 28 0.3958/(28 + 1) = 0.0136

Example 6.3Consider Table 6.3, the taxonomy trees in Figure 6.4, the 4-anonymity re-quirement with QID = {Education, Sex, Work Hrs}, the most masked tablecontaining one row 〈ANY Edu, ANY Sex, [1-99)〉 with the class frequency20Y 20N , and three candidate specialization:

ANY Edu→ {8th, 9th, 10th},ANY Sex→ {M,F}, and[1-99)→ {[1-40), [40-99)}.

The class frequency is:Education: 0Y 4N (8th), 0Y 12N (9th), 20Y 4N (10th)Sex: 20Y 6N (M ), 0Y 14N (F )Work Hrs: 0Y 12N ([1-40)), 20Y 8N ([40-99))

Table 6.4 shows the calculated InfoGain, AnonyLoss, and Score ofthe three candidate refinements. The following shows the calculations ofInfoGain(ANY Edu), InfoGain(ANY Sex), and InfoGain([1-99)):

E(T [ANY Edu]) = − 2040 × log2

2040 − 20

40 × log22040 = 1

E(T [8th]) = − 04 × log2

04 − 4

4 × log244 = 0

E(T [9th]) = − 012 × log2

012 − 12

12 × log21212 = 0

E(T [10th]) = − 2024 × log2

2024 − 4

24 × log2424 = 0.6500

InfoGain(ANY Edu) = E(T [ANY Edu]) − ( 440 × E(T [8th) + 12

40 ×E(T [9th])

+ 2440 × E(T [10th])) = 0.6100

E(T [ANY Sex]) = − 2040 × log2

2040 − 20

40 × log22040 = 1

E(T [M]) = − 2026 × log2

2026 − 6

26 × log2626 = 0.7793

Page 102: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

78 Introduction to Privacy-Preserving Data Publishing

Table 6.5: Final masked table by InfoGain

Education Sex Work Hrs Class # of Recs.10th ANY Sex [1-99) 20Y4N 249th ANY Sex [1-99) 0Y12N 128th ANY Sex [1-99) 0Y4N 4

Table 6.6: Intermediate masked table by Score

Education Sex Work Hrs Class # of Recs.ANY Edu M [1-99) 20Y6N 26ANY Edu F [1-99) 0Y14N 14

E(T [F]) = − 014 × log2

014 − 14

14 × log21414 = 0

InfoGain(ANY Sex) = E(T [ANY Sex])−(2640×E(T [M ])+ 14

40×E(T [F ]))= 0.4934

E(T [[1-99)]) = − 2040 × log2

2040 − 20

40 × log22040 = 1

E(T [[1-40)]) = − 012 × log2

012 − 12

12 × log21212 = 0

E(T [[40-99)]) = − 2028 × log2

2028 − 8

28 × log2828 = 0.8631

InfoGain([1-99)) = E(T [[1-99)])−(1240×E(T [[1-40)])+ 28

40×E(T [[40-99)]))= 0.3958

According to the InfoGain criterion, ANY Edu will be first specializedbecause it has the highest InfoGain. The result is shown in Table 6.5 withA(QID) = 4. After that, there is no further valid specialization because spe-cializing either ANY Sex or [1-99) will result in a violation of 4-anonymity.Note that the first 24 records in the table fail to separate the 4N from theother 20Y .

In contrast, according to the Score criterion, ANY Sex will be first refined.Below, we show the calculation of AnonyLoss of the three candidate refine-ments:

AnonyLoss(ANY Edu) = A(QID)−AANY Edu(QID)= a(〈ANY Edu, ANY Sex, [1-99)〉)− a(〈8th, ANY Sex, [1-99)〉)= 40− 4 = 36, and

AnonyLoss(ANY Sex) = A(QID)−AANY Sex(QID)= a(〈ANY Edu, ANY Sex, [1-99)〉)− a(〈ANY Edu, F, [1-99)〉)= 40− 14 = 26, and

AnonyLoss([1-99)) = A(QID)−A[1-99)(QID)= a(〈ANY Edu, ANY Sex, [1-99)〉)−a(〈ANY Edu, ANY Sex, [1-40)〉)= 40− 12 = 28.

Table 6.4 shows the calculation of Score using InfoGain and AnonyLoss. Theresult is shown in Table 6.6, and A(QID) = 14. Subsequently, further refine-ment on ANY Edu is invalid because it will result in a(〈9th,M, [1-99)〉) =2 < k, but the refinement on [1-99) is valid because it will result in

Page 103: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization for Classification Analysis 79

Table 6.7: Final masked table by Score

Education Sex Work Hrs Class # of Recs.ANY Edu M [40-99) 20Y0N 20ANY Edu M [1-40) 0Y6N 6ANY Edu F [40-99) 0Y8N 8ANY Edu F [1-40) 0Y6N 6

A(QID) = 6 ≥ k. The final masked table is shown in Table 6.7 where theinformation for separating the two classes is preserved. Thus by consideringthe information/anonymity trade-off, the Score criterion produces a more de-sirable sequence of refinements for classification.

An alternative information metric is to heuristically maximize the informa-tion gain, denoted by InfoGain(v), for the classification goal and minimizethe anonymity loss, denoted by AnonyLoss(v), for the privacy goal. v is agood candidate for specialization if InfoGain(v) is large and AnonyLoss(v)is small. This alternative information metric, Score(v) is choosing the can-didate v, for the next specialization, that has the maximum information-gain/anonymity-loss trade-off, defined as

Score(v) = IGPL(v), (6.2)

where IGPL(v) is defined in Equation 4.7 in Chapter 4.3. Each choice ofInfoGain(v) and AnonyLoss(v) gives a trade-off between classification andanonymization.AnonyLoss(v): This is the average loss of anonymity by specializing v overall QIDj that contain the attribute of v:

AnonyLoss(v) = avg{A(QIDj)−Av(QIDj)}, (6.3)

where A(QIDj) and Av(QIDj) represent the anonymity before and afterspecializing v. Note that AnonyLoss(v) does not just depend on the at-tribute of v; it depends on all QIDj that contain the attribute of v. Hence,avg{A(QIDj) −Av(QIDj)} is the average loss of all QIDj that contain theattribute of v.

For a numerical attribute, no prior taxonomy tree is given and the taxon-omy tree has to be grown dynamically in the process of specialization. Thespecialization of an interval refers to the optimal binary split that maximizesinformation gain on the Class attribute. Initially, the interval that covers thefull range of the attribute forms the root. The specialization on an interval v,written v → child(v), refers to the optimal split of v into two child intervalschild(v) that maximizes the information gain. The anonymity is not used forfinding a split good for classification. This is similar to defining a taxonomytree where the main consideration is how the taxonomy best describes the ap-plication. Due to this extra step of identifying the optimal split of the parent

Page 104: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

80 Introduction to Privacy-Preserving Data Publishing

Table 6.8: Compressed raw patient table for illustrating optimalsplit on numerical attribute

Education Sex Work Hrs Class # of Recs.9th M 30 0Y3N 310th M 32 0Y4N 411th M 35 2Y3N 512th F 37 3Y1N 4

Bachelors F 42 4Y2N 6Bachelors F 44 4Y0N 4Masters M 44 4Y0N 4Masters F 44 3Y0N 3

Doctorate F 44 1Y0N 1Total: 21Y13N 34

[1-35)

ANY[1-99)

[1-37) [37-99)

[35-37)

Working Hrs

FIGURE 6.5: Taxonomy tree for Work Hrs

interval, numerical attributes have to be treated differently from categoricalattributes with taxonomy trees. The following example illustrates how to findan optimal split on a numerical interval.

Example 6.4Consider the data in Table 6.8. The table has 34 records in total. Each rowrepresents one or more records with the Class column containing the classfrequency of the records represented, Y for “income >50K” and N for “income≤50K.” For example, the third row represents 5 records having Education =11th, Sex = Male and Work Hrs = 35. The value 2Y 3N in the Class columnconveys that 2 records have the class Y and 3 records have the class N .Semantically, this compressed table is equivalent to the table containing 34rows with each row representing one record.

For the numerical attribute Work Hrs, the top most value is the full rangeinterval of domain values, [1-99). To determine the split point of [1-99), weevaluate the information gain for the five possible split points for the values 30,32, 35, 37, 42, and 44. The following is the calculation for the split point at 37 :

Page 105: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization for Classification Analysis 81

InfoGain(37) = E(T [[1-99)])− (1234 × E(T [[1-37)]) + 22

34 × E(T [[37-99)]))= 0.9597− (12

34 × 0.6500 + 2234 × 0.5746) = 0.3584.

As InfoGain(37) is highest, we grow the taxonomy tree for Work Hrs byadding two child intervals, [1-37) and [37-99), under the interval [1-99). Thetaxonomy tree is illustrated in Figure 6.5.

Case 2: Score for General Data AnalysisSometimes, the data is shared without a specific task. In this case of general

data analysis, we use discernibility cost [213] to measure the data distortionin the anonymous data table. The discernibility cost charges a penalty to eachrecord for being indistinguishable from other records. For each record in anequivalence group qid, the penalty is |T [qid]|. Thus, the penalty on a groupis |T [qid]|2. To minimize the discernibility cost, we choose the specializationv → child(v) that maximizes the value of Score(v) =

∑qidv|T [qidv]|2 over all

qidv containing v. The Discernibility Metric (DM) is discussed in Equation 4.4in Chapter 4.1. Example 6.6 shows the computation of Score(v).

6.2.3 Problem Statement

The goal in this Red Cross problem is to transform a given data set T intoan anonymous version T ′ that satisfies a given LKC-privacy requirement andpreserves as much information as possible for the intended data analysis task.Based on the information requirements specified by the Red Cross, we definethe problems as follows.

DEFINITION 6.1 Anonymization for data analysis Given a datatable T , a LKC-privacy requirement, and a taxonomy tree for each categori-cal attribute contained in QID, the anonymization problem for classificationanalysis is to generalize T on the attributes QID to satisfy the LKC-privacyrequirement while preserving as much information as possible for the classifi-cation analysis. The anonymization problem for general analysis is to gener-alize T on the attributes QID to satisfy the LKC-privacy requirement whileminimizing the overall discernibility cost.

Computing the optimal LKC-privacy solution is NP-hard. Given a QID,there are

(|QID|L

)combinations of decomposed QIDj with maximum size L.

For any value of K and C, each combination of QIDj in LKC-privacy is aninstance of the (α, k)-anonymity problem with α = C and k = K. Wong etal. [246] have proven that computing the optimal (α, k)-anonymous solution isNP-hard; therefore, computing optimal LKC-privacy is also NP-hard. Below,we provide a greedy approach to efficiently identify a sub-optimal solution.

Page 106: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

82 Introduction to Privacy-Preserving Data Publishing

Algorithm 6.3.1 High-Dimensional Top-Down Specialization (HDTDS)1: Initialize every value in T to the topmost value;2: Initialize Cuti to include the topmost value;3: while some candidate v ∈ ∪Cuti is valid do4: Find the Best specialization from ∪Cuti;5: Perform Best on T and update ∪Cuti;6: Update Score(x) and validity for x ∈ ∪Cuti;7: end while;8: Output T and ∪Cuti.;

6.3 High-Dimensional Top-Down Specialization(HDTDS)

Algorithm 6.3.1 provides an overview of the High-Dimensional Top-DownSpecialization (HDTDS) algorithm [171] for achieving LKC-privacy.1 Ini-tially, all values in QID are generalized to the topmost value in their taxonomytrees, and Cuti contains the topmost value for each attribute Di. At each iter-ation, HDTDS performs the Best specialization, which has the highest Scoreamong the candidates that are valid specializations in ∪Cuti (Line 4). Then,apply Best to T and update ∪Cuti (Line 5). Finally, update the Score of theaffected candidates due to the specialization (Line 6). The algorithm termi-nates when there are no more valid candidates in ∪Cuti. In other words, thealgorithm terminates if any further specialization would lead to a violation ofthe LKC-privacy requirement. An important property of HDTDS is that theLKC-privacy is anti-monotone with respect to a specialization: if a general-ized table violates LKC-privacy before a specialization, it remains violatedafter the specialization because a specialization never increases the |T [qid]|and never decreases the maximum P (s|qid). This anti-monotonic propertyguarantees that the final solution cut is a sub-optimal solution. HDTDS is amodified version of TDS [96].

Example 6.5Consider Table 6.1 with L = 2, K = 2, C = 50%, and QID = {Job, Sex, Age}.The QIDi with maximum size 2 are:

〈QID1 = {Job}〉,〈QID2 = {Sex}〉,〈QID3 = {Age}〉,

1The source code and the executable program is available on the web:http://www.ciise.concordia.ca/∼fung/pub/RedCrossKDD09/

Page 107: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization for Classification Analysis 83

〈QID4 = {Job,Sex}〉,〈QID5 = {Job,Age}〉,〈QID6 = {Sex,Age}〉.

Initially, all data records are generalized to 〈ANY Job, ANY Sex, [1-99)〉,and ∪Cuti = {ANY Job, ANY Sex, [1-99)}. To find the Best special-ization among the candidates in ∪Cuti, we compute Score (ANY Job),Score(ANY Sex), and Score([1-99)).

A simple yet inefficient implementation of Lines 4-6 is to scan all datarecords and recompute Score(x) for all candidates in ∪Cuti. The key to theefficiency of the algorithm is having direct access to the data records to bespecialized, and updating Score(x) based on some statistics maintained forcandidates in ∪Cuti, instead of scanning all data records. In the rest of thisChapter 6.3, we explain a scalable implementation and data structures indetail.

6.3.1 Find the Best Specialization

Initially, the algorithm computes the Score for all candidates x in ∪Cuti.For each subsequent iteration, information needed to calculate Score comesfrom the update of the previous iteration (Line 7). Finding the best specializa-tion Best involves at most | ∪Cuti| computations of Score without accessingdata records. The procedure for updating Score will be discussed in Chapter6.3.3.

Example 6.6Continue from Example 6.5. We show the computation of Score(ANY Job)for the specialization

ANY Job→ {Blue-collar,White-collar}.For general data analysis, Score(ANY Job) = 62 +52 = 61. For classificationanalysis,

E(T [ANY Job]) = − 611 × log2

611 − 5

11 × log2511 = 0.994

E(T [Blue-collar]) = − 16 × log2

16 − 5

6 × log256 = 0.6499

E(T [White-collar]) = − 55 × log2

55 − 0

5 × log205 = 0.0

InfoGain(ANY Job) = E(T [ANY Job])− ( 611 × E(T [Blue-collar])

+ 511 × E(T [White-collar])) = 0.6396

Score(ANY Job) = InfoGain(ANY Job) = 0.6396.

Page 108: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

84 Introduction to Privacy-Preserving Data Publishing

Link[1-60)

Head of Link[1-60)

Job Sex AgeANY_Job ANY_Sex [1-99)

Blue-collar ANY_Sex [1-60)

Head of Link[60-99)

[1-99) {[1-60), [60-99)}

11# of Recs.

45White-collar ANY_Sex [1-60)

5White-collar ANY_Sex [1-99) Blue-collar ANY_Sex [1-99) 6

Blue-collar ANY_Sex [60-99) 2

ANY_Job { White-collar , Blue-collar }

FIGURE 6.6: The TIPS data structure

6.3.2 Perform the Best Specialization

Consider a specialization Best → child(Best), where Best ∈ Di andDi ∈ QID. First, we replace Best with child(Best) in ∪Cuti. Then, weneed to retrieve T [Best], the set of data records generalized to Best, to tellthe child value in child(Best) for individual data records. A data structure,called Taxonomy Indexed PartitionS (TIPS) [96], is employed to facilitatethis operation. This data structure is also crucial for updating Score(x) forcandidates x. The general idea is to group data records according to theirgeneralized records on QID.

DEFINITION 6.2 TIPS Taxonomy Indexed PartitionS (TIPS) is a treestructure with each node representing a generalized record over QID, andeach child node representing a specialization of the parent node on exactlyone attribute. Stored with each leaf node is the set of data records havingthe same generalized record, called a leaf partition. For each x in ∪Cuti,Px denotes a leaf partition whose generalized record contains x, and Linkx

denotes the link of all Px, with the head of Linkx stored with x.

At any time, the generalized data is represented by the leaf partitions ofTIPS, but the original data records remain unchanged. Linkx provides a di-rect access to T [x], the set of data records generalized to the value x. Initially,TIPS has only one leaf partition containing all data records, generalized tothe topmost value on every attribute in QID. In each iteration, we performthe best specialization Best by refining the leaf partitions on LinkBest.

Updating TIPS: The algorithm refines each leaf partition PBest found onLinkBest as follows. For each value c in child(Best), a child partition Pc iscreated under PBest, and data records in PBest are split among the childpartitions: Pc contains a data record in PBest if c generalizes the correspond-

Page 109: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization for Classification Analysis 85

LinkANY_Sex

LinkANY_Sex

LinkANY_Sex

LinkBlue-collar

LinkWhite-collar

LinkBlue-collar

Link[60-99)

Blue-collar White-collar

Non-Technical

Carpenter

Manager

ANY

Technical

Lawyer

Professional

Job

TechnicianMoverJanitor [1-30)

ANY[1-99)

[1-60) [60-99)

[30-60)

Age

ANY

Male Female

Sex

Accountant

Link[1-60)

Job Sex AgeANY_Job ANY_Sex [1-99)

Blue-collar ANY_Sex [1-60)

11# of Recs.

45White-collar ANY_Sex [1-60)

5White-collar ANY_Sex [1-99) Blue-collar ANY_Sex [1-99) 6

Blue-collar ANY_Sex [60-99) 2

ANY_Job { White-collar , Blue-collar }

[1-99) {[1-60), [60-99)}

Link[1-60)

FIGURE 6.7: A complete TIPS data structure

ing domain value in the record. An empty Pc is removed. Linkc is createdto link up all Pc’s for the same c. Also, link Pc to every Linkx to whichPBest was previously linked, except for LinkBest. This is the only operationin the whole algorithm that requires accessing data records. The overhead ofmaintaining Linkx is small. For each attribute in ∪QIDj and each leaf par-tition on LinkBest, there are at most |child(Best)| “relinkings,” or at most| ∪QIDj |× |LinkBest|× |child(Best)| “relinkings” in total for applying Best.

Example 6.7Initially, TIPS has only one leaf partition containing all data records and rep-resenting the generalized record 〈ANY Job, ANY Sex, [1-99)〉. Let the bestspecialization be ANY Job → {White-collar, Blue-collar} on Job. Two childpartitions

〈White-collar, ANY Sex, [1-99)〉〈Blue-collar, ANY Sex, [1-99)〉

are created under the root partition as in Figure 6.6, and split data records be-tween them. Both child partitions are on LinkANY Sex and Link[1-99). ∪Cutiis updated into {White-collar, Blue-collar, ANY Sex, [1-99)}. Suppose thatthe next best specialization is [1-99) → {[1-60),[60-99)}, which specializesthe two leaf partitions on Link[1-99), resulting in three child partitions

Page 110: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

86 Introduction to Privacy-Preserving Data Publishing

〈White-collar, ANY Sex, [1-60)〉〈Blue-collar, ANY Sex, [1-60)〉〈Blue-collar, ANY Sex, [60-99)〉

as shown in Figure 6.6.

Figure 6.7 shows a complete TIPS data structure with the corresponding

∪Cuti = {Blue-collar,White-collar, ANY Sex, [1-60), [60-99)},where LinkBlue-collar links up

〈Blue-collar, ANY Sex, [1-60)〉〈Blue-collar, ANY Sex, [60-99)〉,

and LinkWhite-collar links up

〈White-collar, ANY Sex, [1-60)〉,and LinkANY Sex links up

〈White-collar, ANY Sex, [1-60)〉〈Blue-collar, ANY Sex, [1-60)〉〈Blue-collar, ANY Sex, [60-99)〉

and Link[1-60) links up

〈White-collar, ANY Sex, [1-60)〉〈Blue-collar, ANY Sex, [1-60)〉,

and Link[60-99) links up

〈Blue-collar, ANY Sex, [60-99)〉.

A scalable feature of the algorithm is maintaining some statistical informa-tion for each candidate x in ∪Cuti for updating Score(x) without accessingdata records. For each new value c in child(Best) added to ∪Cuti in thecurrent iteration, we collect the following count statistics of c while scanningdata records in PBest for updating TIPS: (1) |T [c]|, |T [d]|, freq(T [c], cls), andfreq(T [d], cls), where d ∈ child(c) and cls is a class label. (2) |Pd|, where Pd

is a child partition under Pc as if c is specialized, kept together with the leafnode for Pc. This information will be used in Chapter 6.3.3.

TIPS has several useful properties. First, all data records in the same leafpartition have the same generalized record although they may have differentraw values. Second, every data record appears in exactly one leaf partition.Third, each leaf partition Px has exactly one generalized qid on QID andcontributes the count |Px| towards |T [qid]|. Later, the algorithm uses the lastproperty to extract |T [qid]| from TIPS.

Page 111: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization for Classification Analysis 87

6.3.3 Update Score and Validity

This step updates Score(x) and validity for candidates x in ∪Cuti to re-flect the impact of the Best specialization. The key to the scalability of thealgorithm is updating Score(x) using the count statistics maintained in Chap-ter 6.3.2 without accessing raw records again.

6.3.3.1 Updating Score

The procedure for updating Score is different depending on the informationrequirement.

Case 1 classification analysis: An observation is that InfoGain(x) is notaffected by Best → child(Best), except that we need to computeInfoGain(c) for each newly added value c in child(Best). InfoGain(c)can be computed from the count statistics for c collected in Chap-ter 6.3.2.

Case 2 general data analysis: Each leaf partition Pc keeps the count|T [qidc]|. By following Linkc from TIPS, we can compute

∑qidc|T [qidc]|2

for all the qidc on Linkc.

6.3.3.2 Validity Check

A specialization Best → child(Best) may change the validity status ofother candidates x ∈ ∪Cuti if Best and x are contained in the same qid withsize not greater than L. Thus, in order to check the validity, we need to keeptrack of the count of every qid with |qid| = L. Note, we can ignore qid withsize less than L because if a table satisfies LKC-privacy, then it must alsosatisfy L′KC-privacy where L′ < L.

We explain an efficient method for checking the validity of a candidate.First, given a QID in T , we identify all QIDj ⊆ QID with size L. Then, foreach QIDj , we use a data structure, called QIDTreej, to index all qidj onQIDj . QIDTreej is a tree, where each level represents one attribute in QIDj .Each root-to-leaf path represents an existing qidj on QIDj in the generalizeddata, with |T [qidj]| and |T [qidj ∧s]| for every s ∈ S stored at the leaf node. Acandidate x ∈ ∪Cuti is valid if, for every c ∈ child(x), every qidj containingc has |T [qidj]| ≥ K and P (s|qidj) ≤ C for any s ∈ S. If x is invalid, removeit from ∪Cuti.

6.3.4 Discussion

Let T [Best] denote the set of records containing value Best in a generalizedtable. Each iteration involves two types of work. The first type accesses datarecords in T [Best] for updating TIPS and count statistics in Chapter 6.3.2.If Best is an interval, an extra step is required for determining the optimalsplit for each child interval c in child(Best). This requires making a scan on

Page 112: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

88 Introduction to Privacy-Preserving Data Publishing

records in T [c], which is a subset of T [Best]. To determine a split, T [c] has tobe sorted, which can be an expensive operation. Fortunately, resorting T [c] isunnecessary for each iteration because its superset T [Best] is already sorted.Thus, this type of work involves one scan of the records being specialized ineach iteration.

The second type computes Score(x) for the candidates x in ∪Cuti withoutaccessing data records in Chapter 6.3.3. For a table with m attributes andeach taxonomy tree with at most p nodes, the number of such x is at mostm× p. This computation makes use of the maintained count statistics, ratherthan accessing data records. In other words, each iteration accesses only therecords being specialized. Since m× p is a small constant, independent of thetable size, the HDTDS algorithm is linear in the table size. This feature makesthe approach scalable.

The current implementation of HDTDS [171] assumes that the data tablefits in memory. Often, this assumption is valid because the qid groups aremuch smaller than the original table. If qid groups do not fit in the memory,we can store leaf partitions of TIPS on disk if necessary. Favorably, the mem-ory is used to keep only leaf partitions that are smaller than the page sizeto avoid fragmentation of disk pages. A nice property of HDTDS is that leafpartitions that cannot be further specialized (i.e., on which there is no candi-date specialization) can be discarded, and only some statistics for them needto be kept. This likely applies to small partitions in memory and, therefore,the memory demand is unlikely to build up.

Compared to iteratively generalizing the data bottom-up starting from do-main values, the top-down specialization is more natural and efficient forhandling numerical attributes. To produce a small number of intervals for anumerical attribute, the top-down approach needs only a small number of in-terval splitting, whereas the bottom-up approach needs many interval mergingstarting from many domain values. In addition, the top-down approach candiscard data records that cannot be further specialized, whereas the bottom-up approach has to keep all data records until the end of computation.

Experimental results on the real-life data sets suggest that HDTDS [171]can effectively preserve both privacy and data utility in the anonymous datafor a wide range of LKC-privacy requirements. There is a trade-off betweendata privacy and data utility with respect to K and L, but the trend isless obvious on C because eliminating record linkages is the primary drivingforce for generalization. The LKC-privacy model clearly retains more infor-mation than the traditional k-anonymity model and provides the flexibility toadjust privacy requirements according to the assumption of the adversary’sbackground knowledge. Finally, HDTDS is highly scalable for large data sets.These characteristics make HDTDS a promising component in the Red Crossblood transfusion information system.

Page 113: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization for Classification Analysis 89

6.4 Workload-Aware Mondrian

LeFevre et al. [150] present a suite of greedy algorithms to address the k-anonymization problem for various data analysis tasks, including classificationanalysis on single/multiple categorical target attribute(s) and regression anal-ysis on single/multiple numerical target attribute(s). The greedy algorithmsrecursively partition the QID domain space. The Mondrian method is flex-ible to adopt the k-anonymity (Chapter 2.1.1), �-diversity (Chapter 2.2.1),and squared-error diversity [150]. The general idea of the anonymization issomewhat similar to the top-down specialization approaches, such as HDTDS(Chapter 6.3). One major difference is that LeFevre et al. [150] employ a mul-tidimensional generalization scheme, which is more flexible than the subtreegeneralization scheme used in HDTDS. Although Mondrian is designed forachieving k-anonymity, it can be easily modified to adopt the LKC-privacymodel in order to accommodate the high-dimensional data.

In the rest of this Chapter 6.4, we briefly discuss Workload-Aware Mondrianon different classification and regression analysis tasks.

6.4.1 Single Categorical Target Attribute

Recall the description of single-dimensional and multidimensional general-ization schemes from Chapter 3.1: Let Di be the domain of an attribute Ai.A single-dimensional generalization, such as full-domain generalization andsubtree generalization, is defined by a function fi : DAi → D′ for each at-tribute Ai in QID. In contrast, a multidimensional generalization is definedby a single function f : DA1 × · · · × DAn → D′, which is used to generalizeqid = 〈v1, . . . , vn〉 to qid′ = 〈u1, . . . , un〉 where for every vi, either vi = ui orvi is a child node of ui in the taxonomy of Ai. This scheme flexibly allows twoqid groups, even having the same value v, to be independently generalizedinto different parent groups.

Every single-dimensional generalized table can be achieved by a sequenceof multidimensional generalization operations. Yet, a multidimensional gen-eralized table cannot be achieved by a sequence of single-dimensional gener-alization operations because the possible generalization space in the single-dimensional generalization scheme is a subset of the possible generaliza-tion space in the multidimensional scheme. Figure 6.8 visualizes this pointwith respect to classification analysis. Consider a data table with two quasi-identifying attributes, Age and Job and two classes Y and N . Suppose thedata holder wants to release a 3-anonymous table for classification analysis.Using single-dimensional cannot clearly separate the class labels due to theprivacy constraint as two Y s are misclassified in the wrong group. In contrast,using multidimensional can clearly separate the classes.

Suppose there is only one categorical target attribute for classification mod-

Page 114: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

90 Introduction to Privacy-Preserving Data Publishing

Y YY YN NN NN N

Y YY YY YN NN N

Y YY YN NN NN N

Y YY YY YN NN N

Job Job

Age Age

Single-dimensional Multidimensional

FIGURE 6.8: Illustrate the flexibility of multidimensional generalizationscheme

eling. To preserve the information utility for classification analysis, we canemploy the InfoGain Mondrian, which is a modified version of the Mondrianalgorithm [149] but with a different information metric [150]:

Entropy(P, C) =∑

partitions P ′

|P ′||P |

c∈AC

−p(c|P ′) log p(c|P ′), (6.4)

where P and P ′ denote the partitions before and after the Best specialization,p(c|P ′) is the percentage of records in P ′ with class label c.

6.4.2 Single Numerical Target Attribute

Most methods that address the anonymization problem for classificationanalysis, e.g., TDR [95, 96], TDD (Chapter 5.2.3), HDTDS (Chapter 6.3),and Genetic Algorithm (Chapter 5.1.2.3), assume the target attribute is acategorical attribute. What method should be used if the target attribute is anumerical attribute? In other words, the data holder would like to release ananonymous table for regression analysis. Inspired by the CART algorithm forregression trees, which recursively select the split that minimizes the weightedsum of the means squared errors (MSE) over the set of resulting partitions,LeFevre et al. [150] propose the Least Squared Deviance (LSD) Mondrianalgorithm that minimizes the following information metric. The general ideaof MSE is to measure the impurity of target numerical attribute Z within acandidate partition P ′.

Error2(P, Z) =∑

Partitions P ′

i∈P ′(vi − v(P ′))2, (6.5)

where vi denotes a record value and v(P ′) denotes the mean value of Z in P ′.

Page 115: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization for Classification Analysis 91

6.4.3 Multiple Target Attributes

In some privacy-preserving data publisher scenarios, a data holder may wantto release an anonymous table for classification and/or regression analysis onmultiple target attributes.

For classification analysis on multiple categorical attributes, LeFevre etal. [150] discuss two possible extensions. In the first approach, the data recip-ient could build a single classifier to predict the combination of class labels〈C1, . . . , Cm〉, which has domain DC1×· · ·×DCm . The data holder can trans-form this multi-target attributes problem into a single-target attribute prob-lem and employ the information metric presented in Chapter 6.4.1 to greedilyminimize the entropy. The catch is that the domain grows exponentially withthe number of target attributes, resulting in a large combination of classesand, thereby, poor classification result.

The second approach is to simplify the problem by assuming independenceamong target attributes, we can employ a greedy information metric thatminimizes the sum of weighted entropies [150]:

i=1∑

m

Entropy(P, Ci). (6.6)

For regression analysis on multiple numerical attributes, we can treat the setof target attributes independently and greedily minimize the sum of squarederror [150]:

i=1∑

m

Error2(P, Zi). (6.7)

6.4.4 Discussion

The suite of Mondrian algorithms studied above provides a very flexibleframework for anonymizing different attributes with respect to multiple anddifferent types of target attributes. Compared to the single-dimensional gen-eralization scheme, the employed multidimensional generalization scheme canhelp reduce information loss with respect to classification analysis and regres-sion analysis. Yet, multidimensional generalization also implies higher com-putational cost and suffers from the data exploration problem discussed inChapter 3.1. The data exploration problem may make the classifier built fromthe anonymous data become unusable. Furthermore, the Mondrian algorithmsdo not handle multiple QIDs and they cannot achieve LKC-privacy. Enforc-ing traditional k-anonymity and �-diversity may result in high informationloss due to the curse of high dimensionality [6].

Page 116: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

92 Introduction to Privacy-Preserving Data Publishing

Algorithm 6.5.2 Bottom-Up Generalization1: while T does not satisfy a given k-anonymity requirement do2: for all generalization g do3: compute ILPG(g);4: end for5: find the Best generalization;6: generalize T by Best;7: end while8: output T ;

6.5 Bottom-Up Generalization

Wang et al. [239] present an effective bottom-up generalization approachto achieve k-anonymity. They employed the subtree generalization scheme(Chapter 3.1). A generalization g : child(v) → v, replaces all instances ofevery child value c in child(v) with the parent value v. Although this methodis designed for achieving k-anonymity, it can be easily modified to adopt theLKC-privacy model in order to accommodate the high-dimensional data.

6.5.1 The Anonymization Algorithm

Algorithm 6.5.2 presents the general idea of their bottom-up generalizationmethod. It begins the generalization from the raw data table T . At each iter-ation, the algorithm greedily selects the Best generalization g that minimizesthe information loss and maximizes the privacy gain. This intuition is cap-tured by the information metric ILPG(g) = IL(g)/PG(g), which has beendiscussed in Chapter 4.3. Then, the algorithm performs the generalizationchild(Best)→ Best on the table T , and repeats the iteration until the tableT satisfies the given k-anonymity requirement.

Let A(QID) and Ag(QID) be the minimum anonymity counts in T beforeand after the generalization g. Given a data table T , there are many possiblegeneralizations that can be performed. Yet, most generalizations g in factdo not affect the minimum anonymity count. In other words, A(QID) =Ag(QID). Thus, to facilitate efficiently choosing a generalization g, there isno need to consider all generalizations. Indeed, we can focus only on the“critical generalizations.”

DEFINITION 6.3 A generalization g is critical if Ag(QID) > A(QID).

Wang et al. [239] make several observations to optimize the efficiency ofAlgorithm 6.5.2: A critical generalization g has a positive PG(g) and a finite

Page 117: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization for Classification Analysis 93

Algorithm 6.5.3 Bottom-Up Generalization1: while T does not satisfy a given k-anonymity requirement do2: for all critical generalization g do3: compute Ag(QID);4: end for5: find the Best generalization;6: generalize T by Best;7: end while8: output T ;

ILPG(g), whereas a non-critical generalization g has PG(g) = 0 and infiniteILPG(g). Therefore, if at least one generalization is critical, all non-criticalgeneralizations will be ignored by the ILPG(g) information metric. If allgeneralizations are non-critical, the ILPG(g) metric will select the one withminimum IL(g). In both cases, Ag(QID) is not needed for a non-criticalgeneralization g. Based on this observation, Lines 2-3 in Algorithm 6.5.2 canbe optimized as illustrated in Algorithm 6.5.3.

6.5.2 Data Structure

To further improve the efficiency of the generalization operation, Wanget al. [239] propose a data structure, called Taxonomy Encoded Anonymity(TEA) index for QID = D1, . . . , Dm. TEA is a tree of m levels. The ith

level represents the current value for Dj . Each root-to-leaf path represents aqid value in the current data table, with a(qid) stored at the leaf node. Inaddition, the TEA index links up the qids according to the generalizationsthat generalize them. When a generalization g is applied, the TEA index isupdated by adjusting the qids linked to the generalization of g. The purposeof this index is to prune the number of candidate generalizations to no morethan |QID| at each iteration, where |QID| is the number of attributes inQID. For a generalization g : child(v) → v, a segment of g is a maximal setof sibling nodes, {s1, . . . , st}, such that {s1, . . . , st} ⊆ child(v), where t is thesize of the segment. All segments of g are linked up. A qid is generalized by asegment if the qid contains a value in the segment.

A segment of g represents a set of sibling nodes in the TEA index thatwill be merged by applying g. To apply generalization g, we follow the linkof the segments of g and merge the nodes in each segment of g. The mergingof sibling nodes implies inserting the new node into a proper segment andrecursively merging the child nodes having the same value if their parents aremerged. The merging of leaf nodes requires adding up a(qid) stored at suchleaf nodes. The cost is proportional to the number of qids generalized by g.

Page 118: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

94 Introduction to Privacy-Preserving Data Publishing

FIGURE 6.9: The TEA structure for QID ={Relationship,Race,Workclass} ([239] c©2004 IEEE)

Example 6.8Figure 6.9 depicts three taxonomy trees for QID attributes {Relationship,Race, Workclass} and the TEA index for qids:

〈c1, b2, a3〉〈c1, b2, c3〉〈c1, b2, d3〉〈c1, c2, a3〉〈c1, c2, b3〉〈d1, c2, b3〉〈d1, c2, e3〉〈d1, d2, b3〉〈d1, d2, e3〉

A rectangle represents a segment, and a dashed line links up the segmentsof the same generalization. For example, the left-most path represents theqid = 〈c1, b2, a3〉, and a(〈c1, b2, a3〉) = 4. {c1, d1} at level 1 is a segment of f1

because it forms a maximal set of siblings that will be merged by f1. {c1c2}and {d1c2, d1d2} at level 2 are two segments of f2. {c1b2c3, c1b2d3} at level 3is a segment of f3. 〈d1, d2, e3〉 and 〈d1, c2, e3〉, in bold face, are the anonymityqids.

Consider applying {c2, d2} → f2. The first segment of f2 contains only onesibling node {c1c2}, we simply relabel the sibling by f2. This creates newqids 〈c1, f2, a3〉 and 〈c1, f2, b3〉. The second segment of f2 contains two sibling

Page 119: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization for Classification Analysis 95

nodes {d1c2, d1d2}. We merge them into a new node labeled by f2, and mergetheir child nodes having the same label. This creates new qids 〈d1, f2, b3〉 and〈d1, f2, e3〉, with a(〈d1, f2, b3〉) = 7 and a(〈d1, f2, e3〉) = 4.

6.5.3 Discussion

Bottom-Up Generalization is an efficient k-anonymization method. Exper-imental results [239] suggest that it can effectively identify a k-anonymoussolution that preserves information utility for classification analysis. Yet, ithas several limitations, making it not applicable to the anonymization problemin the Red Cross BTS.

1. Enforcing traditional k-anonymity on the Red Cross BTS data wouldresult in high utility loss due to the problem of high dimensionality,making the data not useful in practice.

2. The data structure TEA proposed in [239] can handle only a single QID.A new structure is required if the data holder wants to achieve LKC-privacy using the Bottom-Up Generalization method because LKC-privacy in effect is equivalent to breaking a single QID into multipleQIDi ⊆ QID, where |QIDi| ≤ L.

3. Bottom-Up Generalization can handle categorical attributes only. Com-pared to iteratively masking the data bottom-up starting from domainvalues, the top-down specialization is more natural and efficient for han-dling numerical attributes. To produce a small number of intervals fora numerical attribute, the top-down approach needs only a small num-ber of interval splitting, whereas the bottom-up approach needs muchinterval merging starting from many domain values.

4. The initial step of Bottom-Up Generalization has to identify all com-binations of qid values in the raw data table T and keep track of theiranonymity counts a(qid). In case T is large, the number of distinct com-binations could be huge. Keeping all these combinations and counts inmemory may not be feasible. In contrast, the top-down approach candiscard data records that cannot be further specialized, whereas thebottom-up approach has to keep all data records until the end of com-putation.

6.6 Genetic Algorithm

Iyengar [123] is the pioneer to address the anonymization problem for clas-sification analysis and proposed a genetic algorithmic solution to achieve the

Page 120: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

96 Introduction to Privacy-Preserving Data Publishing

traditional k-anonymity with the goal of preserving the data utility.

6.6.1 The Anonymization Algorithm

Iyengar [123] modifies the genetic algorithm from [243]. The algorithm hastwo parts: choosing an attribute for subtree generalization and choosing arecord for entire record suppression. The chromosome in the genetic algorithmframework is a bit string that represents the chosen generalizations. The al-gorithm uses the classification metric CM in Equation 4.5 (Chapter 4.2) todetermine the goodness of a k-anonymous solution. Intuitively, the metriccharges a penalty for each record suppressed or generalized to a group inwhich the record’s class is not the majority class. It is based on the idea thata record having a non-majority class in a group will be classified as the major-ity class, which is an error because it disagrees with the record’s original class.The general idea is to encode each state of generalization as a “chromosome”and data distortion into the fitness function. Then the algorithm employs thegenetic evolution to converge to the fittest chromosome.

6.6.2 Discussion

Iyegnar [123] identifies a new direction for privacy-preserving data publish-ing. Yet, his proposed solution has several limitations. Similar to Mondrianand Bottom-Up Generalization, the Genetic Algorithm presented in [123] canachieve only k-anonymity. It does not address the privacy threats caused byattribute linkages and does not address the high-dimensionality problem.

The major drawback of the Genetic Algorithm is its inefficiency. It requires18 hours to transform 30K records [123]. In contrast, the TDD algorithm(Chapter 5.2.3) takes only 7 seconds to produce a comparable accuracy onthe same data. The HDTDS algorithm (Chapter 6.3) takes only 30 seconds toproduce even better results than the Genetic Algorithm. For large databases,Iyengar suggested running his algorithm on a sample. However, a small sam-pling error could mean failed protection on the entire data. The other meth-ods presented in this chapter do not suffer from this efficiency and scalabilityproblem.

6.7 Evaluation Methodology

In Part I, we have studied the four major components in privacy-preservingdata publishing, namely privacy models, anonymization operations, informa-tion metrics, and anonymization algorithms. After choosing the appropriatecomponents to address a specific privacy-preserving data publishing problem,

Page 121: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization for Classification Analysis 97

the data holder often would like to know the impact on the data quality. Theobjective of this section is to present an evaluation methodology for measur-ing the impact of imposing a privacy requirement on the data quality withrespect to some data analysis tasks. Data holders can employ this evalua-tion methodology to objectively measure the data quality with respect to theprivacy protection level before publishing the data. Current or prospective re-searchers may also find this evaluation methodology beneficial to their futureworks on privacy-preserving data publishing.

We use the LKC-privacy model (Chapter 6.3) and the anonymization algo-rithm HDTDS (Chapter 6.3) as examples to illustrate the evaluation method-ology. Specifically, we study the impact of enforcing various LKC-privacyrequirements on the data quality in terms of classification error and discerni-bility cost, and to evaluate the efficiency and scalability of HDTDS by varyingthe thresholds of maximum adversary’s knowledge L, minimum anonymity K,and maximum confidence C.

Two real-life data sets, Blood and Adult are employed. Blood is a real-lifeblood transfusion data set owned by an anonymous health institute. Bloodcontains 10,000 blood transfusion records in 2008. Each record represents oneincident of blood transfusion. Blood has 62 attributes after removing explicitidentifiers; 41 of them are QID attributes. Blood Group represents the Classattribute with 8 possible values. Diagnosis Codes, representing 15 categories ofdiagnosis, is considered to be the sensitive attribute. The remaining attributesare neither quasi-identifiers nor sensitive.

The publicly available Adult data set [179] is a de facto benchmark for test-ing anonymization algorithms [29, 96, 123, 162, 172, 236, 237] in the researcharea of privacy-preserving data publishing. Adult has 45,222 census records on6 numerical attributes, 8 categorical attributes, and a binary Class columnrepresenting two income levels, ≤50K or >50K. Table 6.9 describes the at-tributes of Adult. Divorced and Separated in the attribute Marital-status areconsidered as sensitive attribute, and the remaining 13 attributes are consid-ered as QID. All experiments were conducted on an Intel Core2 Quad Q66002.4GHz PC with 2GB RAM.

6.7.1 Data Utility

To evaluate the impact on classification quality (Case 1 in Chapter 6.2.2), allrecords are used for generalization, build a classifier on 2/3 of the generalizedrecords as the training set, and measure the classification error (CE) on 1/3of the generalized records as the testing set. Alternatively, one can employthe 10-fold cross validation method to measure the classification error. Forclassification models, the experiments use the well-known C4.5 classifier [191].To better visualize the cost and benefit of the approach, we measure additionalerrors: Baseline Error (BE) is the error measured on the raw data withoutgeneralization. BE −CE represents the cost in terms of classification qualityfor achieving a given LKC-privacy requirement. A naive method to avoid

Page 122: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

98 Introduction to Privacy-Preserving Data Publishing

Table 6.9: Attributes for the Adult data setAttribute Type Numerical Range

# of Leaves # of LevelsAge (A) numerical 17 - 90Capital-gain (Cg) numerical 0 - 99999Capital-loss (Cl) numerical 0 - 4356Education-num (En) numerical 1 - 16Final-weight (Fw) numerical 13492 - 1490400Hours-per-week (H) numerical 1 - 99Education (E) categorical 16 5Marital-status (M) categorical 7 4Native-country (N) categorical 40 5Occupation (O) categorical 14 3Race (Ra) categorical 5 3Relationship (Re) categorical 6 3Sex (S) categorical 2 2Work-class (W) categorical 8 5

record and attributes linkages is to simply remove all QID attributes. Thus,we also measure upper bound error (UE), which is the error on the raw datawith all QID attributes removed. UE − CE represents the benefit of themethod over the naive approach.

To evaluate the impact on general analysis quality (Case 2 in Chapter 6.2.2),we use all records for generalization and measure the discernibility ratio (DR)on the final anonymous data.

DR =

∑qid |T [qid]|2|T |2 . (6.8)

DR is the normalized discernibility cost, with 0 ≤ DR ≤ 1. Lower DRmeans higher data quality.

6.7.1.1 The Blood Data Set

Figure 6.10 depicts the classification error CE with adversary’s knowledgeL = 2, 4, 6, anonymity threshold 20 ≤ K ≤ 100, and confidence threshold C =20% on the Blood data set. This setting allows us to measure the performanceof the algorithm against record linkages for a fixed C. CE generally increasesas K or L increases. However, the increase is not monotonic. For example,the error drops slightly when K increases from 20 to 40 for L = 4. This isdue to the fact that generalization has removed some noise from the data,resulting in a better classification structure in a more general state. For thesame reason, some test cases on L = 2 and L = 4 have CE < BE, implyingthat generalization not only achieves the given LKC-privacy requirement butsometimes may also improve the classification quality. BE = 22.1% and UE =

Page 123: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization for Classification Analysis 99

14.0

19.0

24.0

29.0

34.0

39.0

44.0

49.0

20 40 60 80 100

Cla

ssif

icati

on

Err

or

(CE

)

Threshold K

L=2 L=4 L=6BE=22.1% UE=44.1%

FIGURE 6.10: Classification error on the Blood data set (C = 20%)

0.00

0.20

0.40

0.60

0.80

1.00

20 40 60 80 100

Dis

cern

ibilit

y R

ati

o (

DR

)

Threshold K

L=2 L=4 L=6 Traditional K-Anonymity

FIGURE 6.11: Discernibility ratio on the Blood data set (C = 20%)

Page 124: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

100 Introduction to Privacy-Preserving Data Publishing

14.0

16.0

18.0

20.0

22.0

24.0

26.0

20 40 60 80 100

Cla

ssif

icati

on

Err

or

(CE

)

Threshold K

L=2 L=4 L=6BE=14.7% UE=24.6%

FIGURE 6.12: Classification error on the Adult data set (C = 20%)

44.1%. For L = 2 and L = 4, CE − BE spans from -2.9% to 5.2% andUE − CE spans from 16.8% to 24.9%, suggesting that the cost for achievingLKC-privacy is small, but the benefit is large when L is not large. However,as L increases to 6, CE quickly increases to about 40%, the cost increasesto about 17%, and the benefit decreases to 5%. For a greater value of L, thedifference between LKC-privacy and k-anonymity is very small in terms ofclassification error since more generalized data does not necessarily worsen theclassification error. This result confirms that the assumption of an adversary’sprior knowledge has a significant impact on the classification quality. It alsoindirectly confirms the curse of high dimensionality [6].

Figure 6.11 depicts the discernibility ratio DR with adversary’s knowledgeL = 2, 4, 6, anonymity threshold 20 ≤ K ≤ 100, and a fixed confidence thresh-old C = 20%. DR generally increases as K increases, so it exhibits some trade-off between data privacy and data utility. As L increases, DR increases quicklybecause more generalization is required to ensure each equivalence group hasat least K records. To illustrate the benefit of the proposed LKC-privacymodel over the traditional k-anonymity model, we measure the discernibilityratio, denoted by DRTradK , on traditional K-anonymous solutions producedby the TDR method in [96]. DRTradK −DR, representing the benefit of themodel, spans from 0.1 to 0.45. This indicates a significant improvement ondata quality by making a reasonable assumption on limiting the adversary’sknowledge within L known values. Note, the solutions produced by TDR donot prevent attribute linkages although they have higher discernibility ratio.

6.7.1.2 The Adult Data set

Figure 6.12 depicts the classification error CE with adversary’s knowledgeL = 2, 4, 6, anonymity threshold 20 ≤ K ≤ 100, and confidence thresholdC = 20% on the Adult data set. BE = 14.7% and UE = 24.5%. For L = 2,

Page 125: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization for Classification Analysis 101

14.0

16.0

18.0

20.0

22.0

24.0

26.0

5 10 15 20 25 30

Cla

ssif

icati

on

Err

or

(CE

)

Threshold C (%)

L=2 L=4 L=6BE = 14.7% UE = 24.6%

FIGURE 6.13: Classification error on the Adult data set (K = 100)

0.00

0.02

0.04

0.06

0.08

0.10

20 40 60 80 100

Dis

cern

ibilit

y R

ati

o (

DR

)

Threshold K

L=2 L=4 L=6 Traditional K-Anonymity

FIGURE 6.14: Discernibility ratio on the Adult data set (C = 20%)

CE −BE is less than 1% and UE −CE spans from 8.9% to 9.5%. For L = 4and L = 6, CE − BE spans from 1.1% to 4.1%, and UE − CE spans from5.8% to 8.8%. These results suggest that the cost for achieving LKC-privacyis small, while the benefit of the LKC-privacy anonymization method overthe naive method is large.

Figure 6.13 depicts the CE with adversary’s knowledge L = 2, 4, 6, confi-dence threshold 5% ≤ C ≤ 30%, and anonymity threshold K = 100. Thissetting allows us to measure the performance of the algorithm against at-tribute linkages for a fixed K. The result suggests that CE is insensitive tothe change of confidence threshold C. CE slightly increases as the adversary’sknowledge L increases.

Figure 6.14 depicts the discernibility ratio DR with adversary’s knowledgeL = 2, 4, 6, anonymity threshold 20 ≤ K ≤ 100, and confidence threshold

Page 126: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

102 Introduction to Privacy-Preserving Data Publishing

0.00

0.02

0.04

0.06

0.08

0.10

5 10 15 20 25 30

Dis

cern

ibilit

y R

ati

o (

DR

)

Threshold C (%)

L=2 L=4 L=6

FIGURE 6.15: Discernibility ratio on the Adult data set (K = 100)

C = 20%. DR sometimes has a drop when K increases. This is due to the factthat the presented greedy algorithm identifies only the sub-optimal solution.DR is insensitive to the increase of K and stays close to 0 for L = 2. AsL increases to 4, DR increases significantly and finally equals traditional k-anonymity when L = 6 because the number of attributes in Adult is relativelysmaller than in Blood. Yet, k-anonymity does not prevent attribute linkages,while LKC-privacy provides this additional privacy guarantee.

Figure 6.15 depicts the DR with adversary’s knowledge L = 2, 4, 6, con-fidence threshold 5% ≤ C ≤ 30%, and anonymity threshold K = 100. Ingeneral, DR increases as L increases due to a more restrictive privacy require-ment. Similar to Figure 6.13, the DR is insensitive to the change of confidencethreshold C. It implies that the primary driving forces for generalization areL and K, not C.

6.7.2 Efficiency and Scalability

High-Dimensional Top-Down Specialization (HDTDS) is an efficient andscalable algorithm for achieving LKC-privacy on high-dimensional relationaldata. Every previous test case can finish the entire anonymization processwithin 30 seconds. We further evaluate the scalability of HDTDS with respectto data volume by blowing up the size of the Adult data set. First, we combinedthe training and testing sets, giving 45,222 records. For each original recordr in the combined set, we created α − 1 “variations” of r, where α > 1 isthe blowup scale. Together with all original records, the enlarged data set hasα×45, 222 records. In order to provide a more precise evaluation, the runtimereported below excludes the time for loading data records from disk and thetime for writing the generalized data to disk.

Figure 6.16 depicts the runtime from 200,000 to 1 million records for L = 4,

Page 127: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization for Classification Analysis 103

0

20

40

60

80

100

120

0 200 400 600 800 1000

Tim

e (

seco

nd

s)

# of Records (in thousands)

Reading AnonymizationWriting Total

FIGURE 6.16: Scalability (L = 4, K = 20, C = 100%)

K = 20, C = 100%. The total runtime for anonymizing 1 million records is107s, where 50s are spent on reading raw data, 33s are spent on anonymizing,and 24s are spent on writing the anonymous data. The algorithm is scalabledue to the fact that the algorithm uses the count statistics to update theScore, and thus it only takes one scan of data per iteration to anonymize thedata. As the number of records increases, the total runtime increases linearly.

6.8 Summary and Lesson Learned

This chapter uses the Red Cross Blood Transfusion Service (BTS) as a real-life example to motivate the anonymization problem for classification analy-sis. We have studied the challenge of anonymizing high-dimensional data andpresented the LKC-privacy model [171] for addressing the challenge. Fur-thermore, we have studied four privacy-preserving data publishing methodsto address the anonymization problem for classification analysis.

The High-Dimensional Top-Down Specialization (HDTDS) algorithm [171]is extended from TDS [95, 96]. HDTDS is flexible to adopt different informa-tion metrics in order to accommodate the two different information require-ments specified by BTS, namely classification analysis and general counting.The experimental results on the two real-life data sets suggest the followings.

• HDTDS can effectively preserve both privacy and data utility in theanonymous data for a wide range of LKC-privacy requirements. Thereis a trade-off between data privacy and data utility with respect to K

Page 128: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

104 Introduction to Privacy-Preserving Data Publishing

and L, but the trend is less obvious on C.

• The proposed LKC-privacy model retains more information than thetraditional k-anonymity model and provides the flexibility to adjustprivacy requirements according to the assumption of adversary’s back-ground knowledge.

• HDTDS and its predecessor TDS [95, 96] is highly scalable for largedata sets.

These characteristics make HDTDS a promising component for anonymiz-ing high-dimensional relational data. The proposed solution could serve as amodel for data sharing in the healthcare sector.

We have also studied three other solutions, namely, Workload-Aware Mon-drian, Bottom-Up Generalization, and Genetic Algorithm. All these methodsaim at achieving traditional k-anonymity and/or �-diversity with an infor-mation metric that guides the anonymization to preserve the data utility forclassification analysis. Yet, these methods cannot directly apply to the RedCross BTS problem, which involves high-dimensional data, because applyingk-anonymity on high-dimensional data would suffer from high informationloss.

The presented solutions in this chapter are very different from privacy-preserving data mining (PPDM), which the goal is to share the data miningresult. In contrast, the goal of privacy-preserving data publishing (PPDP) isto share the data. This is an essential requirement for the Red Cross sincethey require the flexibility to perform various data analysis tasks.

Health data are complex, often a combination of relational data, transactiondata, and textual data. Some recent works [97, 101, 220, 256] that will bediscussed in Chapters 13 and 16 are applicable to solve the privacy problemon transaction and textual data in the Red Cross case. Besides the technicalissue, it is equally important to educate health institute management andmedical practitioners about the latest privacy-preserving technology. Whenmanagement encounters the problem of privacy-preserving data publishing aspresented in this chapter, their initial response is often to set up a traditionalrole-based secure access model. In fact, alternative techniques, such as privacy-preserving data mining and data publishing [11, 92], are available to themprovided that the data mining quality does not significantly degrade.

Page 129: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Chapter 7

Anonymization for Cluster Analysis

7.1 Introduction

Substantial research has been conducted on k-anonymization and its ex-tensions as discussed in Chapter 2, but only few prior works have consideredreleasing data for some specific purpose of data mining, which is also known asthe workload-aware anonymization [150]. Chapter 6 presents a practical datapublishing framework for generating an anonymized version of data that pre-serves both individual privacy and information utility for classification analy-sis. This chapter aims at preserving the information utility for cluster analysis.Experiments on real-life data suggest that by focusing on preserving clusterstructure in the anonymization process, the cluster quality is significantly bet-ter than the cluster quality of the anonymized data without such focus [94].The major challenge of anonymizing data for cluster analysis is the lack of classlabels that could be used to guide the anonymization process. The approachpresented in this chapter converts the problem into the counterpart problemfor classification analysis, wherein class labels encode the cluster structurein the data, and presents a framework to evaluate the cluster quality on theanonymized data.

This chapter is organized as follows. Chapter 7.2 presents the anonymiza-tion framework that addresses the anonymization problem for cluster analysis.Chapter 7.3 discusses an alternative solution in a different data model. Chap-ter 7.4 discusses some privacy topics that are orthogonal to the anonymizationproblem of cluster analysis studied in this chapter. Chapter 7.5 summarizesthe chapter.

7.2 Anonymization Framework for Cluster Analysis

Fung et al. [94] define the data publishing scenario as follows. Consider aperson-specific data table T with patients’ information on Zip code, Birthplace,Gender, and Disease. The data holder wants to publish T to some recipientfor cluster analysis. However, if a set of attributes, called a Quasi-Identifier

105

Page 130: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

106 Introduction to Privacy-Preserving Data Publishing

or a QID, on {Zip code, Birthplace, Gender} is so specific that few peoplematch it, publishing the table will lead to linking a unique or small numberof individuals with the sensitive information on Disease. Even if the currentlypublished table T does not contain sensitive information, individuals in T canbe linked to the sensitive information in some external source by a join on thecommon attributes [201, 216]. These types of privacy attacks are known asrecord linkage and attribute linkage, which have been extensively discussed inChapter 2. The problem studied in this chapter is to generate an anonymizedversion of T that satisfies both the anonymity requirement and the clusteringrequirement.

Anonymity Requirement: To thwart privacy threats caused byrecord and attribute linkages, instead of publishing the raw tableT (QID, Sensitive attribute), the data holder publishes an anonymized ta-ble T ′, where QID is a set of quasi-identifying attributes masked to somegeneral concept. Note, we use the term “mask” to refer to the operation of“generalization” or “suppression.” In general, the anonymity requirement canbe any privacy models that can thwart record linkages (Chapter 2.1) andattribute linkages (Chapter 2.2).

Clustering Requirement: The data holder wants to publish a maskedversion of T to a recipient for the purpose of cluster analysis. The goal ofcluster analysis is to group similar objects into the same cluster and group dis-similar objects into different clusters. We assume that the Sensitive attributeis important for the task of cluster analysis; otherwise, it should be removed.The recipient may or may not be known at the time of data publication.

We study the anonymization problem for cluster analysis: For a givenanonymity requirement and a raw data table T , a data holder wants to gener-ate an anonymous version of T , denoted by T ′, that preserves as much of theinformation as possible for cluster analysis, and then publish T ′ to a data re-cipient. The data holder, for example, could be a hospital that wants to shareits patients’ information with a drug company for pharmaceutical research.

There are many possible masked versions of T ′ that satisfy the anonymityrequirement. The challenge is how to identify the appropriate one for clusteranalysis. An inappropriately masked version could put originally dissimilarobjects into the same cluster, or put originally similar objects into differentclusters because other masked objects become more similar to each other.Therefore, a quality-guided masking process is crucial. Unlike the anonymiza-tion problem for classification analysis studied in Chapter 6, the anonymiza-tion problem for cluster analysis does not have class labels to guide the mask-ing. Another challenge is that it is not even clear what “information for clusteranalysis” means, nor how to evaluate the cluster quality of generalized data.In this chapter, we define the anonymization problem for cluster analysis andpresent a solution framework to address the challenges in the problem. Thischapter answers the following key questions:

1. Can a masked table simultaneously satisfy both privacy and clustering

Page 131: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization for Cluster Analysis 107

requirements? The insight is that the two requirements are indeed deal-ing with two types of information: The anonymity requirement aims atmasking identifying information that specifically describes individuals;the clustering requirement aims at extracting general structures thatcapture patterns. Sensitive information tends to be overly specific, thusof less utility, to clustering. Even if masking sensitive information elim-inates some useful structures, alternative structures in the data emergeto help. If masking is carefully performed, identifying information canbe masked while still preserving the patterns for cluster analysis. Ex-perimental results [94] on real-life data sets support this insight.

2. What information should be preserved for cluster analysis in the maskeddata? This chapter presents a framework to convert the anonymizationproblem for cluster analysis to the counterpart problem for classificationanalysis. The idea is to extract the cluster structure from the raw data,encode it in the form of class labels, and preserve such class labels whilemasking the data. The framework also permits the data holder to evalu-ate the cluster quality of the anonymized data by comparing the clusterstructures before and after the masking. This evaluation process is im-portant for data publishing in practice, but very limited study has beenconducted in the context of privacy preservation and cluster analysis.

3. Can cluster-quality guided anonymization improve the cluster quality inanonymous data? A naive solution to the studied privacy problem isto ignore the clustering requirement and employ some general purposeanonymization algorithms, e.g., Incognito [148], to mask data for clusteranalysis. Extensive experiments [94] suggest that by focusing on preserv-ing cluster structure in the masking process, the cluster quality outper-forms the cluster quality on masked data without such focus. In gen-eral, the cluster quality on the masked data degrades as the anonymitythreshold k increases.

4. Can the specification of multiple quasi-identifiers improve the clusterquality in anonymous data? The classic notion of k-anonymity assumesthat a single united QID contains all quasi-identifying attributes, butresearch shows that it often leads to substantial loss of data quality asthe QID size increases [6]. The insight is that, in practice, an adversaryis unlikely to know all identifying attributes of a target victim (the per-son being identified), so the data is over-protected by a single QID. Thestudied method allows the specification of multiple QIDs, each of whichhas a smaller size, and therefore, avoids over-masking and improves thecluster quality. The idea of having multiple QIDs is similar to the no-tion of assuming an adversary’s background is limited to length L inLKC-privacy studied in Chapter 6

5. Can we efficiently mask different types of attributes in real-lifedatabases? Typical relational databases contain both categorical and

Page 132: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

108 Introduction to Privacy-Preserving Data Publishing

numerical attributes. Taxonomy trees like the ones shown in Figure 7.1are pre-specified for some attributes. The taxonomy trees are part ofthe domain knowledge that allows the data holder to specify a gener-alization path so that a data value can be masked to a less specificdescription. However, taxonomy trees may or may not be available inreal-life databases. The anonymization algorithm studied in this chaptercan effectively mask all these variations of attributes by generalizing cat-egorical attributes with pre-specified taxonomy trees, suppressing cate-gorical attributes without taxonomy trees, and dynamically discretizingnumerical attributes.

Given that the clustering task is known in advance, one may ask why notpublish the analysis result instead of the data records? Unlike classificationtrees and association rules, publishing the cluster statistics (e.g., cluster cen-ters, together with their size and radius) usually cannot fulfil the informationneeds for cluster analysis. Often, data recipients want to browse into the clus-tered records to gain more knowledge. For example, a medical researcher maybrowse into some clusters of patients and examine their common characteris-tics. Publishing data records not only fulfills the vital requirement for clusteranalysis, but also increases the availability of information for the recipients.

We first formally define the anonymization problem for cluster analysis,followed by a solution framework.

7.2.1 Anonymization Problem for Cluster Analysis

A labelled table discussed in Chapter 6 has the form T (D1, . . . , Dm, Class)and contains a set of records of the form 〈v1, . . . , vm, cls〉, where vj , for 1 ≤j ≤ m, is a domain value of attribute Dj , and cls is a class label of theClass attribute. Each Dj is either a categorical or a numerical attribute. Anunlabelled table has the same form as a labelled table but without the Classattribute.

7.2.1.1 Privacy Model

In order to provide a concrete explanation on the problem and solution ofthe anonymity for cluster analysis, we specify the privacy model to be thek-anonymity with multiple QIDs in the rest of this chapter. The problem,as well as the solution, can be generalized to achieve other privacy modelsdiscussed in Chapter 2.

DEFINITION 7.1 Anonymity requirement Consider p quasi-identifiers QID1, . . . , QIDp on T ′, where QIDi ⊆ {D1, . . . , Dm} for 1 ≤ i ≤ p.a(qidi) denotes the number of data records in T ′ that share the value qidi onQIDi. The anonymity of QIDi, denoted by A(QIDi), is the minimum a(qidi)for any value qidi on QIDi. A table T ′ satisfies the anonymity requirement

Page 133: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization for Cluster Analysis 109

Table 7.1: The labelled tableRec ID Education Gender Age ... Class Count

1-3 9th M 30 0C1 3C2 34-7 10th M 32 0C1 4C2 48-12 11th M 35 2C1 3C2 513-16 12th F 37 3C1 1C2 417-22 Bachelors F 42 4C1 2C2 623-26 Bachelors F 44 4C1 0C2 427-30 Masters M 44 4C1 0C2 431-33 Masters F 44 3C1 0C2 3

34 Doctorate F 44 1C1 0C2 1Total: 21C1 13C2 34

{〈QID1, h1〉, . . . , 〈QIDp, hp〉} if A(QIDi) ≥ hi for 1 ≤ i ≤ p, where QIDi

and the anonymity thresholds hi are specified by the data holder.

If some QIDj could be “covered” by another QIDi, then QIDj can be re-moved from the anonymity requirement. This observation is stated as follows:

Observation 7.2.1 (Cover) Suppose QIDj ⊆ QIDi and hj ≤ hi wherej = i. If A(QIDi) ≥ hi, then A(QIDj) ≥ hj . We say that QIDj is covered byQIDi; therefore, QIDj is redundant and can be removed.

Example 7.1Consider the data in Table 7.1 and taxonomy trees in Figure 7.1. Ignorethe dashed line in Figure 7.1 for now. The table has 34 records, with eachrow representing one or more raw records that agree on (Education, Gender,Age). The Class column stores a count for each class label. The anonymityrequirement

{〈QID1 = {Education,Gender}, 4〉, 〈QID2 = {Gender}, 4〉}states that every existing qid1 and qid2 in the table must be shared by atleast 4 records. Therefore, 〈9th,M〉, 〈Masters,F〉, 〈Doctorate,F〉 violate thisrequirement. To make the “female doctor” less unique, we can generalize Mas-ters and Doctorate to Grad School. As a result, “she” becomes less identifiableby being one of the four females who have a graduate degree in the maskedtable T ′. Note, QID2 is covered by QID1, so QID2 can be removed.

Definition 7.1 generalizes the classic notion of k-anonymity [202] by allow-ing multiple QIDs with different anonymity thresholds. The specification ofmultiple QIDs is based on an assumption that the data holder knows exactlywhat external information source is available for sensitive record linkage. Theassumption is realistic in some data publishing scenarios. Suppose that the

Page 134: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

110 Introduction to Privacy-Preserving Data Publishing

Secondary University

Junior Sec.

11th

Bachelors

Masters

ANY

Senior Sec.

Doctorate

Grad School

Education

12th10th9th

ANY

Male Female

Gender

[1-35)

[1-99)

[1-37) [37-99)

[35-37)

Age

FIGURE 7.1: Taxonomy trees

data holder wants to release a table T ′(A, B, C, D, S), where A, B, C, D areidentifying attributes and S is a sensitive attribute, and knows that the recip-ient has access to previously released tables T 1∗(A, B, X) and T 2∗(C, D, Y ),where X and Y are attributes not in T . To prevent linking the records in Tto X or Y , the data holder only has to specify the anonymity requirementon QID1 = {A, B} and QID2 = {C, D}. In this case, enforcing anonymityon QID = {A, B, C, D} will distort the data more than is necessary. Theexperimental results in [94] confirm that the specification of multiple QIDscan reduce masking and, therefore, improve the data quality.

7.2.1.2 Masking Operations

To transform a table T to satisfy an anonymity requirement, one can ap-ply the following three types of masking operations on every attribute Dj

in ∪QIDi: If Dj is a categorical attribute with pre-specified taxonomy tree,then we generalize Dj . Specifying taxonomy trees, however, requires expertknowledge of the data. In case the data holder lacks such knowledge or, for anyreason, does not specify a taxonomy tree for the categorical attribute Dj , thenwe suppress Dj . If Dj is a numerical attribute without a pre-discretized tax-onomy tree, then we discretize Dj .1 These three types of masking operationsare formally described as follows:

1. Generalize Dj if it is a categorical attribute with a taxonomy tree speci-fied by the data holder. Figure 7.1 shows the taxonomy trees for categor-ical attributes Education and Gender. A leaf node represents a domainvalue and a parent node represents a less specific value. A generalizedDj can be viewed as a “cut” through its taxonomy tree. A cut of a treeis a subset of values in the tree, denoted by Cutj, that contains exactlyone value on each root-to-leaf path. Figure 7.1 shows a cut on Educationand Gender, indicated by the dash line. If a value v is generalized toits parent, all siblings of v must also be generalized to its parent. Thisproperty ensures that a value and its ancestor values will not coexist in

1A numerical attribute with a pre-discretized taxonomy tree is equivalent to a categoricalattribute with a pre-specified taxonomy tree.

Page 135: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization for Cluster Analysis 111

the generalized table T ′. This subtree generalization scheme is discussedin Chapter 3.1.

2. Suppress Dj if it is a categorical attribute without a taxonomy tree.Suppressing a value on Dj means replacing all occurrences of the valuewith the special value ⊥j . All suppressed values on Dj are representedby the same value ⊥j. We use Supj to denote the set of values suppressedby ⊥j . This type of suppression is performed at the value level, in thatSupj in general contains a subset of the values in the attribute Dj . Aclustering algorithm treats ⊥j as a new value. Suppression can be viewedas a special case of sibling generalization by considering ⊥j to be theroot of a taxonomy tree and child(⊥j) to contain all domain valuesof Dj . Refer to Chapter 3.1 for a detailed discussion on the siblinggeneralization scheme. In this suppression scheme, we could selectivelysuppress some values in child(⊥j) to ⊥j while some other values inchild(⊥j) remain intact.

3. Discretize Dj if it is a numerical attribute. Discretizing a value v onDj means replacing all occurrences of v with an interval containing thevalue. The presented algorithm dynamically grows a taxonomy tree forintervals at runtime. Each node represents an interval. Each non-leafnode has two child nodes representing some optimal binary split of theparent interval. Figure 7.1 shows such a dynamically grown taxonomytree for Age, where [1-99) is split into [1-37) and [37-99). More detailswill be discussed in Chapter 7.2.3.1. A discretized Dj can be representedby the set of intervals, denoted by Intj , corresponding to the leaf nodesin the dynamically grown taxonomy tree of Dj .

A masked table T can be represented by 〈∪Cutj ,∪Supj,∪Intj〉, whereCutj , Supj, Intj are defined above. If the masked table T ′ satisfies theanonymity requirement, then 〈∪Cutj ,∪Supj ,∪Intj〉 is called a solution set.Generalization, suppression, and discretization have their own merits and flex-ibility; therefore, the unified framework presented in this chapter employs allof them.

7.2.1.3 Problem Statement

What kind of information should be preserved for cluster analysis? Un-like classification analysis, wherein the information utility of attributes canbe measured by their power of identifying class labels [29, 95, 123, 150], noclass labels are available for cluster analysis. One natural approach is to pre-serve the cluster structure in the raw data. Any loss of structure due to theanonymization is measured relative to such “raw cluster structure.” We de-fine the anonymization problem for cluster analysis as follows to reflect thisnatural choice of approach.

Page 136: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

112 Introduction to Privacy-Preserving Data Publishing

DEFINITION 7.2 Anonymization problem for cluster analysisGiven an unlabelled table T , an anonymity requirement {〈QID1, h1〉, . . . ,〈QIDp, hp〉}, and an optional taxonomy tree for each categorical attributein ∪QIDi, the anonymization problem for cluster analysis is to mask T onthe attributes ∪QIDi such that the masked table T ′ satisfies the anonymityrequirement and has a cluster structure as similar as possible to the clusterstructure in the raw table T .

Intuitively, two cluster structures, before and after masking, are similar ifthe following two conditions are generally satisfied:

1. two objects that belong to the same cluster before masking remain inthe same cluster after masking, and

2. two objects that belong to different clusters before masking remain indifferent clusters after masking.

A formal measure for the similarity of two structures will be discussed inChapter 7.2.4.

7.2.2 Overview of Solution Framework

Now we explain an algorithmic framework to generate a masked table T ′,represented by a solution set 〈∪Cutj ,∪Supj,∪Intj〉 that satisfies a givenanonymity requirement and preserves as much as possible the raw clusterstructure.

Figure 7.2 provides an overview of the proposed framework. First, we gen-erate the cluster structure in the raw table T and label each record in T bya class label. This labelled table, denoted by Tl, has a Class attribute thatcontains a class label for each record. Essentially, preserving the raw clus-ter structure is to preserve the power of identifying such class labels duringmasking. Masking that diminishes the difference among records belonging todifferent clusters (classes) is penalized. As the requirement is the same asthe anonymization problem for classification analysis, one can apply existinganonymization algorithms for classification analysis (Chapter 6) to achievethe anonymity, although none of them in practice can perform all of the threetypes of masking operations discussed in Chapter 7.2.1. We explain each stepin Figure 7.2 as follows.

1. Convert T to a labelled table Tl. Apply a clustering algorithm to Tto identify the raw cluster structure, and label each record in T by itsclass (cluster) label. The resulting labelled table Tl has a Class attributecontaining the labels.

2. Mask the labelled table Tl. Employ an anonymization algorithm forclassification analysis to mask Tl. The masked T ∗

l satisfies the givenanonymity requirement.

Page 137: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization for Cluster Analysis 113

Raw Table T

Raw Labelled Table Tl*

Masked Labelled Table Tl*

Masked Table Tl*

Ste

p 5

Rel

eas

e

Ste

p 1

Clu

ster

ing

& L

abel

ling

Step 2

Masking

Ste

p 3

Clu

ster

ing

& L

abe

llin

gStep 4

Comparing Cluster Structures

Data Holder Data Recipient

FIGURE 7.2: The framework

3. Clustering on the masked T ∗l . Remove the labels from the masked

T ∗l and then apply a clustering algorithm to the masked T ∗

l , where thenumber of clusters is the same as in Step 1. By default, the clusteringalgorithm in this step is the same as the clustering algorithm in Step1, but can be replaced with the recipient’s choice if this information isavailable. See more discussion below.

4. Evaluate the masked T ∗l . Compute the similarity between the cluster

structure found in Step 3 and the raw cluster structure found in Step 1.The similarity measures the loss of cluster quality due to masking. If theevaluation is unsatisfactory, the data holder may repeat Steps 1-4 withdifferent specification of taxonomy trees, choice of clustering algorithms,masking operations, number of clusters, and anonymity thresholds ifpossible.

5. Release the masked T ∗l . If the evaluation in Step 4 is satisfactory, the

data holder can release the masked T ∗l together with some optional sup-

plementary information: all the taxonomy trees (including those gener-

Page 138: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

114 Introduction to Privacy-Preserving Data Publishing

ated at runtime for numerical attributes), the solution set, the similarityscore computed in Step 4, and the class labels generated in Step 1.

In some data publishing scenarios, the data holder does not even know whothe prospective recipients are and, therefore, does not know how the recipientswill cluster the published data. For example, when the Census Bureau releasesdata on the World Wide Web, how should the bureau set the parameters, suchas the number of clusters, for the clustering algorithm in Step 1? In this case,we suggest releasing one version for each reasonable cluster number so that therecipient can make the choice based on her desired number of clusters, but thiswill cause a potential privacy breach because an adversary can further narrowdown a victim’s record by comparing different releases. A remedy is to employthe multiple views publishing and incremental data publishing methods, whichguarantee the privacy even in the presence of multiple releases. Chapters 8and 10 discuss these extended publishing scenarios in details.

7.2.3 Anonymization for Classification Analysis

To effectively mask both categorical and numerical attributes, we presentan anonymization algorithm called top-down refinement (TDR) [96] that canperform all three types of masking operations in a unified fashion. TDR sharesa similar HDTDS discussed in Chapter 6.3, but HDTDS cannot perform sup-pression and, therefore, cannot handle categorical attributes without taxon-omy trees.

TDR takes a labelled table and an anonymity requirement as inputs. Themain idea of TDR is to perform maskings that preserve the information foridentifying the class labels. The next example illustrates this point.

Example 7.2Suppose that the raw cluster structure produced by Step 1 has the class(cluster) labels given in the Class attribute in Table 7.1. In Example 7.1 wegeneralize Masters and Doctorate into Grad School to make linking through(Education,Gender) more difficult. No information is lost in this generalizationbecause the class label C1 does not depend on the distinction of Mastersand Doctorate. However, further generalizing Bachelors and Grad School toUniversity makes it harder to separate the two class labels involved.

Instead of masking a labelled table T ∗l starting from the most specific do-

main values, TDR masked T ∗l by a sequence of refinements starting from the

most masked state in which each attribute is generalized to the topmost value,suppressed to the special value ⊥, or represented by a single interval. TDRiteratively refines a masked value selected from the current set of cuts, sup-pressed values, and intervals, and stops if any further refinement would violatethe anonymity requirement. A refinement is valid (with respect to T ∗

l ) if T ∗l

satisfies the anonymity requirement after the refinement.

Page 139: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization for Cluster Analysis 115

We formally describe different types of refinements in Chapter 7.2.3.1, definethe information metric for a single refinement in Chapter 7.2.3.2, and providethe anonymization algorithm TDR in Chapter 7.2.3.3.

7.2.3.1 Refinement

• Refinement for generalization. Consider a categorical attribute Dj

with a pre-specified taxonomy tree. Let T ∗l [v] denote the set of general-

ized records that currently contains a generalized value v in the table T ∗l .

Let child(v) be the set of child values of v in a pre-specified taxonomytree of Dj. A refinement, denoted by v → child(v), replaces the parentvalue v in all records in T ∗

l [v] with the child value c ∈ child(v), where cis either a domain value d in the raw record or c is a generalized value ofd. For example, a raw data record r contains a value Masters and thevalue has been generalized to University in a masked table T ∗

l . A refine-ment University → {Bachelors, Grad School} replaces University in rby Grad School because Grad School is a generalized value of Masters.

• Refinement for suppression. For a categorical attribute Dj withouta taxonomy tree, a refinement ⊥j → {v,⊥j} refers to disclosing onevalue v from the set of suppressed values Supj. Let T ∗

l [⊥j] denote theset of suppressed records that currently contain ⊥j in the table T ∗

l .Disclosing v means replacing ⊥j with v in all records in T ∗

l [⊥j] thatoriginally contain v.

• Refinement for discretization. For a numerical attribute, refinementis similar to that for generalization except that no prior taxonomy treeis given and the taxonomy tree has to be grown dynamically in theprocess of refinement. Initially, the interval that covers the full range ofthe attribute forms the root. The refinement on an interval v, writtenv → child(v), refers to the optimal split of v into two child intervalschild(v), which maximizes the information gain. Suppose there are idistinct values in an interval. Then, there are i − 1 number of possiblesplits. The optimal split can be efficiently identified by computing theinformation gain of each possible split in one scan of data records con-taining such an interval of values. See Chapter 7.2.3.2 for the definitionof information gain. Due to this extra step of identifying the optimalsplit of the parent interval, we treat numerical attributes separatelyfrom categorical attributes with taxonomy trees.

7.2.3.2 Information Metric

Each refinement increases information utility and decreases anonymity ofthe table because records are more distinguishable by refined values. The keyis selecting the best refinement at each step with both impacts considered. Ateach iteration, TDR greedily selects the refinement on value v that has the

Page 140: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

116 Introduction to Privacy-Preserving Data Publishing

Algorithm 7.2.4 Top-Down Refinement (TDR)1: initialize every value of Dj to the topmost value or suppress every value

of Dj to ⊥j or include every continuous value of Dj into the full rangeinterval, where Dj ∈ ∪QIDi.

2: initialize Cutj of Dj to include the topmost value, Supj of Dj to includeall domain values of Dj, and Intj of Dj to include the full range interval,where Dj ∈ ∪QIDi.

3: while some candidate v in 〈∪Cutj ,∪Supj,∪Intj〉 is valid do4: find the Best refinement from 〈∪Cutj ,∪Supj,∪Intj〉.5: perform Best on T ∗

l and update 〈∪Cutj ,∪Supj,∪Intj〉.6: update Score(x) and validity for x∈〈∪Cutj ,∪Supj ,∪Intj〉.7: end while8: return masked T ∗

l and 〈∪Cutj ,∪Supj,∪Intj〉.

highest score, in terms of the information gain per unit of anonymity loss:

Score(v) = IGPL(v). (7.1)

which has been discussed in Chapter 4.1. Refer to Equation 4.7 for a detaileddiscussion on InfoGain(v), AnonyLoss(v), and Score(v) if Dj is a numericalattribute or a categorical attribute with taxonomy tree.

If Dj is a categorical attribute without taxonomy tree, the refinement ⊥j →{v,⊥j} means refining T ′[⊥j] into T ′[v] and T ∗′

[⊥j ], where T ′[⊥j] denotes theset of records containing ⊥j before the refinement, T ′[v] and T ∗′

[⊥j ] denotethe set of records containing v and ⊥j after the refinement, respectively. Weemploy the same Score(v) function to measure the goodness of the refinement⊥j → {v,⊥j}, except that InfoGain(v) is now defined as:

InfoGain(v) = E(T ′[⊥j ])− |T′[v]|

|T ′[⊥j]|E(T ′[v])− |T∗′

[⊥j ]||T ′[⊥j ]| E(T ∗′

[⊥j]). (7.2)

7.2.3.3 The Anonymization Algorithm (TDR)

Algorithm 7.2.4 summarizes the conceptual algorithm. All attributes notin ∪QIDi are removed from T ∗

l , and duplicates are collapsed into a singlerow with the Class column storing the count for each class label. Initially,Cutj contains only the topmost value for a categorical attribute Dj with ataxonomy tree, Supj contains all domain values of a categorical attributeDj without a taxonomy tree, and Intj contains the full range interval fora numerical attribute Dj . The valid refinements in 〈∪Cutj ,∪Supj,∪Intj〉form the set of candidates. At each iteration, we find the candidate of thehighest Score, denoted by Best (Line 4), apply Best to T ′ and update〈∪Cutj ,∪Supj,∪Intj〉 (Line 5), and update Score and the validity of thecandidates in 〈∪Cutj ,∪Supj,∪Intj〉 (Line 6). The algorithm terminates when

Page 141: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization for Cluster Analysis 117

there is no more candidate in 〈∪Cutj ,∪Supj,∪Intj〉, in which case it returnsthe masked table together with the solution set 〈∪Cutj ,∪Supj ,∪Intj〉.

The following example illustrates how to achieve a given anonymity require-ment by performing a sequence of refinements, starting from the most maskedtable.

Example 7.3Consider the labelled table in Table 7.1, where Education and Gender havepre-specified taxonomy trees and the anonymity requirement:

{〈QID1 = {Education,Gender}, 4〉, 〈QID2 = {Gender,Age}, 11〉}.Initially, all data records are masked to

〈ANY Edu, ANY Gender, [1-99)〉,and

∪Cuti = {ANY Edu, ANY Gender, [1-99)}.To find the next refinement, we compute the Score for each of ANY Edu,ANY Gender, and [1-99). Table 7.2 shows the masked data after performingthe following refinements in order:

[1-99)→ {[1-37), [37-99)}ANY Edu→ {Secondary, University}Secondary → {JuniorSec., Senior Sec.}Senior Sec.→ {11th, 12th}University → {Bachelors, Grad School}.

After performing the refinements, the a(qidi) counts in the masked table are:

a(〈Junior Sec.,ANY Gender〉) = 7a(〈11th,ANY Gender〉) = 5a(〈12th,ANY Gender〉) = 4a(〈Bachelors,ANY Gender〉) = 10a(〈Grad School,ANY Gender〉) = 8a(〈ANY Gender, [1-37)〉) = 7 + 5 = 12a(〈ANY Gender, [37-99)〉) = 4 + 10 + 8 = 22 .

The solution set ∪Cuti is:

{JuniorSec., 11th, 12th, Bachelors, GradSchool, ANY Gender, [1-37),[37-99)}.

Page 142: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

118 Introduction to Privacy-Preserving Data Publishing

Table 7.2: The masked table, satisfying{〈QID1, 4〉, 〈QID2, 11〉}

Rec ID Education Gender Age . . . Count1-7 Junior Sec. ANY [1-37) . . . 78-12 11th ANY [1-37) . . . 513-16 12th ANY [37-99) . . . 417-26 Bachelors ANY [37-99) . . . 1027-34 Grad School ANY [37-99) . . . 8

Total: . . . 34

7.2.4 Evaluation

This step compares the raw cluster structure found in Step 1 in Chap-ter 7.2.2, denoted by C, with the cluster structure found in the masked datain Step 3, denoted by Cg. Both C and Cg are extracted from the same set ofrecords, so we can evaluate their similarity by comparing their record group-ings. We present two evaluation methods: F-measure [231] and match point.

7.2.4.1 F-measure

F-measure [231] is a well-known evaluation method for cluster analysis withknown cluster labels. The idea is to treat each cluster in C as the relevant setof records for a query, and treat each cluster in Cg as the result of a query. Theclusters in C are called “natural clusters,” and those in Cg are called “queryclusters.”

For a natural cluster Ci in C and a query cluster Kj in Cg, let |Ci| and|Kj | denote the number of records in Ci and Kj respectively, let nij denotethe number of records contained in both Ci and Kj , let |T | denote the totalnumber of records in T ′. The recall, precision, and F-measure for Ci and Kj

are calculated as follows:

Recall(Ci, Kj) =nij

|Ci| (7.3)

read as the fraction of relevant records retrieved by the query.

Precision(Ci, Kj) =nij

|Kj | (7.4)

read as the fraction of relevant records among the records retrieved by thequery.

F (Ci, Kj) =2×Recall(Ci, Kj)× Precision(Ci, Kj)

Recall(Ci, Kj) + Precision(Ci, Kj). (7.5)

F (Ci, Kj) measures the quality of query cluster Kj in describing the naturalcluster Ci, by the harmonic mean of Recall and Precision.

Page 143: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization for Cluster Analysis 119

Table 7.3: The masked labelled table for evaluationRec ID Education Gender Age . . . Class Count

1-7 Junior Sec. ANY [1-37) . . . K1 78-12 11th ANY [1-37) . . . K1 513-16 12th ANY [37-99) . . . K2 417-26 Bachelors ANY [37-99) . . . K2 1027-34 Grad School ANY [37-99) . . . K2 8

Total: 34

Table 7.4: The similarity of twocluster structures

Clusters in Clusters in Table 7.3Table 7.1 K1 K2

C1 2 19C2 10 3

The success of preserving a natural cluster Ci is measured by the “best”query cluster Kj for Ci, i.e., Kj maximizes F (Ci, Kj). We measure the qualityof Cg using the weighted sum of such maximum F-measures for all naturalclusters. This measure is called the overall F-measure of Cg, denoted by F (Cg):

F (Cg) =∑

Ci∈C

|Ci||T | maxKj∈Cg{F (Ci, Kj)}. (7.6)

Note that F (Cg) is in the range [0,1]. A larger value indicates a highersimilarity between the two cluster structures generated from the raw dataand the masked data, i.e., better preserved cluster quality.

Example 7.4Table 7.3 shows a cluster structure with k = 2 produced from the maskedTable 7.2. The first 12 records are grouped into K1, and the rest are groupedinto K2. By comparing with the raw cluster structure in Table 7.1, we cansee that, among the 21 records in C1, 19 remain in the same cluster K2 andonly 2 are sent to a different cluster. C2 has a similar pattern. Table 7.4 showsthe comparison between the clusters of the two structures. The calculationsof Recall(Ci, Kj), Precision(Ci, Kj), and F (Ci, Kj) are illustrated below.

Recall(C1, K1) = 2/21 = 0.10Precision(C1, K1) = 2/12 = 0.17F (C1, K1) = 2×Recall(C1,K1)×Precision(C1,K1)

Recall(C1,K1)×Precision(C1,K1)= 2×0.10×0.17

0.10×0.17 = 0.12

Recall(C1, K2) = 19/21 = 0.90

Page 144: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

120 Introduction to Privacy-Preserving Data Publishing

Table 7.5: The F-measurecomputed from Table 7.4

F (Ci, Kj) K1 K2

C1 0.12 0.88C2 0.8 0.17

Precision(C1, K2) = 19/22 = 0.86F (C1, K2) = 2×Recall(C1,K2)×Precision(C1,K2)

Recall(C1,K2)×Precision(C1,K2)= 2×0.90×0.86

0.90×0.86 = 0.88

Recall(C2, K1) = 10/13 = 0.77Precision(C2, K1) = 10/12 = 0.83F (C2, K1) = 2×Recall(C1,K1)×Precision(C2,K1)

Recall(C2,K1)×Precision(C2,K1)= 2×0.77×0.83

0.77×0.83 = 0.8

Recall(C2, K2) = 3/13 = 0.23Precision(C2, K2) = 3/22 = 0.14F (C2, K2) = 2×Recall(C2,K2)×Precision(C2,K2)

Recall(C2,K2)×Precision(C2,K2)= 2×0.23×0.14

0.23×0.14 = 0.17

Table 7.5 shows the F-measure. The overall F-measure is:

F (Cg) = |C1||T | × F (C1, K2) + |C2|

|T | × F (C2, K1)= 21

34 × 0.88 + 1334 × 0.8 = 0.85.

F-measure is an efficient evaluation method, but it considers only the bestquery cluster Kj for each natural cluster Ci; therefore, it does not capturethe quality of other query clusters and may not provide a full picture of thesimilarity between two cluster structures. An alternative evaluation method,called match point, can directly measure the preserved cluster structure.

7.2.4.2 Match Point

Intuitively, two cluster structures C and Cg are similar if two objects thatbelong to the same cluster in C remain in the same cluster in Cg, and if twoobjects that belong to different clusters in C remain in different clusters in Cg.To reflect the intuition, the method builds two square matrices Matrix(C) andMatrix(Cg) to represent the grouping of records in cluster structures C and Cg,respectively. The square matrices are |T |-by-|T |, where |T | is the total numberof records in table T . The (i, j)th element in Matrix(C) (or Matrix(Cg))has value 1 if the ith record and the jth record in the raw table T (or themasked table T ′) are in the same cluster; 0 otherwise. Then, match point isthe percentage of matched values between Matrix(C) and Matrix(Cg):

Match Point(Matrix(C), Matrix(Cg)) =

∑1≤i,j≤|T | Mij

|T |2 , (7.7)

Page 145: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization for Cluster Analysis 121

Table 7.6: Matrix(C) of clusters{1, 2, 3} and {4, 5}

1 2 3 4 51 1 1 1 0 02 1 1 1 0 03 1 1 1 0 04 0 0 0 1 15 0 0 0 1 1

Table 7.7: Matrix(Cg) of clusters {1, 2}and {3, 4, 5}

1 2 3 4 51 1 1 0 0 02 1 1 0 0 03 0 0 1 1 14 0 0 1 1 15 0 0 1 1 1

Table 7.8: Match Point table forMatrix(C) and Matrix(Cg)

1 2 3 4 51 1 1 0 1 12 1 1 0 1 13 0 0 1 0 04 1 1 0 1 15 1 1 0 1 1

where Mij is 1 if the (i, j)th element in Matrix(C) and Matrix(Cg) have thesame value; 0 otherwise. Note that match point is in the range of [0,1]. Alarger value indicates a higher similarity between the two cluster structuresgenerated from the raw data and the masked data, i.e., better preserved clusterquality.

Example 7.5Let C = {{1, 2, 3}, {4, 5}} be the clusters before anonymization. Let Cg ={{1, 2}, {3, 4, 5}} be the clusters after anonymization. C and Cg are plottedin Table 7.6 and Table 7.7, respectively. Table 7.8 is the Match Point table,which has a value 1 in a cell if the corresponding cells in Table 7.6 and Table 7.7have the same value. Match Point = 17/52 = 0.68.

7.2.5 Discussion

We discuss some open issues and possible improvements in the studiedprivacy framework for cluster analysis.

Page 146: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

122 Introduction to Privacy-Preserving Data Publishing

7.2.5.1 Recipient Oriented vs. Structure Oriented

Refer to Figure 7.2. One open issue is the choice of clustering algorithmsemployed by the data holder in Step 1. Each clustering algorithm has itsown search bias or preference. Experimental results in [94] suggest that ifthe same clustering algorithm is employed in Step 1 and Step 3, then thecluster structure from the masked data is very similar to the raw clusterstructure; otherwise, the cluster structure in the masked data could not evenbe extracted. There are two methods for choosing clustering algorithms.

Recipient oriented. This approach minimizes the difference generated ifthe recipient had applied her clustering algorithm to both the raw data andthe masked data. It requires the clustering algorithm in Step 1 to be the same,or to use the same bias, as the recipient’s algorithm. One can implement thisapproach in a similar way as for determining the cluster number: either therecipient provides her clustering algorithm information, or the data holderreleases one version of masked data for each popular clustering algorithm,leaving the choice to the recipient. Refer to Chapter 10 for handling potentialprivacy breaches caused by multiple releases.

Structure oriented. This approach focuses on preserving the “true” clus-ter structure in the data instead of matching the recipient’s choice of algo-rithms. Indeed, if the recipient chooses a bad clustering algorithm, matchingher choice may minimize the difference but is not helpful for cluster analysis.This approach aims at preserving the “truthful” cluster structure by employ-ing a robust clustering algorithm in Step 1 and Step 3. Dave and Krishna-puram [57] specify a list of requirements in order for a clustering algorithmto be robust. The principle is that “the performance of a robust clusteringalgorithm should not be affected significantly by small deviations from theassumed model and it should not deteriorate drastically due to noise andoutliers.” If the recipient employs a less robust clustering algorithm, it maynot find the “true” cluster structure. This approach is suitable for the casein which the recipient’s preference is unknown at the time of data release,and the data holder wants to publish only one or a small number of versions.Optionally, the data holder may release the class labels in Step 1 as a sampleclustering solution.

7.2.5.2 Summary of Empirical Study

Experiments on real-life data [94] have verified the claim that the proposedapproach of converting the anonymity problem for cluster analysis to the coun-terpart problem for classification analysis is effective. This is demonstrated bythe preservation of most of the cluster structure in the raw data after maskingidentifying information for a broad range of anonymity requirements. The ex-perimental results also suggest that the cluster quality-guided anonymizationcan preserve better cluster structure than the general purpose anonymization.

The experiments demonstrate the cluster quality with respect to the vari-ation of anonymity thresholds, QID size, and number of clusters. In general,

Page 147: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization for Cluster Analysis 123

the cluster quality degrades as the anonymity threshold increases. This trendis more obvious if the data set size is small or if anonymity threshold is large.The cluster quality degrades as the QID size increases. The cluster quality ex-hibits no obvious trend with respect to the number of clusters, as the naturalnumber of clusters is data dependent.

The experiments confirm that the specification of the multi-QID anonymityrequirement helps avoid unnecessary masking and, therefore, preserves moreof the cluster structure. However, if the data recipient and the data holderemploy different clustering algorithms, then there is no guarantee that theencoded raw cluster structure can be extracted. Thus, in practice, it is impor-tant for the data holder to validate the cluster quality, using the evaluationmethods proposed in the framework, before releasing the data. Finally, exper-iments suggest that the proposed anonymization approach is highly efficientand scalable for single QID, but less efficient for multi-QID.

7.2.5.3 Extensions

Chapter 7.2 presents a flexible framework that makes use of existing solu-tions as “plug-in” components. These include the cluster analysis in Steps 1and 3, the anonymization in Step 2, and the evaluation in Step 4. For example,instead of using the proposed TDR algorithm, the data holder has the optionto perform the anonymization by employing any one of the anonymizationmethods discussed in Chapter 6 with some modification.

The solution presented above focuses on preventing the privacy threatscaused by record linkages, but the framework is extendable to thwart attributelinkages by adopting different anonymization algorithms and achieving otherprivacy models, such as �-diversity and confidence bounding, discussed inChapter 2. The extension requires modification of the Score or cost functionsin these algorithms to bias on refinements or maskings that can distinguishclass labels. The framework can also adopt other evaluation methods, such asentropy [210], or any ad-hoc methods defined by the data holder.

The study in TDR focuses mainly on single-dimensional global recoding.Alternative masking operations, such as local recoding and multidimensionalrecoding, for achieving k-anonymity and its extended privacy notions are alsoapplicable to the problem. Nonetheless, it is important to note that localrecoding and multidimensional recoding may suffer from the data explorationproblem discussed in Chapter 3.1.

One useful extension of privacy-preserving data publishing for cluster anal-ysis is to building a visualization tool to allow the data holder to adjust theparameters, such as the number of clusters and anonymity thresholds, andvisualize their influence on the clusters interactively.

Page 148: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

124 Introduction to Privacy-Preserving Data Publishing

7.3 Dimensionality Reduction-Based Transformation

Oliveira and Zaiane [182] introduce an alternative data anonymization ap-proach, called Dimensionality Reduction-Based Transformation (DRBT ), toaddress the problem of privacy-preserving clustering (PPC ) defined as follows.

DEFINITION 7.3 Problem of privacy-preserving clustering (PPC)Let T be a relational database and C be a set of clusters generated from T .The goal is to transform T into T ′ so that the following restrictions hold:

• The transformed T ′ conceals the values of the sensitive attributes.

• The similarity between records in T ′ has to be approximately the sameas that one in T . The clusters in T and T ′ should be as close as possible.

The general principle of the PPC problem [182] is similar to the anonymiza-tion problem for cluster analysis discussed in Definition 7.2. The major differ-ences are on the privacy model. First, the problem of PPC aim at anonymizingthe sensitive attributes while most of the k-anonymization related methodsaim at anonymizing the quasi-identifying attributes. Second, PPC assumesthat all the attributes to be transformed are numerical. The general ideaof DRBT is to transform m-dimensional objects into r-dimensional objects,where r is much smaller than m. The privacy is guaranteed by ensuring thatthe transformation is non-invertible.

7.3.1 Dimensionality Reduction

DRBT assumes that the data records are represented as points (vectors) ina multidimensional space. Each dimension represents a distinct attribute ofthe individual. The database is represented as an n×m matrix with n recordsand m columns. The goal of the methods for dimensionality reduction is tomap n-dimensional records into r-dimensional records, where r � m [145].Dimensionality reduction methods map each object to a point a r-dimensionalspace with the goal of minimizing the stress function, which measures theaverage relative error that the distances in r − n space suffer from:

stress =

∑i,j(dij − dij)2∑

i,j d2ij

(7.8)

where dij is the dissimilarity measure between records i and j in a r-dimensional space, and dij is the dissimilarity measure between records i andj in a m-dimensional space [182].

Random projection is a linear transformation represented by a m×r matrixR. The transformation is achieved by first setting each entry of the matrix

Page 149: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization for Cluster Analysis 125

to a value drawn from an independent and identically distributed N(0, 1)distribution with 0 mean and 1 unit of variance, and then normalizing thecolumns to unit length. Given a m-dimensional data set represented as ann×m matrix T , the mapping T ×R results in reduced-dimension data set:

T ′n×r = Tn×mRm×r. (7.9)

In the reduced space, the distance between two m-dimensional records Xi

and Xj can be approximated by the scaled down Euclidean distance of theserecords as down in Equation

√m

r‖Xi −Xj‖. (7.10)

where the scaling term√

mr takes the decrease in the dimensionality of the

data into consideration [182]. The key point in this method is to choose arandom matrix R. DRBT employs the following simple distribution [2] foreach element eij in R:

eij =√

⎧⎪⎨

⎪⎩

+1 with probability 1/60 with probability 2/3−1 with probability 1/6

(7.11)

7.3.2 The DRBT Method

The DRBT method addresses the problem privacy-preserving clustering inthree steps [182].

1. Attributes suppression: Suppress the attributes that are irrelevant to theclustering task.

2. Dimensions reduction: Transform the original data set T into theanonymized data set T ′ using random projection. We apply Equation 7.9to reduce the dimension of T with m dimensions to T ′ with r dimen-sions. The random matrix R is computed by first setting each entry ofthe matrix to a value drawn from an independent and identically dis-tributed N(0, 1) distribution and then normalizing the columns to unitlength. Alternatively, each element eij in matrix R can be computed byusing Equation 7.11.

3. Cluster evaluation: Evaluate the cluster quality of the anonymized dataset T ′ with respect to the original data set T using the stress functionin 7.8.

Page 150: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

126 Introduction to Privacy-Preserving Data Publishing

7.4 Related Topics

We discuss some privacy works that are orthogonal to the anonymiza-tion problem of cluster analysis studied in this chapter. There is a familyof anonymization methods [8, 9, 10, 13] that achieves privacy by clusteringsimilar data records together. Their objective is very different from the studiedproblem in this chapter, which is publishing data for cluster analysis. Aggar-wal and Yu [8] propose an anonymization approach, called condensation, tofirst condense the records into multiple non-overlapping groups in which eachgroup has a size of at least h records. Then, for each group, the method ex-tracts some statistical information, such as sum and covariance, that sufficesto preserve the mean and correlation across different attributes. Finally, basedon the statistical information, the method generates synthetic data records foreach group. Refer to Chapter 5.1.3.1 for more details on condensation.

In a similar spirit, r-gather clustering [13] partitions records into severalclusters such that each cluster contains at least r data points. Then the clus-ter centers, together with their size, radius, and a set of associated sensitivevalues, are released. Compared to the masking approach, one limitation of theclustering approach is that the published records are “synthetic” in that theymay not correspond to the real world entities represented by the raw data. Asa result, the analysis result is difficult to justify if, for example, a police officerwants to determine the common characteristics of some criminals from thedata records. Refer to Chapter 5.1.3.2 for more details on r-gather clustering.

Many secure protocols have been proposed for distributed computationamong multiple parties. For example, Vaidya and Clifton [229] and Inan etal. [121] present secure protocols to generate a clustering solution from ver-tically and horizontally partitioned data owned by multiple parties. In theirmodel, accessing data held by other parties is prohibited, and only the finalcluster solution is shared among participating parties. We consider a com-pletely different problem, of which the goal is to share data that is immunizedagainst privacy attacks.

7.5 Summary

We have studied the problem of releasing person-specific data for clusteranalysis while protecting privacy. The top-down-specialization solution dis-cussed in Chapter 7.2 is to mask unnecessarily specific information into a lessspecific but semantically consistent version, so that person-specific identifyinginformation is masked but essential cluster structure remains [94]. The majorchallenge is the lack of class labels that could be used to guide the mask-

Page 151: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymization for Cluster Analysis 127

ing process. This chapter illustrates a general framework for converting thisproblem into the counterpart problem for classification analysis so that themasking process can be properly guided. The key idea is to encode the originalcluster structure into the class label of data records and subsequently preservethe class labels for the corresponding classification problem. The experimentalresults verified the effectiveness of this approach.

We also studied several practical issues arising from applying this approachin a real-life data publishing scenario. These include how the choices of clus-tering algorithms, number of clusters, anonymity threshold, and size and typeof quasi-identifiers can affect the effectiveness of this approach, and how toevaluate the effectiveness in terms of cluster quality. These studies lead to therecommendation of two strategies for choosing the clustering algorithm in themasking process, each having a different focus. The materials provide a usefulframework of secure data sharing for the purpose of cluster analysis.

In Chapter 7.3, we have studied an efficient anonymization algorithm calledDimensionality Reduction-Based Transformation (DRBT) [182]. The generalidea is to consider the raw data table as a m dimensional matrix and reduceit to a r dimensional matrix by using random projection. The data modelconsidered in DRBT has several limitations. First, the distortion is appliedon the sensitive attribute, which could be important for the task of clusteranalysis. If the sensitive attributes are not important for cluster analysis,they should be removed first. Second, the DRBT method is applicable onlyon numerical attributes. Yet, real-life databases often contain both categoricaland numerical attributes.

Page 152: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Part III

Extended Data PublishingScenarios

Page 153: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Chapter 8

Multiple Views Publishing

8.1 Introduction

So far, we have considered the simplest case where the data is published to asingle recipient. In practice, the data is often published to multiple recipientsand different data recipients may be interested in different attributes. Supposethere is a person-specific data table T (Job, Sex, Age, Race, Disease, Salary).A data recipient (for example, a pharmaceutical company) is interestedin classification modeling the target attribute Disease with attributes{Job, Sex, Age}. Another data recipient (such as a social service department)is interested in clustering analysis on {Job, Age, Race}.

One approach is to publish a single view on {Job, Sex, Age, Race} for bothpurposes. A drawback is that information is unnecessarily released in thatneither of the two purposes needs all four attributes; it is more vulnerable toattacks. Moreover, if the information needed in the two cases is different, thedata anonymized in a single view may not be good for either of the two cases.A better approach is to anonymize and publish a tailored view for each datamining purpose; each view is anonymized to best address the specific need ofthat purpose.

Suppose a data holder has released multiple views of the same underlyingraw data data. Even if the data holder releases one view to each data recipientbased on their information needs, it is difficult to prevent them from colludingwith each other behind the scene. Thus, some recipient may have access tomultiple or even all views. In particular, an adversary can combine attributesfrom the two views to form a sharper QID that contains attributes from bothviews. The following example illustrates the join attack in multiple views.

Example 8.1Consider the data in Table 8.1 and Table 8.2. Suppose that the data holderreleases one projection view T1 to one data recipient and releases anotherprojection view T2 to another data recipient. Both views are from the sameunderlying patient table. Further suppose that the data holder does not want{Age, Birthplace} to be linked to Disease. When T1 and T2 are examinedseparately, the Age = 40 group and the Birthplace = France group have

131

Page 154: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

132 Introduction to Privacy-Preserving Data Publishing

Table 8.1: Multiple viewspublishing: T1

Age Job Class30 Lawyer c130 Lawyer c140 Carpenter c240 Electrician c350 Engineer c450 Clerk c4

Table 8.2: Multiple views publishing: T2

Job Birthplace DiseaseLawyer US CancerLawyer US Cancer

Carpenter France HIVElectrician UK CancerEngineer France HIV

Clerk US HIV

Table 8.3: Multiple views publishing: the join of T1 and T2

Age Job Birthplace Disease Class30 Lawyer US Cancer c130 Lawyer US Cancer c140 Carpenter France HIV c240 Electrician UK Cancer c350 Engineer France HIV c450 Clerk US HIV c430 Lawyer US Cancer c130 Lawyer US Cancer c1

size 2. However, by joining T1 and T2 using T1.Job = T2.Job (see Table 8.3), anadversary can uniquely identify the record owner in the {40, F rance} group,thus linking {Age, Birthplace} to Disease without difficulty. Moreover, thejoin reveals the inference {30, US} → Cancer with 100% confidence for therecord owners in the {30, US} group. Such inference cannot be made when T1

and T2 are examined separately [236].

In this chapter, we study several works that measure information disclosurearising from linking two or more views. Chapter 8.2 studies Yao et al. [262]’smethod for detecting k-anonymity violation on a set of views, each view ob-tained from a projection and selection query. Kifer and Gehrke [139] proposeto increase the utility of published data by releasing additional marginals thatare essentially duplicate preserving projection views. Chapter 8.3 studies the

Page 155: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Multiple Views Publishing 133

privacy threats caused by the marginals and a detection method to check theviolations of some given k-anonymity and �-diversity requirements.

8.2 Checking Violations of k-Anonymity on MultipleViews

We first illustrate violations of k-anonymity in the data publishing scenariowhere data in a raw data table T are being released in the form of a viewset. A view set is a pair (V, v), where V is a list of selection-projection queries(q1, . . . , qn) on T , and v is a list of relations (r1, . . . , rn) without duplicaterecords [262]. Then, we also consider the privacy threats caused by functionaldependency as prior knowledge, followed by a discussion on the violationsdetection methods. For a data table T , Π(T ) and σ(T ) denote the projectionand selection over T .

8.2.1 Violations by Multiple Selection-Project Views

Yao et al. [262] assume that privacy violation takes the form of linkages,that is, pairs of values appearing in the same data record. For example, nei-ther Calvin nor HIV in Table 8.4 alone is sensitive, but the linkage of thetwo values is. Instead of publishing the raw data table that contains sensitivelinkages, Yao et al. consider publishing the materialized views. For exam-ple, given raw data table T in Table 8.4, sensitive linkages can be defined byΠName,Disease(T ), while the published data are two views, V1 = ΠName,Job(T )and V2 = ΠJob,Disease(T ). Given V1 and V2 as shown in Table 8.5 and Ta-ble 8.6, an adversary can easily deduce that Calvin has HIV because Calvinis a lawyer in V1 and the lawyer in V2 has HIV , assuming that the adversaryhas the knowledge that V1 and V2 are projects of the same underlying rawdata table.

Checking k-anonymity on multiple views could be a complicated task ifthe views are generated based on some selection conditions. The followingexample illustrates this point.

Example 8.2Tables 8.7-8.9 show three views that are generated based on different pro-jection and selection conditions of Table 8.4. By performing an intersectionbetween V3 and V4, an adversary knows that Bob is the only person in theraw table T who has an age between 45 and 55. By further observing V5, theadversary can induce that Bob has Diabetes because the selection conditionof V5 satisfies the selection condition of the intersection of V3 and V4.

Page 156: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

134 Introduction to Privacy-Preserving Data Publishing

Table 8.4: Raw data table T

Name Job Age DiseaseAlice Cook 40 FluBob Cook 50 Diabetes

Calvin Lawyer 60 HIV

Table 8.5: View V1 = ΠName,Job(T )Name JobAlice CookBob Cook

Calvin Lawyer

Table 8.6: View V2 = ΠJob,Disease(T )Job DiseaseCook SARSCook Diabetes

Lawyer HIV

Example 8.2 illustrates that an adversary may be able to infer the sensitivevalue of a target victim from multiple selection-project views. Yao et al. [262]present a comprehensive study to detect violations of k-anonymity on multipleviews. The following example shows the intuition of the detection method.Refer to [262] for details of the violations detection algorithm.

Example 8.3Consider views V3, V4, and V5 in Table 8.7, Table 8.8, and Table 8.9, respec-tively. We use

Table 8.7: View V3 = ΠNameσAge>45(T )NameBob

Calvin

Table 8.8: View V4 = ΠNameσAge<55(T )NameAliceBob

Table 8.9: View V5 = ΠNameσ40<Age<55(T )DiseaseDiabetes

Page 157: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Multiple Views Publishing 135

(Bob, *, Age>45, *)

(*, *, Age<=45, *)(Bob, *, Age>45, *) (Calvin, *, Age>45, *)(*, *, 45<Age<55, Diabetes )

(Alice, *, Age<55, *) Empty

(Bob, *, 45<Age<55,

Diabetes)

Empty

(Bob, *, Age<55, *)

(*, *, Age>=55, *)

V3

V4

V5

(Alice, *, Age<55, *)

(Bob, *, Age<55, *)

(*, *, Age<55, *)

(*, *, 45<Age<55, Diabetes)

Empty

Empty

Empty

Empty

Empty

Empty

(*, *, Age<=45 or Age>=55, *)

(*, *, Age<=45, *)

(Calvin, *, Age>45, *)

FIGURE 8.1: Views V3, V4, and V5

(Name, Job, Age, Disease)

to denote the set of records matching the selection conditions. For example,(Alice, ∗, Age < 55, ∗) denotes the set of records that have Name = Alice,Age < 55, Job and Disease can be any values. A complement record sets of(Alice, ∗, Age < 55, ∗) is (∗, ∗, Age ≥ 55, ∗).

The general idea of the k-anonymity violation detection algorithm is to in-tersect the views that are generated based on the selection conditions, andto check whether or not every non-empty intersection contains at least krecords. In Figure 8.1, we plot the three sets of record sets of V3, V4, andV5, and the complement record sets. In the upper portion of Figure 8.1,we illustrate a record set for the projection fact of Diabetes in V5, whichis (Bob, ∗, 45 < Age < 55, Diabetes). Since the record set represented by theintersection contains only one record, the three views violate 2-anonymity.Figure 8.2 further illustrates the intersection of two sets of record sets of V3

and V4.

Page 158: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

136 Introduction to Privacy-Preserving Data Publishing

(*, *, Age<=45, *)(Bob, *, Age>45, *) (Calvin, *, Age>45, *)

(Alice, *, Age<55, *) Empty

(Bob, *, 45<Age<55, *)

(Bob, *, Age>=55, *)

(Bob, *, Age<55, *)

(*, *, Age>=55, *)

Empty

(Calvin, *, Age>=55, *)

Empty

(Alice, *, Age<=45, *)

Empty

(Bob, *, Age <=45, *)

V3

V4

FIGURE 8.2: Views V3 and V4

8.2.2 Violations by Functional Dependencies

Another type of violation is caused by the presence of functional dependen-cies as adversary’s background knowledge. A functional dependency (FD) isa constraint between two sets of attributes in a table from a database. Givena data table T , a set of attributes X in T is said to functionally determineanother attribute Y , also in T , denoted by (X → Y ), if and only if each Xvalue is associated with precisely one Y value. Even though the data holderdoes not release such functional dependencies to the data recipients, the datarecipients often can obtain such knowledge by interpreting the semantics ofthe attributes. For example, the functional dependency Name → Age canbe derived by common sense. Thus, it is quite possible that the adversarycould use functional dependencies as background knowledge. The followingexamples illustrate the privacy threats caused by functional dependency.

Example 8.4Consider the raw data R in Table 8.10, and two projection views V6 and V7 inTable 8.11 and Table 8.12, respectively. Every record in V6 has a correspondingrecord in V7. Suppose an adversary knows there is a functional dependencyName → Disease, meaning that every patient has only one disease. Theadversary then can infer that Alice has Cancer because Alice has two recordsin V6 and HIV is the only value appearing twice in V7.

8.2.3 Discussion

Yao et al. [262] present a comprehensive study on k-anonymity with multipleviews. First, they proposed a violation detection algorithm for k-anonymitywithout functional dependencies. The data complexity of the detection algo-rithm is polynomial, O((max(|Vi|)N ), where |Vi| is the size of the view Vi and

Page 159: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Multiple Views Publishing 137

Table 8.10: Raw data table R for illustratingprivacy threats caused by functional dependency

Name Doctor DiseaseAlice Smith CancerAlice Thomson CancerBrian Smith FluCathy Thomson Diabetes

Table 8.11: View V6 = ΠName,Doctor(R)Name DoctorAlice SmithAlice ThomsonBrian SmithCathy Thomson

Table 8.12: View V7 = ΠDoctor,Disease(R)

Doctor DiseaseSmith Cancer

Thomson CancerSmith Flu

Thomson Diabetes

N is the number of dimensions. Second, the knowledge of functional depen-dency is available to the adversary. Yao et al. [262] suggest a conservativechecking algorithm that always catches k-anonymity violation, but it maymake mistakes when a view set actually does not violate k-anonymity.

In summary, the k-anonymity violation detection algorithm presentedin [262] has several limitations. First, it does not consider the case that dif-ferent views could have been generalized to different levels according to sometaxonomy trees, which is very common in multiple views publishing. Second,it does not consider attribute linkages with confidence as discussed in Chap-ter 2.2. Third, Yao et al. [262] present algorithms only for detecting violations,not anonymizing the views for achieving k-anonymity. Subsequent chapters inPart III address these limitations.

8.3 Checking Violations with Marginals

In addition to the anonymous base table, Kifer and Gehrke [139] propose toincrease the utility of published data by releasing several anonymous marginalsthat are essentially duplicate preserving projection views. For example,

Page 160: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

138 Introduction to Privacy-Preserving Data Publishing

Table 8.13: Raw patient tablequasi-identifier sensitive

Zip Code Salary Disease25961 43K HIV25964 36K Anthrax25964 32K Giardiasis25964 42K Cholera25961 58K Flu25961 57K Heart25961 56K Flu25964 63K Cholera25964 61K Heart24174 34K Flu24174 48K Cholera24179 48K Cholera24179 40K Heart

Table 8.15 is the Salary marginal, Table 8.16 is the Zip/Disease marginal,and Table 8.17 is the Zip/Salary marginal for the (2, 3)-diverse base Ta-ble 8.14. The availability of additional marginals (views) provides additionalinformation for data mining, but also poses new privacy threats. For ex-ample, if an adversary knows that a patient who lives in 25961 and hassalary under 50K is in the raw patient Table 8.13, then the adversary canjoin Table 8.14 and Table 8.16 to infer that the patient has HIV because{Anthrax, Cholera, Giardiasis, HIV } ∩ {Flu, Heart, HIV } = HIV . Kiferand Gehrke [139] extend k-anonymity and �-diversity for marginals and pre-

Table 8.14: (2,3)-diverse patient tablequasi-identifier sensitive

Zip Code Salary Disease2596* ≤50K Anthrax2596* ≤50K Cholera2596* ≤50K Giardiasis2596* ≤50K HIV2596* >50K Cholera2596* >50K Flu2596* >50K Flu2596* >50K Heart2596* >50K Heart2417* ≤50K Cholera2417* ≤50K Cholera2417* ≤50K Flu2417* ≤50K Heart

Page 161: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Multiple Views Publishing 139

Table 8.15: Salary marginalSalary Count

[31K-35K] 2[36K-40K] 2[41K-45K] 2[46K-50K] 2[51K-55K] 0[56K-60K] 3[61K-65K] 2

Table 8.16: Zip code/disease marginal

Zip Code Disease Count25961 Flu 225961 Heart 125961 HIV 125964 Cholera 225964 Heart 125964 Anthrax 125964 Giardiasis 124174 Flu 124174 Cholera 124179 Cholera 124179 Heart 1

sented a method to check whether published marginals violate the privacyrequirement on the anonymous base table.

Barak et al. [26] also study the privacy threats caused by marginals alongthe line of differential privacy [74] studied in Chapter 2.4.2. Their primary con-tribution is to provide a formal guarantee to preserve all of privacy, accuracy,and consistency in the published marginals. Accuracy bounds the differencebetween the original marginals and published marginals. Consistency ensuresthat there exists a contingency table whose marginals equal to the publishedmarginals. Instead of adding noise to the original data records at the costof accuracy, or adding noise to the published marginals at the cost of con-sistency, they have proposed to transform the original data into the Fourierdomain, apply differential privacy to the transformed data by perturbation,and employ linear programming to obtain a non-negative contingency tablebased on the given Fourier coefficients.

Page 162: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

140 Introduction to Privacy-Preserving Data Publishing

Table 8.17: Zip code/salary marginalZip Code Salary Count

25961 [41K-50K] 125961 [51K-60K] 325964 [31K-40K] 225964 [41K-50K] 125964 [61K-70K] 224174 [31K-40K] 124174 [41K-50K] 124179 [31K-40K] 124179 [41K-50K] 1

8.4 MultiRelational k-Anonymity

Most works on k-anonymity focus on anonymizing a single data table. ErcanNergiz et al. [178] propose a privacy model called MultiR k-anonymity toensure k-anonymity on multiple relational tables. Their model assumes thata relational database contains a person-specific table PT and a set of tablesT1, . . . , Tn, where PT contains a person identifier Pid and some sensitiveattributes, and Ti, for 1 ≤ i ≤ n, contains some foreign keys, some attributesin QID, and sensitive attributes. The general privacy notion is to ensure thatfor each record owner o contained in the join of all tables PT � T1 � · · · � Tn,there exists at least k−1 other record owners who share the same QID with o.It is important to emphasize that the k-anonymization is applied at the recordowner level, not at the record level in traditional k-anonymity. This idea issimilar to (X, Y )-anonymity discussed in Chapter 2.1.2, where X = QID andY = {Pid}.

8.5 Multi-Level Perturbation

Recently, Xiao et al. [252] study the problem of multi-level perturbationwhose objective is to publish multiple anonymous versions of the same un-derlying raw data set. Each version can be anonymized at different privacylevels depending on the trustability levels of the data recipients. The majorchallenge is that data recipients may collude and share their data to infer sen-sitive information beyond their permitted levels. Xiao et al.’s randomizationmethod overcomes the challenge by ensuring that the colluding data recipi-ents cannot learn any information that is more than the information knownby the most trustworthy data recipient alone. Another major strength of the

Page 163: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Multiple Views Publishing 141

randomization approach is that each data recipient can utilize the receiveddata for privacy-preserving data mining as the output of conventional uniformrandomization. Similar to other randomization approaches, this method doesnot preserve the data truthfulness at the record level.

8.6 Summary

In this chapter, we have studied several methods for detecting violations ofprivacy requirements on multiple views. These detection methods are applica-ble for detecting violations on data views publishing and database queries thatviolate privacy requirements. What can we do if some combinations of viewsor database queries violate the privacy requirement? A simple yet non-userfriendly solution is to reject the queries, or not to release the views. Obviously,this approach defeats the goal of data publishing and answering databasequeries. Thus, simply detecting violations is insufficient. The recent work onMultiR k-anonymity (Chapter 8.4) and multi-level perturbation (Chapter 8.5)present effective methods for anonymizing multiple relational databases andviews at different privacy protection levels for different recipients.

The works on multiple views publishing share a common assumption: theentire raw data set is static and is available to the data provider at time ofanonymization and publication. Yet, in practice, this assumption may not holdas new data arrive and a new release has to be published subsequently togetherwith some previously published data. The adversary may combine multiplereleases to sharpen the identification of a record owner and his sensitive in-formation. In Chapters 9 and 10, we study several methods that anonymizemultiple data releases and views without compromising the desirable propertyof preserving data truthfulness.

Page 164: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Chapter 9

Anonymizing Sequential Releaseswith New Attributes

9.1 Introduction

An organization makes a new release as new information becomes available,releases a tailored view for each data request, or releases sensitive informa-tion and identifying information separately. The availability of related releasessharpens the identification of individuals by a global quasi-identifier consist-ing of attributes from related releases. Since it is not an option to anonymizepreviously released data, the current release must be anonymized to ensurethat a global quasi-identifier is not effective for identification.

In multiple views publishing studied in Chapter 8, several tables, for differ-ent purposes, are published at one time. In other words, a data holder alreadyhas all the raw data when the views are published. Since all the data areknown, the data holder has the flexibility to anonymize different views in or-der to achieve the given privacy and information requirements on the views.In this chapter, we will study a more general data publishing scenario calledsequential anonymization [236], in which the data holder knows only part ofthe raw data when a release is published, and has to ensure that subsequentreleases do not violate a privacy requirement even an adversary has access toall releases. A key question is how to anonymize the current release so thatit cannot be linked to previous releases yet remains useful for its own releasepurpose. We will study an anonymization method [236] for this sequentialreleases scenario. The general idea is to employ the lossy join, a negativeproperty in relational database design, as a way to hide the join relationshipamong releases.

9.1.1 Motivations

k-anonymity addresses the problem of reducing the risk of record linkages ina person-specific table. Refer to Chapter 2.1.1 for details of k-anonymity andthe notion of quasi-identifier. In case of single release, the notion of QID isrestricted to the current table, and the database is made anonymous to itself.In many scenarios, however, related data were released previously: an orga-

143

Page 165: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

144 Introduction to Privacy-Preserving Data Publishing

Table 9.1: Raw data table T1

T1

Pid Job Disease1 Banker Cancer2 Banker Cancer3 Clerk HIV4 Driver Cancer5 Engineer HIV

Table 9.2: Raw data table T2

T2

Pid Name Job Class1 Alice Banker c12 Alice Banker c13 Bob Clerk c24 Bob Driver c35 Cathy Engineer c4

nization makes a new release as new information becomes available, releasesa separate view for each data sharing purpose (such as classifying a differenttarget variable as studied in the Red Cross case in Chapter 6), or makes sep-arate releases for personally-identifiable data (e.g., names) and sensitive data(e.g., DNA sequences) [164]. In such scenarios, the QID can be a combinationof attributes from several releases, and the database must be made anonymousto the combination of all releases thus far.

The next example illustrates a scenario of sequential release: T2 was un-known when T1 was released, and T1, once released, cannot be modified whenT2 is considered for release. This scenario is different from the scenario mul-tiple views publishing studied in Chapter 8, where both T1 and T2 are partsof a view and can be modified before the release, which means more “rooms”to satisfy a privacy and information requirement. In the sequential release,each release has its own information need and the join that enables a globalidentifier should be prevented. In the view release, however, all data in theviews may serve the information need collectively, possibly through the joinof all tables.

Example 9.1Consider the data in Tables 9.1-9.3. Pid is the person identifier and is includedonly for discussion, not for release. Suppose the data holder has previouslyreleased T1 and now wants to release T2 for classification analysis of the Classcolumn. Essentially T1 and T2 are two projection views of the patient records.T2.Job is a discriminator of Class but this may not be known to the dataholder. The data holder does not want Name to be linked to Disease in the

Page 166: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Sequential Releases with New Attributes 145

Table 9.3: The join of T1 and T2

The join on T1.Job = T2.JobPid Name Job Disease Class1 Alice Banker Cancer c12 Alice Banker Cancer c13 Bob Clerk HIV c24 Bob Driver Cancer c35 Cathy Engineer HIV c4- Alice Banker Cancer c1- Alice Banker Cancer c1

join of the two releases; in other words, the join should be k-anonymous on{Name, Disease}. Below are several observations that motivate the approachpresented in this chapter.

1. Join sharpens identification : after the natural join (based on equal-ity on all common attributes), the adversary can uniquely identifythe individuals in the {Bob, HIV } group through the combination{Name, Disease} because this group has size 1. When T1 and T2 areexamined separately, both Bob group and HIV group have size 2. Thus,the join may increase the risk of record linkages.

2. Join weakens identification : after the natural join, the {Alice, Cancer}group has size 4 because the records for different persons are matched(i.e., the last two records in the join table). When T1 and T2 are ex-amined separately, both Alice group and Cancer group have smallersize. In the database terminology, the join is lossy. Since the join attackdepends on matching the records for the same person, a lossy join canbe used to combat the join attack. Thus, the join may decrease the riskof record linkages. Note, the ambiguity caused by the lossy join is notconsidered information loss here because the problem is to weaken thelinkages of records among the releases.

3. Join enables inferences across tables: the natural join reveals theinference Alice → Cancer with 100% confidence for the individualsin the Alice group. Thus, the join may increase the risk of attributelinkages.

This sequential anonymization problem was briefly discussed in some pio-neering works on k-anonymity, but they did not provide a practical solution.For example, Samarati and Sweeney [203] suggest to k-anonymize all poten-tial join attributes as the QID in the next release Tp. Sweeney [216] suggeststo generalize Tp based on the previous releases T1, . . . , Tp−1 to ensure that all

Page 167: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

146 Introduction to Privacy-Preserving Data Publishing

values in Tp are not more specific than in any T1, . . . , Tp−1. Both solutions suf-fer from monotonically distorting the data in a later release: as more and morereleases are published, a later release is more and more constrained due to theprevious releases, thus are more and more distorted. The third solution is torelease a “complete” cohort in which all potential releases are anonymized atone time, after which no additional mechanism is required. This requires pre-dicting future releases. The “under-prediction” means no room for additionalreleases and the “over-prediction” means unnecessary data distortion. Also,this solution does not accommodate the new data added at a later time.

The problem of sequential anonymization considers three motivating sce-narios:

1. make new release when new attributes arrive,

2. release some selected attributes for each data request, and

3. make separate releases for personally-identifiable attributes and sensi-tive attributes.

The objective is to anonymize the current release T2 in the presence of a previ-ous release T1, assuming that T1 and T2 are projections of the same underlyingtable. The release of T2 must satisfy a given information requirement and pri-vacy requirement. The information requirement could include such criteriaas minimum classification error (Chapter 4.3) and minimum data distortion(Chapter 4.1). The privacy requirement states that, even if the adversaryjoins T1 with T2, he/she will not succeed in linking individuals to sensitiveproperties. This requirement is formalized to be limiting the linking betweentwo attribute sets X and Y over the join of T1 and T2. The privacy notion(X, Y )-privacy (Chapter 2.2.4) generalizes k-anonymity (Chapter 2.1.1) andconfidence bounding (Chapter 2.2.2).

The basic idea is generalizing the current release T2 so that the join withthe previous release T1 becomes lossy enough to disorient the adversary. Es-sentially, a lossy join hides the true join relationship to cripple a global quasi-identifier. In Chapter 9.1.2, we first formally define the problem, and showthat the sequential anonymization subsumes the k-anonymization, thus theoptimal solution is NP-hard. Then, we study a greedy method for finding aminimally generalized T1. To ensure the minimal generalization, the lossy joinresponds dynamically to each generalization step. Therefore, one challenge ischecking the privacy violation over such dynamic join because a lossy joincan be extremely large. Another challenge is pruning, as early as possible,unpromising generalization steps that lead to privacy violation. To addressthese challenges, we will study a top-down approach, introduced by Wangand Fung [236], to progressively specialize T1 starting from the most gener-alized state in Chapter 9.3. It checks the privacy violation without executingthe join and prunes unpromising specialization based on a proven monotonic-ity of (X, Y )-privacy in Chapter 9.2. Finally, the extension to more than oneprevious release is discussed in Chapter 9.4.

Page 168: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Sequential Releases with New Attributes 147

9.1.2 Anonymization Problem for Sequential Releases

For a table T , Π(T ) and σ(T ) denote the projection and selection overT , att(T ) denotes the set of attributes in T , and |T | denotes the number ofdistinct records in T .

9.1.2.1 Privacy Model

In this problem, X and Y are assumed to be disjoint sets of attributesthat describe individuals and sensitive properties in any order. An exampleis X = {Name, Job} and Y = {Disease}. There are two ways to limit thelinking between X and Y : record linkage and attribute linkage.

To thwart record linkages, sequential anonymization employs the notion of(X, Y )-anonymity. Recall from Chapter 2.1.2, (X, Y )-anonymity states thateach value on X is linked to at least k distinct values on Y . k-anonymity isthe special case where X serves QID and Y is a key in T . The next exampleshows the usefulness of (X, Y )-anonymity where Y is not a key in T andk-anonymity fails to provide the required anonymity.

Example 9.2Consider the data table

Inpatient(Pid, Job, Zip, PoB, T est).

A record in the table represents that a patient identified by Pid has Job,Zip, PoB (place of birth), and Test. In general, a patient can have severaltests, thus several records. Since QID = {Job, Zip, PoB} is not a key in thetable, the k-anonymity on QID fails to ensure that each value on QID islinked to at least k (distinct) patients. For example, if each patient has atleast 3 tests, it is possible that the k records matching a value on QID mayinvolve no more than k/3 patients. With (X, Y )-anonymity, we can specifythe anonymity with respect to patients by letting X = {Job, Zip, PoB} andY = Pid, that is, each X group must be linked to at least k distinct valueson Pid. If X = {Job, Zip, PoB} and Y = Test, each X group is required tobe linked to at least k distinct tests.

Being linked to k persons or tests does not imply that the probability ofbeing linked to any of them is 1/k if some person or test occurs more frequentlythan others. Thus, a large k does not necessarily limit the linking probability.The (X, Y )-linkability addresses this issue. Recall from Chapter 2.2.3, (X, Y )-linkability limits the confidence of inferring a value on Y from a value onX . With X and Y describing individuals and sensitive properties, any suchinference with a high confidence is a privacy breach.

Example 9.3Suppose that (j, z, p) on X = {Job, Zip, PoB} occurs with the HIV test in

Page 169: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

148 Introduction to Privacy-Preserving Data Publishing

Table 9.4: Thepatient data (T1)

PoB Sex ZipUK M Z3UK M Z3UK F Z5FR F Z5FR M Z3FR M Z3US M Z3

9 records and occurs with the Diabetes test in 1 record. The confidence of(j, z, p)→ HIV is 90%. With Y = Test, the (X, Y )-linkability states that notest can be inferred from a value on X with a confidence higher than a giventhreshold.

Often, not all but some values y on Y are sensitive, in which case Y canbe replaced with a subset of yi values on Y , written Y = {y1, . . . , yp}, and adifferent threshold k can be specified for each yi. More generally, the problemof sequential anonymization allows multiple Yi, each representing a subset ofvalues on a different set of attributes, with Y being the union of all Yi. Forexample, Y1 = {HIV } on Test and Y2 = {Banker} on Job. Such a “value-level” specification provides a great flexibility essential for minimizing thedata distortion.

Example 9.4Consider the medical test data and the patient data in Table 9.5 and Table 9.4.Suppose the data holder wants to release Table 9.5 for classification analysison the Class attribute. Europeans (i.e., UK and FR) have the class label Yesand US has the class label No. Suppose that the data holder has previouslyreleased the patient data (T1) and now wants to release the medical test data(T2) for classification analysis on Class, but wants to prevent any inferenceof the value HIV in T1 using the combination {Job, PoB, Zip} in the join ofT1 and T2. This requirement is specified as the (X, Y )-linkability, where

X = {Job, PoB, Zip} and Y = {y = HIV }.Here PoB in X refers to {T1.PoB, T2.PoB} since PoB is a common attribute.In the join, the linkability from X to HIV is Ly(X) = 100% because all 4joined records containing {Banker, UK, Z3} contain the value HIV . If thedata holder can tolerate at most 90% linkability, Table 9.5 (T2) without mod-ification is not safe for release.

Page 170: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Sequential Releases with New Attributes 149

Table 9.5: The medical test data (T2)

The medical test dataTest Job PoB Sex ClassHIV Banker UK M YesHIV Banker UK M YesEye Banker UK F YesEye Clerk FR F Yes

Allergy Driver US M NoAllergy Engineer US M NoAllergy Engineer FR M YesAllergy Engineer FR M Yes

When no distinction is necessary, we use the term (X, Y )-privacy to referto either (X, Y )-anonymity or (X, Y )-linkability. The following corollary canbe easily verified.

COROLLARY 9.1Assume that X ⊆ X ′ and Y ′ ⊆ Y . For the same threshold k, if (X ′, Y ′)-privacy is satisfied, (X, Y )-privacy is satisfied.

9.1.2.2 Generalization and Specialization

One way to look at a (X, Y )-privacy is that Y serves the “reference point”with respect to which the privacy is measured and X is a set of “groupingattributes.” For example, with Y = Test each test in Y serves a referencepoint, and AY (X) measures the minimum number of tests associated with agroup value on X , and LY (X) measures the maximum confidence of inferringa test from X . To satisfy a (X, Y )-privacy, Wang and Fung [236] employthe subtree generalization scheme to generalize X while fixing the referencepoint Y . The purpose is to increase AY (X) or decrease LY (X) as a testfrom X becomes more general. Such subtree generalization scheme is moregeneral than the full-domain generalization where all generalized values mustbe on the same level of the taxonomy tree. Refer to Chapter 3.1 for differentgeneralization schemes.

A generalized table can be obtained by a sequence of specializationsstarting from the most generalized table. Each specialization is denoted byv → {v1, . . . , vc}, where v is the parent value and v1, . . . , vc are the child val-ues of v. It replaces the value v in every record containing v with the childvalue vi that is consistent with the original domain value in the record. Aspecialization for a numerical attribute has the form v → {v1, v2}, where v1

and v2 are two sub-intervals of the larger interval v. Instead of being pre-determined, the splitting point of the two sub-intervals is chosen on-the-fly tomaximize information utility. Refer to Example 6.4 in Chapter 6.2.2 for thetechnique on dynamically growing a taxonomy tree for a numerical attribute.

Page 171: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

150 Introduction to Privacy-Preserving Data Publishing

9.1.2.3 Sequential Releases

Consider a previously released table T1 and the current table T2, where T1

and T2 are projections of the same underlying table and contain some commonattributes. T1 may have been generalized. The problem is to generalize T2 tosatisfy a given (X, Y )-privacy. To preserve information, T2’s generalization isnot necessarily based on T1, that is, T2 may contain values more specific thanin T1. Given T1 and T2, the adversary may apply prior knowledge to matchthe records in T1 and T2. Entity matching has been studied in database, datamining, AI and web communities for information integration, natural languageprocessing, and Semantic Web. See [211] for a list of works. We cannot con-sider a priori every possible way of matching. Thus, the problem of sequentialanonymization primarily considers the matching based on the following priorknowledge available to both the data holder and the adversary: the schemainformation of T1 and T2, the taxonomies for categorical attributes, and thefollowing inclusion-exclusion principle for matching the records. Assume thatt1 ∈ T1 and t2 ∈ T2.

• Consistency Predicate: for every common categorical attribute A, t1.Amatches t2.A if they are on the same generalization path in the taxonomytree for A. Intuitively, this says that t1.A and t2.A can possibly begeneralized from the same domain value. For example, Male matchesSingle Male. This predicate is implicit in the taxonomies for categoricalattributes.

• Inconsistency Predicate: for two distinct categorical attributes T1.A andT2.B, t1.A matches t2.B only if t1.A and t2.B are not semantically in-consistent according to the “common sense.” This predicate excludesimpossible matches. If not specified, “not semantically inconsistent” isassumed. If two values are semantically inconsistent, so are their spe-cialized values. For example, Male and Pregnant are semantically in-consistent, so are Married Male and 6 Month Pregnant.

Numerical attributes are not considered in the predicates of the join at-tributes because their taxonomies may be generated differently for T1 and T2.Both the data holder and the adversary use these predicates to match recordsfrom T1 and T2. The data holder can “catch up with” the adversary by in-corporating the adversary’s knowledge into such “common sense.” We assumethat a match function tests whether (t1, t2) is a match. (t1, t2) is a match ifboth consistency predicate and inconsistency predicate hold. The join of T1

and T2 is a table on att(T1) ∪ att(T2) that contains all matches (t1, t2). Thejoin attributes refer to all attributes that occur in either predicates. Note thatevery common attribute A has two columns T1.A and T2.A in the join. Thefollowing observation says that generalizing the join attributes produces morematches, thereby making the join more lossy. The approach presented in thischapter exploits this property to hide the original matches.

Page 172: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Sequential Releases with New Attributes 151

Observation 9.1.1 (Join preserving) If (t1, t2) is a match and if t′1 is a gen-eralization of t1, (t′1, t2) is a match. (Join relaxing) If (t1, t2) is not a matchand if t′1 is a generalization of t1 on some join attribute A, (t′1, t2) is a matchif and only if t′1.A and t2.A are on the same generalization path and t′1.A isnot semantically inconsistent with any value in t2.

Consider a (X, Y )-privacy. We generalize T2 on the attributes X ∩ att(T2),called the generalization attributes. Corollary 9.1 implies that including moreattributes in X makes the privacy requirement stronger. Observation 9.1.1implies that including more join attributes in X (for generalization) makes thejoin more lossy. Therefore, from the privacy point of view it is a good practiceto include all join attributes in X for generalization. Moreover, if X containsa common attribute A from T1 and T2, under the matching predicate, oneof T1.A and T2.A could be more specific (so reveal more information) thanthe other. To ensure privacy, X should contain both T1.A and T2.A in the(X, Y )-privacy specification, so that they can be generalized.

DEFINITION 9.1 Sequential anonymization The data holder haspreviously released a table T1 and wants to release the next table T2, where T1

and T2 are projections of the same underlying table and contain some commonattributes. The data holder wants to ensure a (X, Y )-privacy on the join ofT1 and T2. The sequential anonymization is to generalize T2 on X ∩ att(T2)so that the join of T1 and T2 satisfies the (X, Y )-privacy requirement and T2

remains as useful as possible.

THEOREM 9.1The sequential anonymization is at least as hard as the k-anonymization prob-lem.

PROOF The k-anonymization of T1 on QID is the special case of se-quential anonymization with (X, Y )-anonymity, where X is QID and Y is acommon key of T1 and T2 and the only join attribute. In this case, the jointrivially appends the attributes of T1 to T2 according to the common key,after which the appended attributes are ignored.

9.2 Monotonicity of Privacy

To generalize T2, the anonymization algorithm presented in [236] specializesT2 starting from the most generalized state. A main reason for this approachis the following anti-monotonicity of (X, Y )-privacy with respect to special-

Page 173: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

152 Introduction to Privacy-Preserving Data Publishing

ization: if (X, Y )-privacy is violated, it remains violated after a specialization.Therefore, the anonymization algorithm can stop further specialization when-ever the (X, Y )-privacy is violated for the first time. This is a highly desirableproperty for pruning unpromising specialization. We first show this propertyfor a single table.

THEOREM 9.2On a single table, the (X, Y )-privacy is anti-monotone with respect to spe-cialization on X .

PROOF See [236]

However, on the join of T1 and T2, in general, (X, Y )-anonymity is notanti-monotone with respect to a specialization on X ∩ att(T2). To see this,let T1(D, Y ) = {d3y3, d3y2, d1y1} and T2(C, D) = {c1d3, c2d}, where ci, di, yi

are domain values and d is a generalized value of d1 and d2. AY (X) is theminimum number of value Y associated with a group value on X . The joinbased on D contains 3 matches (c1d3, d3y2), (c1d3, d3y3), (c2d, d1y1), andAY (X) = AY (c2dd1) = 1, where X = {C, T1.D, T2.D}. After specializingthe record c2d in T1 into c2d2, the join contains only two matches (c1d3, d3y2)and (c1d3, d3y3), and AY (X) = aY (c1d3d3) = 2. Thus, AY (X) increases afterthe specialization.

The above situation arises because the specialized record c2d2 matches norecord in T2 or becomes dangling. However, this situation does not arise forthe T1 and T2 encountered in the sequential anonymization. Two tables arepopulation-related if every record in each table has at least one matchingrecord in the other table. Essentially, this property says that T1 and T2 areabout the same “population” and there is no dangling record. Clearly, if T1

and T2 are projections of the same underlying table, as assumed in the problemsetting of sequential anonymization, T1 and T2 are population-related. Obser-vation 9.1.1 implies that generalizing T2 preserves the population-relatedness.

Observation 9.2.1 If T1 and T2 are population-related, so are they aftergeneralizing T2.

LEMMA 9.1If T1 and T2 are population-related, AY (X) does not increase after a spe-

cialization of T2 on X ∩ att(T2).

Now, we consider (X, Y )-linkability on the join of T1 and T2. It is not im-mediately clear how a specialization on X ∩att(T2) will affect LY (X) becausethe specialization will reduce the matches, therefore, both a(y, x) and a(x) in

Page 174: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Sequential Releases with New Attributes 153

ly(x) = a(y, x)/a(x). The next lemma shows that LY (X) does not decreaseafter a specialization on X ∩ att(T2).

LEMMA 9.2If Y contains attributes from T1 or T2, but not from both, LY (X) does notdecrease after a specialization of T2 on the attributes X ∩ att(T2).

COROLLARY 9.2The (X, Y )-anonymity on the join of T1 and T2 is anti-monotone with respectto a specialization of T2 on X ∩ att(T2). Assume that Y contains attributesfrom either T1 or T2, but not both. The (X, Y )-linkability on the join of T1

and T2 is anti-monotone with respect to a specialization of T2 on X ∩att(T2).

COROLLARY 9.3Let T1, T2 and (X, Y )-privacy be as in Corollary 9.2. There exists a generalizedT2 that satisfies the (X, Y )-privacy if and only if the most generalized T2 does.

Remarks. Lemma 9.1 and Lemma 9.2 can be extended to multiple previousreleases. Thus, the anti-monotonicity of (X, Y )-privacy holds for one or moreprevious releases. The extension in Chapter 9.4 makes use of this observation.

9.3 Anonymization Algorithm for Sequential Releases

We explain the algorithm for generalizing T2 to satisfy the given (X, Y )-privacy on the join of T1 and T2. Corollary 9.3 is first applied to test if thereexists a solution. In the rest of Chapter 9.3, we assume a solution exists. LetXi denote X∩att(Ti), Yi denote Y ∩att(Ti), and Ji denote the join attributesin Ti, where i = 1, 2.

9.3.1 Overview of the Anonymization Algorithm

The algorithm, called Top-Down Specialization for Sequential Anonymiza-tion (TDS4SA), is given in Algorithm 9.3.5. The input consists of T1, T2, the(X, Y )-privacy requirement, and the taxonomy tree for each categorical at-tribute in X2. Starting from the most generalized T2, the algorithm iterativelyspecializes the attributes Aj in X2. T2 contains the current set of generalizedrecords and Cutj contains the current set of generalized values for Aj . Ineach iteration, if some Cutj contains a “valid” candidate for specialization, it

Page 175: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

154 Introduction to Privacy-Preserving Data Publishing

Algorithm 9.3.5 Top-Down Specialization for Sequential Anonymization(TDS4SA)Input: T1, T2, a (X, Y )-privacy requirement, a taxonomy tree for eachcategorical attribute in X1.Output: a generalized T2 satisfying the privacy requirement.

1: generalize every value of Aj to ANYj where Aj ∈ X2;2: while there is a valid candidate in ∪Cutj do3: find the winner w of highest Score(w) from ∪Cutj ;4: specialize w on T2 and remove w from ∪Cutj ;5: update Score(v) and the valid status for all v in ∪Cutj ;6: end while7: output the generalized T2 and ∪Cutj ;

chooses the winner w that maximizes Score. A candidate is valid if the joinspecialized by the candidate does not violate the privacy requirement. The al-gorithm then updates Score(v) and status for the candidates v in ∪Cutj . Thisprocess is repeated until there is no more valid candidate. On termination,Corollary 9.2 implies that a further specialization produces no solution, so T2

is a maximally specialized state satisfying the given privacy requirement.Below, we focus on the three key steps in Lines 3 to 5.

9.3.2 Information Metrics

TDS4SA employs the information metric Score(v) = IGPL(v) presentedin Chapter 4.2 to evaluate the “goodness” of a specialization v for preserv-ing privacy and information. Each specialization gains some “information,”InfoGain(v), and loses some “privacy,” PrivLoss(v). TDS4SA chooses thespecialization that maximizes the trade-off between the gain of informationand the loss of privacy, which has been studied in Chapter 6.2.2.

InfoGain(v) is measured on T2 whereas PrivLoss(v) is measured on thejoin of T1 and T2. Consider a specialization v → {v1, . . . , vc}. For a numericalattribute, c = 2, and v1 and v2 represent the binary split of the interval vthat maximizes InfoGain(v). Before the specialization, T2[v] denotes the setof generalized records in T2 that contain v. After the specialization, T2[vi]denotes the set of records in T2 that contain vi, 1 ≤ i ≤ c.

The choice of InfoGain(v) and PrivLoss(v) depends on the informationrequirement and privacy requirement. If T2 is released for classification ona specified class column, InfoGain(v) could be the reduction of the classentropy defined by Equation 4.8 in Chapter 4.3. The computation dependsonly on the class frequency and some count statistics of v and vi in T2[v] andT2[v1] ∪ · · · ∪ T2[vc]. Another choice of InfoGain(v) could be the notion ofminimal distortion MD or ILoss discussed in Chapter 4.1. If generalizing a

Page 176: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Sequential Releases with New Attributes 155

child value vi to the parent value v costs one unit of distortion, the informationgained by the specialization: InfoGain(v) = |T2[v]|. The third choice can bethe discernibility (DM) in Chapter 4.1.

The above choices are by no means exhaustive. As far as TDS4SA is con-cerned, what is important is that InfoGain(v) depends on the single columnatt(v), and its class frequency, and the specialization of v only affects therecords in T2[v] and the column att(v). We will exploit this property to main-tain Score(v) efficiently for all candidates.

For (X, Y )-privacy, PrivLoss(v) is measured by the decrease of AY (X)or the increase of LY (X) due to the specialization of v: AY (X) − AY (Xv)for (X, Y )-anonymity, and LY (Xv) − LY (X) for (X, Y )-linkability, where Xand Xv represent the attributes before and after specializing v respectively.Computing PrivLoss(v) involves the count statistics about X and Y overthe join of T1 and T2, before and after the specialization of v, which can beexpensive. The data holder may choose to ignore the privacy aspect by lettingPrivLoss(v) = 0 in Score(v).

Challenges. Though Algorithm 9.3.5 has a simple high level structure,several computational challenges must be resolved for an efficient implemen-tation. First, each specialization of the winner w affects the matching of join,hence, the checking of the privacy requirement (i.e., the status on Line 5). It isextremely expensive to rejoin the two tables for each specialization performed.Second, it is inefficient to “perform” every candidate specialization v just toupdate Score(v) on Line 5 (note that AY (Xv) and LY (Xv) are defined for thejoin assuming the specialization of v is performed). Moreover, materializingthe join is impractical because a lossy join can be very large. A key contribu-tion of Wang and Fung’s work [236] is an efficient solution that incrementallymaintains some count statistics without executing the join. We consider thetwo types of privacy separately.

9.3.3 (X, Y )-Linkability

Two expensive operations on performing the winner specialization w areaccessing the records in T2 containing w and matching the records in T2 withthe records in T1. To support these operations efficiently, we organize therecords in T1 and T2 into two tree structures. Recall that X1 = X ∩ att(T1)and X2 = X ∩att(T2), and J1 and J2 denote the join attributes in T1 and T2.

Tree1 and Tree2. In Tree2, we partition the T2 records by the attributesX2 and J2 −X2 in that order, one level per attribute. Each root-to-leaf pathrepresents a generalized record on X2 ∪ J2, with the partition of the originalrecords generalized being stored at the leaf node. For each generalized value vin Cutj , Link[v] links up all nodes for v at the attribute level of v. Therefore,Link[v] provides a direct access to all T2 partitions generalized to v. Tree2is updated upon performing the winner specialization w in each iteration. InTree1, we partition the T1 records by the attributes J1 and X1 − J1 in thatorder. No specialization is performed on T1, so Tree1 is static. Some count

Page 177: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

156 Introduction to Privacy-Preserving Data Publishing

UK

Root

Z3 Z5

Sex

Zip

| P1 | 2

PoB FR US

M F M F M

Z7Z3 Z5 Z72 1 1

Z31 1 1

FIGURE 9.1: The static Tree1

statistics are stored for each partition in Tree1 and Tree2 to facilitate efficientcomputation of Score(v).

Specialize w (Line 4). This step performs the winner specialization w →{w1, . . . , wc}, similar to the TDS algorithm for a single release in Chapter 6.3.It follows Link[w] to specialize the partitions containing w. For each partitionP2 on the link, TDS4SA performs two steps:

1. Refine P2 into the specialized partitions for wi, link them into Link[wi].The specialized partitions remain on the other links of P2. The step willscan the raw records in P2 and collect some count statistics to facilitatethe computation of Score(v) in subsequent iterations.

2. Probe the matching partitions in Tree1. Match the last |J2| attributes inP2 with the first |J1| attributes in Tree1. For each matching node at thelevel |J1| in Tree1, scan all partitions P1 below the node. If x is the valueon X represented by the pair (P1, P2), increment a(x) by |P1| × |P2|,increment a(x, y) by |P1[y]|× |P2| if Y is in T2, or by |P2|× |P1[y]| if Y isin T1, where y is a value on Y . We employ an “X-tree” to keep a(x) anda(x, y) for the values x on X . In the X-tree, the x values are partitionedby the attributes X , one level per attribute, and are represented byleaf nodes. a(x) and a(x, y) are kept at the leaf node for x. Note thatly(x) = a(x, y)/a(x) and Ly(X) = max{ly(x)} over all the leaf nodes xin the X-tree.

Remarks. This step (Line 4) is the only time that raw records are accessedin the algorithm.

Update Score(v) (Line 5). After specializing w, for each candidate vin ∪Cutj , the new InfoGain(v) is obtained from the count statistics storedtogether with the partitions on Link[v]. LY (X) is updated to LY (Xw) thatwas computed in the previous iteration. If LY (Xv) ≤ k, mark v as valid.

Refer to [236] for the details of each step.

Example 9.5Continue with Example 9.4. Recall that T1 denotes the patient data, T2 de-notes the medical test data, X = {Job, PoB, Zip}, Y = {y = HIV } and the

Page 178: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Sequential Releases with New Attributes 157

ANY_Job

ANY_PoB

Root

M

Initial

ANY_Job

Europe

Root

US

After Specializationon ANY_PoB

F

Job

PoB

Sex M F M| P2 |

| P2[y] |

6 42 0

4 42 0

20

Link[Europe]

FIGURE 9.2: Evolution of Tree2 (y = HIV )

ANY_Job

ANY_PoB

Root

UK

InitialAfter Specialization

on ANY_PoB

T2.Job

T2.PoB

T1.PoB

a(x)

a(x,y)

12 44 0

T1.Zip

FR US

Z5Z3 Z740

Z5Z3 Z7 Z312 44 0

40

60

ANY_Job

Root

UK

8 44 0

FR US

Z5Z3 Z740

Z5Z3 Z7 Z38 44 0

40

20

Europe US

FIGURE 9.3: Evolution of X-tree

join attributes J = {PoB, Sex}. X1 = {PoB, Zip} and X2 = {Job, PoB}. Ini-tially, all values for X2 are generalized to ANYJob and ANYPoB , and ∪Cutjcontains these values. To find the winner w, we compute Score(v) for ev-ery v in ∪Cutj . Figure 9.2 shows (static) Tree1 for T1, grouped by X1 ∪ J ,and Figure 9.2 shows on the left the initial Tree2 for the most generalizedT2, grouped by X2 ∪ J . For example, in Tree2 the partition generalized to{ANYJob, ANYPoB , M} on X1 has 6 records, and 2 of them have the valueHIV .

Figure 9.3 shows the initial X-tree on the left. a(x) and a(x, y) in the X-treeare computed from Tree1 and Tree2. For example,

a(ANYJob, ANYPoB , UK, Z3) = 6× 2 = 12,a(ANYJob, ANYPoB , UK, Z3, HIV ) = 2× 2 = 4,

where the ×2 comes from matching the left-most path in Tree2 with the left-most path in Tree1 on the join attribute J . In this initial X-tree, Ly(X) =4/12 = 33%.

On specializing ANYPoB → {Europe, US}, three partitions are created inTree2 as depicted in Figure 9.2 on the right. The X-tree is updated accord-

Page 179: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

158 Introduction to Privacy-Preserving Data Publishing

ingly as depicted in Figure 9.3 on the right. To compute a(x) and a(x, y) forthese updated x’s, we access all partitions in one scan of Link[Europe] andLink[US] in Tree2 and match with the partitions in Tree1 , e.g.,

a(ANYJob, Europe, UK, Z3) = 4× 2 = 8,a(ANYJob, Europe, UK, Z3, HIV ) = 2× 2 = 4.

In the updated X-tree, Ly(X) = 4/8 = 50%.

We provide an analysis on this algorithm:

1. The records in T1 and T2 are stored only once in Tree1 and Tree2. Forthe static Tree1, once it is created, data records can be discarded.

2. On specializing the winner w, Link[w] provides a direct access to therecords involved in T2 and Tree1 provides a direct access to the matchingpartitions in T1. Since the matching is performed at the partition level,not the record level, it scales up with the size of tables.

3. The cost of each iteration has two parts. The first part involves scan-ning the affected partitions on Link[w] for specializing w in Tree2 andmaintaining the count statistics. This is the only operation that accessesrecords. The second part involves using the count statistics to updatethe score and status of candidates.

4. In the whole computation, each record in T2 is accessed at most |X ∩att(T2)| × h times because a record is accessed only if it is specializedon some attribute from X ∩ att(T2), where h is the maximum height ofthe taxonomies for the attributes in X ∩ att(T2).

5. The algorithm can be terminated any time with a generalized T2 satis-fying the given privacy requirement.

9.3.4 (X, Y )-Anonymity

Like for (X, Y )-linkability, we use Tree1 and Tree2 to find the matchingpartitions (P1, P2), and performing the winner specialization and updatingScore(v) is similar to Chapter 9.3.3. But now, we use the X-tree to updateaY (x) for the values x on X , and there is one important difference in theupdate of aY (x). Recall that aY (x) is the number of distinct values y onY associated with the value x. Since the same (x, y) value may be found inmore than one matching (P1, P2) pair, we cannot simply sum up the countextracted from all pairs. Instead, we need to keep track of distinct Y values foreach x value to update aY (x). In general, this is a time-consuming operation,e.g., requiring sorting/hashing/scanning. Refer to [236] for the cases in whichaY (x) can be updated efficiently.

Page 180: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Sequential Releases with New Attributes 159

9.4 Extensions

TDS4SA can be extended to anonymize the current release Tp with thepresence of multiple previous releases T1, . . . , Tp−1. One solution is first joiningall previous releases T1, . . . , Tp−1 into one “history table” and then applyingthe proposed method for two releases. This history table is likely extremelylarge because all T1, . . . , Tp−1 are some generalized versions and there maybe no join attributes between them. A preferred solution should deal with allreleases at their original size. As remarked at the end of Chapter 9.2, Lemma9.1-9.2 can be extended to this general case. Below, we briefly describe themodified problem definition. Refer to [236] for the modification required forthe TDS4SA in Chapter 9.3.

Let ti be a record in Ti. The Consistency Predicate states that, for allreleases Ti that have a common attribute A, ti.A’s are on the same general-ization path in the taxonomy tree for A. The Inconsistency Predicate statesthat for distinct attributes Ti.A and Tj.B, ti.A and tj .B are not semanticallyinconsistent according the “common sense.” (t1, t2, . . . , tp) is a match if it sat-isfies both predicates. The join of T1, T2, . . . , Tp is a table that contains allmatches (t1, t2, . . . , tp). For a (X, Y )-privacy on the join, X and Y are dis-joint subsets of att(T1) ∪ att(T2) ∪ · · · ∪ att(Tp) and if X contains a commonattribute A, X contains all Ti.A such that Ti contains A.

DEFINITION 9.2 Sequential anonymization Suppose that tablesT1, . . . , Tp−1 were previously released. The data holder wants to release a tableTp, but wants to ensure a (X, Y )-privacy on the join of T1, T2, . . . , Tp. Thesequential anonymization is to generalize Tp on the attributes in X ∩ att(Tp)such that the join satisfies the privacy requirement and Tp remains as usefulas possible.

9.5 Summary

Previous discussions on k-anonymization focused on a single release ofdata. In reality, data is not released in one-shot, but released continuously toserve various information purposes. The availability of related releases enablessharper identification attacks through a global quasi-identifier made up of theattributes across releases. In this chapter, we have studied the anonymiza-tion problem of the current release under this assumption, called sequentialanonymization [236]. The privacy notion, (X, Y )-privacy has been extendedto address the privacy issues in this case. We have also studied the notion of

Page 181: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

160 Introduction to Privacy-Preserving Data Publishing

lossy join as a way to hide the join relationship among releases and a scalableanonymization solution to the sequential anonymization problem.

There is an empirical study in [236] thats shows extensive experiments onTDS4SA to evaluate the impact of achieving (X, Y )-privacy on the data qual-ity in terms of classification error and distortion per record. Experimentalresults on real-life data suggest that it requires only a small data penalty toachieve a wide range of (X, Y )-privacy requirements in the scenario of sequen-tial releases. The method is superior to several obvious solution candidates,such as k-anonymizing the current release, removing the join attributes, andremoving the sensitive attributes. All these alternative solutions do not re-spond dynamically to the (X, Y )-privacy specification and the generalizationof join. The experiments show that the dynamical response to the generaliza-tion of join helps achieve the specified privacy with less data distortion. Theindex structure presented in Figures 9.1-9.3 is highly scalable for anonymizinglarge data sets.

Page 182: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Chapter 10

Anonymizing Incrementally UpdatedData Records

10.1 Introduction

k-anonymization has been primarily studied for a single static release ofdata in Parts I and II. In practice, however, new data arrives continuouslyand up-to-date data has to be published to researchers in a timely manner.In the model of incremental data publishing, the data holder has previouslypublished T1, . . . , Tp−1 and now wants to publish Tp, where Ti is an updatedrelease of Ti−1 with record insertions and/or deletions. The problem assumesthat all records for the same individual remain the same in all releases. Unlikethe sequential anonymization problem studied in Chapter 9 which assumesall releases are projections of the same underlying data table, this problemassumes all releases share the same database schema.

The problem of incremental data publishing has two challenges. (1) Eventhough each release T1, . . . , Tp is individually anonymous, the privacy require-ment could be compromised by comparing different releases and eliminatingsome possible sensitive values for a victim. (2) One approach to solve the prob-lem of incremental data publishing is to anonymize and publish new recordsseparately each time they arrive. This naive approach suffers from severe datadistortion because small increments are anonymized independently. Moreover,it is difficult to analyze a collection of independently anonymized data sets.For example, if the country name (e.g., Canada) is used in the release forfirst month records and if the city name (e.g., Toronto) is used in the re-lease for second month records, counting the total number of persons born inToronto is not possible. Another approach is to enforce the later release tobe no more specialized than the previous releases [217]. The major drawbackis that each subsequent release gets increasingly distorted, even if more newdata are available.

The incremental data anonymization problem assumes that the adversaryknows the timestamp and QID of the victim, so the adversary knows exactlywhich releases contain the victim’s data record. The following example showsthe privacy threats caused by record insertions and deletions.

161

Page 183: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

162 Introduction to Privacy-Preserving Data Publishing

Table 10.1: Continuous data publishing:2-diverse T1

Job Sex DiseaseProfessional Female CancerProfessional Female Diabetes

Artist Male FeverArtist Male Cancer

Table 10.2: Continuous data publishing:2-diverse T2 after an insertion to T1

Job Sex DiseaseProfessional Female CancerProfessional Female DiabetesProfessional Female HIV

Artist Male FeverArtist Male Cancer

Example 10.1Let Table 10.1 be the first release T1. Let Table 10.2 be the second releaseT2 after inserting a new record to T1. Both T1 and T2 satisfy 2-diversityindependently. Suppose the adversary knows that a female lawyer, Alice, hasa record in T2 but not in T1, based on the timestamp that Alice was admittedto a hospital. From T2, the adversary can infer that Alice must have contractedeither Cancer, Diabetes, or HIV . By comparing T2 with T1, the adversarycan identify that the first two records in T2 must be old records from T1 and,thus, infer that Alice must have contracted HIV .

Example 10.2Let Table 10.1 be the first release T1. Let Table 10.3 be the second releaseT2 after deleting the record 〈Professional, Female, Diabetes〉 from and in-serting a new record 〈Professional, Female, Fever〉 to T1. Both T1 and T2

satisfy 2-diversity independently. Suppose the adversary knows that a femaleengineer, Beth, must be in both T1 and T2. From T1, the adversary can inferthat Beth must have contracted either Cancer or Diabetes. Since T2 containsno Diabetes, the adversary can infer that Beth must have contracted Cancer.

Byun et al. [42] are the pioneers who propose an anonymization tech-nique that enables privacy-preserving incremental data publishing after newrecords have been inserted. Specifically, it guarantees every release to satisfy�-diversity, which requires each qid group to contain at least � distinct sensitivevalues. Since this instantiation of �-diversity does not consider the frequenciesof sensitive values, an adversary could still confidently infer a sensitive value of

Page 184: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Incrementally Updated Data Records 163

Table 10.3: Continuous datapublishing: 2-diverse T2 after a deletionfrom and an insertion to T1

Job Sex DiseaseProfessional Female CancerProfessional Female Fever

Artist Male FeverArtist Male Cancer

a victim if the value occurs frequently in a qid group. Thus, the instantiationemployed by Byun et al. [42] cannot prevent attribute linkage attacks.

Byun et al. [42] attempt to address the privacy threats caused by recordinsertions but not deletions, so the current release Tp contains all records inprevious releases. The algorithm inserts new records into the current releaseTp only if two privacy requirements remain satisfied after the insertion: (1)Tp is �-diverse. (2) Given any previous release Ti and the current release Tp

together, there are at least � distinct sensitive values in the remaining recordsthat could potentially be the victim’s record. This requirement can be verifiedby comparing the difference and intersection of the sensitive values in any two“comparable” qid groups in Ti and Tp. The algorithm prefers to specialize Tp

as much as possible to improve the data quality, provided that the two privacyrequirements are satisfied. If the insertion of some new records would violateany of the privacy requirements, even after generalization, the insertions aredelayed until later releases. Nonetheless, this strategy sometimes may run intoa situation in which no new data could be released. Also, it requires a verylarge memory buffer to store the delayed data records.

In this chapter, we study two scenarios of privacy-preserving incrementaldata publishing: continuous data publishing [93] in Chapter 10.2 and dynamicdata republishing [251] in Chapter 10.3. In continuous data publishing, everyrelease is an “accumulated” version of data at each time instance, which con-tains all records collected so far. Thus, every data release publishes the “his-tory,” i.e., the events that happened up to time of the new release. The modelcaptures the scenario that once a record is collected, it cannot be deleted. Incontrast, dynamic data republishing assumes the raw data of any previouslyrecords can be inserted, deleted, and updated. A new release contains somenew records and some updated records. Every release is the data collectedat each time instance. In both cases, the adversary has access to all releasespublished at each time instance.

Some works misinterpret that Fung et al.’s work [93] allows only recordsinsertion but not records deletion. In fact, T1 and T2 can be arbitrary setsof records, not necessarily the result of inserting or deleting records from theother table. The difference is that the work in [93] anonymizes (T1 ∪ T2) asone release to allow the data analysis on the whole data set. In contrast, allother works [251] anonymize each Ti independently, so the publishing model

Page 185: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

164 Introduction to Privacy-Preserving Data Publishing

in [251] does not benefit from new data because each Ti is small, resulting inlarge distortion.

10.2 Continuous Data Publishing

k-anonymization is an important privacy protection mechanism in datapublishing. While there has been a great deal of work in recent years, almostall considered a single static release. Such mechanisms only protect the dataup to the first release or first recipient. In practical applications, data is pub-lished continuously as new data arrive; the same data may be anonymizeddifferently for a different purpose or a different recipient. In such scenarios,even when all releases are properly k-anonymized, the anonymity of an indi-vidual may be unintentionally compromised if the recipient cross-examines allthe releases received or colludes with other recipients. Preventing such attacks,called correspondence attacks, faces major challenges.

Fung et al. [93] show a method to systematically quantify the exact num-ber of records that can be “cracked” by comparing all k-anonymous releases.A record in a k-anonymous release is “cracked” if it is impossible to be acandidate record of the target victim. After excluding the cracked recordsfrom a release, a table may no longer be k-anonymous. In some cases, datarecords, with sensitive information of some victims, can even be uniquelyidentified from the releases. Fung et al. [93] propose a privacy requirement,called BCF-anonymity, to measure the true anonymity in a release after ex-cluding the cracked records, and present a generalization method to achieveBCF-anonymity without delaying record publication or inserting counterfeitrecords. Note, the traditional k-anonymity does not consider the sensitivevalues, but BCF-anonymity relies on the sensitive values to determine thecracked records. In Chapter 10.2, we study the correspondence attacks andthe BCF-anonymity in details.

10.2.1 Data Model

Suppose that the data holder previously collected a set of records T1 times-tamped t1, and published a k-anonymized version of T1, denoted by releaseR1. Then the data holder collects a new set of records T2 timestamped t2and wants to publish a k-anonymized version of all records collected so far,T1 ∪ T2, denoted by release R2. Note, Ti contains the “events” that happenedat time ti. An event, once occurred, becomes part of the history, therefore,cannot be deleted. This publishing scenario is different from update scenarioin standard data management where deletion of records can occur. Ri simplypublishes the “history,” i.e., the events that happened up to time ti. A real-life

Page 186: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Incrementally Updated Data Records 165

example can be found in California where the hospitals are required to submitspecific demographic data of all discharged patients every six months.1 Theabove publishing model directly serves the following scenarios.

Continuous data publishing. Publishing the release R2 for T1∪T2 wouldpermit an analysis on the data over the combined time period of t1 and t2.It also takes the advantage of data abundance over a longer period of time toreduce data distortion required by anonymization.

Multi-purpose publishing. With T2 being empty, R1 and R2 can be tworeleases of T1 anonymized differently to serve different information needs, suchas correlation analysis vs. clustering analysis, or different recipients, such as amedical research team vs. a health insurance company. These recipients maycollude together by sharing their received data.

We first describe the publishing model with two releases and then show theextension beyond two releases and beyond k-anonymity in Chapter 10.2.6.Following the convention of k-anonymity [201, 217], we assume that eachindividual has at most one record in T1 ∪ T2. This assumption holds in manyreal-life databases. For example, in a normalized customer data table, eachcustomer has only one profile. In the case that an individual has a record inboth T1 and T2, there will be two duplicates in T1 ∪ T2 and one of them canbe removed in a preprocessing.

10.2.2 Correspondence Attacks

We use the following example to illustrate the idea of correspondence attackand show that the traditional k-anonymization is insufficient for preventingcorrespondence attacks.

Example 10.3Consider Tables 10.4-10.5 with taxonomy trees in Figure 10.1, where QIDis [Birthplace,Job] and the sensitive attribute is Disease. The data holder(e.g., a hospital) published the 5-anonymized R1 for 5 records a1-a5 collectedin the previous month (i.e., timestamp t1). The anonymization was done bygeneralizing UK and France into Europe; the original values in the bracketsare not released. In the current month (i.e., timestamp t2), the data holdercollects 5 new records (i.e., b6-b10) and publishes the 5-anonymized R2 for all10 records collected so far. Records are shuffled to prevent mapping betweenR1 and R2 by their order. The recipients know that every record in R1 has a“corresponding record” in R2 because R2 is a release for T1∪T2. Suppose thatone recipient, the adversary, tries to identify his neighbor Alice’s record fromR1 or R2, knowing that Alice was admitted to the hospital, as well as Alice’sQID and timestamp. Consider the following scenarios with Figure 10.2.

1http://www.oshpd.ca.gov/HQAD/PatientLevel/

Page 187: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

166 Introduction to Privacy-Preserving Data Publishing

Table 10.4: 5-anonymized R1

RID Birthplace Job Disease(a1) Europe (UK) Lawyer Flu(a2) Europe (UK) Lawyer Flu(a3) Europe (UK) Lawyer Flu(a4) Europe (France) Lawyer HIV(a5) Europe (France) Lawyer HIV

Table 10.5: 5-anonymized R2

RID Birthplace Job Disease(b1) UK Professional (Lawyer) Flu(b2) UK Professional (Lawyer) Flu(b3) UK Professional (Lawyer) Flu(b4) France Professional (Lawyer) HIV(b5) France Professional (Lawyer) HIV(b6) France Professional (Lawyer) HIV(b7) France Professional (Doctor) Flu(b8) France Professional (Doctor) Flu(b9) UK Professional (Doctor) HIV(b10) UK Professional (Lawyer) HIV

Scenario I. Alice has QID=[France, Lawyer] and timestamp t1. The ad-versary seeks to identify Alice’s record in R1. Examining R1 alone, Alice’sQID matches all 5 records in R1. However, examining R1 and R2 together,the adversary learns that the records a1,a2,a3 cannot all originate from Al-ice’s QID; otherwise R2 would have contained at least 3 records of [France,Professional, Flu] since every record in R1 has a corresponding record in R2.Consequently, the adversary excludes one of a1,a2,a3 as possibility; the choiceamong a1,a2,a3 does not matter as they are identical. The crossed out recorda2 in Figure 10.2 represents the excluded record.

Scenario II. Alice has QID=[France, Lawyer] and timestamp t1. Knowingthat R2 contains all records at t1 and t2, the adversary seeks to identify Alice’srecord in R2. The adversary infers that, among the matching records b4-b8

in R2, at least one of b4,b5,b6 must have timestamp t2; otherwise b4,b5,b6

would have timestamp t1, in which case there would have been at least 3(corresponding) records of the form [Europe, Lawyer, HIV] in R1. In this case,the adversary excludes at least one of b4,b5,b6 since Alice has timestamp t1.

Europe

UK France

Professional

Lawyer Doctor

FIGURE 10.1: Taxonomy trees for Birthplace and Job

Page 188: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Incrementally Updated Data Records 167

(a1) Flu(a2) Flu(a3) Flu(a4) HIV(a5) HIV

HIV (b4)HIV (b5)HIV (b6)Flu (b7)Flu (b8)

(a1) Flu(a2) Flu(a3) Flu(a4) HIV(a5) HIV

HIV (b4)HIV (b5)HIV (b6)Flu (b7)Flu (b8)

(a1) Flu(a2) Flu(a3) Flu(a4) HIV(a5) HIV

Flu (b1)Flu (b2)Flu (b3)Flu (b7)Flu (b8)HIV (b9)HIV (b10)

Scenario I: F-attackExclude a2

Scenario II: C-attackExclude b6

Scenario III: B-attackExclude b2

FIGURE 10.2: Correspondence attacks

The crossed out record b6 in Figure 10.2 represents the excluded record.Scenario III. Alice has QID=[UK, Lawyer] and timestamp t2 and the

adversary seeks to identify Alice’s record in R2. The adversary infers that,among the matching records b1,b2,b3,b9,b10 in R2, at least one of b1,b2,b3 musthave timestamp t1; otherwise one of a1,a2,a3 would have no correspondingrecord in R2. In this case, at least one of b1,b2,b3 is excluded since Alice hastimestamp t2. The crossed out record b2 in Figure 10.2 represents the excludedrecord.

In each scenario, at least one matching record is excluded, so the 5-anonymity of Alice is compromised.

All these attacks “crack” some matching records in R1 or R2 by inferringthat they either do not originate from Alice’s QID or do not have Alice’stimestamp. Such cracked records are not related to Alice, thus, excluding themallows the adversary to focus on a smaller set of candidates. Since crackedrecords are identified by cross-examining R1 and R2 and by exploiting theknowledge that every record in R1 has a “corresponding record” in R2, suchattacks are called correspondence attacks.

Having access to only R1 and R2, not T1 and T2, cracking a record isnot straightforward. For example, to crack a record in R1 for Alice havingQID=[France, Lawyer], the adversary must show that the original birthplacein the record is not France, whereas the published Europe may or may notoriginate from France. Similarly, it is not straightforward to infer the time-stamp of a record in R2. For example, any three of b1,b2,b3,b7,b8 can be thecorresponding records of a1,a2,a3, so none of them must have timestamp t1.In fact, observing only the published records, there are many possible as-signments of corresponding records between R1 and R2. For example, oneassignment is

(a1, b1), (a2, b2), (a3, b3), (a4, b4), (a5, b5),where the original record represented by (a1, b1) is [UK, Lawyer, Flu]. An-other assignment is

Page 189: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

168 Introduction to Privacy-Preserving Data Publishing

(a1, b7), (a2, b2), (a3, b8), (a4, b6), (a5, b9).In this assignment, the original record represented by (a1, b7) is [France,Lawyer, Flu]. All such assignments are possible to the adversary because theyall produce the same “view,” i.e., R1 and R2. Detecting correspondence at-tacks assuming this view of the adversary is non-trivial.

In Chapter 10.2, we formalize the notion of correspondence attacks andpresent an approach to prevent such attacks. We focus on answering severalkey questions:

• Given that there are many possible ways of assigning correspondingpairs, and each may lead to a different inference, what should the adver-sary assume while cracking a record? We present a model of correspon-dence attacks to address this issue and show that the continuous datapublishing problem subsumes the case with multiple colluding recipients(Chapter 10.2.3).

• What are exactly the records that can be cracked based on R1 and R2?We systematically characterize the set of cracked records by correspon-dence attacks and propose the notion of BCF-anonymity to measureanonymity assuming this power of the adversary (Chapter 10.2.4).

• Can R2 be anonymized such that R2 satisfies BCF-anonymity yet re-mains useful? We show that the optimal BCF-anonymization is NP-hard. Then, we develop a practically efficient algorithm to determinea BCF-anonymized R2 (Chapter 10.2.5), and extend the proposed ap-proach to deal with more than two releases (Chapter 10.2.6).

10.2.3 Anonymization Problem for Continuous Publishing

To generalize a table, Fung et al. [93] employ the subtree generalizationscheme discussed in Chapter 3.1. A generalized attribute Aj can be repre-sented by a “cut” through its taxonomy tree. R1 and R2 in Tables 10.4-10.5are examples. There are some other generalization schemes, such as multi-dimensional and local recoding, that cause less data distortion, but theseschemes make data analysis difficult and suffer from the data explorationproblem discussed in Chapter 3.1.

In a k-anonymized table, records are partitioned into equivalence classes ofsize (i.e., the number of records) at least k. Each equivalence class contains allrecords having the same value on QID. qid denotes both a value on QID andthe corresponding equivalence class. |qid| denotes the size of the equivalenceclass. A group g in an equivalence class qid consists of the records in qidthat have the same value on the sensitive attribute. In other words, a groupcontains all records in the table that are indistinguishable with respect toQID and the sensitive attribute. Note that this group notion is different fromthat of QID groups in the literature, which require only being identical on

Page 190: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Incrementally Updated Data Records 169

QID. A person matches a (generalized) record in a table if her QID is eitherequal to or more specific than the record on every attribute in QID.

The data holder previously collected some data T1 timestamped t1 andpublished a k-anonymized version of T1, called release R1. Then the dataholder collects new data T2 timestamped t2 and publishes a k-anonymizedversion of T1 ∪T2, called release R2. An adversary, one of the recipients of R1

and R2, attempts to identify the record of some target person, denoted by P ,from R1 or R2. The problem assumes that the adversary is aware of P ’s QIDand timestamp. In addition, the adversary has the following correspondenceknowledge:

1. Every record timestamped t1 (i.e., from T1) has a record in R1 and arecord in R2, called corresponding records.

2. Every record timestamped t2 (i.e., from T2) has a record in R2, but notin R1. Below is an intuition of the three possible attacks based on suchknowledge.

Forward-attack, denoted by F-attack(R1, R2). P has timestamp t1 andthe adversary tries to identify P ’s record in the cracking release R1 using thebackground release R2. Since P has a record in R1 and a record in R2, if amatching record r1 in R1 represents P , there must be a corresponding recordin R2 that matches P ’s QID and agrees with r1 on the sensitive attribute. Ifr1 fails to have such a corresponding record in R2, then r1 does not originatefrom P ’s QID, and therefore, r1 can be excluded from the possibility of P ’srecord. Scenario I is an example.

Cross-attack, denoted by C-attack(R1, R2). P has timestamp t1 and theadversary tries to identify P ’s record in the cracking release R2 using thebackground release R1. Similar to F-attack, if a matching record r2 in R2

represents P , there must be a corresponding record in R1 that matches P ’sQID and agrees with r2 on the sensitive attribute. If r2 fails to have sucha corresponding record in R1, then r2 either has timestamp t2 or does notoriginate from P ’s QID, and therefore, r2 can be excluded from the possibilityof P ’s record. Scenario II is an example.

Backward-attack, denoted by B-attack(R1, R2). P has timestamp t2 andthe adversary tries to identify P ’s record in the cracking release R2 usingthe background release R1. In this case, P has a record in R2, but not inR1. Therefore, if a matching record r2 in R2 has to be the correspondingrecord of some record in R1, then r2 has timestamp t1, and therefore, r2 canbe excluded from the possibility of P ’s record. Scenario III is an example.Note that it is impossible to single out the matching records in R2 that havetimestamp t2 but do not originate from P ’s QID since all records at t2 haveno corresponding record in R1.

Table 10.6 summarizes all four possible combinations of cracking release(R1 or R2) and target P ’s timestamp (t1 or t2). Note that if a target P has

Page 191: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

170 Introduction to Privacy-Preserving Data Publishing

Table 10.6: Types of correspondence attacksTarget P ’s Cracking Backgroundtimestamp release release

F-attack t1 R1 R2

C-attack t1 R2 R1

B-attack t2 R2 R1

No attack t2 R1 R2

timestamp t2, P does not have a record in R1, so it is impossible to crack P ’srecord in R1 in such a case and there are only three types of attacks.

All these attacks are based on making some inferences about correspondingrecords. There are many possible assignments of corresponding records thatare consistent with the published view R1 and R2, and the adversary’s knowl-edge. Each assignment implies a possibly different underlying data (D′

1, D′2),

not necessarily the actual underlying data (T1, T2) collected by the dataholder. Since all such underlying data (D′

1, D′2) generate the same published

R1 and R2, they are all possible to the adversary who knows about the dataonly through the published R1 and R2. This observation suggests that weshould consider only the inferences that do not depend on a particular choiceof a candidate (D′

1, D′2). First, let us define the space of such candidates un-

derlying data for the published R1 and R2.Consider a record r in R1 or R2. An instantiation of r is a raw record that

agrees with r on the sensitive attribute and specializes r or agrees with r onQID. A generator of (R1, R2) is an assignment, denoted by I, from the recordsin R1∪R2 to their instantiations such that: for each record r1 in R1, there is adistinct record r2 in R2 such that I(r1) = I(r2); (r1, r2) is called buddies underI. Duplicate records are treated as distinct records. The buddy relationshipis injective: no two records have the same buddy. Every record in R1 has abuddy in R2 and exactly |R1| records in R2 have a buddy in R1. If r2 in R2

has a buddy in R1, I(r2) has timestamp t1; otherwise I(r2) has timestamp t2.Intuitively, a generator represents an underlying data for (R1, R2) and eachpair of buddies represent corresponding records in the generator.

Example 10.4Refer to R1 and R2 in Table 10.4 and Table 10.5. One generator has thebuddies:

(a1, b1), assigned to [UK, Lawyer, Flu](a2, b2), assigned to [UK, Lawyer, Flu](a3, b3), assigned to [UK, Lawyer, Flu](a4, b4), assigned to [France, Lawyer, HIV](a5, b5), assigned to [France, Lawyer, HIV]

I(b1)-I(b5) have timestamp t1. b6-b10 can be assigned to any instantiation.

Page 192: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Incrementally Updated Data Records 171

Another generator has the buddies:

(a1, b7), assigned to [France, Lawyer, Flu](a2, b2), assigned to [UK, Lawyer, Flu](a3, b8), assigned to [France, Lawyer, Flu](a4, b9), assigned to [UK, Lawyer, HIV](a5, b10), assigned to [UK, Lawyer, HIV]

I(b2), I(b7)-I(b10) have timestamp t1.

Another generator has the buddies:

(a1, b7), assigned to [France, Lawyer, Flu](a2, b2), assigned to [UK, Lawyer, Flu](a3, b8), assigned to [France, Lawyer, Flu](a4, b6), assigned to [France, Lawyer, HIV](a5, b9), assigned to [UK, Lawyer, HIV]

I(b2), I(b6)-I(b9) have timestamp t1. Note that these generators give differentunderlying data.

Consider a record r in R1 or R2. Suppose that for some generator I, theinstantiation I(r) matches P ’s QID and timestamp. In this case, excludingr means information loss for the purpose of attack because there is someunderlying data for (R1, R2) (given by I) in which r is P ’s record. On theother hand, suppose that for no generator I the instantiation I(r) can matchP ’s QID and timestamp. Then r definitely cannot be P ’s record, so excludingr losses no information to the adversary. The attack model presented in [93]is based on excluding such non-representing records.

For a target P with timestamp t1, if P has a record in R1, P must matchsome qid1 in R1 and some qid2 in R2. Therefore, we assume such a matchingpair (qid1, qid2) for P . Recall that a group gi in an equivalence class qidi con-sists of the records in qidi that agree on the sensitive attribute. For a generatorI, I(gi) denotes the set of records {I(ri) | ri ∈ gi}. Below, we present the for-mal definition for each type of attacks. ri, gi, qidi refer to records, groups, andequivalence classes from Ri, i = 1, 2.

10.2.3.1 F-attack

The F-attack seeks to identify as many as possible records in an equivalenceclass qid1 that do not represent P in any choice of the generator. Such crackedrecords definitely cannot be P ’s record, and therefore can be excluded. Sinceall records in a group are identical, the choice of cracked records in a groupdoes not make a difference and determining the number of cracked records(i.e., the crack size) is sufficient to define the attack.

Page 193: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

172 Introduction to Privacy-Preserving Data Publishing

DEFINITION 10.1 Crack size Assume that a target P has timestampt1 and matches (qid1, qid2). A group g1 in qid1 has crack size c with respectto P if c is maximal such that for every generator I, at least c records in I(g1)do not match P ’s QID.

If g1 has crack size c, at least c records in g1 can be excluded from thepossibility of P ’s record. On the other hand, with c being maximal, excludingmore than c records will result in excluding some record that can possibly beP ’s record. Therefore, the crack size is both the minimum and the maximumnumber of records that can be excluded from g1 without any information lossfor the purpose of attack.

Example 10.5Consider Scenario I in Example 10.3. Alice has QID=[France, Lawyer] andmatches (qid1, qid2), where

qid1 = [Europe, Lawyer] = {a1, a2, a3, a4, a5}qid2 = [France, Professional] = {b4, b5, b6, b7, b8}.

qid1 has two groups: g1 = {a1, a2, a3} for Flu and g′1 = {a4, a5} for HIV. g1 hascrack size 1 with respect to Alice because, for any generator I, at least one ofI(a1),I(a2),I(a3) does not match Alice’s QID: if all of I(a1),I(a2),I(a3) matchAlice’s QID, R2 would have contained three buddies of the form [France, Pro-fessional, Flu]. The crack size is maximal since the second generator I in Ex-ample 10.4 shows that only one of I(a1),I(a2),I(a3) (e.g., I(a2) in Figure 10.2)does not match Alice’s QID.

Definition 10.1 does not explain how to effectively determine the crack size,which is the topic in Chapter 10.2.4. For now, assuming that the crack sizeis known, we want to measure the anonymity after excluding the crackedrecords from each equivalence class. The F-anonymity below measures theminimum size of an equivalence class in R1 after excluding all records crackedby F-attack.

DEFINITION 10.2 F-anonymity Let F (P, qid1, qid2) be the sum of thecrack sizes for all groups in qid1 with respect to P . F (qid1, qid2) denotes themaximum F (P, qid1, qid2) for any target P that matches (qid1, qid2). F (qid1)denotes the maximum F (qid1, qid2) for all qid2 in R2. The F-anonymity of(R1, R2), denoted by FA(R1, R2) or FA, is the minimum (|qid1| − F (qid1))for all qid1 in R1.

Page 194: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Incrementally Updated Data Records 173

10.2.3.2 C-attack

DEFINITION 10.3 Crack size Assume that a target P has timestampt1 and matches (qid1, qid2). A group g2 in qid2 has crack size c with respectto P if c is maximal such that for every generator I, at least c records in I(g2)do not match either P ’s timestamp or P ’s QID.

Example 10.6Consider Scenario II in Example 10.3. Alice has QID=[France, Lawyer] andmatches (qid1,qid2) where

qid1 = [Europe, Lawyer] = {a1, a2, a3, a4, a5}qid2 = [France, Professional] = {b4, b5, b6, b7, b8}.

qid2 has two groups: g2 = {b7, b8} for Flu, and g′2 = {b4, b5, b6} for HIV.g′2 has crack size 1 with respect to Alice since, for any generator I at leastone of I(b4),I(b5),I(b6) does not match Alice’s timestamp or QID; otherwise,R1 would have contained three buddies of the form [Europe, Lawyer, HIV].The crack size is maximal since the first generator I in Example 10.4 showsthat I(b4) and I(b5) match Alice’s timestamp and QID. b6 is excluded inFigure 10.2.

DEFINITION 10.4 C-anonymity Let C(P, qid1, qid2) be the sum ofcrack sizes of all groups in qid2 with respect to P . C(qid1, qid2) denotes themaximum C(P, qid1, qid2) for any target P that matches (qid1, qid2). C(qid2)denotes the maximum C(qid1, qid2) for all qid1 in R1. The C-anonymity of(R1, R2), denoted by CA(R1, R2) or CA, is the minimum (|qid2| − C(qid2))for all qid2 in R2.

10.2.3.3 B-attack

A target P for B-attack has timestamp t2, thus, does not have to matchany qid1 in R1.

DEFINITION 10.5 Crack size Assume that a target P has timestampt2 and matches qid2 in R2. A group g2 in qid2 has crack size c with respect toP if c is maximal such that for every generator I, at least c records in I(g2)have timestamp t1.

Example 10.7Consider Scenario III in Example 10.3. Alice has timestamp t2 and QID=[UK,Lawyer]. qid2=[UK, Professional] consists of g2 = {b1, b2, b3} for Flu andg′2 = {b9, b10} for HIV. g2 has crack size 1 with respect to Alice. For every

Page 195: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

174 Introduction to Privacy-Preserving Data Publishing

generator I, at least one of I(b1),I(b2),I(b3) has timestamp t1; otherwise, oneof a1,a2,a3 would have no buddy in R2. The crack size is maximal since thesecond generator I in Example 10.4 shows that only I(b2) in Figure 10.2,among I(b1),I(b2),I(b3), has timestamp t1.

DEFINITION 10.6 B-anonymity Let B(P, qid2) be the sum of thecrack sizes of all groups in qid2 with respect to P . B(qid2) denotes the max-imum B(P, qid2) for any target P that matches qid2. The B-anonymity of(R1, R2), denoted by BA(R1, R2) or BA, is the minimum (|qid2| − B(qid2))for all qid2 in R2.

10.2.3.4 Detection and Anonymization Problems

A BCF-anonymity requirement states that all of BA, CA and FA areequal to or larger than some data-holder-specified threshold. We study twoproblems. The first problem checks whether a BCF-anonymity requirement issatisfied, assuming the input (i.e., R1 and R2) as viewed by the adversary.

DEFINITION 10.7 Detection Given R1 and R2, as described above,the BCF-detection problem is to determine whether a BCF-anonymity re-quirement is satisfied.

The second problem is to produce a generalized R2 that satisfies a givenBCF-anonymity requirement and remains useful. This problem assumes theinput as viewed by the data holder, that is, R1, T1 and T2, and uses an infor-mation metric to measure the usefulness of the generalized R2. Examples arediscernibility cost [213] and data distortion [203].

DEFINITION 10.8 Anonymization Given R1, T1 and T2, as describedabove, the BCF-anonymization problem is to generalize R2 = T1 ∪ T2 so thatR2 satisfies a given BCF-anonymity requirement and remains as useful aspossible with respect to a specified information metric.

In the special case of empty T1, F-attack and C-attack do not happen andB-anonymity coincides with k-anonymity of R2 for T2. Since the optimal k-anonymization is NP-hard [168], the optimal BCF-anonymization is NP-hard.

So far, the problem assumes that both R1 and R2 are received by one recip-ient. In the special case of empty T2, R1 and R2 are two different generalizedversions of the same data T1 to serve different information requirements or dif-ferent recipients. In this case, there are potentially multiple adversaries. Whathappens if the adversaries collude together? The collusion problem may seemto be very different. Indeed, Definitions 10.7-10.8 subsume the collusion prob-lem. Consider the worst-case collusion scenario in which all recipients collude

Page 196: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Incrementally Updated Data Records 175

together by sharing all of their received data. This scenario is equivalent topublishing all releases to one adversary.

10.2.4 Detection of Correspondence Attacks

The key to the BCF-detection problem is computing the crack size in Def-initions 10.1, 10.3, 10.5. Fung et al. [93] present a method for computing thecrack size of a group. Their insight is that if a record r represents the targetP for some generator, its buddy in the other release (i.e., the correspondingrecord) must satisfy some conditions. If such conditions fail, r does not repre-sent P for that generator. One of the conditions is the following “comparable”relationship.

DEFINITION 10.9 For qid1 in R1 and qid2 in R2, (qid1, qid2) are com-parable if for every attribute A in QID, qid1[A] and qid2[A] are on the samepath in the taxonomy of A. For a record in r1 (or a group) in qid1 and arecord r2 (or a group) in qid2, (r1, r2) are comparable if they agree on the sen-sitive attribute and (qid1, qid2) are comparable. For comparable (qid1, qid2),CG(qid1, qid2) denotes the set of group pairs {(g1, g2)}, where g1 and g2 aregroups in qid1 and qid2 for the same sensitive value and there is one pair(g1, g2) for each sensitive value (unless both g1 and g2 are empty).

Essentially, being comparable means sharing a common instantiation. Forexample, qid1 = [Europe, Lawyer] and qid2 = [UK, Professional] are com-parable, but qid1 = [Europe, Lawyer] and qid2 = [Canada, Professional] arenot. It is easy to see that if a target P matches (qid1, qid2), (qid1, qid2) arecomparable; if (r1, r2) are buddies (for some generator), (r1, r2) are compara-ble; comparable (r1, r2) can be assigned to be buddies (because of sharing acommon instantiation). The following fact can be verified:

THEOREM 10.1Suppose that P matches (qid1, qid2) and that (r1, r2) are buddies for a gen-erator I. If I(r1) and I(r2) match P ’s QID, then r1 is in g1 if and only if r2

is in g2, where (g1, g2) is in CG(qid1, qid2).

Theorem 10.1 follows because buddies agree on the sensitive attribute andI(r1) and I(r2) matching P ’s QID implies that r1 is in qid1 and r2 is in qid2.The next two lemmas are used to derive an upper bound on crack size. Thefirst states some transitivity of the “comparable” relationship, which will beused to construct a required generator in the upper bound proof.

LEMMA 10.1 3-hop transitivityLet r1, r

′1 be in R1 and r2, r

′2 be in R2. If each of (r′1, r2), (r2, r1), and (r1, r

′2)

Page 197: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

176 Introduction to Privacy-Preserving Data Publishing

is comparable, (r′1, r′2) is comparable.

PROOF See [93].

The following is the key lemma for proving the upper bound of crack size.

LEMMA 10.2Let (g1, g2) be in CG(qid1, qid2). There exists a generator in which exactlymin(|g1|, |g2|) records in g1 have a buddy in g2.

PROOF See [93].

Below, we show how to determine the crack size for F-attack. For C-attackand B-attack, we use examples to illustrate the general. Refer to [93] for thedetails for determining the crack sizes of C-attack and B-attack.

10.2.4.1 F-attack

Assume that P matches (qid1, qid2). Consider a group pair (g1, g2) inCG(qid1, qid2). Since the buddy relationship is injective, if g1 contains morerecords than g2, i.e., |g1| > min(|g1|, |g2|), at least |g1|−min(|g1|, |g2|) recordsin g1 do not have a buddy in g2 for any generator. According to Theorem 10.1,these records do not originate from P ’s QID for any generator. Thus the cracksize of g1 is at least |g1| −min(|g1|, |g2|) (i.e., a lower bound). On the otherhand, according to Lemma 10.2, there exists some generator in which exactly|g1|−min(|g1|, |g2|) records in g1 do not have a buddy in g2; according to The-orem 10.1, these records do not originate from P ’s QID for any generator.By Definition 10.1, |g1| −min(|g1|, |g2|) is the crack size of g1.

THEOREM 10.2Suppose that a target P matches (qid1, qid2). Let (g1, g2) be in CG(qid1, qid2).(1) g1 has crack size c with respect to P , where c = |g1| −min(|g1|, |g2|). (2)F (qid1, qid2) =

∑c, where

∑is over (g1, g2) in CG(qid1, qid2) and g1 has

the crack size c determined in (1).

Remarks. F (P, qid1, qid2) is the same for all targets P that match(qid1, qid2), i.e., F (qid1, qid2) computed by Theorem 10.2. To compute FA,we compute F (qid1, qid2) for all comparable (qid1, qid2). This requires par-titioning the records into equivalence classes and groups, which can bedone by sorting the records on all attributes. The F-attack happens when|g1| > min(|g1|, |g2|), that is, g1 contains too many records for their buddiesto be contained in g2. This could be the case if R2 has less generalization dueto additional records at timestamp t2.

Page 198: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Incrementally Updated Data Records 177

Example 10.8Continue with Example 10.5. Alice matches qid1=[Europe, Lawyer] andqid2=[France, Professional]. qid1 consists of g1 = {a1, a2, a3} for Flu and g′1 ={a4, a5} for HIV. qid2 consists of g2 = {b7, b8} for Flu and g′2 = {b4, b5, b6}for HIV. CG(qid1, qid2) = {(g1, g2), (g′1, g′2)}. |g1| = 3, |g2| = 2, |g′1| = 2 and|g′2| = 3. So, g1 has crack size 1 and g′1 has crack size 0.

10.2.4.2 C-attack

By a similar argument, at least |g2| − min(|g1|, |g2|) records in g2 do nothave a buddy in g1 in any generator, so the crack size of g2 is at least |g2| −min(|g1|, |g2|).

THEOREM 10.3Suppose that a target P matches (qid1, qid2). Let (g1, g2) be in CG(qid1, qid2).(1) g2 has crack size c with respect to P , where c = |g2| −min(|g1|, |g2|). (2)C(qid1, qid2) =

∑c, where

∑is over (g1, g2) in CG(qid1, qid2) and g2 has

the crack size c determined in (1).

Example 10.9Continue with Example 10.6. Alice matches qid1=[Europe, Lawyer] andqid2=[France, Professional]. qid1 consists of g1 = {a1, a2, a3} and g′1 ={a4, a5}. qid2 consists of g2 = {b7, b8} and g′2 = {b4, b5, b6}. CG(qid1, qid2)= {(g1, g2), (g′1, g

′2)}. |g2| = 2, |g1| = 3, |g′2| = 3 and |g′1| = 2. Thus g2 has

crack size 0 and g′2 has crack size 1.

10.2.4.3 B-attack

Suppose that P matches some qid2 in R2. Let g2 be a group in qid2. Thecrack size of g2 is related to the number of records in g2 that have a buddy inR1 (thus timestamp t1). Let G1 denote the set of records in R1 comparable tog2. So G1 contains all the records in R1 that can have a buddy in g2. Let G2

denote the set of records in R2 comparable to some record in G1. The nextlemma implies that all records in G1 can have a buddy in G2 − g2.

LEMMA 10.3Every record in G2 is comparable to all records in G1 and only those recordsin G1.

From Lemma 10.3, all records in G1 and only those records in G1 can havea buddy in G2. Each record in G1 has its buddy either in g2 or in G2 − g2,but not in both. If |G1| > |G2| − |g2|, the remaining c = |G1| − (|G2| − |g2|)records in G1 must have their buddies in g2, or equivalently, c records in g2

Page 199: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

178 Introduction to Privacy-Preserving Data Publishing

must have their buddies in R1 (therefore, timestamp t1). The next theoremfollows from this observation.

THEOREM 10.4Suppose that a target P has timestamp t2 and matches qid2 in R2. Let g2 inqid2. (1) If |G2| < |g2|, g2 has crack size 0 with respect to P . (2) If |G2| ≥ |g2|,g2 has crack size c, where c = max(0, |G1|− (|G2|− |g2|)). (3) B(qid2) =

∑c,

where∑

is over g2 in qid2 and g2 has the crack size c determined in (1) and(2).

Example 10.10Continue with Example 10.7. Alice has [UK, Lawyer] and timestamp t2. qid2 =[UK, Professional] has g2 for Flu and g′2 for HIV:

g2 = {b1, b2, b3}G1 = {a1, a2, a3}G2 = {b1, b2, b3, b7, b8}|G1| − (|G2| − |g2|) = 3− (5− 3) = 1

g′2 = {b9, b10}G1 = {a4, a5}G2 = {b4, b5, b6, b9, b10}|G1| − (|G2| − |g′2|) = 2− (5− 2) = −1

Thus g2 has crack size 1 and g′2 has crack size 0.

10.2.4.4 Equivalence of F-attack and C-attack

F-attack and C-attack are motivated under different scenarios and have adifferent characterization of crack size. Despite such differences, we show thatthese attacks are not independent of each other at all; in fact, FA = CA.Refer to [93] for a formal proof.

10.2.5 Anonymization Algorithm for Correspondence At-tacks

An algorithm called BCF-anonymizer for anonymizing R2 to satisfy all ofFA ≥ k, CA ≥ k, BA ≥ k can be found in [93]. BCF-anonymizer iterativelyspecializes R2 starting from the most generalized state of R2. In the mostgeneralized state, all values for each attribute Aj ∈ QID are generalized tothe top most value in the taxonomy. Each specialization, for some attribute inQID, replaces a parent value with an appropriate child value in every recordcontaining the parent value. The top-down specialization approach relies onthe anti-monotonicity property of BCF-anonymity: FA, CA, and BA are non-

Page 200: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Incrementally Updated Data Records 179

Algorithm 10.2.6 BCF-AnonymizerInput: R1, R2 = T1 ∪ T2, k, taxonomy for each Aj ∈ QID.Output: a BCF-anonymized R2.1: generalize every value for Aj ∈ QID in R2 to ANYj ;2: let candidate list = ∪Cut2j containing all ANYj ;3: sort candidate list by Score in descending order;4: while the candidate list is not empty do5: if the first candidate w in candidate list is valid then6: specialize w into w1, . . . , wz in R2;7: compute Score for all wi;8: remove w from ∪Cut2j and the candidate list;9: add w1, . . . , wz to ∪Cut2j and the candidate list;

10: sort the candidate list by Score in descending order;11: else12: remove w from the candidate list;13: end if14: end while15: output R2 and ∪Cut2j ;

increasing in this specialization process. Therefore, all further specializationscan be pruned once any of the above requirements are violated.

THEOREM 10.5Each of FA, CA and BA is non-increasing with respect to a specialization onR2.

According to Theorem 10.5, if the most generalized R2 does not satisfyFA ≥ k, CA ≥ k, and BA ≥ k, no generalized R2 does. Therefore, we canfirst check if the most generalized R2 satisfies this requirement before searchingfor a less generalized R2 satisfying the requirement.

COROLLARY 10.1For a given requirement on BCF-anonymity, there exists a generalized R2 thatsatisfies the requirement if and only if the most generalized R2 does.

Below, we study the general idea of an efficient algorithm for producing alocally maximal specialized R2.

Finding an optimal BCF-anonymized R2 is NP-hard. BCF-anonymizer,summarized in Algorithm 10.2.6, aims at producing a maximally specialized(suboptimal) BCF-anonymized R2 which any further specialization leads toa violation. It starts with the most generalized R2. At any time, R2 containsthe generalized records of T1 ∪ T2 and Cut2j gives the generalization cut for

Page 201: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

180 Introduction to Privacy-Preserving Data Publishing

Aj ∈ QID. Each equivalence class qid2 in R2 is associated with a set of groupsg2 with stored |g2|. Each group g2 is associated with the set of raw recordsin T1 ∪ T2 generalized to the group. R1 is represented similarly with |qid1|and |g1| stored, except that no raw record is kept for g1. Cut1j contains thegeneralization cut Aj ∈ QID in R1. Cut1j never change once created.

Initially, ∪Cut2j and the candidate list contain the most general valueANYj for every Aj ∈ QID (Lines 1-3). In each iteration, BCF-anonymizerexamines the first valid candidate specialization ranked by a criterion Score.If the candidate w is valid, that is, not violating the BCF-anonymity afterits specialization, we specialize w on R2 (Lines 6-10); otherwise, we removew from the candidate list (Line 12). This iteration is repeated until there isno more candidate. From Theorem 10.5, the returned R2 is maximal (subop-timal). Similar to the top-down specialization approaches studied in previouschapters, Score ranks the candidates by their “information worth.” We em-ploy the discernibility cost DM (Chapter 4.1) which charges a penalty to eachrecord for being indistinguishable from other records.

In general, Lines 5-7 require scanning all pairs (qid1, qid2) and all recordsin R2, which is highly inefficient for a large data set. Fung et al. [93] presentan incremental computation of FA, CA, and BA that examines only compa-rable pairs (qid1, qid2) and raw records in R2 that are involved in the currentspecialization.

10.2.6 Beyond Two Releases

Fung et al. [93] also extend the two-release case to the general case involvingmore than two releases. Consider the raw data T1, . . . , Tn collected at times-tamp t1, . . . , tn. Let Ri denote the release for T1 ∪ · · · ∪ Ti, 1 ≤ i ≤ n. Allrecords in Ri have the special timestamp, denoted by T ∗

i , that matches anytimestamp from t1, . . . , ti. The correspondence knowledge now has the formthat every record in Ri (except the last one) has a corresponding record inall releases Rj such that j > i. The notion of “generators” can take this intoaccount. Given more releases, the adversary can conduct two additional typesof correspondence attacks described below.

Optimal micro attacks: The general idea is to choose the “best” back-ground release, yielding the largest possible crack size, individually to crackeach group.

Composition of micro attacks: Another type of attack is to “compose”multiple micro attacks together (apply one after another) in order to increasethe crack size of a group. Composition is possible only if all the micro at-tacks in the composition assume the same timestamp for the target and thecorrespondence knowledge required for the next attack holds after applyingprevious attacks.

The anonymization algorithm can be extended as follows to handle multiplereleases.

Page 202: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Incrementally Updated Data Records 181

1. The notion of BCF-anonymity should be defined based on the optimalcrack size of a group with respect to micro attacks as well as composedattacks on the group.

2. Each time we anonymize the next release Rn for T1∪· · ·∪Tn, we assumethat R1, . . . , Rn−1 satisfy BCF-anonymity. Hence, the anonymization ofRn only needs to ensure that BCF-anonymity is not violated by anyattack that involves Rn.

3. The anti-monotonicity of BCF-anonymity, in the spirit of Theorem 10.5,remains valid in this general case. These observations are crucial formaintaining the efficiency.

As the number of releases increases, the constraints imposed on the nextrelease Rn become increasingly restrictive. However, this does not necessar-ily require more distortion because the new records Tn may help reduce theneed of distortion. In case the distortion becomes too severe, the data holdermay consider starting a new chain of releases without including previouslypublished records.

10.2.7 Beyond Anonymity

The presented detection and anonymization methods can be extended tothwart attribute linkages by incorporating with other privacy models, such asentropy �-diversity and (c,�)-diversity in Chapter 2.2.1, confidence boundingin Chapter 2.2.2, and (α,k)-anonymity in Chapter 2.2.5. The first modifi-cation is to take the privacy requirements into account to the exclusion ofcracked records. In this case, the crack size of each group gives all the infor-mation needed to exclude sensitive values from an equivalence class. For theextension of the anonymization algorithm, anonymity is a necessary privacyproperty because identifying the exact record of an individual from a smallset of records is too easy. Thus, BCF-anonymity is required even if otherprivacy requirements are desired. Under this assumption, the proposed ap-proach is still applicable to prune unpromising specializations based on theanti-monotonicity of BCF-anonymity.

10.3 Dynamic Data Republishing

Xiao and Tao [251] study the privacy issues in the dynamic data republish-ing model, in which both record insertions and deletions may be performedbefore every release. They also present a privacy model and an anonymizationmethod for the dynamic data republishing model. The following example [251]illustrates the privacy threats in this scenario.

Page 203: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

182 Introduction to Privacy-Preserving Data Publishing

Table 10.7: Raw data T1

Not release Quasi-identifier (QID) SensitiveRec ID Age Salary Surgery

1 31 22K Transgender2 32 24K Plastic3 34 28K Vascular4 33 35K Urology5 51 30K Vascular6 46 37K Urology7 47 43K Transgender8 50 45K Vascular9 53 36K Urology10 62 43K Transgender11 66 44K Urology

Table 10.8: 2-anonymous and 2-diverse R1

Not release Quasi-identifier (QID) SensitiveRec ID Age Salary Surgery

1 [31-32] [22K-24K] Transgender2 [31-32] [22K-24K] Plastic3 [33-34] [28K-35K] Vascular4 [33-34] [28K-35K] Urology5 [46-51] [30K-37K] Vascular6 [46-51] [30K-37K] Urology7 [47-53] [36K-45K] Transgender8 [47-53] [36K-45K] Vascular9 [47-53] [36K-45K] Urology10 [62-66] [43K-44K] Transgender11 [62-66] [43K-44K] Urology

10.3.1 Privacy Threats

Example 10.11A hospital wants to periodically, say every three months, publish the currentpatient’s records. Let T1 (Table 10.7) be the raw data of the first release. RecID is for discussion only, not for release. QID = {Age,Salary} and Surgeryis sensitive. The hospital generalizes T1 and publishes R1 (Table 10.8). Then,during the next three months, records # 2, 3, 6, 8, and 10 are deleted, andnew records # 12-16 are added. The updated raw data table T2 is shown inTable 10.9, where the new records are bolded. The hospital then generalizesT2 and publishes a new release R2 (Table 10.10). Though both R1 and R2 areindividually 2-anonymous and 2-diverse, an adversary can still identify thesensitive value of some patients.

Page 204: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Incrementally Updated Data Records 183

Suppose an adversary has targeted a victim, Annie, and knows that An-nie is 31 years old and has salary 22K as background knowledge. Based ongeneralized R1, the adversary can determine that Annie’s received surgery iseither Transgender or Plastic because records # 1 and 2 are the only recordsmatching the background knowledge. Based on the generalized R2, the ad-versary can determine that Annie’s received surgery is either Transgender orUrology. Thus, by combining the aforementioned knowledge, the adversarycan correctly infer that Annie’s received surgery is Transgender.

The above illustrated privacy threats cannot be resolved by generalizationsbecause the value Plastic is absent in T2; therefore, the adversary can inferthat Annie’s received surgery is Transgender no matter how T2 is generalized.This problem is called critical absence [251].

One naive solution to tackle the problem of critical absence is to let thedeleted records remain in the subsequent releases and use them to disorientthe inferences by the adversary. Yet, this strategy implies that the numberof records increases monotonically in every release while many records are nolonger relevant for that period of time. Also, keeping the deleted records maynot necessarily provide the privacy guarantee as expected if an adversary isaware of the records’ deletion timestamps [251].

10.3.2 m-invariance

To address both record insertions and deletions in this dynamic data re-publishing model, Xiao and Tao [251] propose a privacy model called m-invariance. A sequence of releases R1, . . . , Rp is m-invariant if (1) every qidgroup in any Ri contains at least m records and all records in qid have differ-ent sensitive values, and (2) for any record r with published lifespan [x, y]where 1 ≤ x, y ≤ p, qidx, . . . , qidy have the same set of sensitive valueswhere qidx, . . . , qidy are the generalized qid groups containing r in Rx, . . . , Ry.The rationale of m-invariance is that, if a record r has been published inRx, . . . , Ry, then all qid groups containing r must have the same set of sensi-tive values. This will ensure that the intersection of sensitive values over allsuch qid groups does not reduce the set of sensitive values compared to eachqid group. Thus, the above mentioned critical absence does not occur.

Note, (2) does not guarantee the privacy if the life span is “broken,” notcontinuous. In this case, the two broken lifespan is never checked for privacyguarantee. In other words, the model assumes that a record cannot reappearafter its first lifespan; however, this assumption may not be realistic. Forexample, a patient got some diseases over several releases and recovered, andlater on got other diseases again. Consequently, his records will appear inmultiple, broken time intervals.

Given a sequence of m-invariant R1, . . . , Rp−1, Xiao and Tao [251] main-tain a sequence of m-invariant R1, . . . , Rp by minimally adding counterfeit

Page 205: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

184 Introduction to Privacy-Preserving Data Publishing

Table 10.9: Raw data T2

Not release Quasi-identifier (QID) SensitiveRec ID Age Salary Surgery

1 31 22K Transgender4 33 35K Urology12 25 31K Vascular7 47 43K Transgender9 53 36K Urology5 51 30K Vascular13 56 40K Urology14 64 41K Transgender11 66 44K Urology15 70 54K Urology16 75 46K Vascular

Table 10.10: 2-anonymous and 2-diverse R2

Not release Quasi-identifier (QID) SensitiveRec ID Age Salary Surgery

1 [31-33] [22K-35K] Transgender4 [31-33] [22K-35K] Urology12 [35-53] [31K-43K] Vascular7 [35-53] [31K-43K] Transgender9 [35-53] [31K-43K] Urology5 [51-56] [30K-40K] Vascular13 [51-56] [30K-40K] Urology14 [64-66] [41K-44K] Transgender11 [64-66] [41K-44K] Urology15 [70-75] [46K-54K] Urology16 [70-75] [46K-54K] Vascular

data records and generalizing the current release Rp. The following exampleillustrates the general idea of counterfeited generalization.

Example 10.12Let us revisit Example 10.11, where the hospital has published R1 (Table 10.8)and wants to anonymize T2 for the second release R2. Following the methodof counterfeited generalization [251], R2 contains two tables:

1. The generalized data table (Table 10.11) with 2 counterfeits c1 and c2.An adversary cannot tell whether a record is counterfeit or real.

2. The auxiliary table (Table 10.12), indicating the qid group 〈 [31-32],[22K-24K] 〉 contains 1 counterfeit, and the qid group 〈 [47-53], [36K-43K] 〉 contains 1 counterfeit. The objective of providing such additionalinformation is to improve the effectiveness of data analysis.

Page 206: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Incrementally Updated Data Records 185

Table 10.11: R2 by counterfeited generalizationNot release Quasi-identifier (QID) SensitiveRec ID Age Salary Surgery

1 [31-32] [22K-24K] Transgenderc1 [31-32] [22K-24K] Plastic4 [33-35] [31K-35K] Urology12 [33-35] [31K-35K] Vascular7 [47-53] [36K-43K] Transgenderc2 [47-53] [36K-43K] Vascular9 [47-53] [36K-43K] Urology5 [51-56] [30K-40K] Vascular13 [51-56] [30K-40K] Urology14 [64-66] [41K-44K] Transgender11 [64-66] [41K-44K] Urology15 [70-75] [46K-54K] Urology16 [70-75] [46K-54K] Vascular

Table 10.12: Auxiliary tableqid group Number of counterfeits

〈[31-32], [22K-24K]〉 1〈[47-53], [36K-43K]〉 1

Now, suppose the adversary has a target victim, Annie, with backgroundknowledge Age = 31 and Salary = 22K. By observing R1 in Table 10.8 andR2 with counterfeits in Table 10.11, the adversary can use the backgroundknowledge to identify the qid groups 〈 [31-32], [22K-24K] 〉 in both releases.Since the qid groups in both releases share the same set of sensitive values{Transgender,Plastic}, the adversary can only conclude that Annie’s receivedsurgery is either Transgender or Plastic, with 50% of chance on each. Note,the auxiliary information in Table 10.12 does not provide any additional in-formation to the adversary to crack Annie’s sensitive value.

From the above example, we notice that the key to privacy-preserving dy-namic data re-publication is to ensure certain “invariance” in all qid groupsthat record is generalized to in different releases [251]. The privacy modelm-invariance captures this notion.

10.4 HD-Composition

With m-invariance, we assume that the QID attribute values and the sensi-tive attribute values of each individual does not change over time. However, in

Page 207: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

186 Introduction to Privacy-Preserving Data Publishing

Table 10.13: Voterregistration list (RL)

PID Age Zipp1 23 16355p2 22 15500p3 21 12900p4 26 18310p5 25 25000p6 20 29000p7 24 33000. . . . . . . . .

p|RL| 31 31000

RL1

PID Age Zipp1 23 16355p2 22 15500p3 21 12900p4 26 18310p5 25 25000p6 20 29000p7 24 33000. . . . . . . . .

p|RL| 31 31000

RL2

PID Age Zipp1 23 16355p2 22 15500p3 21 12900p4 26 18310p5 25 15000p6 20 29000p7 24 33000. . . . . . . . .

p|RL| 31 31000

RL3

Page 208: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Incrementally Updated Data Records 187

realistic applications, both the QID value and sensitive value of an individualcan change over time, while some special sensitive values should remain un-changed. For example, after a move, the postal code of an individual changes.That is, the external table such as a voter registration list can have multiplereleases and changes from time to time. Also, a patient may recover fromone disease but develop another disease, meaning that the sensitive value canchange over time. Bu et al. [42] propose a method called HD-composition todeal with this scenario.

The motivating example described by [42] and [251] is that the adversarymay notice a neighbor being sent to hospital, from which s/he knows that arecord for the neighbor must exist in two or more consecutive releases. Theyfurther assume that the disease attribute of the neighbor must remain thesame in these releases. However, the presence of the neighbor in multiple datareleases does not imply that the records for the neighbor will remain the samein terms of the sensitive value.

At the same time, some sensitive values that once linked to a record ownercan never be unlinked. For instance, in medical records, sensitive diseases suchas HIV, diabetes, and cancers are to this date incurable, and therefore theyare expected to persist. These are the permanent sensitive values. Permanentsensitive values can be found in many domains of interest. Some examples are“having a pilot’s qualification” and “having a criminal record.”

Let us illustrate the problem with an example. In Table 10.13, RL1, RL2,and RL3 are snapshots of a voter registration list at times 1, 2, and 3, respec-tively. The raw data in Tables 10.14 T1, T2, and T3 are to be anonymized attimes 1, 2, and 3, respectively. In Table 10.15, three tables T ∗

1 , T ∗2 , and T ∗

3

are published serially at times 1, 2, and 3, respectively. It is easy to see thatT ∗

1 , T ∗2 , and T ∗

3 satisfy 3-invariance. This is because in any release, for eachindividual, the set of 3 distinct sensitive values that the individual is linked toin the corresponding qid group remains unchanged. Note that HIV is a perma-nent disease but Flu and Fever are transient diseases. Furthermore, assumethat from the registration lists, one can determine that p1, p2, . . . , p6 are theonly individuals who satisfy the QID conditions for the groups with GID = 1and GID = 2 in all the three tables of T ∗

1 , T ∗2 , and T ∗

3 . Then surprisingly, theadversary can determine that p4 has HIV with 100% probability. The reasonis based on possible world exclusion from all published releases.

It can be shown that p1 and p6 cannot be linked to HIV. Suppose that p1

suffers from HIV. In T ∗1 , since p1, p2, and p3 form a qid group containing one

HIV value, we deduce that both p2 and p3 are not linked to HIV. Similarly,in T ∗

2 , since p1, p4, and p5 form a qid group containing one HIV value, p4 andp5 are non-HIV carriers. Similarly, from T ∗

3 , we deduce that p4 and p6 are notlinked to HIV. Then, we conclude that p2, p3, p4, p5, and p6 do not contractHIV. However, in each of the releases T ∗

1 , T ∗2 , and T ∗

3 , we know that there aretwo HIV values. This leads to a contradiction. Thus, p1 cannot be linked toHIV. Similarly, by the same inductions, p6 cannot be an HIV carrier. Finally,from the qid group with GID = 2 in T ∗

3 , we figure out that p4 must be an

Page 209: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

188 Introduction to Privacy-Preserving Data Publishing

Table 10.14:Raw data (T )

PID Diseasep1 Flup2 HIVp3 Feverp4 HIVp5 Flup6 Fever

T1

PID Diseasep1 Flup2 HIVp3 Flup4 HIVp5 Feverp6 Fever

T2

PID Diseasep1 Flup2 HIVp3 Flup4 HIVp5 Feverp6 Fever

T3

HIV carrier!No matter how large m is, this kind of possible world exclusion can ap-

pear after several publishing rounds. Note that even if the registration listremains unchanged, the same problem can occur since the six individualscan be grouped in the same way as in T ∗

1 , T ∗2 , and T ∗

3 at 3 different times,according to the algorithm proposed by [251].

The anonymization mechanism for serial publishing should provideindividual-based protection. Yet, Byun et al. [42] and Xiao and Tao [251] focuson record-based protection. In m-invariance, each record is associated with alifespan of contiguous releases and a signature which is an invariant set ofsensitive values linking to rj in the published table. If a record rj for individ-ual pi appears at time j, disappears at time j + 1 (e.g., pi may discontinuetreatment or may switch to another hospital), and reappears at time j + 2,

Page 210: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Incrementally Updated Data Records 189

Table 10.15: Published tables T ∗ satisfying3-invariancePID GID Age Zip Diseasep1 1 [21, 23] [12k, 17k] Flup2 1 [21, 23] [12k, 17k] HIVp3 1 [21, 23] [12k, 17k] Feverp4 2 [20, 26] [18k, 29k] HIVp5 2 [20, 26] [18k, 29k] Flup6 2 [20, 26] [18k, 29k] Fever

First Publication T ∗1

PID GID Age Zip Diseasep2 1 [20, 22] [12k, 29k] HIVp3 1 [20, 22] [12k, 29k] Flup6 1 [20, 22] [12k, 29k] Feverp1 2 [23, 26] [16k, 25k] Flup4 2 [23, 26] [16k, 25k] HIVp5 2 [23, 26] [16k, 25k] Fever

Second Publication T ∗2

PID GID Age Zip Diseasep2 1 [21, 25] [12k, 16k] HIVp3 1 [21, 25] [12k, 16k] Flup5 1 [21, 25] [12k, 16k] Feverp1 2 [20, 26] [16k, 29k] Flup4 2 [20, 26] [16k, 29k] HIVp6 2 [20, 26] [16k, 29k] Fever

Third Publication T ∗3

the appearance at j + 2 is treated as a new record rj+2 in the anonymiza-tion process adopted by [251]. There is no memory of the previous signaturefor rj , and a new signature is created for rj+2. Let us take a look at T ∗

1 inTable 10.15. From T ∗

1 , we can find that by 3-invariance, the signature of therecords for p1 and p3 in T ∗

1 is {Flu, HIV, Fever}. If p1 and p3 recover fromFlu and Fever at time 2 (not in T2), and reappears due to other disease attime 3(in T3), the reappearance of p1 and p3 in T3 is treated as new recordsr′1, r

′3 and by m-invariance, there is no constraint for their signature. Thus, at

time 3, if the signatures for r′1 and r′3 do not contain HIV, p1 and p3 will beexcluded from HIV. Consequently, p2 will be found to have HIV!

To handle the above challenges, Bu et al. [38] propose an anonymizationmethod called HD-composition which protects individual privacy for perma-nent sensitive values. The method involves two major roles, namely holderand decoy. The objective is to bound the probability of linkage between any

Page 211: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

190 Introduction to Privacy-Preserving Data Publishing

individual and any permanent sensitive value by a given threshold, e.g., 1/�.Suppose an individual pi has a permanent sensitive value s in the microdata.One major technique used for anonymizing static data is to form a qid groupmixing pi and other individuals whose sensitive values are not s. Merely hav-ing the published qid groups, the adversary cannot establish strong linkagefrom pi to s. The anonymization also follows this basic principle, where theindividual to be protected is named as a holder and some other individualsfor protection are named as decoys.

There are two major principles for partitioning: role-based partition andcohort-based partition. By role-based partition, in every qid group of the pub-lished data, for each holder of a permanent sensitive value s, � − 1 decoyswhich are not linked to s can be found. Thus, each holder is masked by �− 1decoys. By cohort-based partition, for each permanent sensitive value s, theproposed method constructs � cohorts, one for holders and the other �− 1 fordecoys, and disallows decoys from the same cohort to be placed in the samepartition. The objective is to imitate the properties of the true holders.

10.5 Summary

In this chapter, we have studied two scenarios of incremental data pub-lishing. In Chapter 10.2, we have discussed the anonymization problem for ascenario where the data are continuously collected and published. Each releasecontains the new data as well as previously collected data. Even if each re-lease is k-anonymized, the anonymity of an individual can be compromised bycross-examining multiple releases. We formalized this notion of attacks andpresented a detection method and an anonymization algorithm to preventsuch attacks. Finally, we showed that both the detection and the anonymiza-tion methods are extendable to deal with multiple releases and other privacyrequirements.

Recall that the anatomy approach discussed in Chapter 3.2 publishes theexact QID and the sensitive attribute in two separate tables, QIT and ST,linked by a common GroupID. In this continuous data publishing scenario,however, publishing the exact QID allows the adversary to isolate the newrecords added in later release by comparing the difference between the old(QIT ,ST ) and the new (QIT ,ST ). Once new records are isolated, the attackcan focus on the usually small increment, which increases the re-identificationrisk. Thus, anatomy is not suitable for continuous data publishing.

In Chapter 10.3, we have studied the privacy threats caused by dynamicdata republishing, in which both record insertions and deletions may be per-formed before every release, and discussed a privacy model called m-invarianceto ensure the invariance of sensitive values in the intersection of qid groups

Page 212: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Incrementally Updated Data Records 191

across all releases. m-invariance is achieved by adding counterfeit records andgeneralizations. Adding counterfeits is necessary in order to address the prob-lem of critical absence. However, a table with counterfeit records could nolonger preserve the data truthfulness at the record level, which is importantin some applications, as explained in Chapter 1.1. Bu et al. [38] further relaxthe PPDP scenario and assume that the QID and sensitive values of a recordowner could change in subsequent releases.

With m-invariance, we assume that the QID attribute values and the sen-sitive attribute values of each individual does not change over time. This as-sumption, however, may not hold in real-life data publishing. In Chapter 10.4,we have studied another approach that addresses these issues. Pei et al. [186]also consider privacy threats in the incremental data publishing scenario thatthe Case-ID of records must be published. Case-ID are unique identifiers as-sociated with an entity, e.g., a patient. Most works on incremental data pub-lishing consider the scenario that the data holder has removed the Case-ID ofrecords, so the attack based on Case-ID does not occur. Publishing Case-IDgives the adversary the very powerful knowledge of locating the correspondingpublished records in two releases. This additional threat can be removed bynot publishing Case-ID. In fact, most aggregate data analysis (such as countqueries) does not depend on the Case-ID. Iwuchukwu and Naughton [122] pro-pose an efficient index structure to incrementally k-anonymize each individualrelease, but it does not address the attack models studied in this chapter.

Page 213: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Chapter 11

Collaborative Anonymization forVertically Partitioned Data

11.1 Introduction

Nowadays, one-stop service has been a trend followed by many competitivebusiness sectors, where a single location provides multiple related services. Forexample, financial institutions often provide all of daily banking, mortgage,investment, insurance in one location. Behind the scene, this usually involvesinformation sharing among multiple companies. However, a company cannotindiscriminately open up the database to other companies because privacypolicies [221] place a limit on information sharing. Consequently, there is adual demand on information sharing and information protection, driven bytrends such as one-stop service, end-to-end integration, outsourcing, simulta-neous competition and cooperation, privacy and security.

A typical scenario is that two parties wish to integrate their privatedatabases to achieve a common goal beneficial to both, provided that theirprivacy requirements are satisfied. In this chapter, we consider the goal ofachieving some common data mining tasks over the integrated data while sat-isfying the k-anonymity privacy requirement. The k-anonymity requirementstates that domain values are generalized so that each value of some specifiedattributes identifies at least k records. The generalization process must notleak more specific information other than the final integrated data. In thischapter, we study some solutions to this problem.

So far, we have considered only a single data holder. In real-life datapublishing, a single organization often does not hold the complete data.Organizations need to share data for mutual benefits or for publishing toa third party. For example, two credit card companies want to integratetheir customer data for developing a fraud detection system, or for pub-lishing to a bank. However, the credit card companies do not want to in-discriminately disclose their data to each other or to the bank for reasonssuch as privacy protection and business competitiveness. Figure 11.1 de-picts this scenario, called collaborative anonymization for vertically parti-tioned data [172]: several data holders own different sets of attributes onthe same set of records and want to publish the integrated data on all at-

193

Page 214: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

194 Introduction to Privacy-Preserving Data Publishing

Data Publisher 1

Dat

a C

olle

ctio

n

Data Publisher 2

Alice Bob

Integrated anonymized

data

Dat

a P

ublis

hing Data Recipient

FIGURE 11.1: Collaborative anonymization for vertically partitioned data

tributes. Say, publisher 1 owns {RecID, Job, Sex, Age} and publisher 2 owns{RecID, Salary, Disease}, where RecID, such as the SSN , is the recordidentifier shared by all data holders. They want to publish an integrated k-anonymous table on all attributes. Also, no data holder should learn morespecific information, owned by the other data holders, than the informationappears in the final integrated table.

In Chapter 11.2, we study the problem of collaborative anonymization forvertically partitioned data in the context of a data mashup application and usea real-life scenario to motivate the information and privacy requirements aswell as the solution. In Chapter 11.3, we study the cryptographic approachesto the collaborative anonymization problem.

11.2 Privacy-Preserving Data Mashup

The problem of collaborative anonymization for vertically partitioned datahas been applied to the scenario of a distributed data mashup applica-tion [172]. In this Chapter 11.2, we use this real-life data mashup applicationto illustrate the problem and the solution.

Mashup is a web technology that combines information and services frommore than one source into a single web application. It was first discussed in a2005 issue of Business Week [117] on the topic of integrating real estate infor-mation into Google Maps. Since then, web giants like Amazon, Yahoo!, andGoogle have been actively developing mashup applications. Mashup has cre-ated a new horizon for service providers to integrate their data and expertiseto deliver highly customizable services to their customers.

Data mashup is a special type of mashup application that aims at integrat-ing data from multiple data holders depending on the user’s service request.

Page 215: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Collaborative Anonymization for Vertically Partitioned Data 195

Privacy-Preserving Data

Mashup Algorithm

Web Service

Private DB

Private DB

Web Service

Data Holder A Data Holder B

Data Mining Module

Service Provider: Web Applica�on Server

Private DB

Private DB

FIGURE 11.2: Architecture of privacy-preserving data mashup

Figure 11.2 illustrates a typical architecture of the data mashup technology.A service request could be a general data exploration or a sophisticated datamining task such as classification analysis. Upon receiving a service request,the data mashup web application dynamically determines the data holders,collects information from them through their web service application program-ming interface (API),1 and then integrates the collected information to fulfillthe service request. Further computation and visualization can be performedat the user’s site (e.g., a browser or an applet). This is very different fromthe traditional web portal which simply divides a web page or a website intoindependent sections for displaying information from different sources.

A typical application of data mashup is to implement the concept of one-stop service. For example, a single health mashup application could providea patient all of her health history, doctor’s information, test results, appoint-ment bookings, insurance, and health reports. This concept involves informa-tion sharing among multiple parties, e.g., hospital, drug store, and insurancecompany.

To service providers, data mashup provides a low-cost solution to integratetheir services with their partners and broaden their market. To users, datamashup provides a flexible interface to obtain information from different ser-vice providers. However, to adversaries, data mashup could be a valuable toolfor identifying sensitive information. A data mashup application can help or-dinary users explore new knowledge. Nevertheless, it could also be misused byadversaries to reveal sensitive information that was not available before thedata integration.

In this chapter, we study the privacy threats caused by data mashup anddiscuss a privacy-preserving data mashup (PPMashup) algorithm to securely

1Authentication may be required to ensure that the user has access rights to the requesteddata.

Page 216: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

196 Introduction to Privacy-Preserving Data Publishing

integrate person-specific sensitive data from different data holders, whereasthe integrated data still retains the essential information for supporting gen-eral data exploration or a specific data mining task, such as classificationanalysis. The following real-life scenario illustrates the simultaneous need ofinformation sharing and privacy preservation in the financial industry.

This data mashup problem was discovered in a collaborative project [172]with a provider of unsecured loans in Sweden. Their problem can be gener-alized as follows: A loan company A and a bank B observe different sets ofattributes about the same set of individuals identified by the common keySSN,2 e.g., TA(SSN, Age, Balance) and TB(SSN, Job, Salary). These com-panies want to implement a data mashup application that integrates theirdata to support better decision making such as loan or credit limit approval,which is basically a data mining task on classification analysis. In additionto companies A and B, their partnered credit card company C also have ac-cess to the data mashup application, so all three companies A, B, and C aredata recipients of the final integrated data. Companies A and B have twoprivacy concerns. First, simply joining TA and TB would reveal the sensitiveinformation to the other party. Second, even if TA and TB individually do notcontain person specific or sensitive information, the integrated data can in-crease the possibility of identifying the record of an individual. Their privacyconcerns are reasonable because Sweden has a population of only 9 millionpeople. Thus, it is possible to identify the record of an individual by collectinginformation from other data sources. The next example illustrates this point.

Example 11.1Consider the data in Table 11.1 and taxonomy trees in Figure 11.3. Party A(the loan company) and Party B (the bank) own TA(SSN, Sex, . . . , Class)and TB(SSN, Job, Salary, . . . , Class), respectively. Each row represents oneor more raw records and Class contains the distribution of class labels Y andN, representing whether or not the loan has been approved. After integratingthe two tables (by matching the SSN field), the female lawyer on (Sex, Job)becomes unique, therefore, vulnerable to be linked to sensitive informationsuch as Salary. In other words, record linkage is possible on the fields Sexand Job. To prevent such linkage, we can generalize Accountant and Lawyer toProfessional so that this individual becomes one of many female professionals.No information is lost as far as classification is concerned because Class doesnot depend on the distinction of Accountant and Lawyer.

The privacy-preserving data mashup problem is defined as follows. Givenmultiple private tables for the same set of records on different sets of attributes(i.e., vertically partitioned tables), we want to efficiently produce an integratedtable on all attributes for releasing it to to both parties or even to a third

2SSN is called “personnummer” in Sweden.

Page 217: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Collaborative Anonymization for Vertically Partitioned Data 197

Table 11.1: Raw tables

Shared Party A Party BSSN Class Sex ... Job Salary ...1-3 0Y3N Male Janitor 30K4-7 0Y4N Male Mover 32K8-12 2Y3N Male Carpenter 35K13-16 3Y1N Female Technician 37K17-22 4Y2N Female Manager 42K23-25 3Y0N Female Manager 44K26-28 3Y0N Male Accountant 44K29-31 3Y0N Female Accountant 44K32-33 2Y0N Male Lawyer 44K

34 1Y0N Female Lawyer 44K

Blue-collar White-collar

Non-Technical

Carpenter

Manager

ANY

Technical

Lawyer

Professional

Job

TechnicianMoverJanitor [1-35)

ANY[1-99)

[1-37) [37-99)

[35-37)

Salary

ANY

Male Female

Sex

<QID1 = {Sex, Job}, 4><QID2 = {Sex, Salary}, 11>

Accountant

FIGURE 11.3: Taxonomy trees and QIDs

party. The integrated table must satisfy both the following anonymity andinformation requirements:

Anonymity Requirement: The integrated table has to satisfy k-anonymity as discussed in previous chapters: A data table T satisfies k-anonymity if every combination of values on QID is shared by at least krecords in T , where the quasi-identifier (QID) is a set of attributes in T thatcould potentially identify an individual in T , and k is a user-specified thresh-old. k-anonymity can be satisfied by generalizing domain values into higherlevel concepts. In addition, at any time in the procedure of generalization,no party should learn more detailed information about the other party otherthan those in the final integrated table. For example, Lawyer is more detailedthan Professional. In other words, the generalization process must not leakmore specific information other than the final integrated data.

Page 218: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

198 Introduction to Privacy-Preserving Data Publishing

Information Requirement: The generalized data should be as useful aspossible to classification analysis. Generally speaking, the privacy goal requiresmasking sensitive information that is specific enough to identify individuals,whereas the classification goal requires extracting trends and patterns that aregeneral enough to predict new cases. If generalization is carefully performed,it is possible to mask identifying information while preserving patterns usefulfor classification.

In addition to the privacy and information requirements, the data mashupapplication is an online web application. The user dynamically specifies theirrequirement and the system is expected to be efficient and scalable to handlehigh volumes of data.

There are two obvious yet incorrect approaches. The first one is “integrate-then-generalize”: first integrate the two tables and then generalize the in-tegrated table using some single table anonymization methods discussedin Chapters 5 and 6, such as K-Optimize [29], Top-Down Specialization(TDS) [95, 96], HDTDS [171], Genetic Algorithm [123], and InfoGain Mon-drian [150]. Unfortunately, these approaches do not preserve privacy in thestudied scenario because any party holding the integrated table will imme-diately know all private information of both parties. The second approach is“generalize-then-integrate”: first generalize each table locally and then in-tegrate the generalized tables. This approach does not work for a quasi-identifier that spans multiple tables. In the above example, the k-anonymityon (Sex,Job) cannot be achieved by the k-anonymity on each of Sex and Jobseparately.

The rest of this Chapter 11.2 is organized as follows. In Chapter 11.2.1,we illustrate a real-life privacy problem in the financial industry and gen-eralize their requirements to formulate the privacy-preserving data mashupproblem. The goal is to allow data sharing for classification analysis in thepresence of privacy concern. This problem is very different from cryptographicapproach [128, 259], which will be discussed in Chapter 11.3, that allows “re-sult sharing” (e.g., the classifier in this case) but completely prohibits datasharing. In some applications, data sharing gives greater flexibility than resultsharing because data recipients can perform their required analysis and dataexploration, such as, mine patterns in a specific group of records, visualizethe transactions containing a specific pattern, try different modeling methodsand parameters.

In Chapter 11.2.3, we illustrate a service-oriented architecture for the prob-lem privacy-preserving data mashup. The architecture defines communicationpaths of all participating parties, and defines the role of the mashup coordi-nator who is responsible to initialize the protocol execution and present thefinal integrated data set to the user. The architecture does not require themashup coordinator to be a trusted entity.

In Chapters 11.2.4-11.2.5, we discuss two algorithms to securely integrateprivate data from multiple parties for two different adversary models. The firstalgorithm assumes that parties are semi-honest. In the semi-honest adversarial

Page 219: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Collaborative Anonymization for Vertically Partitioned Data 199

model, it is assumed that parties do follow protocol but may try to deduceadditional information. This is the common security definition adopted in theSecure Multiparty Computation (SMC) literature [128]. The second algorithmfurther addresses the data integration model with malicious parties. We showthat a party may deviate from the protocol for its own benefit. To overcome themalicious problem, we discuss a game-theoretic approach to combine incentivecompatible strategies with the anonymization technique.

The studied algorithm, PPMashup, can effectively achieve an anonymityrequirement without compromising the useful data for classification, and themethods are scalable to handle large data sets. The algorithm for the semi-honest model produces the same final anonymous table as the integrate-then-generalize approach, and only reveals local data that has satisfied a givenk-anonymity requirement. Moreover, the algorithm for the malicious modelprovides additional security by ensuring fair participation of the data holders.

In Chapter 11.2.4.4, we discuss the possible extensions to thwart attributelinkages by achieving other privacy requirements, such as �-diversity [162],(α,k)-anonymity [246], and confidence bounding [237].

11.2.1 Anonymization Problem for Data Mashup

To ease explanation, we assume the privacy requirement to be theanonymity requirement, QID1, . . . , QIDp, discussed in Definition 7.1, whichis in effect k-anonymity with multiple QIDs. Yet, the privacy requirement isnot limited to k-anonymity and can be other privacy models in practice.

Example 11.2〈QID1 = {Sex, Job}, 4〉 states that every qid on QID1 in T must be sharedby at least 4 records in T . In Table 11.1, the following qids violate this re-quirement:〈Male, Janitor〉,〈Male,Accountant〉,〈Female,Accountant〉,〈Male,Lawyer〉,〈Female,Lawyer〉.

The example in Figure 11.3 specifies the anonymity requirement with twoQIDs.

Consider n data holders {Party 1,. . . ,,Party n}, where each Party y ownsa private table Ty(ID, Attribsy, Class) over the same set of records. IDand Class are shared attributes among all parties. Attribsy is a set of pri-vate attributes. Attribsy ∩ Attribsz = ∅ for any 1 ≤ y, z ≤ n. These par-ties agree to release “minimal information” to form an integrated table T(by matching the ID) for conducting a joint classification analysis. The no-tion of minimal information is specified by the joint anonymity requirement

Page 220: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

200 Introduction to Privacy-Preserving Data Publishing

{〈QID1, k1〉, . . . , 〈QIDp, kp〉} on the integrated table. QIDj is local if it con-tains only attributes from one party, and global otherwise.

DEFINITION 11.1 Privacy-preserving data mashup Given multipleprivate tables T1, . . . , Tn, a joint anonymity requirement {〈QID1, k1〉, . . . ,〈QIDp, kp〉}, and a taxonomy tree for each categorical attribute in ∪QIDj ,the problem of privacy-preserving data mashup is to efficiently produce ageneralized integrated table T such that

1. T satisfies the joint anonymity requirement,

2. contains as much information as possible for classification, and

3. each party learns nothing about the other party more specific than whatis in the final generalized T . We assume that the data holders are semi-honest, meaning that they will follow the protocol but may attempt toderive sensitive information from the received data.

The problem requires achieving anonymity in the final integrated table aswell as in any intermediate table. For example, if a record in the final T hasvalues Female and Professional on Sex and Job, and if Party A learns thatProfessional in this record comes from Lawyer, condition (3) is violated.

11.2.1.1 Challenges

In case all QIDs are locals, we can generalize each table TA and TB in-dependently, and join the generalized tables to produce the integrated data.However, if there are global QIDs, global QIDs are ignored in this approach.Further generalizing the integrated table using global QIDs does not work be-cause the requirement (3) is violated by the intermediate table that containsmore specific information than the final table.

It may seem that local QIDs can be generalized beforehand. However, if alocal QIDl shares some attributes with a global QIDg, the local generaliza-tion ignores the chance of getting a better result by generalizing QIDg first,which leads to a sub-optimal solution. A better strategy is generalizing sharedattributes in the presence of both QIDl and QIDg. Similarly, the generaliza-tion of shared attributes will affect the generalization of other attributes inQIDl, thus, affect other local QIDs that share an attribute with QIDl. As aresult, all local QIDs reachable by a path of shared attributes from a globalQID should be considered in the presence of the global QID.

11.2.1.2 General Join

So far, we have assumed that the join between TA and TB is through thecommon key ID. If the join attributes are not keys in TA or TB, a preprocess-ing step is required to convert the problem to that defined in Definition 11.1.

Page 221: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Collaborative Anonymization for Vertically Partitioned Data 201

The two parties first perform a join on their join attributes, but not the person-specific data, then assign a unique ID to each joint record, and use these jointrecords to join the local table TA or TB. Both TA and TB now contain thecommon key column ID. In the rest of this chapter, we consider only join ofTA and TB through the common key ID. Note, this model assumes that thejoin attributes are not sensitive.

In this case, we have TA(IDA, JA, D1, . . . , Dt) and TB(IDB, JB, Dt+1, . . . ,Dm), where IDA and IDB are the record identifier in TA and TB, and JA andJB are join columns in TA and TB. The two parties first compute equijoin ofΠIDA,JA(TA) and ΠIDB ,JB (TB) based on JA = JB. Let ΠID,IDA,IDB be theprojection of the join onto IDA, IDB, where ID is a new identifier for therecords on IDA, IDB. Each party replaces IDA and IDB with ID.

11.2.2 Information Metrics

To generalize T , a taxonomy tree is specified for each categorical attributein ∪QIDj. For a numerical attribute in ∪QIDj , a taxonomy tree can begrown at runtime, where each node represents an interval, and each non-leafnode has two child nodes representing some optimal binary split of the parentinterval. The algorithm generalizes a table T by a sequence of specializationsstarting from the top most general state in which each attribute has the topmost value of its taxonomy tree. A specialization, written v → child(v), wherechild(v) denotes the set of child values of v, replaces the parent value v withthe child value that generalizes the domain value in a record. In other words,the algorithms employ the subtree generalization scheme. Refer to Chapter 3.1for the details of generalization and specialization.

A specialization is valid if the specialization results in a table satisfying theanonymity requirement after the specialization. A specialization is beneficialif more than one class are involved in the records containing v. If not then thatspecialization does not provide any helpful information for classification. Thus,a specialization is performed only if it is both valid and beneficial. The notionsof specialization, InfoGain(v), AnonyLoss(v), Score(v), and the procedureof dynamically growing a taxonomy tree on numerical attributes have beenstudied in Chapter 6.2.2.

Example 11.3The specialization ANY Job refines the 34 records into 16 records for Blue-collar and 18 records for White-collar. Score(ANY Job) is calculated as fol-lows.

E(T [ANY Job]) = − 2134 × log2

2134 − 13

34 × log21334 = 0.9597

E(T [Blue-collar]) = − 516 × log2

516 − 11

16 × log21116 = 0.8960

E(T [White-collar]) = − 1618 × log2

1618 − 2

18 × log2218 = 0.5033

InfoGain(ANY Job) = E(T [ANY Job])− (1634 × E(T [Blue-collar])

+ 1834 × E(T [White-collar])) = 0.2716

Page 222: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

202 Introduction to Privacy-Preserving Data Publishing

AnonyLoss(ANY Job) = avg{A(QID1)−AANY Job(QID1)}= (34− 16)/1 = 18

Score(ANY Job) = 0.271618+1 = 0.0143.

In practice, the data holders can define their information metrics. The meth-ods studied in this chapter are also applicable to achieve other informationrequirements, not limited to classification analysis. Here, we assume the goalof classification analysis in order to illustrate a concrete scenario.

11.2.3 Architecture and Protocol

In this Chapter 11.2.3, we discuss the technical architecture [225] shown inFigure 11.4 with the communication paths of all participating parties followedby a privacy-preserving data mashup protocol in Chapter 11.2.4. Referring tothe architecture, the mashup coordinator plays the central role in initializingthe protocol execution and presenting the final integrated data set to the user.The architecture does not require the mashup coordinator to be a trustedentity. This makes the architecture practical because a trusted party is oftennot available in real-life scenarios.

The mashup coordinator of the communication protocol separates the archi-tecture into two phases. In Phase I, the mashup coordinator receives requestsfrom users, establishes connections with the data holders who contribute theirdata in a privacy-preserving manner. In Phase II, the mashup coordinatormanages the privacy-preserving data mashup algorithm (PPMashup) amongthe data holders for a particular client request.

11.2.3.1 Phase I: Session Establishment

The objective of Phase I is to establish a common session context amongthe contributing data holders and the user. An operational context is suc-cessfully established by proceeding through the steps of user authentication,contributing data holders identification, session initialization, and commonrequirements negotiation.

Authenticate user : The mashup coordinator first authenticates a user to therequested service, generates a session token for the current user interaction,and then identifies the data holders accessible by the user. Some data holdersare public and are accessible by any users.

Identify contributing data holders: Next, the mashup coordinator queriesthe data schema of the accessible data holders to identify the data holdersthat can contribute data for the requested service. To facilitate more efficientqueries, the mashup coordinator could pre-fetch data schema from the dataholders (i.e., the pull model), or the data holders could update their dataschema periodically (i.e., the push model).

Page 223: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Collaborative Anonymization for Vertically Partitioned Data 203

Browser View

Data Mining ModuleUSER

Data Holder 2 Data Holder N

Data Holder 1

PrivateDB

Web Service

Web Service

Web Service Sessions

Mashup Service Provider

user datarequest/response

session link

session link

privacy-preserving data mashup algorithm

FIGURE 11.4: Service-oriented architecture for privacy-preserving datamashup

Initializing session context : Then, the mashup coordinator notifies all con-tributing data holders with the session identifier. All prospective data holdersshare a common session context, which represents a stateful presentation ofinformation related to a specific execution of privacy-preserving data mashupalgorithm PPMashup, which will be discussed in Chapter 11.2.3.2. Due to thefact that multiple parties are involved and the flow of multiple protocol mes-sages is needed in order to fulfill the data mashup, we can use a Web ServiceResource Framework (WSRF) to keep stateful information along an initialservice request. An established session context stored as a single web serviceresource contains several attributes to identify a PPMashup process, whichare an unique session identifier (making use of end-point reference (EPR),which is built from service address and identifiers of the resource in use), theclient address, the data holder addresses and their certificates, an authen-tication token (containing the user certificate), as well as additional statusinformation.

Page 224: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

204 Introduction to Privacy-Preserving Data Publishing

Negotiating privacy and information requirements : The mashup coordina-tor is responsible to communicate the negotiation of privacy and informationrequirements among the data holders and the user. Specifically, this step in-volves negotiating the price, the anonymity requirement, and the expectedinformation quality. For example, in the case of classification analysis, infor-mation quality can be estimated by classification error on some testing data.

11.2.3.2 Phase II: Initiating Privacy-Preserving Protocol

After a common session has been established among the data holders, themashup coordinator initiates PPMashup and stays back. Upon the comple-tion of the protocol, the mashup coordinator will receive an integrated tablethat satisfies both the information and anonymity requirements. There aretwo advantages that the mashup coordinator does not have to participate inthe PPMashup protocol. First, the architecture does not require the mashupcoordinator to be a trusted entity. The mashup coordinator only has access tothe final integrated k-anonymous data. Second, this setup removes the compu-tation burden from the mashup coordinator, and frees up the coordinator tohandle other requests. Chapter 11.2.4 and Chapter 11.2.5 discuss anonymiza-tion algorithms for semi-honest model and for malicious model, respectively.

11.2.4 Anonymization Algorithm for Semi-Honest Model

In Chapter 6.3, we have studied a top-down specialization (TDS) [95, 96] ap-proach to generalize a single table T . One non-privacy-preserving approachto the problem of data mashup is to first join the multiple private tables intoa single table T and then generalize T to satisfy a k-anonymity requirementusing TDS. Though this approach does not satisfy the privacy requirement (3)in Definition 11.1 (because the party that generalizes the joint table knowsall the details of the other parties), the integrated table produced satisfiesrequirements (1) and (2). Therefore, it is helpful to first have an overview ofTDS: Initially, all values are generalized to the top most value in its taxon-omy tree, and Cuti contains the top most value for each attribute Di. At eachiteration, TDS performs the best specialization, which has the highest Scoreamong the candidates that are valid, beneficial specializations in ∪Cuti, andthen updates the Score of the affected candidates. The algorithm terminateswhen there is no more valid and beneficial candidate in ∪Cuti. In other words,the algorithm terminates if any further specialization would lead to a viola-tion of the anonymity requirement. An important property of TDS is that theanonymity requirement is anti-monotone with respect to a specialization: If itis violated before a specialization, it remains violated after the specialization.This is because a specialization never increases the anonymity count a(qid).

Now, we consider that the table T is given by two tables (n = 2) TA and TB

with a common key ID, where Party A holds TA and Party B holds TB. Atfirst glance, it seems that the change from one party to two parties is trivial

Page 225: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Collaborative Anonymization for Vertically Partitioned Data 205

Algorithm 11.2.7 PPMashup for Party A (Same as Party B) in Semi-HonestModel1: initialize Tg to include one record containing top most values;2: initialize ∪Cuti to include only top most values;3: while some candidate v ∈ ∪Cuti is valid do4: find the local candidate x of highest Score(x);5: communicate Score(x) with Party B to find the winner;6: if the winner w is local then7: specialize w on Tg;8: instruct Party B to specialize w;9: else

10: wait for the instruction from Party B;11: specialize w on Tg using the instruction;12: end if13: replace w with child(w) in the local copy of ∪Cuti;14: update Score(x) and beneficial/validity status for candidates x ∈

∪Cuti;15: end while16: return Tg and ∪Cuti;

because the change of Score due to specializing a single attribute dependsonly on that attribute and Class, and each party knows about Class andthe attributes they have. This observation is wrong because the change ofScore involves the change of A(QIDj) that depends on the combination ofthe attributes in QIDj. In PPMashup, each party keeps a copy of the current∪Cuti and generalized T , denoted by Tg, in addition to the private TA or TB.The nature of the top-down approach implies that Tg is more general than thefinal answer, therefore, does not violate the requirement (3) in Definition 11.1.At each iteration, the two parties cooperate to perform the same specializationas identified in TDS by communicating certain information in a way thatsatisfies the requirement (3) in Definition 11.1. Algorithm 11.2.7 describes theprocedure at Party A (same for Party B).

First, Party A finds the local best candidate using the specialization criteriapresented in Chapter 11.2.2 and communicates with Party B to identify theoverall global winner candidate, say w. To protect the input score, the securemultiparty maximum protocol [260] can be used. Suppose that w is localto Party A (otherwise, the discussion below applies to Party B). Party Aperforms w → child(w) on its copy of ∪Cuti and Tg. This means specializingeach record t ∈ Tg containing w into those t′1, . . . , t

′z containing child values in

child(w). Similarly, Party B updates its ∪Cuti and Tg, and partitions TB[t]into TB[t′1], . . . , TB[t′z]. Since Party B does not have the attribute for w, PartyA needs to instruct Party B how to partition these records in terms of IDs.

Page 226: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

206 Introduction to Privacy-Preserving Data Publishing

LinkANY_Job

Head of Link ANY_JobSex Job Salary

ANY_Sex ANY_Job [1-99)

[1-99) {[1-37), [37-99)}

34# of Recs.

12ANY_Sex ANY_Job [1-37) ANY_Sex ANY_Job [37-99) 22

FIGURE 11.5: The TIPS after the first specialization

LinkBlue-collar

Head of Link Blue-collar

Sex Job SalaryANY_Sex ANY_Job [1-99)

ANY_Sex Blue-collar [37-99)

Head of Link White-collar

[1-99) {[1-37), [37-99)}

34# of Recs.

412ANY_Sex Blue-collar [1-37)

12ANY_Sex ANY_Job [1-37) ANY_Sex ANY_Job [37-99) 22

ANY_Sex White-collar [37-99) 18

ANY_Job {Blue-collar, White-collar}

FIGURE 11.6: The TIPS after the second specialization

Example 11.4Consider Table 11.1 and the joint anonymity requirement:{〈QID1 = {Sex, Job}, 4〉, 〈QID2 = {Sex,Salary}, 11〉}.

Initially,Tg = {〈ANY Sex,ANY Job, [1-99)〉}

and∪Cuti = {ANY Sex,ANY Job, [1-99)},

and all specializations in ∪Cuti are candidates. To find the candidate, PartyA computes Score(ANY Sex), and Party B computes Score(ANY Job) andScore([1-99)).

Algorithm 11.2.7 makes no claim on efficiency. In a straightforward method,Lines 4, 7, and 11 require scanning all data records and recomputing Score forall candidates in ∪Cuti. The key to the efficiency of the algorithm is directlyaccessing the data records to be specialized, and updating Score based onsome statistics maintained for candidates in ∪Cuti, instead of accessing datarecords. Below, we briefly describe the key steps: find the winner candidate(Lines 4-5), perform the winning specialization (Lines 7-11), and update thescore and status of candidates (Line 14). For Party A (or Party B), a localattribute refers to an attribute from TA (or TB), and a local specializationrefers to that of a local attribute. Refer to [172] for details.

Page 227: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Collaborative Anonymization for Vertically Partitioned Data 207

11.2.4.1 Find the Winner Candidate

Party A first finds the local candidate x of highest Score(x), by making useof computed InfoGain(x), Ax(QIDj) and A(QIDj), and then communicateswith Party B (using secure multiparty max algorithm in [260]) to find the win-ner candidate. InfoGain(x), Ax(QIDj) and A(QIDj) come from the updatedone in the previous iteration or the initialization prior to the first iteration.This step does not access data records. Updating InfoGain(x), Ax(QIDj)and A(QIDj) is considered in Chapter 11.2.4.3.

11.2.4.2 Perform the Winner Candidate

Suppose that the winner candidate w is local at Party A (otherwise, replaceParty A with Party B). For each record t in Tg containing w, Party A accessesthe raw records in TA[t] to tell how to specialize t. To facilitate this operation,we represent Tg by the data structure called Taxonomy Indexed PartitionS(TIPS), which has been discussed in Chapter 6.3.2.

With the TIPS, we can find all raw records generalized to x by followingLinkx for a candidate x in ∪Cuti. To ensure that each party has only accessto its own raw records, a leaf partition at Party A contains only raw recordsfrom TA and a leaf partition at Party B contains only raw records from TB.Initially, the TIPS has only the root node representing the most generalizedrecord and all raw records. In each iteration, the two parties cooperate toperform the specialization w by refining the leaf partitions Pw on Linkw intheir own TIPS.

Example 11.5Continue with Example 11.4. Initially, TIPS has the root representing themost generalized record 〈ANY Sex,ANY Job, [1-99)〉, TA[root] = TA andTB[root] = TB. The root is on LinkANY Sex, LinkANY Job, and Link[1−99).See the root in Figure 11.5. The shaded field contains the number of rawrecords generalized by a node. Suppose that the winning candidate w is

[1-99)→ {[1-37), [37-99)} (on Salary).

Party B first creates two child nodes under the root and partitions TB[root]between them. The root is deleted from all the Linkx, the child nodes areadded to Link[1-37) and Link[37-99), respectively, and both are added toLinkANY Job and LinkANY Sex. Party B then sends the following instructionto Party A:

IDs 1-12 go to the node for [1-37).IDs 13-34 go to the node for [37-99).

On receiving this instruction, Party A creates the two child nodes under theroot in its copy of TIPS and partitions TA[root] similarly. Suppose that thenext winning candidate is

Page 228: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

208 Introduction to Privacy-Preserving Data Publishing

ANY Job→ {Blue-collar,White-collar}.Similarly the two parties cooperate to specialize each leaf node onLinkANY Job, resulting in the TIPS in Figure 11.6.

We summarize the operations at the two parties, assuming that the winnerw is local at Party A.

Party A. Refine each leaf partition Pw on Linkw into child partitions Pc.Linkc is created to link up the new Pc’s for the same c. Mark c as beneficialif the records on Linkc has more than one class. Also, add Pc to every Linkx

other than Linkw to which Pw was previously linked. While scanning therecords in Pw, Party A also collects the following information.

• Instruction for Party B. If a record in Pw is specialized to a child value c,collect the pair (id,c), where id is the ID of the record. This informationwill be sent to B to refine the corresponding leaf partitions there.

• Count statistics. Some count statistics for computing the updating theScore. See [172] for details.

Party B. On receiving the instruction from Party A, Party B creates childpartitions Pc in its own TIPS. At Party B, Pc’s contain raw records from TB.Pc’s are obtained by splitting Pw among Pc’s according to the (id, c) pairsreceived.

Note, updating TIPS is the only operation that accesses raw records. Subse-quently, updating Score(x) makes use of the count statistics without access-ing raw records anymore. The overhead of maintaining Linkx is small. Foreach attribute in ∪QIDj and each leaf partition on Linkw, there are at most|child(w)| “relinkings.” Therefore, there are at most | ∪ QIDj| × |Linkw| ×|child(w)| “relinkings” for performing w.

11.2.4.3 Update the Score

The key to the scalability of PPMashup algorithm is updating Score(x)using the maintained count statistics without accessing raw records again.Score(x) depends on InfoGain(x), Ax(QIDj) and A(QIDj). The updatedA(QIDj) is obtained from Aw(QIDj), where w is the specialization just per-formed.

11.2.4.4 Beyond k-Anonymity

k-anonymity is an effective privacy requirement that prevents record link-ages. However, if some sensitive values occur very frequently within a qidgroup, the adversary could still confidently infer the sensitive value of an indi-vidual by his/her qid value. This type of attribute linkage attacks was studiedin Chapter 2.2. The proposed approach in this chapter can be extended to

Page 229: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Collaborative Anonymization for Vertically Partitioned Data 209

incorporate with other privacy models, such as �-diversity [162], confidencebounding [237], and (α,k)-anonymity [246], to thwart attribute linkages.

To adopt these privacy requirements, we make three changes.

1. The notion of valid specialization has to be redefined depending onthe privacy requirement. The PPMashup algorithm guarantees that theidentified solution is local optimal if the privacy measure holds the (anti-)monotonicity property with respect to specialization. �-diversity (Chap-ter 2.2.1), confidence bounding (Chapter 2.2.2), and (α,k)-anonymity(Chapter 2.2.5) hold such (anti-)monotonicity property.

2. The AnonyLoss(v) function in Chapter 11.2.2 has to be modified inorder to reflect the loss of privacy with respect to a specialization onvalue v. We can, for example, adopt the PrivLoss(v) function in [237]to capture the increase of confidence on inferring a sensitive value by aqid.

3. To check the validity of a candidate, the party holding the sensitiveattributes has to first check the distribution of sensitive values in aqid group before actually performing the specialization. Suppose PartyB holds a sensitive attribute SB. Upon receiving a specialization in-struction on value v from Party A, Party B has to first verify whetherspecializing v would violate the privacy requirement. If there is a viola-tion, Party B rejects the specialization request and both parties have toredetermine the next candidate; otherwise, the algorithm proceeds thespecialization as in Algorithm 11.2.7.

11.2.4.5 Analysis

PPMashup in Algorithm 11.2.7 produces the same integrated table as thesingle party algorithm TDS in Chapter 6.3 on a joint table, and ensures thatno party learns more detailed information about the other party other thanwhat they agree to share. This claim follows from the fact that PPMashupperforms exactly the same sequence of specializations as in TDS in a dis-tributed manner where TA and TB are kept locally at the sources. The onlyinformation revealed to each other is those in ∪Cutj and Tg at each iteration.However, such information is more general than the final integrated table thatthe two parties agree to share.

PPMashup is extendable for multiple parties with minor changes: In Line 5,each party should communicate with all the other parties for determining thewinner. Similarly, in Line 8, the party holding the winner candidate shouldinstruct all the other parties and in Line 10, a party should wait for instructionfrom the winner party.

The cost of PPMashup can be summarized as follows. Each iteration in-volves the following work:

Page 230: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

210 Introduction to Privacy-Preserving Data Publishing

1. Scan the records in TA[w] and TB[w] for updating TIPS and maintainingcount statistics (Chapter 11.2.4.2).

2. Update QIDTreej, InfoGain(x) and Ax(QIDj) for affected candidatesx (Chapter 11.2.4.3).

3. Send “instruction” to the remote party. The instruction contains onlyIDs of the records in TA[w] or TB[w] and child values c in child(w),therefore, is compact.

Only the work in (1) involves accessing data records; the work in (2) makesuse of the count statistics without accessing data records and is restricted toonly affected candidates. For the communication cost (3), each party commu-nicates (Line 5 of Algorithm 11.2.7) with others to determine the global bestcandidate. Thus, each party sends n − 1 messages, where n is the numberof parties. Then, the winner party (Line 8) sends instruction to other par-ties. This communication process continues for at most s times, where s isthe number of valid specializations which is bounded by the number of dis-tinct values in ∪QIDj. Hence, for a given data set, the total communicationcost is s{n(n − 1) + (n − 1)} = s(n2 − 1) ≈ O(n2). If n = 2, then the totalcommunication cost is 3s.

In the special case that the anonymity requirement contains only local QIDs,one can shrink down the TIPS to include only local attributes. Parties do nothave to pass around the specialization array because each party specializesonly local attributes. A party only has to keep track of QIDTreej only if QIDj

is a local QID. The memory requirement and network traffic can be furtherreduced and the efficiency can be further improved. In the special case thatthere is only a single QID, each root-to-leaf path in TIPS has representeda qid. One can store a(qid) directly at the leaf partitions in TIPS withoutQIDTrees. A single QID is considered in where the QID contains all potentiallyidentifying attributes to be used for linking the table to an external source.PPMashup can be more efficient in this special case.

PPMashup presented in Algorithm 11.2.7 is based on the assumption thatall the parties are semi-honest. An interesting extension is to consider thepresence of malicious and selfish parties [180]. In such scenario, the algorithmhas to be not only secure, but also incentive compatible to ensure fair contri-butions. We study this scenario next.

11.2.5 Anonymization Algorithm for Malicious Model

The PPMashup algorithm presented in Algorithm 11.2.7 satisfies all theconditions of Definition 11.1 only if all the parties follow the defined protocol.However, a malicious party can easily cheat others by under declaring itsScore value, thus avoiding to share its data with others. Suppose Party Ais malicious. During the anonymization process, Party A can always send 0as its Score value to Party B (Line 5 of Algorithm 11.2.7) for determining

Page 231: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Collaborative Anonymization for Vertically Partitioned Data 211

Table 11.2: Anonymous tables, illustrating a selfishParty A

Shared Party A Party BSSN Class Sex ... Job Salary ...1-7 0Y7N ANY Non-Technical [1-35)8-16 5Y4N ANY Technical [35-37)17-25 7Y2N ANY Manager [37-99)26-34 9Y0N ANY Professional [37-99)

the global winner candidate. Hence, Party A indirectly forces Party B tospecialize its attributes in every round. This can continue as long as PartyB has a valid candidate. Thus, the malicious Party A successfully obtainsthe locally anonymized data of Party B while sharing no data of its own.Table 11.2 is an example of an integrated anonymous table, where Party Adoes not participate in the anonymization process. Moreover, this gives PartyA a global data set, which is less anonymous than if it has cooperated withParty B.

Next, we provide a solution to prevent parties from reporting fake Scorevalues. We assume that a party exhibits its malicious behavior only by re-porting a fake Score value. It however does not provide wrong data to otherparties (Line 8 of Algorithm 11.2.7). Preventing malicious parties from shar-ing fake data is difficult since data is a private information of a party, whichis not verifiable. Further investigation is needed to thwart this kind of mis-behavior. In this regard, mechanism design theory [180] could be a potentialtool to motivate parties to share their real data.

11.2.5.1 Rational Participation

To generate the integrated anonymous table, each party specializes its ownattributes, which can be considered as a contribution. The contribution canbe measured by the attribute’s Score value. Thus, the total contribution ofParty A, denoted by μA, is the summation of all the Score values from itsattribute specializations. We use ϕA to denote the contributions of all otherparties excluding Party A. This is the ultimate value that each party wantsto maximize from the integrated anonymous table.

The Score function in Equation 6.1 uses information gain (InfoGain) toidentify the next candidate for specialization. Yet, InfoGain favors the candi-date value that has a larger number of child values in the taxonomy tree [191].If we compare μi across different parties, parties having values with a largernumber of child values tend to high Score, and thereby, resulting in unfair con-tributions. To avoid this problem, a better Score function is to use GainRatio,which normalizes the InfoGain by SplitInfo.

Score(v) = GainRatio(v) =InfoGain(v)SplitInfo(v)

. (11.1)

Page 232: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

212 Introduction to Privacy-Preserving Data Publishing

(0,p)

(p,0)

(p/2,p/2)

Party A’s φ

Par

tyB

’s φ

FIGURE 11.7: Rational participation

where

SplitInfo(v) = −∑

c

|T [c]||T [v]| × log2

|T [c]||T [v]| . (11.2)

Example 11.6Consider the raw data in 11.1. Initially, ∪Cuti = {ANY Sex,ANY Job, [1-99)}.The specialization ANY Job refines the 34 records into 16 records for Blue-collar and 18 records for White-collar. Score(ANY Job) is calculated as fol-lows.

E(ANY Job) = − 2134 × log2

2134 − 13

34 × log21334 = 0.960

E(Blue-collar) = − 516 × log2

516 − 11

16 × log21116 = 0.896

E(White-collar) = − 1618 × log2

1618 − 2

18 × log2218 = 0.503

InfoGain(ANY Job) = E(ANY Job)− (1634 × E(Blue-collar)

+ 1834 × E(White-collar)) = 0.272

SplitInfo(ANY Job) = − 1634 × log2

1634 − 18

34 × log21834 = 0.998

GainRatio(ANY Job) = 0.2720.998 = 0.272.

The value of ϕ for each party is a function of the anonymization processparticipated by the different parties. Since all the parties are rational, theiractions are driven by self interest. Hence, they may want to deviate from thealgorithm to maximize their ϕ while minimizing μ as much as possible. Toovercome this problem, the presented algorithm has to be such that

1. following the algorithm parties will reach a fair distribution of ϕ

2. deviating from the algorithm will eventually decrease the value of ϕ.

Page 233: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Collaborative Anonymization for Vertically Partitioned Data 213

An anonymous integrated table can be achieved in two ways. Firstly, boththe parties (n = 2) can specialize their attributes. Certainly, there can bemany different choices for attribute selection. The heuristic function presentedin Equation 11.1 is one of them. The data holder may apply other heuristicfunctions. Secondly, only one party can specialize its attributes (based on onlylocal QIDs), while the other party’s attributes are generalized to the top mostvalues. In this case, part of the table is locally anonymized and part of thetable is completely generalized to the top most values. As an example, considerParty A and B, having the same number of local attributes and being equallycapable of contributing to the anonymization process. Let’s assume that theintegrated table cannot be more specialized after any s number of attributespecializations. Each specialization has the same Score value and the sumof the Score value is p. Figure 11.7 shows the possible values of ϕ for boththe parties. The line joining points (0, p) and (p, 0) shows different choices ofϕ values. Both the extreme points represent the participation of one party,while the points in between are the different levels of contributions from theparties. Each party cannot increase its ϕ value without decreasing the otherparty’s ϕ. It is now easy to find the unique operating point from these possiblealternatives. Rationality suggests that the only point that can be accepted byboth the parties is (p/2, p/2). The proposed algorithm ensures that followingthe algorithm parties will reach this rational participation point.

In reality, each party holds different attributes and some attributes are moreinformative than others for classification analysis. Thus, all the parties are notequally capable of contributing to the anonymization process. Based on thecontribution capability, we assume that parties are divided into y differentclasses, where parties belonging to class 1 are able to contribute the most andparties belonging to class y are the least capable ones. If there are differentnumbers of parties from different classes then extending the concept of ratio-nality, we can conclude that the interaction between them will be dominatedby the party of the least capable class. For example, if one party of class 1and two parties of class 2 participate to form an integrated anonymous table,then the party of class 1 will behave as if it belongs to class 2. It is becauseby contributing more than the class 2 parties, the party of class 1 will notreceive any additional contribution from them.

The problem of ensuring honest participation falls into the framework ofnon-cooperative game theory. To make the chapter self-contained, we providesome background information on game theory below. A more general overviewof game theory can be found in [24, 183].

11.2.5.2 Game Theory

A game can be defined as an interaction model among players,3 where eachplayer has its own strategies and possible payoffs. Game theory is the tool

3We use the word player to refer a party

Page 234: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

214 Introduction to Privacy-Preserving Data Publishing

Cooperate

DefectCooperate

Defect

Player 1

Player 2

(5, 5)

(10, - 5)

(-5, 10)

(0, 0)

FIGURE 11.8: Payoff matrix for prisoner’s dilemma

that studies this interaction which is established and broken in the course ofcooperation and competition between the players. Game theory can be clas-sified into two categories: cooperative game theory and non-cooperative gametheory. Below we introduce a well known non-cooperative game with an ex-ample and illustrate how this model can be used to derive a solution for theaforementioned problem.

Basic DefinitionsA normal game consists of a finite set of players P = {1, 2, . . . , n}, a strategy

set s1, for each player and a set of outcomes, O. Figure 11.8 shows differentelements of a game where two players have two strategies each. The columnsrepresent the strategies of player 2 and the rows represent the strategies ofplayer 1. The intersections between the row and the column represent thepayoff of player 1 and 2. For example, the payoff of both the players are 5when they both choose to “cooperate.”

Strategy: Each player can select his strategy from a strategy set si. Forexample, the strategy set for each player is S1 = S2 = {Cooperate, Defect}. Astrategy profile s = {s1, s2, . . . , sn} is the vector of strategies. It is the set of allthe strategies chosen by the players, whereas s−i = {s1, . . . , si−1, si+1, . . . , sn}denotes the strategies of all the players except player i. Strategy profile is theoutcome, o(s) ε O, of the game. For example, the possible outcomes of thegame are (Cooperate, Cooperate), (Cooperate, Defect), (Defect, Cooperate),and (Defect, Defect). Each player chooses its strategy in such a way that itspreferred outcome occurs and the preference over the outcome of the game isexpressed by the utility function.

Utility Function: Each player has preferences over the outcomes of the game.Preferences over outcomes are represented through a utility function, ui. Util-ity function of a player i can be considered as a transformation of the outcometo a real number. It is expressed formally as:

ui : O → � (11.3)

A player prefers outcome o1 over outcome o2 if ui(o1) > ui(o2). A rationalplayer always wants to maximize its utility. Thus, it chooses a strategy which

Page 235: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Collaborative Anonymization for Vertically Partitioned Data 215

will increase its expected utility given the preferences of the outcomes, thestructure of the game and the belief of others’ strategies.

Solution Concepts : Game theory uses different techniques to determine theoutcome of the game. These outcomes are the stable or the equilibrium pointsof the game. The most well known equilibrium concept is known as the Nashequilibrium. It states that each player plays its best strategy to maximize theutility, given the strategies of the other players.

DEFINITION 11.2 Nash Equilibrium A strategy profile

s∗ = {s∗1, s∗2, . . . , s∗n}is a Nash equilibrium if this strategy profile maximizes the utility of everyplayer, i. Formally

∀i ui(o(s∗i , s∗−i)) ≥ ui(o(si, s

∗−i)), ∀si (11.4)

Nash equilibrium is the point where no player can take advantage of the otherplayer’s strategy to improve his own position.

For the given example, both the players will play the strategy “defect” sincethat ensures better payoff regardless of what another player chooses.

Iterated Prisoner’s DilemmaPrisoner’s Dilemma (PD) is a classical example of a non-cooperative game,

which can be used to describe the decision making process regarding contri-bution of the rational parties. In PD, two players can choose to cooperateor defect one another. If both the players cooperate, they receive some bene-fits. If both defect, they receive punishments. If only exactly one cooperates,then the cooperator gets the punishment while the other player receives ben-efit. The payoff matrix of Figure 11.8 is a canonical example of the prisoner’sdilemma.

Certainly, the mutual benefit of both the players is to cooperate. However,from each player’s perspective, defect is the best strategy. For example, Player1 notices that it always gets higher payoff by choosing defect than cooperateirrespective of the strategy of Player 2. Thus, the Nash equilibrium of PDgiven in Figure 11.8 is to defect for both the players. However, cooperationcan emerge as the Nash equilibrium, if this one shot game is played repeat-edly. Such a repeated game is known as iterative prisoner’s dilemma. In aniterated PD game, each player selects its strategy based on the strategy ofthe other player in the previous games. This allows players to cooperate witheach other. Generous Tit for Tit (GTFT ) is a well known strategy that en-ables cooperation to emerge as the Nash equilibrium. In GTFT strategy, eachplayer behaves as the other player does. However, occasionally a player isgenerous by cooperating even if its opponent defects. It has been proven thatGTFT strategy enforces all the players to cooperate in an iterative prisoner‘sdilemma [24].

Page 236: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

216 Introduction to Privacy-Preserving Data Publishing

11.2.5.3 Participation Strategy

The privacy-preserving data mashup problem can be modeled as an iterativeprisoner’s dilemma game. If both (n=2) parties cooperate, then they can forma joint integrated table and, thus, both receive benefits. If both choose defect,then it is not possible to form an integrated table, which can be considered asa loss or punishment. However, if one cooperates (the party that contributes),the cooperator gets no benefit, while the other receives more benefit.

To ensure that both the parties will contribute to form the joint integratedtable, we can employ a GTFT strategy and adopt it in the context of theprivacy-preserving data mashup problem. GTFT helps the parties to reachtowards the rational participating point and therefore rationality dictates tofollow the strategy. In the revised version of the PPMashup algorithm, eachparty keeps track of μ and ϕ values. These values indicate the contributions ofthe parties. In each iteration, each party decides whether or not to contributebased on these two variables. For two parties, the decision strategy works asfollows:

If (μ > ϕ + ε) then Not-contribute else Contribute (11.5)

where ε is a small positive number. The general intuition is that each partyparticipates in the anonymization process if it has not contributed more thanthe other party. However, each party is a bit generous in a sense that itcontributes as long as the difference is small. In the case of multiple parties,we can generalize decision strategy as follows:

If (μ >ϕ

n− 1+ ε) then Not-contribute else Contribute (11.6)

where n is the number of parties.We incorporate the decision strategy in Algorithm 11.2.7 and present Al-

gorithm 11.2.8 for integrating private tables from multiple malicious parties.The extended algorithm has the following key differences. First, in Lines 5-6,a party does not further contribute if its contributions are larger than thecontributions of other parties plus some generosity. Second, in Lines 7-9, thealgorithm terminates and returns the current generalized table if all otherparties also do not contribute. Finally, in Lines 16 and 20, every party keepstrack of its own contributions and other parties’ contributions.

11.2.5.4 Analysis

Algorithm 11.2.8 has some nice properties. First, each party only requires tokeep two extra variables disregarding the number of participating parties. Thismakes the decision algorithm scalable. Second, each party decides whether ornot to contribute based on the locally generated information. Thus, a partycannot be exploited by others in the decision making process. Third, althoughthe values of μ are not exactly the same for all the parties when the algorithm

Page 237: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Collaborative Anonymization for Vertically Partitioned Data 217

Algorithm 11.2.8 PPMashup for Every Party in Malicious Model1: initialize Tg to include one record containing top most values;2: initialize ∪Cuti to include only top most values;3: while some candidate v ∈ ∪Cuti is valid do4: find the local candidate x of highest Score(x);5: if μ > ϕ

n−1 + ε or the party has no valid candidate then6: send Not-contribute;7: if receive Not-contribute from every other party then8: break;9: end if

10: else11: communicate Score(x) with all other parties to find the winner;12: end if13: if the winner w is local then14: specialize w on Tg;15: instruct all other to specialize w;16: μ := μ + Score(w);17: else18: wait for the instruction from the party who owns w;19: specialize w on Tg using the instruction;20: ϕ := ϕ + Score(w);21: end if22: replace w with child(w) in the local copy of ∪Cuti;23: update Score(x) and beneficial/validity status for candidates x ∈

∪Cuti;24: end while25: return Tg and ∪Cuti;

terminates, the algorithm progresses by ensuring an almost even contributionfrom all the parties with a maximum different ε. Since parties are unable todetermine the last iteration of the algorithm, they will cooperate until theanonymization finishes. Finally, the computational and communication costof the algorithm remains to be the same as Algorithm 11.2.7 discussed inChapter 11.2.4.5.

11.2.6 Discussion

We have presented a service-oriented architecture for the privacy-preservingdata mashup algorithm. The architecture clearly separates the requesting con-sumer of the mashup application from the backend process. Due to issues ofconvenience and control, a mashup coordinator represents a static point ofconnection between clients and providers with a high rate of availability. Amashup coordinator would also be able to cache frequently requested data ta-

Page 238: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

218 Introduction to Privacy-Preserving Data Publishing

bles during a period where they are valid. Requests are attached to a sessiontoken identifying a kind of contract between a user and several data holdersand are maintained by the mashup coordinator, who can also be a genericservice provider. Another benefit is that the mashup coordinator is able tohandle and unify several service level agreements among different data hold-ers and queues service requests according to the workload of individual dataholders.

In real-life collaborative anonymization problems, different parties agree toshare their data when they have mutual trust and benefits. However, if the par-ties do not want to share their data more than others, then Algorithm 11.2.8provides an appropriate solution. The participating data holders can decidewhether to employ Algorithm 11.2.7 or Algorithm 11.2.8.

Experiments on real-life data verified several claims about the PPMashupalgorithms [94]. First, data integration does lead to improved data analysis.Second, PPMashup achieves a broad range of anonymity requirements with-out sacrificing significantly the usefulness of data to classification. The dataquality is identical or comparable to the result produced by the single partyanonymization methods [95, 96, 123]. This study suggests that classificationanalysis has a high tolerance towards data generalization, thereby, enablingdata mashup across multiple data holders even in a broad range of anonymityrequirements. Third, PPMashup is scalable for large data sets and differentsingle QID anonymity requirements. It provides a practical solution to datamashup where there is the dual need for information sharing and privacyprotection.

11.3 Cryptographic Approach

11.3.1 Secure Multiparty Computation

Information integration has been an active area of database research [58,244]. This literature typically assumes that all information in each databasecan be freely shared [17]. Secure multiparty computation (SMC), on the otherhand, allows sharing of the computed result (e.g., a classifier), but completelyprohibits sharing of data [260], which is a primary goal of the problem studiedin this chapter. An example is the secure multiparty computation of classifiers[50, 69, 71, 259].

Yang et al. [258] propose several cryptographic solutions to collect infor-mation from a large number for data owners. Yang et al. [259] develop acryptographic approach to learn classification rules from a large number ofdata owners while their sensitive attributes are protected. The problem canbe viewed as a horizontally partitioned data table in which each transactionis owned by a different data owner. The model studied in this chapter can

Page 239: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Collaborative Anonymization for Vertically Partitioned Data 219

be viewed as a vertically partitioned data table, which is completely differentfrom [258, 259]. More importantly, the output of their method is a classifier,but the output of PPMashup is an integrated anonymous data that supportsclassification analysis. Having accessed the data, the data recipient has thefreedom to apply her own classifier and parameters.

Vaidya and Clifton [228, 229] propose techniques to mine association rulesand to compute k-means clustering in a distributed setting without sharingdata. Comparing to data mining results sharing, data sharing offers morefreedom to the data recipients to apply her own classifiers and parameters.Refer to [227, 230] for more details on privacy-preserving distributed datamining (PPDDM).

Jiang and Clifton [127, 128] propose a cryptographic approach. First, eachdata holder determines a locally k-anonymous table. Then, the intersectionof RecID’s for the qid groups in the two locally k-anonymous tables is de-termined. If the intersection size of each pair of qid group is at least k, thenthe algorithm returns the join of the two locally k-anonymous tables that isglobally k-anonymous; otherwise, further generalization is performed on bothtables and the RecID comparison procedure repeated. To prevent the otherdata holder from learning more specific information than that appearing in thefinal integrated table through RecID, a commutative encryption scheme [189]is employed to encrypt the RecID’s for comparison. This scheme ensures theequality of two values encrypted in different order on the same set of keys,i.e., EKey1(EKey2(RecID)) = EKey2(EKey1(RecID)).

Their methods, however, have several limitations. This model is limited toonly two parties, whereas the technique presented in this chapter is applicablefor multiple parties. This data model [128] assumes the participating partiesare semi-honest, meaning that the data holders follow the secure protocol butmay attempt to derive additional (sensitive) information from their collecteddata. Their solution cannot guarantee the security and fair contributions inthe presence of malicious participants as discussed in Chapter 11.2.5. Whiledetermining a locally k-anonymous table, each party does not communicatewith other parties. As a result, each generalizes the data according to a dif-ferent set of QID attributes. The locally optimized anonymization may leadto very different qid groups. Consequently, the second phase tends to producevery small intersection of record IDs; therefore, the data will be excessivelygeneralized locally in order to have minimal intersection size k. In contrast,the PPMashup always builds the same grouping across all parties by shar-ing the grouping information. Thus, PPMashup does not generalize the dataexcessively because of inconsistent groupings at different parties.

11.3.2 Minimal Information Sharing

Agrawal et al. [17] propose the notion of minimal information sharing forcomputing queries spanning private databases. They considered computingintersection, intersection size, equijoin and equijoin size, assuming that certain

Page 240: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

220 Introduction to Privacy-Preserving Data Publishing

metadata such as the cardinality of databases can be shared to both parties.Besides, there exists an extensive literature on inference control in multilevelsecure databases [87, 104, 115, 116, 125]. All these works prohibit the sharingof databases.

11.4 Summary and Lesson Learned

We have studied the problem of collaborative anonymization for verticallypartitioned data, motivated the problem with a real-life scenario, and gener-alized the privacy and information requirements from the financial industryto the problem of privacy-preserving data mashup for the purpose of jointclassification analysis. This problem has been formalized as achieving the k-anonymity on the integrated data without revealing more detailed informationin this process. We have studied several secure protocols for the semi-honestand malicious models. Compared to classic secure multiparty computation, aunique feature in collaborative anonymization is to allow data sharing insteadof only result sharing. This feature is especially important for data analysiswhere the process is hardly performing an input/output black-box mappingand user interaction and knowledge about the data often lead to superior re-sults. Being able to share data records would permit such exploratory dataanalysis and explanation of results.

In general, the financial sector prefers simple privacy requirement. Despitesome criticisms on k-anonymity as we have discussed in previous chapters, thefinancial sector (and probably some other sectors) finds that k-anonymity isan ideal privacy requirement due to its intuitiveness. Their primary concernis whether they can still effectively perform the task of data analysis on theanonymous data. Therefore, solutions that solely satisfy some privacy require-ments are insufficient for them. They demand anonymization methods thatcan preserve information for various data analysis tasks.

Page 241: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Chapter 12

Collaborative Anonymization forHorizontally Partitioned Data

12.1 Introduction

In the previous chapter, we have studied the data publishing model wheremultiple data holders want to collaboratively anonymize their vertically parti-tioned data. In this chapter, we study the problem of collaborative anonymiza-tion for horizontally partitioned data, where multiple data holders own sets ofperson-specific data records on the same set of attributes. The model assumesthat the sets of records are disjoint, meaning that a record owner appears inat most one record set. Often there is a strong urge to integrate scattered dataowned by different parties for greater benefits [131]. A good example is theShared Pathology Informatics Network (SPIN) initiated by National CancerInstitute.1 The objective is to create a virtual database combining data fromdifferent healthcare institutions to facilitate research investigations. Thoughthe virtual database is an integration of different databases, however in realitythe data should remain physically in different locations under the completecontrol of the local healthcare institutions.

There are several practical challenges in such a scenario. First, accordingto the Health Insurance Portability and Accountability Act (HIPAA), it isnot allowed to share the patient’s records directly without de-identification.Second, the institutions cannot share their patients’ record among themselvesdue to the confidentiality of the data. Hence, similar to the problem of privacy-preserving data mashup studied in Chapter 11.2, the data integration shouldtake place in such a way that the final integrated table should satisfy somejointly agreed privacy requirements, such as k-anonymity and �-diversity, andevery data holder should not learn more detailed information other than thosein the final integrated table. This particular scenario can be generalized to theproblem of collaborative anonymization for horizontally partitioned data.

Different approaches can be taken to enable data anonymization for dis-tributed databases. Let’s first try the same two naive approaches men-tioned in the previous chapter: “generalize-then-integrate” and “integrate-

1Shared Pathology Informatics Network. http://www.cancerdiagnosis.nci.nih.gov/spin/

221

Page 242: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

222 Introduction to Privacy-Preserving Data Publishing

then-generalize.” In the “generalize-then-integrate” approach, each party firstlocally anonymizes the data and then integrates. Unlike privacy-preservingdata mashup, this solution is correct since all the parties own the same set ofattributes. However, it suffers from over generalization because each party hasa smaller data set, therefore, losing data utility. On the other hand, “integrate-then-generalize” approach is incorrect since the party holding the integratedtable has access to all the private data, therefore, violating the privacy require-ment. An easy solution is to assume the existence of a trusted third party,where all the data can be integrated before anonymization. Needless to men-tion, this assumption is not practical since it is not always feasible to find sucha trusted third party. Moreover, a third party-based solution is risky becauseany failure in the trusted third party will comprise the complete privacy ofall the participating parties.

In this chapter, we will describe a fully distributed solution proposed by Ju-rczyk and Xiong in [131] for horizontally partitioned data.The privacy modeland proposed solution will be presented briefly followed by discussion.

12.2 Privacy Model

There are two different privacy requirements: privacy for record owners andprivacy for data holders.

1. Privacy for record owners requires that the integrated data should notcontain any identifiable information of any record owners. To protectthe privacy of the record owners, the final integrated table should sat-isfy both k-anonymity (Chapter 2.1.1) and �-diversity (Chapter 2.2.1)privacy models.

2. Privacy for data holders imposes that the data holder should not revealany extra information to others in the process of anonymization, nor theownership of the data.

The second goal of the privacy model is to protect the privacy of the dataholders. This requires that each data holder should not reveal any additionalinformation to other data holders than what is in the final integrated table.This requirement is similar to the secure multiparty computation (SMC) pro-tocols, where no participant learns more information than the outcome ofthe function. To achieve this privacy, the data holders are considered to besemi-honest. Refer to Chapter 11.2 for the definition of semi-honest.

The privacy model further requires that the ownership of a particular recordin the integrated table should also be concealed. For example, given a particu-lar record, an adversary with some background knowledge should not identify

Page 243: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Collaborative Anonymization for Horizontally Partitioned Data 223

that this record is from a particular hospital. To overcome this problem, �-diversity privacy principle can be used considering location as sensitive infor-mation. Thus, for each equivalent class, the records should be from � differentlocations. This privacy requirement is known as the �-site diversity.

12.3 Overview of the Solution

Initially, the data is located in L different locations, where L > 2. Thus,the minimum number of data holders is three. Each data holder Partyi ownsa private database di over the same set of attributes. The integrated databaseis the anonymous union of the local databases, denoted by d = ∪1≤i≤Ldi.Note that the quasi-identifiers are uniform across all the local databases. Dataholders participate in the distributed protocol and produce a local anonymousdatabase which itself may not be k-anonymous and �-diverse, however, theunion of the local anonymous databases is guaranteed to be k-anonymousand �-diverse. Note, the solution is not applicable to two parties (L = 2)because the secure sum protocol, which will be discussed later, only works forL > 2.

The distributed anonymization algorithm is based on the top-down Mon-drian algorithm [149]. It is worth mentioning that for distributed anonymiza-tion, top-down approach is better than bottom-up anonymization algorithmssince anything revealed during the protocol execution has a more general viewthan the final result. The original Mondrian algorithm was designed for singleparty, however now it has to be decomposed and SMC protocols have to beused to make a secure distributed algorithm. Refer to Chapter 5.1.2.6 for moredetails on the Mondrian algorithm.

The distributed anonymization algorithm executes these three phasesamong different parties securely with the help of SMC protocols. Initially,the data holders are divided into leading and non-leading nodes. One nodeamong all the parties acts as a leading node and guides the anonymizationprocess. The leading node first determines the attribute to specialize. To dothis, it needs to calculate the range of all the candidates in d = ∪1≤i≤Ldi. Asecure kth element protocol can be used to securely compute the minimum(k = 1) and maximum values of each attribute across the databases [14].

Once it determines the split attribute, it instructs other data holders to dothe specialization. After the partitioning, the leading node recursively checkswhether or not further specialization will violate the specified privacy require-ments. In order to determine validity of a specialization, a secure sum proto-col [207] is used to determine the number of tuples across the whole database.This secure sum protocol is secure when there are more than two data hold-ers with a leading node. The leading node is the only node that is able to

Page 244: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

224 Introduction to Privacy-Preserving Data Publishing

Table 12.1: Node 0Original data Generalized data

Rec ID Age Salary Rec ID Age Salary1 30 41K 1 30-36 41K-42K2 33 42K 2 30-36 41K-42K

Table 12.2: Node 1Original data Generalized data

Rec ID Age Salary Rec ID Age Salary3 45 55K 3 37-56 42K-55K4 56 42K 4 37-56 42K-55K

Table 12.3: Node 2Original data Generalized data

Rec ID Age Salary Rec ID Age Salary5 30 32K 5 30-36 32K-40K6 53 32K 6 37-56 32K-41K

Table 12.4: Node 3Original data Generalized data

Rec ID Age Salary Rec ID Age Salary7 38 41K 7 37-56 32K-41K8 33 40K 8 30-36 32K-40K

determine the value of the summation. This imposes some limitations on theproposed distributed algorithm which will be discussed in Chapter 12.4.

Example 12.1Tables 12.1-12.4 show an example scenario of distribute anonymization. Theunion of the generalized data in Table 12.5 satisfies 2-anonymity and 1-site-diversity. Note that although the local anonymous data at nodes 2 and 3are not 2-anonymous individually, the union of all the local generalized datasatisfies the 2-anonymity requirement. The union satisfies only 1-site-diversitybecause the two records in qid = 〈30 − 36, 41K − 42K〉 come from the samenode.

12.4 Discussion

The aforementioned distributed anonymization algorithm is simple, secure,and privacy preserved. One of the advantages of the proposed model is its

Page 245: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Collaborative Anonymization for Horizontally Partitioned Data 225

Table 12.5: Union of thegeneralized data

Node ID Rec ID Age Salary0 1 30-36 41K-42K0 2 30-36 41K-42K1 3 37-56 42K-55K1 4 37-56 42K-55K2 5 30-36 32K-40K2 6 37-56 32K-41K3 7 37-56 32K-41K3 8 30-36 32K-40K

flexibility to adapt different anonymization algorithms. Though it is built onthe multi-dimensional top-down Mondrian algorithm, alternative top-downanonymization algorithms, such as TDS, can also be used with little mod-ification. The overall complexity of the distributed algorithm is O(nlog2n),where n is the number of records of the integrated data table.

One weakness of the proposed method is that the distributed anonymizationalgorithm can only be applied when there are more than two data holders. Itis due to the limitation of the secure sum protocol that is used to determinewhether any further split of a particular subgroup is possible or not. Besides,though the proposed architecture is distributed, one node acts as a leadingnode. This leading node controls the entire anonymization process and de-cides which subgroup to split and when to stop the anonymization process.This architecture works perfectly under the assumption of the semi-honestadversary model; however, if the leading node turns out to be malicious, thenthe privacy of both the data holders and the record owners are jeopardized.Thus, further investigation is needed to devise an alternative technique of se-cure sum protocol so that the architecture can be used for two data holderswithout the need of a leading node.

The method guarantees the privacy of the integrated data to be k-anonymous. However, a data holder can always identify its own data recordsand remove them from the integrated data. The remaining data records areno longer k-anonymous. This remains to be an issue for open discussion.

Page 246: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Part IV

Anonymizing ComplexData

Page 247: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Chapter 13

Anonymizing Transaction Data

13.1 Introduction

So far, we have considered relational data where all records have a fixedset of attributes. In real life scenarios, there is often a need to publish un-structured data. In this chapter, we examine anonymization techniques forone type of unstructured data, transaction data. Like relational data, trans-action data D consists of a set of records, t1, . . . , tn. Unlike relational data,each record ti, called a transaction, is an arbitrary set of items drawn from auniverse I. For example, a transaction can be a web query containing severalquery terms, a basket of purchased items in a shopping transaction, a clickstream in an online session, an email or a text document containing severaltext terms. Transaction data is a rich source for data mining [108]. Examplesare association rule mining [18], user behavior prediction [5], recommendersystems (www.amazon.com), information retrieval [53] and personalized websearch [68], and many other web based applications [253].

13.1.1 Motivations

Detailed transaction data concerning individuals often contain sensitivepersonal information and publishing such data could lead to serious privacybreaches. America Online (AOL) recently released a database of query logsto the public for research purposes [28]. For privacy protection, all explicitidentifiers of searchers have been removed from the query logs. However, byexamining the query terms contained in a query, the searcher No. 4417749was traced back to Thelma Arnold, a 62-year-old widow who lives in Lilburn.Essentially, the re-identification of the searcher is made possible by matchingcertain “background knowledge” about a searcher with query terms in a query.According to [108], this scandal leads to not only the disclosure of private in-formation for AOL users, but also damages to data publishers’ enthusiasm onoffering anonymized data to researchers.

The retailer example. To have a closer look at how background knowl-edge is used in such attacks, let us consider a toy example. Suppose that aweb-based retailer released online shopping data to a marketing company forcustomer behavior analysis. Albert, who works in the marketing company,

229

Page 248: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

230 Introduction to Privacy-Preserving Data Publishing

learned that Jane, a colleague, purchased “printer,” “frame,” and “camera”from this web site some days ago. Such background knowledge could be ac-quired from an office conversion, for instance. Having the access to the trans-action data, Albert matched these items against all transaction records andsurprisingly found only three transactions matched. Furthermore, out of thesethree transactions, two contain “adult toy.” Albert then concluded, with 67%certainty, that Jane bought “adult toy,” an item that Jane does not intend toreveal to anybody.

13.1.2 The Transaction Publishing Problem

The problem we consider here, called privacy-preserving transaction publish-ing, can be described as follows. We assume that a set of original transactionsD can be collected by a publisher, where each transaction corresponds to anindividual (a searcher, a patient, a customer, etc.). The publisher wants toproduce and publish an anonymized version D′ of D, such that (i) an adver-sary with background knowledge on an individual cannot link the individualto his transaction in D′ or to a sensitive item in his transaction, with a highcertainty, (ii) D′ retains as much information of D as possible for research. Arecord linkage attack refers to linking an individual to a specific transaction,and an attribute linkage attack refers to linking an individual to a specificitem. This problem has two emphases that distinguish it from previous workson privacy-preserving data mining.

First, it emphasizes publishing the data, instead of data mining results.There are several reasons why the researcher must receive the actual data,instead of data mining results. First, the researcher wants to have the controlover how to mine the data. Indeed, there are many ways that the data can beanalyzed and the researcher may want to try more than one of them. Second,data mining is exploratory in nature in that often it is hard to know exactlywhat patterns will be searched in advance. Instead, the goal of search is refinedduring the mining process as the data miner gets more and more knowledgeabout the data. In such cases, publishing the data is essential.

Another emphasis is dealing with background knowledge of an adversary.Background knowledge, also called external knowledge, refers to knowledgethat comes from a source other than the published data. In the above retailerexample, from an office conversion Albert acquired the background knowledgethat Jane purchased “printer,” “frame,” and “camera.” Typical backgroundknowledge includes geographical and demographic information and other notso sensitive information such as items purchased and query terms in a webquery. Such knowledge may be acquired either from public sources (such asvoter registration lists) or from close interaction with the target individual.Modeling such background knowledge and preventing linking it to a transac-tion is the key.

Page 249: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Transaction Data 231

13.1.3 Previous Works on Privacy-Preserving Data Mining

Similar to the data publishing problem, privacy-preserving data miningconsiders hiding certain sensitive information in transactions. Unlike the datapublishing problem, however, these works either do not consider publishingthe data, or do not consider background knowledge for linking attacks. In thissense, these works are only loosely related to our data publishing problem,therefore, we only briefly review these works. The details can be found in thegiven references.

Several early works consider publishing data mining results, instead of pub-lishing data, e.g., [233, 206, 35, 22]. As we discussed above, we consider thescenario where publishing data is essential. Therefore, these works are notdirectly applicable to our problem. The synthetic data approach [241] advo-cates publishing synthetic data that has similar statistical characteristics butno real semantics associated. A drawback of this approach is that the lack ofdata semantics disables the use of human’s domain knowledge to guide thesearch in the data mining process. There is a similar problem with the en-cryption approach [146]. Verykios et al. [233] consider hiding a set of sensitiveassociation rules in a data set of transactions. The original transactions arealtered by adding or removing items, in such a way that the sensitive rulesdo not have the minimum support or the minimum confidence. Note that thisapproach uses a small support as a means of protection. The exact opposite istrue in our problem: a small support means that an individual will be linkedto a small number of transactions, which is a privacy breach.

Randomization [84] is another approach to control breaches that arise fromthe published transactions. However, this approach does not attempt to con-trol breaches that arise from background knowledge besides the transactions.In addition, this approach considers the data collection scenario (instead ofthe data publishing scenario). In the data collection scenario, there is oneserver and many clients. Each client has a set of items. The clients want theserver to gather statistical information about associations among items. How-ever, the clients do not want the server to know with certainty who has gotwhich items. When a client sends its set of items to the server, it modifiesthe set according to some specific randomization operators. The server thengathers statistical information from the modified sets of items (transactions)and recovers from it the actual associations. For example, the server wants tolearn itemsets that occur frequently within transactions.

The intuition of randomization is that, in addition to replacing some ofthe items, we shall insert so many “false” items into a transaction that oneis as likely to see a “false” itemset as a “true” one [84]. Take the select-a-size randomization operator as an example. This operator has parameters0 ≤ ρ ≤ 1 and probabilities {p[j]}mj=0. Given a transaction t, the operatorgenerates another transaction t′ in three steps:

1. The operator selects an integer j at random from the set {0, 1, . . . , m}so that P [j is chosen] = p[j];

Page 250: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

232 Introduction to Privacy-Preserving Data Publishing

2. It selects j items from t, uniformly at random (without replacement).These items, and no other items of t, are placed into t′;

3. It considers each item a ∈ t in turn and tosses a coin with probability ρof “heads” and 1− ρ of “tails.” All those items for which the coin faces“heads” are added to t′.

ρ determines the amount of new items added, and {p[j]}mj=0 determines theamount of original items deleted. Given ρ, one heuristic is to set the p[j]’sso that many original items are made to the randomized transaction, i.e., tomaximize

∑mj=0 j × p[j]. In [83], the authors consider ρ1-to-ρ2 privacy: there

is a ρ1 − to − ρ2 privacy breach with respect to property Q(t) if for somerandomized transaction t′, P [t] ≤ ρ1 and P [t|t′] ≥ ρ2, where 0 < ρ1 < ρ2 < 1.In other words, a privacy breach occurs if the posterior P [t|t′] has significantlyincreased (compared to the prior P [t]). Evfimievski et al. [83] present a methodfor finding the p[j]’s that maximizes the utility objective

∑mj=0 j × p[j] while

eliminating all ρ1-to-ρ2 privacy breaches.

13.1.4 Challenges and Requirements

At first glance, the transaction publishing problem is similar to the publish-ing problem for relational data. However, the unstructured nature of trans-action data presents some unique challenges in modeling and eliminating theattacks. Below is a summary of these challenges.

High dimensionality. Since each transaction is an arbitrary set of items,transaction data does not have a natural notion of “attributes,” thus, thenotion of quasi-identifier (QID), as found on relational data. Typically, theitem universe I is very large and a transaction contains a small fraction ofthe items in I. In the above retailer example, I may have 10,000 items and atransaction typically contains a tiny fraction (say 1% or less) of all items. Inthe relational presentation, QID would contain one binary attribute for eachitem in I, thus, has an extremely high dimensionality. Forming an equivalenceclass on this QID means suppressing or generalizing mostly all items [6].

Itemset based background knowledge. Another implication of the high di-mensionality of transaction data is that it is hard to know in advance whatsets of items might be used as background knowledge. In the above retailerexample, the subset {printer, frame, camera} is used as background knowl-edge. In principle, any subset of items from I can be background knowledgeprovided that it can be acquired by the adversary on an individual. On theother hand, a “realistic” adversary is frequently limited by the effort requiredto observe an item on an individual. For example, to find that Jane purchased“printer,” “frame,” and “camera” from a particular place, Albert has to con-duct some investigation. This requires effort and time on the adversary sideand is not always successful.

Itemset based utility. Though it is often unknown exactly what data miningtasks the data will be used for, certain elements in the data are generally use-

Page 251: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Transaction Data 233

ful for a wide range of data mining tasks, thus, should be preserved as muchas possible. One type of such elements is frequent itemsets [18], i.e., the itemsthat co-occur frequently in transactions. Co-occurrences of items potentiallyrepresent interesting relationships among items, thus, are excellent candidatesfor many data mining applications, including association rules mining [18],classification or predication [156], correlation analysis [37], emerging patterns[67], recommender systems (www.amazon.com), and many web based applica-tions [253]. Existing utility metrics designed for relational data fail to capturesuch itemset based utilities because they measure information loss for eachattribute independently.

Besides addressing the above challenges, several other considerations areimportant for ensuring the practical usefulness of the anonymized data.

Limitation on background knowledge. An adversary uses her back-ground knowledge on the individual to extract relevant transactions.More background knowledge implies more accurate extraction. Fromthe privacy perspective, it is always safer to assume an adversary withas much background knowledge as possible. However, this assumptionrenders the data less useful. In practice, a “realistic” adversary is oftenbounded by the amount of background knowledge that can be acquired.Therefore, it makes sense to consider a bounded adversary that is limitedby the maximum number of items that can be acquired as backgroundknowledge in an attack. An unbounded adversary does not have this re-striction and can obtain background knowledge on any subset of items.

Truthfulness of results. When the original data is modified to satisfy theprivacy constraint, it is essential that the analysis results obtained fromthe modified data holds on the original data. For example, in the originaldata, all customers who bought cream and meat have also bought apregnancy test with 100% certainty, but in the modified data only halfof all customers who bought cream and meat have bought a pregnancytest. In this sense, the results derived from the modified data do nothold on the original data. Such results can be misleading and hard touse because the analyst cannot tell whether the 50% is the originalcertainty or the modified one.

Permitting inferences. The �-diversity principle prevents linking attacksby enforcing the certainty of sensitive inferences α → s (s being a sen-sitive item) to no more than 1/l, where l is typically 4-10. On the otherhand, with all inferences being modified to have a certainty below 1/l,the data becomes less useful, because in most cases, such as predic-tion and classification, it is the inferences with a high certainty thatare most interesting. For a bounded adversary, it is possible to retainall inferences that require background knowledge beyond the power ofa bounded adversary. For example, if a bounded adversary can only

Page 252: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

234 Introduction to Privacy-Preserving Data Publishing

acquire background knowledge on no more than 10 items, then an infer-ence α → s with β containing more than 10 items cannot be exploitedby such adversaries. On the other hand, this inference can be used bythe researcher for predicting the disease s.

Value exclusiveness. A largely ignored utility aspect is value exclusiveness,that is, the items retained in the modified data are exclusive of eachother. This property has a significant impact on data usefulness in prac-tice. Unfortunately, the local recoding transformation [149] does nothave this property. Suppose that “Canada” and “USA” are two childitems of “North America.” Local recoding would allows to generalizex out of y (x < y) occurrences of “Canada” and x′ out of y′ (x′ < y′)occurrences of “USA,” leaving y−x “Canada,” y′−x′ “USA,” and x+y“North America” in the modified data. Now the published count y − xon “Canada” and y′ − x′ on “USA” are misleading about the originaldata. In fact, it is not possible to count the number of transactions con-taining “Canada” (or “USA”) from the modified data. Although localrecoding has the flexibility of generalizing fewer occurrences of detaileditems than global recoding, this example shows that the retained de-tailed items have little utility. In addition, most data mining algorithmsassume value exclusiveness and rely on counting queries. Global recod-ing scores better in this aspect because it provides value exclusiveness,which enables standard algorithms on the anonymized data.

The rest of this chapter focuses on several recent works on privacy-preserving transaction publishing and query log publishing. The organiza-tion is as follows. Chapters 13.2 and 13.3 present two approaches to prevent-ing attribute linkage attacks, i.e., the coherence approach [256, 255] and theband matrix approach [101]. Chapters 13.4 and 13.5 discuss two approachesto preventing record linkage attacks, i.e., the km-anonymity approach [220]and the transactional k-anonymity approach [112]. Chapter 13.6 studies theanonymization problem for query log. Chapter 13.7 summarizes the chapter.

13.2 Cohesion Approach

Xu et al. [256] present an approach called coherence for eliminating bothrecord linkage attacks and attribute linkage attacks. For any backgroundknowledge on a subset of items, this approach guarantees some minimumnumber of transactions in the anonymized data such that (i) these trans-actions match the subset and (ii) no sensitive information in the matchingtransactions can be inferred with a high certainty. (i) ensures that no specifictransaction can be linked and (ii) ensures that no specific sensitive informa-

Page 253: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Transaction Data 235

tion can be linked. In this Chapter 13.2, we examine this approach in moredetails.

One assumption in this approach is that the items in I are classified intopublic items and private items. Public items correspond to potential infor-mation on which background knowledge may be acquired by an adversary.Private items correspond to sensitive information that should be protected.For specialized applications such as health care, financial sectors, and insur-ance industry, well defined guidelines for classifying public/private items oftenexist.

To launch an attack on a target individual who has a transaction in D,the adversary has the background knowledge that the transaction containssome public items denoted by β (called a public itemset below). An attackis described by β → e, where e is a private item the adversary tries to inferabout the target individual. The adversary applies the background knowledgeβ to focus on the transactions that contain all the items in β. Sup(β), calledthe support of β, denotes the number of such transactions.

P (β → e) = Sup(β∪{e})Sup(β)

is the probability that a transaction contains e given that it contains β. Thebreach probability of β, denoted by Pbreach(β), is defined by the maximumP (β → e) for any private item e, i.e., maxe{P (β → e)|e is a private item}.

In the retailer example, the adversary has the background knowledgeβ = {“printer, ”“frame, ”“camera”}, and finds that, out of the three trans-actions that contain β, two contains “adult toy.” So Sup(β) = 3, Sup(β →adult toy) = 2, P (β → adult toy) = 2/3. The adversary then infers that Janebought “adult toy” with the probability P (β → adult toy) = 2/3 = 67%.

13.2.1 Coherence

Given the large size of I, a realistic adversary is limited by a maximumsize |β| (the number of items in β) of background knowledge β. An adversaryhas the power p if he/she can only acquire background knowledge of up to ppublic items, i.e., |β| ≤ p. For such β, if Sup(β) < k, the adversary is ableto link a target individual to a transaction with more than 1/k probability;if Pbreach(β) > h, the adversary is able to link a target individual to a pri-vate item with more than h probability. This motivates the following privacynotion.

DEFINITION 13.1 Coherence A public itemset β with |β| ≤ p andSup(β) > 0 is called a mole with respect to (h, k, p) if either Sup(β) < kor Pbreach(β) > h. The data D is (h, k, p)-coherent if D contains no moleswith respect to (h, k, p), that is, for all public itemsets β with |β| ≤ p andSup(β) > 0, Sup(β) ≥ k and Pbreach(β) ≤ h.

Page 254: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

236 Introduction to Privacy-Preserving Data Publishing

Table 13.1: D, k = 2, p = 2, h = 0.8TID Activities Medical HistoryT1 a,c,d,f,g DiabetesT2 a,b,c,d HepatitisT3 b,d,f,x HepatitisT4 b,c,g,y,z HIVT5 a,c,f,g HIV

Sup(β) < k and Pbreach(β) > h correspond to record linkage attacks andattribute linkage attacks. A mole is any background knowledge (constrained bythe maximum size p) that can lead to a linking attack. The goal of coherenceis to eliminate all moles.

Example 13.1Suppose that a healthcare provider publishes the data D in Table 13.1 forresearch on life styles and illnesses. “Activities” refer to the activities a per-son engages in (e.g., drinking, smoking) and are public. “Medical History”refers to the person’s major illness and is considered private. Each person canhave a number of activities and illness chosen from a universe I. Considerk = 2, p = 2, h = 80%. D violates (h, k, p)-coherence because only one trans-action T2 contains ab (we use ab for {a, b}). So an adversary acquiring thebackground knowledge ab can uniquely identify T2 as the transaction of thetarget individual. For the background knowledge bf , transactions T2 and T3

contain bf . However, both transactions also contain “Hepatitis.” Therefore,an adversary with the background knowledge bf can infer “Hepatitis” with100% probability.

13.2.2 Item Suppression

A mole can be removed by suppressing any item in the mole. Global sup-pression of an item refers to deleting the item from all transactions containingit. Local suppression of an item refers to deleting the item from some trans-actions containing it. Global suppression has two nice properties. First, itguarantees to eliminate all moles containing the suppressed item. Second, itleaves any remaining itemset with the support equal to the support in theoriginal data. The second property implies that any result derived from themodified data also holds on the original data. Local suppression does not havethese properties. For this reason, we shall consider global suppression of items.

Example 13.2Consider three transactions {a, b, HIV }, {a, b, d, f, HIV }, and {b, d, f, Diabetes}.The global suppression of a and f transforms these transactions into {b, HIV },

Page 255: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Transaction Data 237

{b, d, HIV }, and {b, d, Diabetes}. After the suppression, the remaining item-sets bd and bdHIV have the support of 2 and 1, which is the same as in theoriginal data. Instead, the local suppression of b from only the transaction{a, b, d, f, HIV } leaves the itemsets bd and bdHIV with the support of 1 and0, which is different from the support in the original data.

Let IL(e) denote the information loss caused by suppressing an item e.There are many metrics for information loss. A simple metric is a constantpenalty associated with each item e. Another metric is the number of occur-rences of e suppressed, which is equal to Sup(e) under global suppression.More sophisticatedly, IL(e) can be defined by the number of patterns, suchas frequent itemsets, eliminated by the suppression of e. We will consider thismetrics in Chapter 13.2.4. We assume that private items are important andwill not be suppressed. Suppose that D is transformed to D′ by suppress-ing zero or more public items. We define IL(D, D′) =

∑e IL(e) to be the

information loss of the transformation, where e is an item suppressed.

DEFINITION 13.2 Let D′ be a transformation of D by suppressingpublic items. D′ is a (h, k, p)-cohesion of D if D′ is (h, k, p)-coherent. D′ isan optimal (h, k, p)-cohesion of D if D′ is a (h, k, p)-cohesion of D and forevery other (h, k, p)-cohesion D” of D, IL(D, D”) ≥ IL(D, D′). The optimalcohesion problem is to find an optimal (h, k, p)-cohesion of D.

It can be shown that D has no (h, k, p)-cohesion if and only if the emptyitemset is a mole with respect to (h, k, p). We assume that the empty set is nota mole. Xu et al. [256] show, by a reduction from the vertex cover problem,that the optimal cohesion problem is NP-hard, even for the special case ofk = 2, p = 2, IL(e) = 1.

13.2.3 A Heuristic Suppression Algorithm

Since the optimal cohesion problem is NP-hard, a heuristic solution wasproposed in [256]. The algorithm greedily suppresses the next item e withthe maximum Score(e) until all moles are removed. The algorithm focuseson minimal moles, i.e., those moles that contain no proper subset as a mole,as removing all such moles is sufficient for removing all moles. Let MM(e)denote the set of minimal moles containing the public item e and |MM(e)|denote the number of minimal moles in MM(e). The score

Score(e) =|MM(e)|

IL(e)(13.1)

measures the number of minimal moles eliminated per unit of informationloss. Algorithm 13.2.9 shows this greedy algorithm.

Page 256: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

238 Introduction to Privacy-Preserving Data Publishing

Algorithm 13.2.9 The Greedy Algorithm1: suppress all size-1 moles from D;2: while there are minimal moles in D do3: suppress a remaining public item e with the maximum Score(e) from

D;4: end while;

Two questions remain. The first question is how to tell whether there areminimal moles in D at line 2. The second question is how to identify the publicitem e with the maximum Score(e) at line 3. Recomputing all minimal molesin each iteration is not scalable because the number of minimal moles is large.Another approach is finding all minimal moles in a preprocessing step andmaintaining minimal moles in each iteration. Below, we describe the detail offinding minimal moles and maintaining minimal moles.

13.2.3.1 Finding Minimal Moles

A minimal mole contains no subset as a minimal mole. Thus one strategyfor finding all minimal moles is examining i-itemsets in the increasing size iuntil an itemset becomes a mole for the first time, at which time it must bea minimal mole. If an examined i-itemset β is not a mole, we then extend βby one more item; such β is called an extendible non-mole. The benefit of thisstrategy is that all examined candidates of size i + 1 can be constructed fromextendible non-moles of size i, so that we can limit the search to extendiblenon-moles in each iteration.

Let Mi denote the set of minimal moles of size i and let Fi denote theset of extendible non-moles of size i. For every β = 〈e1, . . . , ei−1, ei, ei+1〉 inMi+1 or in Fi+1, by definition, no i-subset of β should be in Mi and both of〈e1, . . . , ei−1, ei〉 and 〈e1, . . . , ei−1, ei+1〉 should be in Fi. In other words, eachβ = 〈e1, . . . , ei−1, ei, ei+1〉 in Mi+1 and Fi+1 can be constructed from a pairof 〈e1, . . . , ei−1, ei〉 and 〈e1, . . . , ei−1, ei+1〉 in Fi. This computation is given inAlgorithm 13.2.10.

This algorithm shares some similarity with Apriori [18] for mining frequentitemsets. Apriori exploits the subset property that every proper subset of afrequent itemset is a frequent itemset. A minimal mole has the property thatevery proper subset of a minimal mole (and an extendible non-mole) is anextendible non-mole. To exploit the subset property, we have to constructMi+1 and Fi+1 in parallel.

13.2.3.2 Maintaining Minimal Moles

After suppressing an item e in the current iteration in Algorithm 13.2.9, weshall maintain the set of remaining minimal moles, M∗, and Score(e′) for eachremaining public item e′. Note that IL(e′) defined by Sup(e′) is unaffected by

Page 257: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Transaction Data 239

Algorithm 13.2.10 Identifying Minimal Moles1: find M1 and F1 in one scan of D;2: while i < p and Fi is not empty do3: generate the candidate set Ci+1 from Fi as described above;4: scan D to count Sup(β) and Pbreach(β) for all β in Ci+1;5: for all β in Ci+1 do6: if Sup(β) < k or Pbreach(β) > h then7: add β to Mi+1;8: else9: add β to Fi+1;

10: end if11: end for12: i + +;13: end while

the suppression of e. |MM(e′)| will be decreased by the number of minimalmoles that contain e′e. To compute this number, we can store all minimalmoles in M∗ in the following MOLE-tree. The MOLE-tree contains the rootlabeled “null” and a root-to-leaf path for each minimal mole in M∗. Eachnon-root node has three fields: label - the item at this node; mole-num - thenumber of minimal moles that pass this node; node-link - the link pointing tothe next node with the same label. In addition, the Score table contains threefields for each remaining public item e: |MM(e)|, IL(e), head-of-link(e) thatpoints to the first node on the node-link for e. On suppressing e, all minimalmoles containing e can be found and deleted by following the node-link for e.The next example illustrates this update.

Example 13.3Figure 13.1 shows the MOLE-tree for seven minimal moles

M∗ = {db, da, dg, dc, ba, bg, bf},where items are arranged in the descending order of |MM(e)|. AssumeIL(e) = Sup(e). The node < b : 3 > indicates that 3 minimal moles passthe node, i.e., ba, bg, bf . The entry < b : 4, 3 > in the Score table indi-cates that |MM(b)| = 4, IL(b) = 3. Since the item d has the maximumScore = |MM |/IL, we first suppress d by deleting all minimal moles passingthe (only) node for d. This is done by traversing the subtree at the node for dand decreasing |MM | for b, a, g, and c by 1, and decreasing |MM | for d by 4.Now |MM(d)| and |MM(c)| become 0, so the entries for d and c are deletedfrom the Score table. The new MOLE-tree and Score table are shown in Fig-ure 13.2. Next, the item b has the maximum |MM |/IL and is suppressed. Atthis point, all remaining moles are deleted and now the Score table becomesempty.

Page 258: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

240 Introduction to Privacy-Preserving Data Publishing

b:3d:4

b:1 c:1g:1a:1

d: 4,2

b: 4,3

a: 2,3

g: 2,3

c: 1,4

f: 1,4

Score Table (Item: |MM|, IL) root

g:1a:1 f:1

FIGURE 13.1: MOLE-tree: k = 2, p = 2, h = 80%

b:3

b:3,3

a:1,3

g:1,3

f:1,4

Score Table (Item: |MM|, IL)root

g:1a:1 f:1

FIGURE 13.2: MOLE-tree: k = 2, p = 2, h = 80%

The above approach has been evaluated on a variety of data sets in [256].The main findings are that, for a dense data, it suffices to suppress a smallnumber of low support items and distortion is low (usually less than 10%). Fora sparse data, distortion is larger (usually 15%) because of the large numberof moles to eliminate. Also, the study shows that more public items and alarger adversary’s power lead to more distortion.

13.2.4 Itemset-Based Utility

The aforementioned information loss IL(e) is measured by considering onlythe item e itself (e.g., Sup(e)). In some applications, certain co-occurrencesof items are the source of utility. For example, frequent itemset mining [18] islooking for the itemsets that have support no less than some threshold. If anitem e occurs in no frequent itemset, suppressing the item incurs no informa-tion loss. On the other hand, if an item e occurs in many frequent itemsets,suppressing the item incurs a large information loss because all frequent item-

Page 259: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Transaction Data 241

sets containing e are removed from the data.To model the above itemset based utility, Xu et al. [255] consider an itemset

α to be a nugget with respect to (k′, p′) if |α| ≤ p′ and Sup(α) ≥ k′, where k′

and p′ are user-specified parameters. In other words, a nugget is an itemsetthat has a large support and a size bounded by a maximum length. Let N(D)denote the set of nuggets in D with respect to (k′, p′) and |N(D)| denote thenumber of nuggets in D. Note that, for any D′ obtained from D by globalsuppression of items, N(D′) ⊆ N(D).

The goal is to find a (h, k, p)-cohesion D′ with the maximum |N(D′)|. LetM(e) and N(e) denote the set of moles and nuggets containing the iteme. To retain nuggets, we can suppress a remaining public item e with themaximum Score(e) = |M(e)|/|N(e)|, which maximizes the number of moleseliminated per nugget lost. For each remaining public item e′, |M(e′)| and|N(e′)| will be decreased by the number of moles and nuggets containingee′. In Chapter 13.2.3.2, we maintain all minimal moles because eliminatingminimal moles is sufficient for eliminating all moles. This approach is notapplicable to nuggets because we must compute the actual count |N(e)| forpreserving nuggets. However, maintaining all nuggets is not an option becausethe number of nuggets grows exponentially.

Xu et al. [255] propose a border approach for updating |M(e′)| and |N(e′)|.The observation is that M(D) can be represented by a border [U, L]: U isthe collection of minimal itemsets in M(D) (i.e., those that have no subsetin M(D)) and L is the collection of maximal itemsets in M(D) (i.e., thosethat have no superset in M(D)). Similarly, N(D) can be represented by aborder. Xu et al. [255] present an algorithm for maintaining these bordersand a method for computing |M(e′)| and |N(e′)| using these borders. Sinceborders are much smaller than the full sets of all itemsets that they represent,maintaining borders requires much less space and time.

13.2.5 Discussion

Terrovitis et al. [220] propose the notion of km-anonymity to prevent recordlinkage attacks: D satisfies km-anonymity if for any itemset β with |β| ≤ m,at least k transactions in D contain all the items in β. We can show that km-anonymity is a special case of (h, k, p)-coherence. To model km-anonymity, weconsider all the items in I as public items and a new item universe I ′ = I

⋃{e},where e is a new item and is the only private item. We add the new item e toevery transaction in D. Now each attack under the coherence model has theform β → e, where β is an itemset with items drawn from I and e is the newitem. By setting h = 100%, the condition Pbreach(β) ≤ h always holds andthe (h, k, p)-coherence degenerates into the requirement that, for all itemsetsβ drawn from I with |β| ≤ p and Sup(β) > 0, Sup(β) ≥ k. This requirementis exactly the km-anonymity with m = p.

Page 260: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

242 Introduction to Privacy-Preserving Data Publishing

PROPOSITION 13.1Let D and D′ be the transaction data defined above. D is km-anonymous ifand only if D′ is (h, k, p)-coherent with h = 100% and p = m.

Another interesting property of (h, k, p)-coherence is that it permits datamining rules β → s for a private item s with |β| > p. In 13.1, for p =1, k = 2, h = 80%, D is (h, k, p)-coherent, though bf → Hepatitis has 100%probability. This means that, for a data recipient with the power p = 1, she willbe able to extract this rule for research purposes, but is not able to link thisrule to an individual. This property is particularly interesting as the researchin [156] shows that accurate rules β → s usually involve a long antecedent β.A long antecedent sets up a high bar for the adversary under our model butenables data mining for a genuine researcher. In contrast, �-diversity forcesall rules β → s for a private item s to no more than 1/l certainty, where lis usually 4 or larger, as required by privacy protection. Such rules are lessuseful for data mining because the certainty is too low.

In Chapter 13.2.2, we discussed that total item suppression retains theoriginal support of remaining itemsets. This property guarantees that rulesextracted from the modified data hold on the original data. For example,the usefulness of the association rule bd → HIV depends on the confi-dence Sup(bdHIV )/Sup(bd). A small change in the support Sup(bdHIV )or Sup(bd) could lead to a significant difference in the confidence, thus, un-expected or even invalid data mining results. In Example 13.2, the globalsuppression a and f leaves the rule bd → HIV with the same confidence asin the original data, i.e., 50%, but the local suppression yields the confidenceof 0%, which does not hold on the original data. Unfortunately, this issuehas not received much attention in the literature. For example, local recodinghas a similar problem to local suppression, thus, does not retain the originaloccurrence count of attribute/value combinations.

13.3 Band Matrix Method

Ghinita et al. [101] present another method called band matrix method be-low for preventing attribute linkage attacks. Given a set of transactions Dcontaining items from I, this method determines a partitioning P of D intoanonymized groups with “privacy degree” at least p, such that the “recon-struction error” using such groups is minimized. Like the coherence approach,this method classifies the items in I into sensitive items (i.e., private items),S, and non-sensitive items (i.e., public items), Q = I − S. We call such itemsS-items and Q-items, respectively. A transaction containing no S-item is anon-sensitive transaction, otherwise, a sensitive transaction. We explain the

Page 261: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Transaction Data 243

notion of privacy degree, reconstruction error, and the algorithm for groupingtransactions.

A transformation of transaction data D has privacy degree p if the prob-ability of associating any transaction t ∈ D with a particular sensitive itemdoes not exceed 1/p. To enforce this privacy requirement, D is partitioned intodisjoint sets of transactions, called anonymized groups. Like Anatomy [249],for each group G, it publishes the exact Q-items, together with a summaryof the frequencies of S-items contained in G. Let f1, . . . , fm be the number ofoccurrences for sensitive items s1, . . . , sm in G. Then group G offers privacydegree pG = mini=1..m|G|/fi. The privacy degree of an entire partitioning Pof D is defined by the minimum pG for all groups G.

13.3.1 Band Matrix Representation

Another criterion for grouping transactions is to preserve the correlationbetween items as much as possible. In other words, transactions that sharemany items in common should be assigned to the same group. To identifysuch transactions, D is represented by a band matrix [103, 194]. In a bandmatrix, rows correspond to transactions t and columns correspond to Q-itemsi, with the 0/1 value in each entry (t, i) indicating whether t contains i. Aband matrix has the general form shown in Figure 13.3, where all entries of thematrix are 0, except for the main diagonal d0, a number of U upper diagonals(d1, . . . , dU ), and L lower diagonals (d−1, . . . , d−L). The objective of bandmatrix is to minimize the total bandwidth B = U +L+1, by rearranging theorder of rows and columns in the matrix, thus, bringing transactions that shareitems in common close to each other. Finding an optimal band matrix, i.e.with minimum B, is NP-complete. Multiple heuristics have been proposed toobtain band matrices with low bandwidth. The most prominent is the ReverseCuthill-McKee algorithm, a variation of the Cuthill-McKee algorithm [54].

13.3.2 Constructing Anonymized Groups

Once the data is transformed to a band matrix with a narrow bandwidth,the next step is to create anonymized groups of transactions. To satisfy theprivacy requirement, each sensitive transaction is grouped with non-sensitivetransactions or sensitive ones with different sensitive items. A greedy al-gorithm called Correlation-aware Anonymization of High-dimensional Data(CAHD) is presented in [101]. This algorithm adopts the “one-occurrence-per-group” heuristic that allows only one occurrence of each sensitive itemin a group. It is shown that if solution to the anonymization problem withprivacy degree p exists, then such an heuristic will always find a solution.

CAHD works as follows: given a sensitive transaction t0, it forms a candidatelist (CL) by including all the non-conflicting transactions in a window centeredat t0. A transaction is non-conflicting with t0 if either it is non-sensitive or ithas a different sensitive S-item. The window contains 2αp− 1 non-conflicting

Page 262: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

244 Introduction to Privacy-Preserving Data Publishing

FIGURE 13.3: A band matrix ([101] c©2008 IEEE)

transactions, where α is a system parameter. Then, out of the transactions inCL(t0), the p−1 of them that have the largest number of Q-items in commonwith t0 are chosen to form the anonymized group with t0. The intuition is that,the more transactions share the same Q-items, the smaller the reconstructionerror is (see the discussion below shortly). All selected transactions are thenremoved from D, and the process continues with the next sensitive transactionin the order.

It is important that at any time, forming a group will not yield a remainingset of transactions that can not be anonymized (for instance, if all remainingtransactions share one common sensitive item). To ensure this, a histogramwith the number of remaining occurrences for each sensitive item is maintainedevery time a new group is formed. If forming a group yields a remaining set oftransactions that violates the privacy requirement, the current group is rolledback and a new group formation is attempted starting from the next sensitivetransaction in the sequence.

Example 13.4Table 13.2 shows the original matrix with the sensitive items. Table 13.3 showsthe matrix after the rearrangement of rows and columns, where non-sensitiveitems are more closely clustered around the main diagonal. Table 13.4 showstwo groups that have privacy degree 2. From the original data in Table 13.2,we can infer that all customers who bought cream but not meat have alsobought a pregnancy test (with 100% certainty). From the anonymized datain Table 13.4, the adversary can only infer that half of such customers havebought a pregnancy test. This example is borrowed from [101].

Page 263: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Transaction Data 245

Table 13.2: Example of band matrix: original dataSensitive Items

Name Wine Meat Cream Strawberries Pregnancy ViagraTest

Bob X X XDavid X XClaire X X XAndrea X XEllen X X X

Source: [101] c©2008 IEEE

Table 13.3: Example of band matrix: re-organized dataSensitive Items

Name Wine Meat Cream Strawberries Pregnancy ViagraTest

Bob X X XDavid X XEllen X X XAndrea X XClaire X X X

Source: [101] c©2008 IEEE

Table 13.4: Example of band matrix: anonymized dataName Wine Meat Cream Strawberries Sensitive ItemsBob X XDavid X X Viagra: 1Ellen X X XAndrea X X PregnancyClaire X X Test: 1

Source: [101] c©2008 IEEE

13.3.3 Reconstruction Error

The information loss of anonymized groups is measured by the reconstruc-tion error in terms of the KL-divergence of the distribution of S-items in theresult for answering a query. A query has the form

SELECT COUNT(*)FROM DWHERE (Sensitive Item s is present) AND Cell

Cell is a condition on Q-items q1, . . . , qr of the form q1 = val1∧· · ·∧qr = valr.Such queries can be modeled using a probability distribution function (pdf) of

Page 264: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

246 Introduction to Privacy-Preserving Data Publishing

a S-item s over the space defined by all cells for q1, . . . , qr. The total numberof such cells is 2r, corresponding to all combinations of presence and absenceof these Q-items in a transaction. The actual pdf value of a S-item s for a cellC is

ActsC =Occurrences of s in C

Total occurrences of s in D(13.2)

In other words, ActsC is the proportion of s in C.When the query is applied to a group G, the query result for the cell C on G

is estimated as follows. Denote the number of occurrences of S-item s in G bya, and the number of transactions that match the Q-items selection predicatein Cell of the query by b. Then the estimated result of the query for G isaΔb/|G|. Intuitively, b/|G| is the probability that a transaction matches theQ-item selection predicate Cell for G. For the estimated pdf EstsC , we replacethe numerator “Occurrences of s in C” of Equation 13.2 with aΔb/|G| summedover all groups G that intersect C. Then the utility of the anonymized data ismeasured as the distance between ActsC and EstsC over all cells C, measuredby the KL-divergence:

C

ActsC logActsCEstsC

. (13.3)

13.3.4 Discussion

Since exact Q-items are published, the band matrix method cannot be usedto prevent record linkage attacks. Indeed, in this case, each anonymized groupis a set of original transactions, and publishing such groups leads to publishingthe exact original database.

Like �-diversity [160] and Anatomy [249], the anonymized data produced bythe band matrix method could yield untruthful analysis results, i.e., resultsthat do not hold on the original data. Example 13.4 illustrates this point. Inthe original data, all customers who bought cream but not meat have alsobought a pregnancy test with 100% certainty; whereas in the anonymizeddata, only half of all customers who bought cream but not meat have boughta pregnancy test. The 50% certainty derived from the modified data doesnot hold on the original data. Since the analyst cannot tell whether the 50%certainty is the original certainty or the modified one, it is hard to use suchanalysis results because it may or may not hold on the original data.

As dictated by the privacy degree p, all correlations involving S-items havea low certainty, i.e., no more than 1/p. Typically, p is 4 or larger. With thecertainty being so low, such correlations are less useful for data mining. Onthe other hand, this method loses no information on correlations involvingonly Q-items because exact Q-items are published.

Page 265: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Transaction Data 247

13.4 km-Anonymization

Now we turn to two methods that focus on record linkage attacks (notethat the coherence approach also handles record linkage attacks). We discussthe km-anonymity method proposed in [220], and in Chapter 13.5, we discussthe k-anonymity method for transaction data proposed in [112]. Both meth-ods assume that any subset of items can be used as background knowledgeand that an item taxonomy is available. Both remove record linkage attacksby generalizing detailed items to general items according to the given itemtaxonomy. They differ mainly in the notion of privacy.

13.4.1 km-Anonymity

Like the coherence approach, the km-anonymity approach assumes that anadversary is limited by the maximum number m of items that can be acquiredas background knowledge in an attack.

DEFINITION 13.3 km-anonymity A transaction database D is said tobe km-anonymous if no adversary that has background knowledge of up to mitems of a transaction t ∈ D can use these items to identify less than k trans-actions from D. The km-anonymization problem is to find a km-anonymoustransformation D′ for D with minimum information loss.

In other words, km-anonymity means that for any subset of up to m items,there are at least k transactions that contain all the items in the subset.As discussed in Chapter 13.2, this privacy notion coincides with the specialcase of (h, k, p)-coherence with h = 100% and p = m. In this case, a subsetof items that causes violation of km-anonymity is exactly a mole under the(h, k, p)-coherence model.

To transform D to a km-anonymous D′, Terrovitis et al. [220] employ anitem taxonomy to generalize precise items to general items, in contrast to thesuppression operation for coherence. Terrovitis et al. [220] adopt the globalrecoding scheme: if any child node is generalized to the parent node, all itssibling nodes are generalized to the parent node, and the generalization al-ways applies to all transactions. Figure 13.4 shows a sample database andthe transformation using the generalization rule {a1, a2 → A}. Each possibletransformation under the global recoding scheme corresponds to a possiblehorizontal cut of the taxonomy tree. The set of possible cuts also forms a hi-erarchy lattice, based on the generalization relationship among cuts. A set ofpossible cuts is shown in Figure 13.5 for the small taxonomy and the hierarchylattice for these cuts is shown in Figure 13.6.

The information loss of a cut can be measured in various ways. For con-

Page 266: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

248 Introduction to Privacy-Preserving Data Publishing

FIGURE 13.4: Transformation using generalization rule {a1, a2} → A ([220])

creteness, consider the loss metric proposed in [123]. For each generalized itemi, this metric charges an information loss that is proportional to (i) the per-centage of leaf nodes under i in the item taxonomy and (ii) the occurrencesof i in the generalized data. Therefore, if the cut c0 is more general than thecut c, both (i) and (ii) of c0 will be larger than those of c, so c0 has a highercost (information loss) than c.

The set of possible cuts satisfies the following monotonicity property. If thehierarchy cut c results in a km-anonymous database, then all cuts c0, such thatc0 is more general than c, also result in a km-anonymous database. Under theabove cost metric for information loss, based on this monotonicity property,as soon as we find a cut c that satisfies the km-anonymity constraint, we donot have to seek for a better cut in c’s ancestors.

13.4.2 Apriori Anonymization

Enumerating the entire cut lattice is not scalable. In addition, validatinga cut requires checking if any subset of up to m items causes a violation ofthe anonymity requirement. The number of such subsets can be very large. Agreedy algorithm called Apriori anonymization (AA) is proposed in [220] tofind a good cut. This algorithm is based on the apriori principle: if a i-itemsetα causes anonymity violation, so does each superset of α. Thus it exploresthe space of itemsets in an apriori, bottom-up fashion. It first identifies andeliminates anonymity violations caused by (l− 1)-itemsets, before checking l-itemsets, l = 2, . . . , m. By operating in such a bottom-up fashion, the numberof itemsets that have to be checked at a higher level are greatly reduced, asdetailed items could have been generalized to more generalized ones. Thisalgorithm is given in Algorithm 13.4.11.

Page 267: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Transaction Data 249

FIGURE 13.5: Possible generalizations and cut hierarchy

Algorithm 13.4.11 Apriori-Based AnonymizationAA(D, k, m)1: initialize GenRules to the empty set;2: for all i := 1 to m do3: initialize a new count-tree;4: for all t ∈ D do5: extend t according to GenRules;6: add all i-subsets of extended t to count-tree;7: run DA on count-tree for m = i and update GenRules;8: end for;9: end for;

First, the algorithm initializes the set of generalization rules GenRules tothe empty set. In the ith iteration, a count-tree data structure is constructedto keep track of all i-itemsets and their support. This structure is very similarto the MOLE-tree used for counting moles in Chapter 13.2. Each root-to-leaf path in the count-tree represents a i-itemset. Line 5 generalizes eachtransaction t in D according to the generalization rules GenRules. For eachgeneralized transaction t, line 6 adds all i-subsets of t to a new count-treeand increases the support of each i-subset. Line 7 identifies all i-subsets witha support below k and finds a set of generalization rules to eliminate allsuch i-subsets. This step runs Direct Anonymization (DA) on the count-tree,another heuristic algorithm proposed in [220] that operates directly on i-itemsets violating the anonymity requirement. After the mth iteration, all i-

Page 268: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

250 Introduction to Privacy-Preserving Data Publishing

FIGURE 13.6: Possible generalizations and cut hierarchy

subsets of items with support below k are removed, i ≤ m. The final GenRulesis the set of generalization rules that will transform D into km-anonymity.

The benefit of this algorithm is that it exploits the generalizations per-formed in iteration i, to reduce the search space in iteration i + 1. This isbecause earlier generalizations reduce the size of the taxonomy, pruning de-tailed items, for later iterations. As the algorithm progresses to larger valuesof i, the effect of pruned detailed items increases because (i) more general-ization rules are expected to be in GenRules and (ii) the total number ofi-itemsets increases exponentially with i. Terrovitis et al. [220] also presentOptimal Anonymization (OA) that explores in a bottom-up fashion all pos-sible cuts and picks the best one. Although optimal, OA cannot be appliedfor realistic databases because it has very high computational cost due to theexhaustive search nature.

13.4.3 Discussion

The km-anonymity approach only handles record linkage attacks, not at-tribute linkage attacks. For record linkage attacks, the privacy notion is thesame as the notion of coherence when all items are public items, as shownin Proposition 13.1. Both models consider a bounded adversary having back-ground knowledge limited by the maximum number of items. The main dif-ference lies at the anonymization operator. The coherence approach uses totalitem suppression, whereas the km-anonymization approach uses global itemgeneralization. Another difference is that the coherence approach also handlesattribute linkage attacks, but the km-anonymity approach does not.

Item suppression and item generalization each have its own strengths andweaknesses. Item suppression is effective in dealing with “outliers” that causeviolation of the privacy requirement, but may have a large information loss ifthe data is sparse. In contrast, global generalization is vulnerable to “outlier”

Page 269: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Transaction Data 251

FIGURE 13.7: Outliers e and i cause generalization to top item T

items because the generalization of such items causes the generalization ofother items. We show this point in Figure 13.7. The dashed curve representsa cut through the taxonomy, which corresponds to a solution by the general-ization approach. Suppose that the items e and i are contained in less than ktransactions, the global generalization will cause all items to be generalized tothe top level, even though the other items do not cause violation. In contrast,suppressing the items e and i is all we need, which incurs less information loss.In general, global generalization incurs high information loss for a “shallow”and “wide” taxonomy because each generalization step tends to generalizemany items. Moreover, the generalization approach is applicable only whenan item taxonomy is available.

Fewer items will be generalized if local generalization is used instead ofglobal generalization because local generalization does not have to generalizeall occurrences of an item and all sibling items. However, like total item sup-pression, global generalization preserves the truthfulness of analysis results,i.e., the results derived from the generalized data also hold on the originaldata. Partial generalization does not have this property, as discussed in Chap-ter 13.2.5. For example, suppose that five transactions bought both “cream”and “chicken,” which causes violation of 62-anonymity. Suppose that globalgeneralization removes this violation by generalizing “cream” (and all sib-lings) to “dairy product.” Suppose now that ten transactions contain both“dairy product” and “chicken,” so the violation is eliminated. The analysison the generalized data remains truthful of the original data. For example,the count 10 of the itemest {“dairy product,” “chicken” } indicates that tentransactions contain some items from the category “dairy product” and theitem “chicken,” which is true of the original data.

Page 270: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

252 Introduction to Privacy-Preserving Data Publishing

13.5 Transactional k-Anonymity

A key assumption made in (h, k, p)-coherence and km-anonymity is the max-imum number of items that an adversary can acquire as background knowl-edge, which is p for (h, k, p)-coherence and m for km-anonymity. There arescenarios where it may not be possible to determine this bound in advance.If all background knowledge is limited to the presence of items, this case maybe addressed by setting p or m to the maximum transaction length in thedatabase since any subset of items that is more than this length does notmatch any transaction. However, if background knowledge may be on the ab-sence of items, the adversary may exclude transactions using this knowledgeand focus on fewer than k transactions. For example, an adversary may knowthat Alice has purchased “milk,” “beer,” and “diapers,” but has not pur-chased “snow tires.” Suppose that three transactions contain “milk,” “beer,”and “diapers,” but only two of them contain “snow tires.” Then the adver-sary could exclude the two transactions containing “snow tires” and link theremaining transaction to Alice. In this example, km privacy with k = 2 andm = 4 is violated, even m is the maximum transaction length.

13.5.1 k-Anonymity for Set Valued Data

One way to address these scenarios is to make each transaction indis-tinguishable from k − 1 other transactions. The following transactional k-anonymity, proposed in [112], is an adoption of the k-anonymity originallyproposed for relational data to set-valued data.

DEFINITION 13.4 Transactional k-anonymity A transactiondatabase D is k-anonymous if every transaction in D occurs at least k times.A transaction database D0 with some instances of items generalized from Dusing a taxonomy is a k-anonymization of D if D0 is k-anonymous.

Intuitively, a transaction database is k-anonymous if each transaction isidentical to at least k− 1 others in the database. This partitions transactionsinto equivalence classes, where all transactions in the same equivalence classare exactly identical. This model is different from the (h, k, p)-coherence andkm-anonymity models, which states that, for any set of up to m (or p) items(that occur in some transaction), there are at least k transactions that containthese items, but these transactions are not required to be identical to eachother. Also, the k-anonymity model does not have the parameter m becauseit requires all transactions in the same equivalence class to be identical.

It follows that every database D that satisfies k-anonymity also satisfieskm-anonymity for all m. Indeed, if any n-itemset, where n ≤ m, is contained

Page 271: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Transaction Data 253

Table 13.5: Original databaseOwner TID Items PurchasedAlice T1 {a1, b1}Bob T2 {a2, b1, b2}Chris T3 {a1, a2, b2}Dan T4 {a1, a2, b1, b2}

Table 13.6: Generalized 2-anonymous databaseOwner TID Items PurchasedAlice T1 {A, B}Bob T2 {A, B}Chris T3 {a1, a2, B}Dan T4 {a1, a2, B}

in some transaction t ∈ D, the k-anonymity of D implies that there are atleast k−1 other transactions identical to t, thus, this n-itemset has a supportof at least k. So D is km-anonymous. The next example from [112] shows thatthere exists a database D such that for any m, D satisfies km-anonymity butnot k-anonymity.

Example 13.5The original database D in Table 13.5 is not 2-anonymous, but satisfies 22-anonymity. Assuming that the taxonomy in Figure 13.4 is used, Table 13.6gives a possible 2-anonymization. To further illustrate that there exists adatabase that satisfies km-anonymity for any m but not k-anonymity, we addan additional transaction T5 in Table 13.5 which is identical to T4. This newdatabase is 2m-anonymous for any m given the existence of T4 and T5, butnot 2-anonymous due to T1, T2, and T3.

He and Naughton [112] present a generalization approach to transform Dinto k-anonymization, assuming that a taxonomy of items is available. If sev-eral items are generalized to the same item, only one occurrence of the gener-alized item will be kept in the generalized transaction. Thus the informationloss now comes from two types: a detailed child item is replaced with a gen-eral parent item, and duplicate general items are eliminated. To see the secondtype, in the above example, from the generalized transaction, we know thatthe transaction originally contains some items from the A category, but we willnot know the number of such items in the original transaction. As a result,count queries pertaining to the number of items in an original transaction,such as the average number of items per transaction, will not be supported bythis type of generalization. This second type of information loss was not mea-sured by a usual information loss metric for relational data where no attributevalue will be eliminated by generalization.

Page 272: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

254 Introduction to Privacy-Preserving Data Publishing

Algorithm 13.5.12 Top-Down, Local Partitioning AlgorithmAnonymize(partition)1: if no further drill down possible for partition then2: return and put partition in global list of returned partitions;3: else4: expandNode← pickNode(partition);5: end if6: for all transaction t in partition do7: distribute t to a proper subpartition induced by expandNode;8: end for9: merge small subpartitions;

10: for all subpartition do11: Anonymize(subpartition);12: end for

13.5.2 Top-Down Partitioning Anonymization

A greedy top-down partitioning algorithm, presented in [112], is described inAlgorithm 13.5.12. This algorithm is essentially Mondrian [149] but extendedto transaction data. It starts with the single partition containing all trans-actions with all items generalized to the topmost item ALL. Since duplicateitems are removed, each transaction at this level contains the single item ALL.Then it recursively splits a partition by specializing a node in the taxonomyfor all the transactions in the partition. For each partition, there is a choiceof which node to specialize. This choice is determined by the pick node sub-routine. Then all the transactions in the partition with the same specializeditem are distributed to the same subpartition. At the end of data distributionphase, (small) subpartitions with fewer than k transactions are merged intoa special leftover subpartition, and if necessary, some large transactions withover k transactions will be re-distributed to the leftover partition to make surethat the leftover partition has at least k transactions. The partitioning stopswhen violation of k-anonymity occurs. Since the specialization is determinedindependently for each partition, the recoding for generalization is local.

13.5.3 Discussion

Like the km-anonymity approach, the k-anonymity approach deals with onlyrecord linkage attacks. He and Naughton [112] show that k-anonymization hasa small information loss. This is because local generalization is employed wherethe partitioning decision is made independently for each partition withoutforcing it in other partitions. This flexibility, however, creates the anonymizeddata that does not have the value exclusiveness property discussed in Chapter13.1.4. For example, it is possible to have two partitions such that one hasthe item “North America” and the other has the item “Canada.” The first

Page 273: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Transaction Data 255

partition is obtained by expanding a node other than “North America” andthe second partition is produced by expanding the node for “North America.”With the data containing both “North America” and “Canada” (in differentpartitions), it is not possible to count the number of transactions containing“Canada” because “North America” covers “Canada.” On the other hand,the km-anonymity approach and coherence approach preserve value exclusive-ness, due to global recoding and total item suppression. In summary, whichanonymization scheme retains more data utility depends on the actual use ofthe data and the data mining algorithms available.

13.6 Anonymizing Query Logs

As discussed in Chapter 13.1, query log anonymization is an importantproblem and several recent works from the web community have studied thisproblem. In many cases, these studies identify the attacks arising from pub-lishing query logs and motivate the need for a solution. There is a lack of aformal problem statement and privacy guarantee. One solution to this prob-lem is to model a query log as a transaction database where each query is atransaction and each query term is an item. Since each query is usually veryshort, it makes sense to merge all queries by the same user into one transac-tion. Now, each transaction corresponds to a user, instead of a query, and suchtransactions can be used to analyze user’s behaviors. Then we can anonymizequery logs by applying the methods discussed previously. Below we discussseveral works on query log anonymization from the web community.

13.6.1 Token-Based Hashing

[146] is one of the first few works considering attacks on query logs. Itstudied a natural anonymization technique called token based hashing andshows that serious privacy leaks are possible in this technique. First, a querylog is anonymized by tokenizing each query term and securely hashing eachtoken to an identifier. It assumes that an unanonymized “reference” query loghas been released previously and is available to the adversary. The adversaryfirst employs the reference query log to extract statistical properties of queryterms in the log-file. The adversary then processes the anonymized log toinvert the hash function based on co-occurrences of tokens within queries.Their study shows that serious leaks are possible in token-based hashing evenwhen the order of the underlying tokens is hidden. For example, suppose thatthe term “Tom” has a frequency of 23% in the unanonymized reference log.The adversary may find a small number of hashes in the anonymized log witha frequency similar to 23%. If “Tom” also occurs in the anonymized log, the

Page 274: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

256 Introduction to Privacy-Preserving Data Publishing

adversary can infer that one of these hashes corresponds to “Tom,” therefore,inverting the hashing with a high probability. A more powerful attack canmake use of combinations of several terms. Their study focuses the accuracyof such attacks. No prevention solution is proposed.

13.6.2 Secret Sharing

The standard logging system for enforcing k-anonymity requires bufferingall queries until there are k users for the same query, and anonymizes them inless than real time. This means that the logs are being held as unencrypteddata for some period which may be undesirable for certain scenarios. To solvethis probelm, a secret sharing scheme is proposed in [4]. The main idea is tosplit a query into k random shares and publish a new share for each distinctuser issuing the same query. Once all k shares for the same query are presentedin the published log, which happens when at least k distinct users have issuedthe query, the query can be decoded by summing up all its k shares. Thistechnique ensures k-anonymity.

More specifically, a secret S corresponding to a query is split into k randomshares using some predefined function. Each share is useless on its own, and allthe k shares are required to decode the secret. The scheme works essentiallyby generating k -1 random numbers S1, . . . , Sk−1 in the range of 0 and m− 1.A final secret, Sk, is generated as:

Sk = S −k−1∑

i=1

Si mod m. (13.4)

A query must appear k times (for k unique users) before S can be decodedusing

S =k∑

i=1

Si mod m. (13.5)

For each query qi, we need to keep track of the share Sji for the user jissuing the query. To achieve this, k hash buckets are created and the user IDis mapped to one of these buckets (e.g., UserID mod k). The bucket indicateswhich secret share is to be used for the present user uj, and the query qi isreplaced with the appropriate share Sji and a query ID. Thus the string forany given query, qi, by user uj , is replaced with Sji, the secret for user j forquery i:

〈uj , qi〉 = 〈uj, H(qi), Sji〉.Here H(qi) is the hash that identifies a query. If all the shares of a particularquery are present in the log, they can be combined to form the exact query.More precisely, to decode the query qi, we compute

∑Sji over the entries

〈uj , H(qi), Sji〉. At least k entries for the same H(qi) but different uj must be

Page 275: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Transaction Data 257

present. To avoid the need to remember all the previously generated shares Sji

for a query qi and user ui, it is possible to seed the random number generatordeterministically for qi and user ui. A disadvantage of this scheme is that itis possible that initially more than k unique users might be required to makethe query before it can be decoded. This is due to the fact that more thanone unique user might get hashed into the same bucket. One solution is tokeep a distinct hash bucket for each of the first k users of each query. Once wehave seen k users issuing the same query, the query will be decoded, therefore,there is no need to retrieve the share for additional users of the same query.

An alternative scheme suggested in the same paper is the Threshold Schemein which the secret is split up into n pieces where any k of those pieces canbe combined to decode this secret. By choosing a sufficiently large n, theprobability of getting any k distinct secrets can be ensured to be very high,thus, reducing the probability of the above disadvantages.

13.6.3 Split Personality

This technique, also proposed by Adar [4], splits the logs of each user onthe basis of “interests” so that the users become dissimilar to themselves,thus reducing the possibility of reconstructing a full user trace (i.e., searchhistory of a user) and finding subsets of the data that can be used to identifythe user. Using a similarity measure, a number of profiles are built for eachof the users. Each user profile is then given a different user ID, so that twoprofiles cannot be linked to the same individual. As an example, if a particularuser is interested in Tennis and Music, then two different profiles would becreated for him, one containing the queries concerning Tennis, and the othercontaining queries related to Music. This reduces the probability of findingthe complete search history of an individual by an adversary, thus increasingthe privacy. The privacy gain by making it difficult for an adversary to relatemultiple profiles to the same person is exactly a utility loss because it limits usto one specific facet of an individual. If a researcher is interested in correlatingdifferent facets, they will not be able to use a data set encoded in this way.

13.6.4 Other Related Works

Finally, Jones et al. [129] use classifiers to map a sequence of queries intodemographic information and shows that candidate users can be identifiedusing query logs. Korolova [142] proposes the concept of minimizing the in-crease in privacy risk as the result of using search engines. They propose toadd random noise and suppress sensitive information. Instead of privacy forthe users, Poblete et al. [188] deal with the privacy for the web sites, whichare being clicked using search engines, and whose URLs are mentioned inthe query logs. Xiong and Agichtein [253] put forward the need of query logpublishing and the challenges that need to be faced.

Page 276: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

258 Introduction to Privacy-Preserving Data Publishing

Table 13.7: Summary of approachescoherence band matrix km-anonymity k-anonymity[255, 256] [101] [220] [112]

Record linkage Yes No Yes YesAttribute linkage Yes Yes No NoAdversary type bounded unbounded bounded unboundedTruthfulness Yes No Yes NoInferences Yes No N/A N/AItemset utility Yes Yes No NoValue exclusiveness Yes Yes Yes No

13.7 Summary

We summarize the transaction anonymization works in Table 13.7. We char-acterize them according to several properties discussed in Chapter 13.1.4:type of attacks (record linkage or attribute linkage attacks), type of adversary(bounded or unbounded background knowledge), truthfulness with respect tothe original data, whether permitting inferences on sensitive items, whethermodeling itemset based utility, whether guaranteeing value exclusiveness. Adetailed discussion of these properties can be found at the Discussion sectionfor each of these approaches. Among the approaches in Table 13.7, the co-herence approach is the only approach that can be used to prevent recordlinkage attacks as well as attribute linkage attacks, and is the only approachthat permits inferences on sensitive items. The coherence approach and km-anonymity approach guarantees truthful analysis with respect to the originaldata.

On information loss, the item suppression of the coherence approach couldbetter deals with “outliers,” but may suffer a large information loss if the datais too sparse. The item generalization of the km-anonymization approach andthe transactional k-anonymity approach could work better if the data is sparseand the taxonomy is “slim” and “tall,” but have a large information loss ifthe taxonomy is “short” and “wide.” The local generalization of transactionalk-anonymity approach has a smaller information loss than global generaliza-tion, however, the anonymized data does not have the value exclusiveness,a property assumed by most existing data mining algorithms. This meansthat either existing algorithms must be modified or new algorithms must bedesigned to analyze such data.

The works on query log anonymization somewhat focus on features specificto query logs, such as query history, temporal information like query time,terms of certain types such as those on demographic information, “vanity”queries [4], privacy of URL mentioned in queries, and reducing privacy riskthrough search engines. Unlike the works on transaction anonymization, there

Page 277: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Transaction Data 259

is a lack of a formal notion of privacy and problem formulation in this bodyof works. The difficulty comes from factors such as the lack of a generallyagreeable taxonomy, the extreme sparseness of the data, and the lack of prac-tical utility notions. So far, the research from the two different communitieshave taken different approaches, although the problems considered share muchsimilarity in structure. It would be interesting to see how these fields couldbenefit by taking an integral approach.

Page 278: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Chapter 14

Anonymizing Trajectory Data

14.1 Introduction

Location-aware devices are used extensively in many network systems, suchas mass transportation, car navigation, and healthcare management. The col-lected spatio-temporal data capture the detailed movement information of thetagged objects, offering tremendous opportunities for mining useful knowl-edge. Yet, publishing the raw data for data mining would reveal specific sen-sitive information of the tagged objects or individuals. In this chapter, westudy the privacy threats in trajectory data publishing and show that tra-ditional anonymization methods are not applicable for trajectory data dueto its challenging properties: high-dimensional, sparse, and sequential. In thischapter, we study several trajectory data anonymization methods to addressthe anonymization problem for trajectory data.

14.1.1 Motivations

In recent years, there has been an explosive growth of location-aware de-vices such as RFID tags, GPS-based devices, cell phones, and PDAs. The useof these devices facilitates new and exciting location-based applications thatconsequently generate a huge collection of trajectory data. Recent research re-veals that these trajectory data can be used for various data analysis purposesto improve current systems, such as city traffic control, mobility management,urban planning, and location-based service advertisements. Clearly, publica-tion of these trajectory data threatens individuals’ privacy since these rawtrajectory data provide location information that identifies individuals and,potentially, their sensitive information. Below, we present some real-life ap-plications of publishing trajectory data.

• Transit company: Transit companies have started to use smart cardsfor passengers, such as the Oyster Travel card in London. The companysays they do not associate journey data with named passengers, althoughthey do provide such data to government agencies on request [222].

• Hospital: Some hospitals have adopted Radio Frequency IDentification(RFID) sensory system to track the positions of patients, doctors, and

261

Page 279: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

262 Introduction to Privacy-Preserving Data Publishing

medical equipment inside the hospital with the goals of minimizing life-threatening medical errors and improving the management of patientsand resources [240]. Analyzing trajectory data, however, is a non-trivialtask. Hospitals often do not have the expertise to perform the analysisthemselves but outsource this process and, therefore, require granting athird party access to the patient-specific location and health data.

• LBS provider: Many companies provide location-based services (LBS)for mobile devices. With the help of triangulation and GPS devices, thelocation information of users can be precisely determined. Various datamining tasks can be performed on these trajectory data for differentapplications, such as traffic analysis and location-based advertisements.However, these trajectory data contain people’s visited locations andthus reveal identifiable sensitive information such as social customs, re-ligious preferences, and sexual preferences.

In this chapter, we study privacy threats in the data publishing phase anddefine a practical privacy model to accommodate the special challenges ofanonymizing trajectory data. We illustrate an anonymization algorithm in [91,170] to transform the underlying raw data into a version that is immunizedagainst privacy attacks but still useful for effective data mining tasks. Data“publishing” includes sharing the data with specific recipients and releasingthe data for public download; the recipient could potentially be an adversarywho attempts to associate sensitive information in the published data with atarget victim.

14.1.2 Attack Models on Trajectory Data

We use an example to illustrate the privacy threats and challenges of pub-lishing trajectory data.

Example 14.1A hospital wants to release the patient-specific trajectory and health data(Table 14.1) to a data miner for research purposes. Each record containsa path and some patient-specific information, where the path is a sequenceof pairs (lociti) indicating the patient’s visited location loci at time ti. Forexample, ID#2 has a path 〈f6 → c7 → e8〉, meaning that the patient hasvisited locations f , c, and e at time 6, 7, and 8, respectively. Without loss ofgenerality, we assume that each record contains only one sensitive attribute,namely, diagnosis, in this example. We address two types of privacy threats:

• Record linkage: If a path in the table is so specific that not many pa-tients match it, releasing the trajectory data may lead to linking thevictim’s record and, therefore, her diagnosed disease. Suppose the ad-versary knows that the data record of a target victim, Alice, is in Ta-ble 14.1, and Alice has visited b2 and d3. Alice’s record, together with

Page 280: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Trajectory Data 263

Table 14.1: Raw trajectory and health dataID Path Diagnosis ...1 〈b2→ d3→ c4→ f6→ c7〉 AIDS ...2 〈f6→ c7→ e8〉 Flu ...3 〈d3→ c4→ f6→ e8〉 Fever ...4 〈b2→ c5→ c7→ e8〉 Flu ...5 〈d3→ c7→ e8〉 Fever ...6 〈c5→ f6→ e8〉 Diabetes ...7 〈b2→ f6→ c7→ e8〉 Diabetes ...8 〈b2→ c5→ f6→ c7〉 AIDS ...

Table 14.2: Traditional 2-anonymous dataEPC Path Disease ...1 〈f6→ c7〉 AIDS ...2 〈f6→ c7→ e8〉 Flu ...3 〈f6→ e8〉 Fever ...4 〈c7→ e8〉 Flu ...5 〈c7→ e8〉 Fever ...6 〈f6→ e8〉 Diabetes ...7 〈f6→ c7→ e8〉 Diabetes ...8 〈f6→ c7〉 AIDS ...

her sensitive value (AIDS in this case), can be uniquely identified be-cause ID#1 is the only record that contains b2 and d3. Besides, theadversary can also determine the other visited locations of Alice, suchas c4, f6, and c7.

• Attribute linkage: If a sensitive value occurs frequently together withsome sequence of pairs, then the sensitive information can be inferredfrom such sequence even though the exact record of the victim cannotbe identified. Suppose the adversary knows that Bob has visited b2 andf6. Since two out of the three records (ID#1,7,8) containing b2 and f6have sensitive value AIDS, the adversary can infer that Bob has AIDSwith 2/3 = 67% confidence.

Many privacy models, such as k-anonymity [201, 217] and its exten-sions [162, 247, 250], have been proposed to thwart privacy threats causedby record and attribute linkages in the context of relational databases. Thesemodels are based on the notion of quasi-identifier (QID), which is a set of at-tributes that may be used for linkages. The basic idea is to disorient potentiallinkages by generalizing the records into equivalent groups that share valueson QID. These privacy models are effective for anonymizing relational data,but they are not applicable to trajectory data due to two special challenges.

Page 281: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

264 Introduction to Privacy-Preserving Data Publishing

Table 14.3: Anonymous data withL = 2, K = 2, C = 50%

ID Path Diagnosis ...1 〈d3→ f6→ c7〉 AIDS ...2 〈f6→ c7→ e8〉 Flu ...3 〈d3→ f6→ e8〉 Fever ...4 〈c5→ c7→ e8〉 Flu ...5 〈d3→ c7→ e8〉 Fever ...6 〈c5→ f6→ e8〉 Diabetes ...7 〈f6→ c7→ e8〉 Diabetes ...8 〈c5→ f6→ c7〉 AIDS ...

1. High dimensionality: Consider a hospital with 200 rooms function-ing 24 hours a day. There are 200 × 24 = 4800 possible combinations(dimensions) of locations and timestamps. Each dimension could be apotential QID attribute used for record and attribute linkages. Tradi-tional k-anonymity would require every path to be shared by at least krecords. Due to the curse of high dimensionality [6], most of the datahave to be suppressed in order to achieve k-anonymity. For example,to achieve 2-anonymity on the path data in Table 14.1, all instances of{b2, d3, c4, c5} have to be suppressed as shown in Table 14.2 even thoughk is small.

2. Data sparseness: Consider the patients in a hospital or the passen-gers in a public transit system. They usually visit only a few loca-tions compared to all available locations, so each trajectory path isrelatively short. Anonymizing these short, little-overlapping paths ina high-dimensional space poses a significant challenge for traditionalanonymization techniques because it is difficult to identify and group thepaths together. Enforcing traditional k-anonymity on high-dimensionaland sparse data would render the data useless.

14.2 LKC-Privacy

Traditional k-anonymity and its extended privacy models assume that anadversary could potentially use any or even all of the QID attributes asbackground knowledge to perform record or attribute linkages. However, inreal-life privacy attacks, it is very difficult for an adversary to acquire allthe visited locations and timestamps of a victim because it requires non-trivial effort to gather each piece of background knowledge from so manypossible locations at different times. Thus, it is reasonable to assume that the

Page 282: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Trajectory Data 265

adversary’s background knowledge is bounded by at most L pairs of (lociti)that the victim has visited.

Based on this assumption, we can employ the LKC-privacy model studiedin Chapter 6 for anonymizing high-dimensional and sparse spatio-temporaldata. The general intuition is to ensure that every sequence q with maximumlength L of any path in a data table T is shared by at least K records in T ,and the confidence of inferring any sensitive value in S from q is not greaterthan C, where L and K are positive integer thresholds, C is a positive realnumber threshold, and S is a set of sensitive values specified by the dataholder. LKC-privacy bounds the probability of a successful record linkage tobe ≤ 1/K and the probability of a successful attribute linkage to be ≤ C.Table 14.3 shows an example of an anonymous table that satisfies (2, 2, 50%)-privacy by suppressing b2 and c4 from Table 14.1. Every possible sequenceq with maximum length 2 in Table 14.3 is shared by at least 2 records andthe confidence of inferring the sensitive value AIDS from q is not greater than50%.

While protecting privacy is a critical element in data publishing, it is equallyimportant to preserve the utility of the published data because this is the pri-mary reason for publication. In this chapter, we aim at preserving the maximalfrequent sequences (MFS ) because MFS often serves as the information basisfor different primitive data mining tasks on sequential data. MFS is useful fortrajectory pattern mining [102], workflow mining [105] and it also capturesthe major paths of moving objects in the trajectory data [30].

In Chapter 14.2.1, we explain the special challenges of anonymizing high-dimensional, sparse, and sequential trajectory data. In Chapter 14.2.2, wepresent an efficient anonymization algorithm to achieve LKC-privacy on tra-jectory data while preserving maximal frequent sequences in the anonymoustrajectory data.

14.2.1 Trajectory Anonymity for Maximal Frequent Se-quences

We first describe the trajectory database and then formally define the pri-vacy and utility requirements.

14.2.1.1 Trajectory Data

A trajectory data table T is a collection of records in the form

〈(loc1t1)→ . . .→ (locntn)〉 : s1, . . . , sp : d1, . . . , dm,

where 〈(loc1t1)→ . . .→ (locntn)〉 is the path, si ∈ Si are the sensitive values,and di ∈ Di are the quasi-identifying (QID) values of an object. A pair (lociti)represents the visited location loci of an object at time ti. An object mayrevisit the same location at different times. At any time, an object can appear

Page 283: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

266 Introduction to Privacy-Preserving Data Publishing

at only one location, so 〈a1→ b1〉 is not a valid sequence and timestamps ina path increase monotonically.

The sensitive and QID values are the object-specific data in the formof relational data. Record and attribute linkages via the QID attributescan be avoided by applying existing anonymization methods for relationaldata [96, 148, 151, 162, 237]. In this chapter, we focus on eliminating recordand attribute linkages via trajectory data as illustrated in Example 14.1.

14.2.1.2 Privacy Model

Suppose a data holder wants to publish the trajectory data table T (e.g.,Table 14.1) to recipients for data mining. Explicit identifiers, e.g., name, andSSN, are removed. (Note, we keep the ID in our examples for discussion pur-pose only.)

We assume that the adversary knows at most L pairs of location and time-stamp that V has previously visited. We use q to denote such an a priori knownsequence of pairs, where |q| ≤ L. T (q) denotes a group of records that containsq. A record in T contains q if q is a subsequence of the path in the record.For example in Table 14.1, ID#1, 2, 7, 8 contains q = 〈f6 → c7〉, writtenas T (q) = {ID#1, 2, 7, 8}. Based on background knowledge q, the adversarycould launch record and attribute linkage attacks. To thwart the record andattribute linkages, we require that every sequence with a maximum length Lin the trajectory data table has to be shared by at least a certain number ofrecords, and the ratio of sensitive value(s) in every group cannot be too high.The presented privacy model, LKC-privacy [91, 170], reflects this intuition.

DEFINITION 14.1 LKC-privacy Let L be the maximum length of thebackground knowledge. Let S be a set of sensitive values. A trajectory datatable T satisfies LKC-privacy if and only if for any sequence q with |q| ≤ L,

1. |T (q)| ≥ K, where K > 0 is an integer anonymity threshold, and

2. P (s|q) ≤ C for any s ∈ S, where 0 < C ≤ 1 is a real number confidencethreshold.

LKC-privacy has several nice properties that make it suitable for anonymiz-ing high-dimensional sparse trajectory data. First, it only requires subse-quences of a path to be shared by at least K records. This is a major relaxationfrom traditional k-anonymity based on a very reasonable assumption thatthe adversary has limited power. Second, LKC-privacy generalizes severaltraditional privacy models, such as k-anonymity (Chapter 2.1.1), confidencebounding (Chapter 2.2.2), (α, k)-anonymity (Chapter 2.2.5), and �-diversity(Chapter 2.2.1). Refer to Chapter 6.2 for a discussion on the generalization ofthese privacy models. Third, it is flexible to adjust the trade-off between dataprivacy and data utility, and between an adversary’s power and data utility.

Page 284: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Trajectory Data 267

Increasing L and K, or decreasing C, would improve the privacy at the ex-pense of data utility. Finally, LKC-privacy is a general privacy model thatthwarts both identity linkage and attribute linkage, i.e., the privacy model isapplicable to anonymize trajectory data with or without sensitive attributes.

In the attack models studied in this chapter, we assume that the adversarycan acquire both the time and locations of a target victim as backgroundknowledge. In a real-life attack, it is possible that the adversary’s backgroundknowledge q′ contains only the location loci or only the timestamp ti. Thistype of attack is obviously weaker than the attack based on background knowl-edge q containing (lociti) because the identified group |T (q′)| ≥ |T (q)|. Thus,an LKC-privacy preserved table that can thwart linkages on q can also thwartlinkages on q′.

14.2.1.3 Utility Measure

The measure of data utility varies depending on the data mining task tobe performed on the published data. In this chapter, we aim at preservingthe maximal frequent sequences. A sequence q = 〈(loc1t1)→ . . .→ (locntn)〉is an ordered set of locations. A sequence q is frequent in a trajectory datatable T if |T (q)| ≥ K ′, where T (q) is the set of records containing q andK ′ is a minimum support threshold. Frequent sequences (FS) capture themajor paths of the moving objects [30], and often form the information basisfor different primitive data mining tasks on sequential data, e.g., associationrules mining. In the context of trajectories, association rules can be used todetermine the subsequent locations of the moving object given the previouslyvisited locations. This knowledge is important for traffic analysis.

There is no doubt that FS are useful. Yet, mining all FS is a computationallyexpensive operation. When the data volume is large and FS are long, it isinfeasible to identify all FS because all subsequences of an FS are also frequent.Since trajectory data is high-dimensional and in large volume, a more feasiblesolution is to preserve only the maximal frequent sequences (MFS ).

DEFINITION 14.2 Maximal frequent sequence For a given min-imum support threshold K ′ > 0, a sequence x is maximal frequent in a tra-jectory data table T if x is frequent and no super sequence of x is frequent inT .

The set of MFS in T , denoted by U(T ), is much smaller than the set ofFS in T given the same K ′. MFS still contains the essential information fordifferent kinds of data analysis [155]. For example, MFS captures the longestfrequently visited paths. Any subsequence of an MFS is also a FS. Once all theMFS have been determined, the support counts of any particular FS can becomputed by scanning the data table once. The data utility goal is to preserveas many MFS as possible, i.e., maximize |U(T )|, in the anonymous trajectorydata table.

Page 285: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

268 Introduction to Privacy-Preserving Data Publishing

One frequently raised question is: why do we want to publish sensitive at-tributes at all when the goal is to preserve maximal frequent sequences? It isbecause some applications may require publishing the sensitive attributes fordata mining purpose, such as to find association rules between frequent se-quences and sensitive attributes. However, if there is no such data mining pur-pose, the sensitive attributes should be removed. LKC-privacy, together withthe anonymization algorithm presented in Chapter 14.2.2, is flexible enoughto handle trajectory data with or without sensitive attributes.

14.2.1.4 Problem Statement

LKC-privacy can be achieved by performing a sequence of suppressions onselected pairs from T . In this chapter, we employ global suppression, meaningthat if a pair p is chosen to be suppressed, all instances of p in T are sup-pressed. For example, Table 14.3 is the result of suppressing b2 and c4 fromTable 14.1. Global suppression offers several advantages over generalizationand local suppression. First, suppression does not require a predefined taxon-omy tree for generalization, which often is unavailable in real-life databases.Second, trajectory data could be extremely sparse. Enforcing global gener-alization on trajectory data will result in generalizing many sibling locationor time values even if there is only a small number of outlier pairs, such asc4 in Table 14.1. Suppression offers the flexibility of removing those outlierswithout affecting the rest of the data. Note, we do not intend to claim thatglobal suppression is always better than other schemes. For example, LeFevreet al. [148] present some local generalization schemes that may result in lessdata loss depending on the utility measure. Third, global suppression retainsexactly the same support counts of the preserved MFS in the anonymous tra-jectory data as there were in the raw data. In contrast, a local suppressionscheme may delete some instances of the chosen pair and, therefore, changethe support counts of the preserved MFS. The property of data truthfulnessis vital in some data analysis, such as traffic analysis.

DEFINITION 14.3 Trajectory anonymity for MFS Given a trajec-tory data table T , a LKC-privacy requirement, a minimum support thresholdK ′, a set of sensitive values S, the problem of trajectory anonymity for max-imal frequent sequences (MFS) is to identify a transformed version of T thatsatisfies the LKC-privacy requirement while preserving the maximum numberof MFS with respect to K ′.

Theorem 14.2, given in Chapter 14.2.2.3, proves that finding an optimumsolution for LKC-privacy is NP-hard. Thus, we propose a greedy algorithmto efficiently identify a reasonably “good” sub-optimal solution.

Page 286: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Trajectory Data 269

14.2.2 Anonymization Algorithm for LKC-Privacy

Given a trajectory data table T , the first step is to identify all sequencesthat violate the given LKC-privacy requirement. Chapter 14.2.2.1 describesa method to identify violating sequences efficiently. Chapter 14.2.2.2 presentsa greedy algorithm to eliminate the violating sequences with the goal of pre-serving as many maximal frequent sequences as possible.

14.2.2.1 Identifying Violating Sequences

An adversary may use any sequence with length not greater than L as back-ground knowledge to launch a linkage attack. Thus, any non-empty sequenceq with |q| ≤ L in T is a violating sequence if its group T (q) does not satisfycondition 1, condition 2, or both in LKC-privacy in Definition 14.1.

DEFINITION 14.4 Violating sequence Let q be a sequence of a pathin T with |q| ≤ L. q is a violating sequence with respect to a LKC-privacyrequirement if (1) q is non-empty, and (2) |T (q)| < K or P (s|q) > C for anysensitive value s ∈ S.

Example 14.2Let L = 2, K = 2, C = 50%, and S = {AIDS}. In Table 14.1, a sequenceq1 = 〈b2 → c4〉 is a violating sequence because |T (q1)| = 1 < K. A sequenceq2 = 〈b2 → f6〉 is a violating sequence because P (AIDS|q2) = 67% > C.However, a sequence q3 = 〈b2 → c5 → f6 → c7〉 is not a violating sequenceeven if |T (q3)| = 1 < K and P (AIDS|q3) = 100% > C because |q3| > L.

A trajectory data table satisfies a given LKC-privacy requirement, if allviolating sequences with respect to the privacy requirement are removed, be-cause all possible channels for record and attribute linkages are eliminated.A naive approach is to first enumerate all possible violating sequences andthen remove them. This approach is infeasible because of the huge numberof violating sequences. Consider a violating sequence q with |T (q)| < K. Anysuper sequence of q, denoted by q′′, in the data table T is also a violatingsequence because |T (q′′)| ≤ |T (q)| < K.

Another inefficient and incorrect approach to achieve LKC-privacy is toignore the sequences with size less than L and assume that if a table T satisfiesLKC-privacy, then T satisfies L′KC-privacy where L′ < L. Unfortunately,this monotonic property with respect to L does not hold in LCK-privacy.

THEOREM 14.1LKC-privacy is not monotonic with respect to adversary’s knowledge L.

Page 287: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

270 Introduction to Privacy-Preserving Data Publishing

PROOF To prove that LKC-privacy is not monotonic with respect to L,it is sufficient to prove that one of the conditions of LKC-privacy in Defini-tion 14.1 is not monotonic.

Condition 1: Anonymity threshold K is monotonic with respect to L. IfL′ ≤ L and C = 100%, a data table T satisfying LKC-privacy must satisfyL′KC-privacy because |T (q′)| ≥ |T (q)| ≥ K, where q′ is subsequence of q.

Condition 2: Confidence threshold C is not monotonic with respect to L. Ifq is a non-violating sequence with P (s|q) ≤ C and |T (q)| ≥ K, its subsequenceq′ may or may not be a non-violating sequence. We use a counter exampleto show that P (s|q′) ≤ P (s|q) ≤ C does not always hold. In Table 14.4, thesequence q = 〈a1 → b2 → c3〉 satisfies P (AIDS|q) = 50% ≤ C. However, itssubsequence q′ = 〈a1→ b2〉 does not satisfy P (AIDS|q′) = 100% > C.

To satisfy condition 2 in Definition 14.1, it is insufficient to ensure that everysequence q with only length L in T satisfies P (s|q) ≤ C. Instead, we need toensure that every sequence q with length not greater than L in T satisfiesP (s|q) ≤ C. To overcome this bottleneck of violating sequence enumeration,the insight is that there exists some “minimal” violating sequences among theviolating sequences, and it is sufficient to achieve LKC-privacy by removingonly the minimal violating sequences.

DEFINITION 14.5 Minimal violating sequence A violating sequenceq is a minimal violating sequence (MVS ) if every proper subsequence of q isnot a violating sequence.

Example 14.3In Table 14.1, given L = 3, K = 2, C = 50%, S = {AIDS}, the sequenceq = 〈b2 → d3〉 is a MVS because 〈b2〉 and 〈d3〉 are not violating sequences.The sequence q = 〈b2 → d3 → c4〉 is a violating sequence but not a MVSbecause its subsequence 〈b2→ d3〉 is a violating sequence.

Every violating sequence is either a MVS or it contains a MVS. Thus, if Tcontains no MVS, then T contains no violating sequences.

Observation 14.2.1 A trajectory data table T satisfies LKC-privacy if andonly if T contains no MVS.

Table 14.4: Counter example formonotonic property

ID Path Diagnosis ...1 〈a1→ b2〉 AIDS ...2 〈a1→ b2→ c3〉 AIDS ...3 〈a1→ b2→ c3〉 Fever ...

Page 288: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Trajectory Data 271

Algorithm 14.2.13 MVS GeneratorInput: Raw trajectory data table TInput: Thresholds L, K, and CInput: Sensitive values SOutput: Minimal violating sequence V (T )1: X1 ← set of all distinct pairs in T ;2: i = 1;3: while i ≤ L or Xi �= ∅ do4: Scan T to compute |T (q)| and P (s|q), for ∀q ∈ Xi, ∀s ∈ S;5: for ∀q ∈ Xi where |T (q)| > 0 do6: if |T (q)| < K or P (s|q) > C then7: Add q to Vi;8: else9: Add q to Wi;

10: end if11: end for12: Xi+1 ←Wi � Wi;13: for ∀q ∈ Xi+1 do14: if q is a super sequence of any v ∈ Vi then15: Remove q from Xi+1;16: end if17: end for18: i++;19: end while20: return V (T ) = V1 ∪ · · · ∪ Vi−1;

Next, we present an algorithm to efficiently identify all MVS in T withrespect to a LKC-privacy requirement. Based on Definition 14.5, we generateall MVS of size i + 1, denoted by Vi+1, by incrementally extending a non-violating sequence of size i, denoted by Wi, with an additional pair.

Algorithm 14.2.13 presents a method to efficiently generate all MVS. Line1 puts all the size-1 sequences, i.e., all distinct pairs, as candidates X1 ofMVS. Line 4 scans T once to compute |T (q)| and P (s|q) for each sequenceq ∈ Xi and for each sensitive value s ∈ S. If the sequence q violates the LKC-privacy requirement in Line 6, then we add q to the MVS set Vi (Line 7);otherwise, add q to the non-violating sequence set Wi (Line 9) for generatingthe next candidate set Xi+1, which is a self-join of Wi (Line 12). Two sequencesqx = 〈(locx

1tx1) → . . . → (locxi txi )〉 and qy = 〈(locy

1ty1) → . . .→ (locy

i tyi )〉 in Wi

can be joined only if the first i−1 pairs of qx and qy are identical and txi < tyi .The joined sequence is 〈(locx

1tx1) → . . . → (locxi txi ) → (locy

i tyi )〉. Lines 13-17remove a candidate q from Xi+1 if q is a super sequence of any sequence inVi because any proper subsequence of a MVS cannot be a violating sequence.The set of MVS, denoted by V (T ), is the union of all Vi.

Page 289: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

272 Introduction to Privacy-Preserving Data Publishing

Example 14.4Consider Table 14.1 with L = 2, K = 2, C = 50%, and S = {AIDS}.

X1 = {b2, d3, c4, c5, f6, c7, e8}.After scanning T , we divide X1 into V1 = ∅ and

W1 = {b2, d3, c4, c5, f6, c7, e8}.Next, from W1 we generate the candidate set

X2 = {b2d3, b2c4, b2c5, b2f6, b2c7, b2e8, d3c4, d3c5, d3f6, d3c7, d3e8, c4c5,c4f6, c4c7, c4e8, c5f6, c5c7, c5e8, f6c7, f6e8, c7e8}.

We scan T again to determine

V2 = {b2d3, b2c4, b2f6, c4c7, c4e8}.

We do not further generate X3 because L = 2.

14.2.2.2 Eliminating Violating Sequences

We present a greedy algorithm to transform the raw trajectory data table Tto an anonymous table T ′ with respect to a given LKC-privacy requirement bya sequence of suppressions. In each iteration, the algorithm selects a pair p forsuppression based on a greedy selection function. In general, a suppression on apair p in T increases privacy because it removes minimal violating sequences(MVS), and decreases data utility because it eliminates maximal frequentsequences (MFS) in T . Therefore, we define the greedy function, Score(p), toselect a suppression on a pair p that maximizes the number of MVS removedbut minimizes the number of MFS removed in T . Score(p) is defined as follows:

Score(p) =PrivGain(p)

UtilityLoss(p) + 1(14.1)

where PrivGain(p) and UtilityLoss(p) are the number of minimal violatingsequences (MVS) and the number of maximal frequent sequences (MFS) con-taining the pair p, respectively. A pair p may not belong to any MFS, resultingin |UtilityLoss(p)| = 0. To avoid dividing by zero, we add 1 to the denomina-tor. The pair p with the highest Score(p) is called the winner pair, denotedby w.

Algorithm 14.2.14 summarizes the anonymization algorithm that removesall MVS. Line 1 calls Algorithm 14.2.13 to identify all MVS, denoted byV (T ), and then builds a MVS-tree with a PG table that keeps track of thePrivGain of all candidate pairs for suppressions. Line 2 calls a maximal fre-quent sequence mining algorithm (see Chapter 14.2.2.3) to identify all MFS,denoted by U(T ), and then builds a MFS-tree with a UL table that keepstrack of the UtilityLoss of all candidate pairs. At each iteration in Lines 3-9,the algorithm selects the winner pair w that has the highest Score(w) from

Page 290: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Trajectory Data 273

Algorithm 14.2.14 Trajectory Data AnonymizerInput: Raw trajectory data table TInput: Thresholds L, K, C, and K ′

Input: Sensitive values SOutput: Anonymous T ′ that satisfies LKC-privacy1: generate V (T ) by Algorithm 14.2.13 and build MVS-tree;2: generate U(T ) by MFS algorithm and build MFS-tree;3: while PG table is not empty do4: select a pair w that has the highest Score to suppress;5: delete all MVS and MFS containing w from MVS-tree and MFS-tree;6: update the Score(p) if both w and p are contained in the same MVS or

MFS;7: remove w from PG Table;8: add w to Sup;9: end while

10: for ∀w ∈ Sup, suppress all instances of w from T ;11: return the suppressed T as T ′;

Table 14.5: Initial Score

b2 d3 c4 f6 c7 e8PrivGain 3 1 3 1 1 1UtilityLoss (+1) 4 4 2 5 6 5Score 0.75 0.25 1.5 0.2 0.16 0.2

the PG table, removes all the MVS and MFS that contain w, incrementallyupdates the Score of the affected candidates, and adds w to the set of sup-pressed values, denoted by Sup. Values in Sup are collectively suppressed inLine 10 in one scan of T . Finally, Algorithm 14.2.14 returns the anonymizedT as T ′. The most expensive operations are identifying the MVS and MFScontaining w and updating the Score of the affected candidates. Below, wepropose two tree structures to efficiently perform these operations.

DEFINITION 14.6 MVS-tree MVS-tree is a tree structure that rep-resents each MVS as a tree path from root-to-leaf. Each node keeps track ofa count of MVS sharing the same prefix. The count at the root is the totalnumber of MVS. MVS-tree has a PG table that maintains every candidatepair p for suppression, together with its PrivGain(p). Each candidate pair pin the PG table has a link, denoted by Linkp, that links up all the nodes inan MVS-tree containing p. PrivGain(p) is the sum of the counts of MVS onLinkp.

Page 291: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

274 Introduction to Privacy-Preserving Data Publishing

MFS: 9

b2 : 33 b2 3

1 d3 3

3 c4 1

1 f6 4

1 c7 5

1 e8 4

UL

c5 : 1 f6 : 1 c7 : 1

e8 : 1c7 : 1c7 : 1

d3 : 3

c7 : 1 e8 : 1

f6 : 1

f6 : 1

c7 : 1

e8 : 1

c4 : 1f6 : 1 e8 :1

c5 : 2

MVS: 5

c4 : 2

c8 : 1c7 : 1d3 : 1 f6 : 1

b2 : 3

c4 : 1

PGCandidate pair

1

1

FIGURE 14.1: MVS-tree and MFS-tree for efficient Score updates

MFS: 8

b2 : 32 b2 3

1 d3 2

1 f6 3

UL

c5 : 1 f6 : 1 c7 : 1

e8 : 1c7 : 1c7 : 1

d3 : 2

c7 : 1 e8 : 1

f6 : 1

c7 : 1

e8 : 1

f6 : 1 e8 :1

c5 : 2

MVS: 2

d3 : 1 f6 : 1

b2 : 2

PGCandidate pair

FIGURE 14.2: MVS-tree and MFS-tree after suppressing c4

DEFINITION 14.7 MFS-tree MFS-tree is a tree structure that rep-resents each MFS as a tree path from root-to-leaf. Each node keeps track of acount of MFS sharing the same prefix. The count at the root is the total num-ber of MFS. MFS-tree has a UL table that keeps the UtilityLoss(p) for everycandidate pair p. Each candidate pair p in the UL table has a link, denoted byLinkp, that links up all the nodes in MFS-tree containing p. UtilityLoss(p)is the sum of the counts of MFS on Linkp.

Example 14.5Figure 14.1 depicts both an MVS-tree and an MFS-tree generated from Ta-ble 14.1, where

V (T ) = {b2d3, b2c4, b2f6, c4c7, c4e8} andU(T ) = {b2c5c7, b2f6c7, b2c7e8, d3c4f6, f6c7e8, c5f6, c5e8, d3c7, d3e8}

with L = 2, K = 2, C = 50%, and K ′ = 2. Each root-to-leaf path representsone sequence of MVS or MFS. To find all the MVS (or MFS) containing c4,follow Linkc4 starting from the PG (or UL) table. For illustration purposes,we show PG and UL as a single table.

Table 14.5 shows the initial Score(p) of every candidate in the PG tablein the MVS-tree. Identify the winner pair c4 from the PG table. Then tra-verse Linkc4 to identify all MVS and MFS containing c4 and delete themfrom the MVS-tree and MFS-tree accordingly. These links are the key to ef-ficient Score updates and suppressions. When a winner pair w is suppressed

Page 292: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Trajectory Data 275

Table 14.6: Score after suppressing c4

b2 d3 f6PrivGain 2 1 1UtilityLoss (+1) 4 3 4Score 0.5 0.33 0.25

from the trees, the entire branch of w is trimmed. The trees provide an effi-cient structure for updating the counts of MVS and MFS. For example, whenc4 is suppressed, all its descendants are removed as well. The counts of c4’sancestor nodes are decremented by the counts of the deleted c4 node. If acandidate pair p and the winner pair w are contained in some common MVSor MFS, then UtilityLoss(p), PrivGain(p), and Score(p) have to be updatedby adding up the counts on Linkp. A pair p is removed from the PG tableif PrivGain(p) = 0. The shaded blocks in Figure 14.1 represent the nodesto be deleted after suppressing c4. The resulting MVS-tree and MFS-tree areshown in Figure 14.2. Table 14.6 shows the updated Score of the remainingcandidate pairs. In the next iteration, b2 is suppressed and thus all the re-maining MVS are removed. Table 14.3 shows the resulting anonymized tableT ′ for (2, 2, 50%)-privacy.

14.2.2.3 Analysis

THEOREM 14.2Given a trajectory data table T and a LKC-privacy requirement, it is NP-hard to find the optimum anonymous solution.

PROOF The problem of finding the optimum anonymous solution canbe converted into the vertex cover problem. The vertex cover problem is awell-known problem in which, given an undirected graph G = (V, E), it isNP-hard to find the smallest set of vertices S such that each edge has at leastone endpoint in S. To reduce the problem into the vertex cover problem, weconsider the set of candidate pairs as the set of vertices V . The set of MVS,denoted by V (T ), is analogous to the set of edges E. Hence, the optimumvertex cover, S, means finding the smallest set of candidate pairs that mustbe suppressed to obtain the optimum anonymous data set T ′. Given that itis NP-hard to determine the smallest set of vertices S, it is also NP-hard tofind the optimum set of candidate pairs for suppression.

The presented trajectory anonymization method has two steps. In the firststep, we determine the set of MVS and the set of MFS. In the second step,we build the MVS-tree and MFS-tree, and suppress the winner pairs itera-tively according to their Score. We modified MAFIA [39], which is originallydesigned for mining maximal frequent itemsets, to mine MFS. Any alterna-

Page 293: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

276 Introduction to Privacy-Preserving Data Publishing

tive MFS algorithm can be used as a plug-in to the presented anonymizationmethod. The most expensive operation of the method is scanning the rawtrajectory data table T once to compute |T (q)| and P (s|q) for all sequenceq in the candidate set Xi. This operation takes place during MVS genera-tion. The cost of this operation is approximated as Cost =

∑Li=1 mii, where

mi = |Xi|. Note that the searching cost depends on the value of L and size ofthe candidate set. When i = 1, the candidate set Xi is the set of all distinctpairs in T . Hence, the upper limit of mi = |d|, where |d| is the number ofdimensions. It is unlikely to have any single pair violating the LKC-privacy;therefore, m2 = |d|(|d| − 1)/2. In practice, most of the candidate sets are ofsize-2; therefore, the lower bound of the Cost ≤ m1 + 2m2 = |d|2. Finally, in-cluding the dependence on the data size, the time complexity of the presentedanonymization algorithm is O(|d|2n).

In the second step, we insert the MVS and MFS into the respective treesand delete them iteratively afterward. This operation is proportional to thenumber of MVS and thus in the order of O(|V (T )|). Due to MVS-tree andMFS-tree data structures, the anonymization method can efficiently calculateand update the the score of the candidate pairs.

14.2.3 Discussion

14.2.3.1 Applying LKC-Privacy on RFID Data

Radio Frequency IDentification (RFID) is a technology of automatic ob-ject identification. Figure 14.3 illustrates a typical RFID information system,which consists of a large number of tags and readers and an infrastructurefor handling a high volume of RFID data. A tag is a small device that canbe attached to a person or a manufactured object for the purpose of uniqueidentification. A reader is an electronic device that communicates with theRFID tag. A reader broadcasts a radio signal to the tag, which then trans-mits its information back to the reader [204]. Streams of RFID data records,in the format of 〈EPC, loc, t〉, are then stored in an RFID database, whereEPC (Electronic Product Code) is a unique identifier of the tagged personor object, loc is the location of the reader, and t is the time of detection. Thepath of an object, like the path defined in Chapter 14.2.1, is a sequence ofpairs that can be obtained by first grouping the RFID records by EPC andthen sorting the records in each group by timestamps. A data recipient (or adata analysis module) could obtain the information on either specific taggedobjects or general workflow patterns [105] by submitting data requests to thequery engine. The query engine then responds to the requests by joining theRFID data with some object-specific data.

Retailers and manufacturers have created compelling business cases for de-ploying RFID in their supply chains, from reducing out-of-stocks at Wal-Martto up-selling consumers in Prada. Yet, the uniquely identifiable objects pose aprivacy threat to individuals, such as tracing a person’s movements, and pro-

Page 294: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Trajectory Data 277

Query Engine(Anonymization Algorithm)

Data AnalysisModule

RFID Data

Object Data

Data Recipient

Reader

Da

ta C

oll

ec

tio

nD

ata

Pu

bli

sh

ing

FIGURE 14.3: Data flow in RFID system

filing individuals becomes possible. Most previous work on privacy-preservingRFID technology [204] focus on the threats caused by the physical RFID tags.Techniques like EPC re-encryption and killing tags [130] have been proposedto address the privacy issues in the data collection phase, but these techniquescannot address the privacy threats in the data publishing phase, when a largevolume of RFID data is released to a third party.

Fung et al. [91] present the first work to study the privacy threats in thedata publishing phase and define a practical privacy model to accommodatethe special challenges of RFID data. Their privacy model, LKC-privacy whichis also studied in this chapter, ensures that every RFID moving paths withlength not greater than L is shared by least K−1 other moving paths, and theconfidence of inferring any pre-specified sensitive values is not greater than C.Fung et al. [91] propose an anonymization algorithm for the query engine (seeFigure 14.3) to transform the underlying raw object-specific RFID data intoa version that is immunized against privacy attacks. The general assumptionis that the recipient could be an adversary, who attempts to associate sometarget victim(s) to his/her sensitive information from the published data. Thegeneral idea is very similar to the privacy model and anonymization methodstudied in this chapter.

Page 295: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

278 Introduction to Privacy-Preserving Data Publishing

14.2.3.2 Comparing with Anonymizing Transaction Data

Chapter 13 discusses some works on anonymizing high-dimensional transac-tion data [101, 220, 255, 256]. Ghinita et al. [101] divide the transaction datainto public and private items. Then the public items are grouped togetherbased on similarity. Each group is then associated with a set of private itemssuch that probabilistically there are no linkages between the public and privateitems. The idea is similar to the privacy model of Anatomy [249] in Chap-ter 3.2, which disorients the linkages between QID and sensitive attributes byputting them into two tables.

The methods presented in [220, 255, 256] model the adversary’s power bya maximum number of known items as background knowledge. This assump-tion is similar to LKC-privacy studied above, but with two major differences.First, a transaction is a set of items, but a moving object’s path is a se-quence of visited location-time pairs. Sequential data drastically increases thecomputational complexity for counting the support counts as compared trans-action data because 〈a → b〉 is different from 〈b → a〉. Hence, their proposedmodels are not applicable to spatio-temporal data studied in this chapter. Sec-ond, the privacy and utility measures are different. Terrovitis et al.’s privacymodel [220] is based on only k-anonymity and does not consider attributelinkages. Xu et al. [255, 256] measure their data utility in terms of preserveditem instances and frequent itemsets, respectively. In contrast, the methodstudied in Chapter 14.2 aims at preserving frequent sequences.

14.3 (k, δ)-Anonymity

Many mobile devices, such as GPS navigation device on cell phones, havevery limited energy. To reduce energy consumption, some methods [124] havebeen developed to predict the location of a mobile device at a given time.A mobile device has to report its new location only if its actual locationdiffers more than an uncertainty threshold δ from the predicted location [1].Abul et al. [1] propose a new privacy model called (k, δ)-anonymity thatexploits the inherent uncertainty of moving objects’ locations. The trajectorycan be considered as polylines in a cylindrical volume with some uncertainty;therefore, the anonymity is achieved if k different trajectories co-exist withinthe radius δ of any trajectory, as depicted in Figure 14.4. δ could be the resultof inaccuracy in a positioning device.

14.3.1 Trajectory Anonymity for Minimal Distortion

Enforcing (k, δ)-anonymity on a data set of paths D requires every path inD to be (k, δ)-anonymous; otherwise, we need to transform D into another

Page 296: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Trajectory Data 279

FIGURE 14.4: Time and spatial trajectory volume ([1] c©2008 IEEE)

version D′ that satisfies the given (k, δ)-anonymity requirement. Let τ betrajectory. Let (x, y, t) ∈ τ be the (x, y) position of τ at time t. The problemof (k, δ)-anonymity is formally defined as follows.

DEFINITION 14.8 Co-localization Two trajectories τ1 and τ2 definedin time interval [t1, tn] co-localize, written as Colocδ

[t1,tn](τ1, τ2), with respectto a uncertainty threshold δ if and only if for each point (x1, y1, t) in τ1 and(x2, y2, t) in τ2 with t ∈ [t1, tn], it holds that Dist((x1, y1), (x2, y2)) ≤ δ [1].

Dist can be any function that measures the distance between two points.For example, Euclidean distance:

Dist((x1, y1), (x2, y2)) =√

(x1 − x2)2 + (y1 − y2)2 (14.2)

Intuitively, a trajectory is anonymous if it shares a similar path with othertrajectories.

DEFINITION 14.9 Anonymity group of trajectories Given anuncertainty threshold δ and an anonymity threshold k, a group G of tra-jectories is (k, δ)-anonymous if and only if |G| ≥ k, ∀τi, τj ∈ G, andColocδ

[t1,tn](τi, τj) [1].

The trajectory anonymity problem is to transform a data set such thatevery group of trajectories is (k, δ)-anonymous with minimal distortion.

Page 297: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

280 Introduction to Privacy-Preserving Data Publishing

DEFINITION 14.10 Trajectory anonymity for minimal distortionGiven a data set of trajectory paths D, an uncertainty threshold δ, and ananonymity threshold k, the problem of trajectory anonymity for minimal dis-tortion is to transform D into a (k, δ)-anonymous data set D′, such that fortrajectory τ ∈ D′, there exists a (k, δ)-anonymous group G ⊆ D′, τ ∈ G, andthe distortion between D and D′ is minimized [1].

14.3.2 The Never Walk Alone Anonymization Algorithm

The Never Walk Alone (NWA) anonymization algorithm in [1] can be sum-marized in three phases. The first phase, preprocessing, is to trim the startingand ending portions of the trajectories to ensure that they share the same timespan. The second phase, clustering, is to group nearby trajectories togetherinto clusters. The third phase, space translation, is to “push” some locationsof a trajectory that fall outside cylindrical volume into the cluster. Each phaseis described in details below.

14.3.2.1 Preprocessing

In real-world moving object databases, trajectories are very unlikely to havethe same starting and ending points. Consider a city transport system. Manypassengers travel from their home in uptown area to their workplaces in down-town area in the morning and go back home in the evening. Though they sharea similar path and direction during the same period of time, their homes andworkplaces are different. Enforcing (k, δ)-anonymity on these scattered tra-jectories will result in poor data quality. Thus, the first step of NWA is topartition the input trajectories into groups of trajectories that have the samestarting time and the same ending time.

The preprocessing is controlled by an integer parameter π: only one time-stamp every π can be the starting or ending point of a trajectory [1]. Forexample, if the original data was sampled at a frequency of one minute, andπ = 60, all trajectories are preprocessed in such a way that their starting andending timestamps are mapped to full hours. We can first determine the start-ing timestamp ts and the ending timestamp te of each trajectory, and then allthe points of trajectory that do not fall between ts and te are discarded. Afterthis preprocessing step, trajectories are partitioned into equivalence classeswith respect to their new starting and ending timestamps.

14.3.2.2 Clustering

Next, NWA clusters the trajectories into groups. First, NWA selects a se-quence of pivot trajectories as cluster centers. The first pivot trajectory chosenis the farthest one from the center of the entire data set D. Then, the nextpivot trajectory chosen is the farthest trajectory from the previous pivot. Aftera pivot trajectory is determined, a cluster is formed by taking the (k − 1)-nearest neighbor trajectories as its elements. Then, assign remaining trajec-

Page 298: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Trajectory Data 281

tories to the closest pivot. Yet, NWA enforces an additional constraint: theradius of every cluster must not be larger than a threshold max radius. Incase a cluster cannot be formed on the pivot due to lack of nearby trajectories,it will not be used as pivot but it can be used in the subsequent iterationsas member of some other cluster. The remaining trajectories that cannot beadded to any cluster without violating the max radius constraint, such out-lier trajectories are discarded. If there are too many outlier trajectories, NWArestarts the clustering procedure by increasing max radius until not too manytrajectories are discarded.

14.3.2.3 Space Translation

Space translation is the last phase to achieve (k, δ)-anonymity. This op-eration involves moving some trajectory points from the original location toanother location. The objective is to achieve (k, δ)-anonymity while minimiz-ing the distortion on the original routes. For each cluster formed in the theclustering phase, compute the cluster center, form a cylindrical volume basedon the cluster center, and move points lying outside of the cylindrical volumeonto the perimeter of the cylindrical volume, from the original location to-wards the cluster center with minimal distortion. A natural choice of minimaldistortion on trajectories is the sum of point-wise distances between the origi-nal and translated trajectories. The problem of space translation is to achievea given (k, δ)-anonymity with the goal minimizing the total translation dis-tortion cost defined below.

DEFINITION 14.11 Space translation distortion Let τ ′ ∈ D′T be

the translated version of raw trajectory τ ∈ DT . The translation distortioncost of τ ′ with respect to τ is:

DistortCost(τ, τ ′) =∑

t∈T

Dist(τ [t], τ ′[t]) (14.3)

where τ [t] is the location of τ in the form of (x, y) at time t. The total trans-lation distortion cost of the translated data set D′

T with respect to the rawdata set DT is:

TotalDistortCost(DT , D′T ) =

τ∈DT

DistortCost(τ, τ ′). (14.4)

14.3.3 Discussion

Abul et al. [1] present an interesting trajectory anonymization algorithm,Never Walk Alone (NWA), by exploiting the inherent uncertainty of mobiledevices. Some new RFID devices, however, are highly accurate with error lessthan a couple of centimeters. In that, the inherent uncertainty may not bepresent. Also, NWA relies on the basic assumption that every trajectory is

Page 299: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

282 Introduction to Privacy-Preserving Data Publishing

continuous. Though this assumption is valid for GPS-like devices where theobject can be traced all the time, it does not hold for RFID-based moving ob-jects. For example, when a passenger uses his smart card in a subway station,the smart card or RFID tag is detected once at the entrance. The system maynot know the next location of the passenger until hours later, so the time ona RFID path may not be continuous.

The clustering approach employed in NWA has to restart again everytimewhen too many trajectories are discarded as outliers. As a result, it is very diffi-cult to predict when the algorithm will terminate. Furthermore, the anonymityis achieved by space translation, which changes the actual location of an ob-ject, resulting in the publication of untruthful data.

14.4 MOB k-Anonymity

In contrast to (k, δ)-anonymity, which bypasses the fundamental conceptof QID and relies on the concept of co-localization to achieve k-anonymity,Yarovoy et al. [263] present a novel notion of k-anonymity in the contextof moving object databases (MOD) based on the assumption that differentmoving objects may have different QIDs.

14.4.1 Trajectory Anonymity for Minimal Information Loss

Unlike in relational data, where all tuples share the same set of quasi-identifier (QIDs), Yarovoy et al. argue that in MOD there does not exist a fixedset of QID attributes for all the MOBs. Hence it is of importance to modelthe concept of quasi-identifier on an individual basis. Specifically, Yarovoyet al. [263] consider timestamps as the QIDs with MOBs’ positions formingtheir values. A group of moving objects with the same QID values is called ananonymization group. In a MOD, the anonymization groups associated withdifferent moving objects may not be disjoint.

Example 14.6Consider a raw moving object database (Table 14.7) with 4 moving objects,in which the explicit identifiers (e.g. I1, I2, I3, I4) have been suppressed.Each row represents the trajectory of a moving object. Suppose k = 2 andQID(O1) = QID(O4) = {t1}, QID(O2) = QID(O3) = {t2}. Clearly thebest anonymization group for O1 with respect to its QID{t1} is AS(O1) ={O1, O3} as O3 is closest to O1 at time t1. Similarly, the best anonymizationgroup for O2 and O3 with respect to their QID{t2} is {O2, O3}, and forO4 with respect to its QID{t1} is {O2, O4}. The anonymization groups areillustrated in Figure 14.5 by dark rectangles. Obviously, the anonymizationgroups of O1 and O3 as well as O2 and O4 overlap.

Page 300: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Trajectory Data 283

8

7

6

5

4

3

2

1

1 2 3 4 5 6 7 8

O1

O1

O2 O2

O3

O3

O4

O4

FIGURE 14.5: Graphical representation of MOD D

Due to this fact, the most obvious but incorrect way of defining k-anonymityis that an anonymous version D∗ of a MOD D satisfies k-anonymity providedthat ∀O ∈ D and ∀t ∈ QID(O), there are at least k− 1 other distinct MOBsin D∗ indistinguishable from O at time t. Under such definition, an adversaryis able to conduct a privacy attack based on an attack graph formalized below.

DEFINITION 14.12 Attack graph [263] An attack graph associatedwith a MOD D and its anonymous version D∗ is a bipartite graph that consistsof nodes for every individual I in D, called I-nodes, and nodes for every MOBid O in D∗, called O-nodes. If and only if ∀t ∈ QID(I)[D(O, t) � D∗(O, t)],there is an edge (I, O) in G, where � denotes spatial containment.

Example 14.7Table 14.8 provides an incorrect 2-anonymity release of the MOD D (Ta-ble 14.7) based on space generalization. For example, all objects in AS(O1) ={O1, O3} are generalized to the smallest common region [(2, 1), (2, 2)] attime t1. Figure 14.6 presents the corresponding attack graph. An adversarycan re-identify both I1 and I4 because there exist perfect matchings in theattack graph, that is, there are some O-nodes with degree 1. As a result, theadversary knows that O1 must map to I1 and O4 must map to I4.

In some subtler cases, the adversary can iteratively identify and prune edgesthat can not be part of any perfect matching to discover perfect matchings,

Page 301: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

284 Introduction to Privacy-Preserving Data Publishing

Table 14.7: Raw movingobject database D

MOB t1 t2O1 (2, 1) (6, 2)O2 (3, 6) (6, 6)O3 (2, 2) (7, 7)O4 (4, 5) (8, 4)

Table 14.8: A 2-anonymity scheme of Dthat is not safe

MOB t1 t2O1 [(2, 1), (2, 2)] (6, 2)O2 [(3, 5), (4, 6)] [(6, 6), (7, 7)]O3 [(2, 1), (2, 2)] [(6, 6), (7, 7)]O4 [(3, 5), (4, 6)] (8, 4)

and finally intrude a record owner’s privacy. To avoid privacy attacks based onthe notion of attack graph, MOB k-anonymity is formally defined as follows.

DEFINITION 14.13 MOB k-anonymity [263] Let D be a MODand D∗ its anonymized version. Given a set of QIDs for the MOBs in D,let G be the attack graph with respect to D and D∗. D∗ satisfies the MOBk-anonymity provided that: (i) every I-node in G has at least degree k; and(ii) G is symmetric, i.e. whenever G contains an edge (Ii, Oj), it also containsthe edge (Ij , Oi).

For MOB k-anonymity, the information loss is measured as the reductionin the probability of accurately determining the position of an object over alltimestamps between the raw MOD D and its anonymous version D∗. It isformally defined as follows:

IL(D, D∗) =n∑

i=1

m∑

j=i

(1 − 1/area(D∗(Oi, tj))),

where area(D∗(Oi, tj)) denotes the area of the region D∗(Oi, tj). For example,in Table 14.8, area(D∗(O1, t1)) is 2 while area(D∗(O2, t2)) is 4.

14.4.2 Anonymization Algorithm for MOB k-Anonymity

The MOB anonymization algorithm in [263] is composed of two steps: iden-tifying anonymization groups and generalizing the groups to common regionsaccording to the QIDs while achieving minimal information loss.

Page 302: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Trajectory Data 285

I1

I2

I3

I4

O1

O2

O3

O4

FIGURE 14.6: Attack graph of the unsafe 2-anonymity scheme of D inTable 14.8

14.4.2.1 Identifying anonymization groups

The first step of identifying anonymization groups is for each MOB O in agiven MOD D to find the top-K MOBs whose aggregate distance (e.g., sum ofdistance, average distance) from O over all QID(O) timestamps is minimun.The distance between two MOBs is measured by Hilbert index, which makessure that points close in the multi-dimensional space remain close in the linearHilbert ordering. Given a MOB O, its Hilbert index at time t is denoted asHt(O). The list of MOBs and their Hilbert index at time t is referred as theHilbert list of MOBs at time t, denoted by Lt. The distance between twoMOBs, O and O′, can therefore be approximated as |Ht(O)−Ht(O′)|. Thus,the problem of finding the top-K MOBs with closest distance from O is to findthe top-K MOBs with the lowest overall score,

∑t∈QID(O) |Ht(O) −Ht(O′)|.

Yarovoy et al. [263] adapt the Threshold Algorithm (TA) [85] by keeping thelist Lt in ascending order of the Hilbert index. For every MOB O, we considerthe top-K MOBs as its anonymization group AG(O).

However, simply generalizing a moving object O with the MOBs in AG(O)with respect to QID(O) may not ensure k-anonymity. Two algorithms, ex-treme union and symmetric anonymization, are consequently proposed to ex-pand the anonymization groups in order to ensure k-anonymity. In extremeunion, for every moving object O in D, we take the union of the QIDs of allMOBs in AG(O) and then generalize all of them with respect to every timepoint in this union in the later stage (Chapter 14.4.2.2). For example, in therunning example, AG(O1) with respect to QID(O1) = {t1} is {O1, O3}. It isinsufficient to guarantee k-anonymity by generalizing only on t1. In contrast,the extreme union will generalize O1 and O2 together with respect to theunion of their QIDs, {t1, t2}. Since the distance between a moving object Oand another object O′ ∈ AG(O) for the timestamps outside QID(O) but inthe timestamp union (e.g. O1 and O3 at time t2) could be arbitrarily far, thegeneralizations over such timestamps could incur significant information loss.Table 14.9 shows the timestamp unions of the MOBs in Table 14.7.

An alternative approach, symmetric anonymization, keeps the timestamps

Page 303: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

286 Introduction to Privacy-Preserving Data Publishing

Table 14.9: Extreme union over Table 14.7MOB QID AG(O) Timestamp UnionO1 t1 O1, O3 t1, t2O2 t2 O2, O3 t2O3 t2 O2, O3 t2O4 t1 O2, O4 t1, t2

Table 14.10: Symmetricanonymization over Table 14.7

MOB QID HS(O)O1 t1 O1, O3

O2 t2 O2, O3, O4

O3 t2 O1, O2, O3

O4 t1 O2, O4

of an object O w.r.t. which all objects in AG(O) are generalized together fixedto QID(O) only, instead of the union of the QIDs of all objects in AG(O). Toenforce symmetry, it controls the composition of the anonymization groups.Thus, the anonymization groups generated by symmetric anonymization maynot be the same as the ones produced by top-K MOBs, and will be referred ashiding sets (HS) for clarity. To compute the HS of a moving object Oi, first, weadd Oi itself to HS(Oi); second, if the number of distinct objects in HS(Oi) isless than k, we add the top-(k−|HS(Oi)|) MOBs to HS(Oi); third, to enforcethe symmetry for each object Oj in HS(Oi), where i = j, we further add Oi

to HS(Oj). For example, assume we consider 2-anonymity in the MOD givenin Table 14.7. HS(O1) is initialized to {O1}, then {O3} as the top-1 MOBswith respect to O1 at time t1 is added to HS(O1). To enforce symmetry, O1

is added to HS(O3). Now both O1 and O3 have 2 objects in their hiding sets,so we can move to O2 and set HS(O2) = {O2, O3}. Due to the symmetryrequirement, we need to add O2 to HS(O3) in spite of the fact that thereare already 2 objects in HS(O3). Finally, we compute HS(O4) = {O2, O4},which requires to add O4 to HS(O2) due to the symmetry. Table 14.10 showsthe hiding sets of the MOBs in Table 14.7.

14.4.2.2 Generalizing anonymization groups

The last step of the anonymization algorithm is to perform the space gen-eralization over either all objects in AG(O) together with respect to everytimestamp in its timestamp union for every object O in the MOD D (forextreme union) or all objects in HS(O) together with respect to QID(O) forevery object O in D (for symmetric anonymization). The main challenge isthat overlapping anonymization groups can force us to revisit earlier general-izations [263]. The concept of equivalence class is consequently proposed.

Page 304: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Trajectory Data 287

Table 14.11: Equivalence classes produced by extreme union andsymmetric union over Table 14.7

t ECt by Extreme Union ECt by Symmetric Anonymizationt1 {O1, O3}, {O2, O4} {O1, O3}, {O2, O4}t2 {O2, O3} {O1, O2, O3, O4}

DEFINITION 14.14 Equivalence class Two MOBs Oi and Oj are equiv-alent at time t, denoted by Oi ≡t Oj , if and only if:

1. t ∈ QID(Oj) and Oi ∈ S(Oj), or

2. t ∈ QID(Oi) and Oj ∈ S(Oi), or

3. there exists a MOB Ok = Oi, Oj such that Oi ≡t Ok and Ok ≡t Oj ,where S(O) is either AG(O) or HS(O) [263].

For all Oi ∈ D, let S(Oi) be the anonymization group or the hiding set ofOi, let T (Oi) be the set of timestamps associated with S(Oi), and let QITbe the union of all timestamps in D. The equivalence classes with respectto time t, denoted by ECt, can be computed by going through every MOBOi with t ∈ QID(Oi) and then adding S(Oi) to the collection ECt. Whena new S(Oi) is added to ECt, if it has overlaps with an existing set in ECt,we merge it with the existing set, otherwise, we make a new set. Table 14.11presents the equivalence classes produced by both extreme union and sym-metric anonymization on Table 14.7. Achieving a k-anonymous D∗ of D is,therefore, for each time t ∈ QIT , for each equivalence class set C ∈ ECt togeneralize the position of every MOB O ∈ C to the smallest common regionin order to minimize the information loss.

Though symmetric anonymization overcomes the aforementioned disadvan-tage of extreme union, from Table 14.7 we can observe that its resulting equiv-alence classes could be larger than those produced by extreme union, depend-ing on the relative positions of the moving objects. Typically, the smallerthe classes, the less general the regions. Thus, theoretically extreme unionand symmetric anonymization are incomparable in terms of the informationloss they result in [263]. However, the experimental results in [263] indicatethat symmetric anonymization outperforms extreme union in terms of bothefficiency and efficacy.

14.4.2.3 Discussion

An underlying assumption of the MOB k-anonymity is that the data pub-lisher must be aware of the QIDs of all moving objects in the MOD to publish,that is, all adversaries’ possible background knowledge. However, the paperleft the acquisition of QIDs for a data publisher unsolved. An attacker mayobtain a moving object’s QIDs from various sources beyond the data pub-

Page 305: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

288 Introduction to Privacy-Preserving Data Publishing

lisher’s perception. Recall that usually data publishers are not experts, so thefact may hinder the application of the MOB k-anonymity.

The approach proposed in [263] is limited to protection from only identitylinkage attacks (k-anonymity). However, in the context of MODs, there arealso requirements of protecting attribute linkage attacks (see Chapter 14.1.2),thus the approach is not applicable to such scenarios. In addition, space gen-eralization based on coordinates sometimes may impose additional efforts onthe data publisher. For example, a transit company needs preprocessing toget all station’s coordinates.

14.5 Other Spatio-Temporal Anonymization Methods

Different solutions have been proposed to protect the privacy of location-based service (LBS) users. The anonymity of a user in LBS is achieved bymixing the user’s identity and request with other users. Example of such tech-niques are Mix Zones [31], cloaking [106], and location-based k-anonymity [98].The objective of these techniques is very different from the problem studiedin this chapter. First, their goal is to anonymize an individual user’s identityresulting from a set of LBS requests, but the problem studied in this chapteris to anonymize a high-dimensional trajectory data. Second, they deal withsmall dynamic groups of users at a time, but we anonymize a large staticdata set. Hence, their problem is very different from spatio-temporal datapublishing studied in this chapter.

The privacy model proposed by Terrovitis and Mamoulis [219] assumes thatdifferent adversaries have different background knowledge about the trajec-tories, and thus their objective is to prevent adversaries from gaining anyfurther information from the published data. They consider the locations ina trajectory as sensitive information and assume that the data holder has thebackground knowledge of all the adversaries. In reality, such information isdifficult to obtain.

Papadimitriou et al. [185] study the privacy issue on publishing time seriesdata and examined the trade-offs between time series compressibility and par-tial information hiding and their fundamental implications on how one shouldintroduce uncertainty about individual values by perturbing them. The studyfound that by making the perturbation “similar” to the original data, we canboth preserve the structure of the data better, while simultaneously makingbreaches harder. However, as data become more compressible, a fraction of theuncertainty can be removed if true values are leaked, revealing how they wereperturbed. Malin and Airoldi [163] study the privacy threats in location-baseddata in the environment of hospitals.

Page 306: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Trajectory Data 289

14.6 Summary

We have studied the problem of anonymizing high-dimensional trajectorydata and shown that traditional QID-based anonymization methods, suchas k-anonymity and its variants, are not suitable for anonymizing trajecto-ries, due to the curse of high dimensionality. Applying k-anonymity on high-dimensional data would result in a high utility loss. To overcome the problem,we discuss the LKC-privacy model [91, 170], where privacy is ensured by as-suming that an adversary has limited background knowledge about the victim.We have also presented an efficient algorithm for achieving LKC-privacy withthe goal of preserving maximal frequent sequences, which serves as the basisof many data mining tasks on sequential data.

The Never Walk Alone (NWA) [1] algorithm studied in Chapter 14.3 em-ploys clustering and space translation to achieve the anonymity. The drawbackof the clustering phase in NWA is that it may have to restart a number oftimes in order to find a suitable solution. Another drawback is that the spacetranslation produces untruthful data. In contrast, the LKC-privacy approachpresented in Chapter 14.2 does not require continuous data and employs sup-pression for anonymity. Thus, the LKC-privacy approach preserves the datatruthfulness and maximal frequent sequences with true support counts. Pre-serving data truthfulness is important if the data will be examined by humanusers for the purposes of auditing and data interpretation. Moreover, NWAdoes not address the privacy threats caused by attribute linkages. Yarovoy etal. [263] consider time as a QID attribute. However, there is no fixed set oftime for all moving objects, or rather each trajectory has its own set of timesas its QID. It is unclear how the data holder can determine the QID attributesfor each trajectory.

Page 307: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Chapter 15

Anonymizing Social Networks

15.1 Introduction

In 2001, Enron Corporation filed for bankruptcy. With the related legal in-vestigation in the accounting fraud and corruption, the Federal Energy Regu-latory Commission has made public a large set of email messages concerningthe corporation. This data set is known as the Enron corpus, and containsover 600,000 messages that belong to 158 users, mostly senior managementof Enron. After removing duplicates, there are about 200,000 messages. Thisdata set is valuable for researchers interested in how emails are used in an or-ganization and better understanding of organization structure. If we representeach user as a node, and create an edge between two nodes when there existssufficient email correspondence between the two corresponding individuals,then we arrive at a data graph, or a social network.

It is natural that when such data are made public, the involved individualswill be concerned about the disclosure of their personal information, whichshould be kept private. This data set is quoted in [111] as a motivating examplefor the study of privacy in social networks. Such a set of data can be visualizedas a graph as shown in Figure 15.1.

It is observed that social networks are abundant on the Internet, corporatecorrespondences, networks of collaborating authors, etc. Social networkingwebsites such as Friendster, MySpace, Facebook, Cyworld, and others havebecome very popular in recent years. The information in social networks be-comes an important data source, and sometimes it is necessary or beneficialto release such data to the public. Other than the above utilities of the Enronemail data set, there can be other kinds of utility for social network data.Kumar et al. [147] study the structure and evolution of the network. Getorand Diehl [100] prepare a survey on link mining for data sets that resemblesocial networks. McCallum et al. [167] consider topic and role discovery insocial networks.

To this date, not many data sets have been made public because of theassociated privacy issues. It is obvious that no employee of Enron would likehis/her identity in the data set to be disclosed since it means that their mes-sages to others, which were written only for the recipient(s), would be dis-closed. Anonymous web browsing is an example where people want to conceal

291

Page 308: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

292 Introduction to Privacy-Preserving Data Publishing

any behavior that is deemed unethical or disapproved by society. It becomesimportant to ensure that any published data about the social networks wouldnot violate individual privacy. In this chapter we examine the related issuesand explore some of the existing mechanisms.

15.1.1 Data Models

Different models have been proposed with respect to social network datathat are published or made available to the public. The following are some ofsuch models

• A simple model : The data set is given by a graph G = (V, E), where V isa set of nodes, and E is a set of edges. Each node is an entity of interest.In social networks, each node typically corresponds to an individual ora group of individuals. An edge represents a relationship between twoindividuals.

• Labeled nodes : In [271], the graph G = (V, E) is enriched with labels,let L be a set of labels, there is a mapping function L : V → L whichassigns to each node in V a label from L. For example, the occupation ofan individual can be used as a label. Figure 15.2 shows a social networkwith labeled nodes, where A, B, C are the labels.

• Nodes with attached attributes or data: Each node typically is an indi-vidual, so each node is associated with some personal information. Inthe Enron data set, the emails sent or received by an individual can beseen as attached data to the corresponding node.

• Sensitive edges : The data set is a multi-graph G = (V, E1, E2, . . . , Es),where V is a set of nodes, and Ei are sets of edges. Es correspond to thesensitive relationships. There can be different sets of edges where eachset belongs to a particular class. Edges can also be labeled with moreinformation about the relationships.

15.1.2 Attack Models

The anonymization problem of social networks depends on the backgroundknowledge of the adversary and the model of privacy attack, which can bethe identification of individuals in the network or the relationship amongindividuals.

Li et al. [153] suggest that there are two types of privacy attacks on socialnetworks:

• identity disclosure: With a social network, this is to link the nodes in theanonymized network to real world individuals or entities. Conceptually,the system may choose a value of k, and make sure that in the released

Page 309: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Social Networks 293

User

Group

User

User

Group

User

A

B

CA

A

C

B

User

FIGURE 15.1: A social network

social network G′, which can be different from the original data G, nonode v in G′ can be re-identified with a probability of greater than 1/k,based on the assumption of the adversary’s background knowledge. Mostexisting mechanisms would try to ensure that at least k different nodesare similar in terms of the adversary’s knowledge about the target ofattack, so that each of the k nodes has the same chance of being theculprit.

• attribute disclosure: With a social network, each node represents anindividual or a group of individuals. There can be attributes attachedto each node, such as personal or group information. Sometimes, suchinformation can be considered sensitive or private to some individuals.

With graph data, another type of attack is identified:

• link re-identification: Links or edges in a social network can representsensitive relationships. In cases where the identity of nodes are releasedto some user groups, some individuals may not want their relationshipto some other individuals to be discovered.

15.1.2.1 Passive Attacks

The terms passive versus active attacks are used to differentiate betweencases where the adversary only observes the data and does not tamper withthe data versus the cases where the adversary may change the data for attackpurposes.

Page 310: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

294 Introduction to Privacy-Preserving Data Publishing

Basis of Attacks: Background Knowledge of Adversary

In a social network, an individual may be identified based on the attributevalues and also the relationships with other individuals [235]. Before we in-troduce the kinds of adversary’s knowledge, we need some definitions.

DEFINITION 15.1 Induced subgraph Given a graph G = (V, E), withnode set V and edge set E, let S be a set of nodes in V , the induced subgraphof G on S is G(S) = (S, E′), where E′ = {(u, v)|(u, v) ∈ E, u ∈ V, v ∈ V }.

DEFINITION 15.2 1-neighborhood Given a graph, G = (V, E), if(u, v) ∈ E, then v and v are neighbors to each other. Suppose u ∈ V , letW be the set of neighbors of u. The 1-neighborhood of u ∈ V is the inducedsubgraph of G on {u} ∪W .

Hi: Neighborhood Knowledge

Hay et al. [111, 110] consider a class of queries, of increasing power, whichreport on the local structure of the graph a target node. For example, Hay etal. [111, 110] define that

• H0(x) returns the label of a node x (in case of unlabeled social network,the nodes are not labeled, H0(x) = ε),

• H1(x) returns the degree of node x, and

• H2(x) returns the multiset of the degrees of each neighbor of x.

For example, consider the node x1 in Figure 15.2.

H0(x1) = AH1(x1) = 2H2(x1) = {2, 4}

The queries can be defined iteratively, where Hi(x) could return the mul-tiset of values which are the result of evaluating Hi−1(x) on the set of nodesadjacent to x. For example,

• Hi(x) = {Hi−1(z1),Hi−1(z2), . . . ,Hi−1(zm)},where z1, . . . , zm are the neighbors of x.

Subgraph Knowledge

Hay et al. [110] suggest that other than Hi, a stronger and more realisticclass of query is subgraph query, which asserts the existence of a subgraph

Page 311: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Social Networks 295

A

B

C

A

A

C

B

X1

X6

X4X2

X3

X5

X7

FIGURE 15.2: A social network

around the target node. The descriptive power of the query is counted by thenumber of edges in the subgraph. Note that this class of adversary’s knowl-edge covers also the 1-neighborhood since the 1-neighborhood is one possiblesubgraph around the target node.

In Figure 15.2, if the adversary knows that the target individual is at thecenter of a star with 4 neighbors, then the adversary can pinpoint the nodex2 in the network as the target, since x2 is the only node with 4 neighbors.However, if the adversary knows only that the target is labeled A and has aneighbor with label A, then nodes x1, x2, and x3 all qualify as the possiblenode for the target, and the adversary cannot succeed in a sure attack.

Zhou and Lei [271] consider an adversary’s background knowledge of the1-neighborhood of the target node. Backstrom et al. [25] describe a type ofpassive attack, in which a small group of colluding social network users dis-cover their nodes in the anonymized network by utilizing the knowledge ofthe network structure around them. This attack is feasible, but only works ona small scale because the colluding users can only compromise the privacy ofsome of the users who are already their friends.

For link re-identification, Korolova et al. [143] assume that each user canhave the knowledge of the neighborhood within a distance �, and with thisknowledge, an adversary can try to bribe other nodes in the network to gaininformation about a significant portion of the links in the network.

Recently, Zou et al. [272] propose a k-automorphism model that can preventany types of structural attacks. The intuition of the method is described as fol-lows: Assume that there are k−1 automorphic functions Fa for 1 ≤ a ≤ k−1in the published network, and for each vertex v, Fax(v) = Fay (v) (ax = ay).Thus, for each vertex v in published network, there are always k − 1 othersymmetric vertices. This means that there are no structural differences be-tween v and each of its k− 1 symmetric vertices. Thus, the adversary cannotdistinguish v from the other k− 1 symmetric vertices using any structural in-formation. Therefore, the target victim cannot be identified with a probabilityhigher than 1/k.

Page 312: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

296 Introduction to Privacy-Preserving Data Publishing

15.1.2.2 Active Attacks

Backstrom et al. [25] study the problem of active attacks aiming for identitydisclosure and node re-identification with an assumption that the adversaryhas the ability to modify the network prior to it release. The general idea isfirst choose an arbitrary set of target victim users from the social network,create a small number of new “sybil” social network user accounts and linkthese new accounts with target victims, and create a pattern of links amongthe new accounts such that they can be uniquely identified in the anonymizedsocial network structure. The attack involves creating O(logN) number ofsybil accounts, where N is the total number of users in the social network.

Alternatively, an adversary would find or create k nodes in the networknamely {x1, . . . , xk}, and next create the edge (xi, xj) independently withprobability 1/2. This produces a random graph H . The graph will be includedinto the given social network forming a graph G. The first requirement ofH is such that there is no other subgraph S in G that is isomorphic to H(meaning that H can result from S by relabeling the nodes). In this way Hcan be uniquely identified in G. The second requirement of H is that thereis no automorphism, which is an isomorphism from H to itself. With H , theadversary can link a target node w to a subset of nodes N in H so that onceH is located, w is also located. From results in random graph theory [34],the chance of achieving the two requirements above is very high by using therandom graph generation method.

Fortunately, active attacks are not practical for large scale privacy attacksdue to three limitations suggested by Narayanan and Shmatikov [175].

1. Online social networks only: Active attacks are restricted to online so-cial networks only because the adversary has to create a large numberfake user accounts in the social network before its release. Many socialnetwork providers, such as Facebook, ensure that each social networkaccount associates with at least one e-mail account, making creation ofa large number of fake social network user accounts difficult.

2. Sybil accounts have low in-degree: The adversary may be able to plantthe fake user accounts and link the fake accounts to the target victims.Yet, the adversary has little control to increase the in-degree of the fakeuser accounts. As a result, sybil accounts may be identified with low in-degree. Yu et al. [266] develop some techniques to identify sybil attacksfrom social networks.

3. Mutual links : Many online social network providers consider two usersare linked only if both users mutually agree to be “friends.” For example,in Facebook, user A initiates a friendship with user B. The friendship isestablished only if B confirms the relationship. Due to the restriction ofmutual links, the target victims will not agree to link back to the fakeaccounts, so the links are not established, thereby do not appear in thereleased network.

Page 313: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Social Networks 297

Note that since the attack is based on a subgraph in the network, it canbe defended in the same manner as for passive attack based on subgraphknowledge.

15.1.3 Utility of the Published Data

For privacy-preserving data publishing, the other side of the coin is theutility of the published data. Most of the utility measures for social networksare related to graph properties. The following are some examples.

• Properties of interest in network data: Hay et al. [110] use thefollowing properties for the utility measure.

– Degree: distribution of degrees of all nodes in the graph

– Path length: distribution of the lengths of the shortest paths be-tween 500 randomly sampled pairs of nodes in the network

– Transitivity: distributions of the size of the connected componentthat a node belongs to.

– Network resilience: number of nodes in the largest connected com-ponent of the graph as nodes are removed.

– Infectiousness: proportion of nodes infected by a hypothetical dis-ease, first randomly pick a node for the first infection, and thenspread the disease with a specific rate.

• Aggregate query answering: Korolova et al. [143] consider queriesof the following form: for two labels l1, l2, what is the average distancefrom a node with label l1 to its nearest node with label l2.

15.2 General Privacy-Preserving Strategies

In order to avoid privacy breach, there can be different strategies. Thefirst method, the most widely adopted approach is to publish an anonymizednetwork. In other cases, such as Facebook, the data holder may choose to allowonly subgraphs to be known to individual users. For example, the system maylet each user see one level of the neighbors of the node of the user. Anotherapproach is not to release the network, but to return an approximate answerto any query that is issued about the network.

To preserve privacy, we can release a data set that deviates from the originaldata set, but which still contains useful information for different usages ofthe data. In this Chapter 15.2, we introduce two different approaches: graphmodification and equivalence class of nodes.

Page 314: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

298 Introduction to Privacy-Preserving Data Publishing

A

B

C

A

A

C

B

X1

X6

X4

X2

X3

X5

X7

FIGURE 15.3: Anonymization by addition and deletion of edges

15.2.1 Graph Modification

One way to anonymize a social network is to modify the network by addingor deleting nodes, adding or deleting edges, and changing (generalize) thelabels of nodes. The network is release only after sufficient changes have beenmade to satisfy certain privacy criteria.

In Figure 15.3, we show an anonymization of Figure 15.1 by adding anddeleting edges only. The aim is to resist attacks from adversaries with knowl-edge of the 1-neighborhood of some target node. We would like the result-ing graph to be 2-anonymous with respect to this neighborhood knowledge,meaning that for any 1-neighborhood N , there are at least 2 nodes with a1-neighborhood of N . In the anonymization, one edge (x2, x3) is added, while4 edges are deleted, namely, (x2, x4), (x2, x5), (x2, x6), (x3, x6). If the targetvictim node is x1, which is labeled A and is connected to 2 neighbors also la-beled A, we find that in the resulting graph in Figure 15.3, there are 3 nodes,namely x1, x2, and x3 that have the same 1-neighborhood. If the target nodeis x7, with label B, we find that both x7 and x6 have the same 1-neighborhood.The graph is 2-anonymous with respect to attacks by 1-neighborhood.

15.2.2 Equivalence Classes of Nodes

With this method, clusters of nodes are formed and selected linkages canbe hidden, or the linkages among nodes for two different clusters can be “gen-eralized” to links among the clusters instead. The clusters form equivalenceclasses of nodes, so that within any cluster or equivalence class, the nodes areindistinguishable from each other. For example, Figure 15.4 shows three clus-ters formed from Figure 15.2. An earlier work that proposes such a methodis [270], the aim of this work is to protect the re-identification of the links.Some links are classified as sensitive and they need to be protected.

Page 315: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Social Networks 299

A

B

CA

A

C

B

X1

X6

X4

X2

X3 X5

X7

S1

S2

S3

2

1

1

2

2

FIGURE 15.4: Anonymization by clustering nodes

15.3 Anonymization Methods for Social Networks

In the rest of the chapter, we will describe in more details some of theexisting anonymization methods.

15.3.1 Edge Insertion and Label Generalization

Zhou and Pei [271] consider the 1-neighborhood of a node as a possi-ble adversary’s background knowledge. The aim of protection is node re-identification. Here graph changes by edge addition and change of labels areallowed in the algorithm.

There are two main steps in the anonymization:

1. Find the 1-neighborhood of each node in the network.

2. Group the nodes by similarity and modify the neighborhoods of thenodes with the goal of having at least k nodes with the same 1-neighborhood in the anonymized graph.

The second step in the above faces the difficult problem of graph isomor-phism, which is one of a very small number of known problems in NP withuncertainty in NP-completeness, that is, there is no known polynomial algo-rithm and also no known proof of NP-completeness. Hence only exponentialtime algorithms are available. To solve this problem, Zhou and Pei [271] usea minimum depth-first-search coding for each node, which is adopted from[257].

Page 316: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

300 Introduction to Privacy-Preserving Data Publishing

15.3.2 Clustering Nodes for k-Anonymity

Hay et al. [110] propose to form clusters or equivalent classes of nodes.Clusters of nodes of size at least k are formed. The aim is to make the nodesinside each cluster indistinguishable, so that node re-identification is not pos-sible. This method can resist attacks based on different kinds of adversary’sknowledge, including Hi, neighboring subgraph, and 1-neighborhood.

In the terminology of [110], each cluster of nodes is called a supernode, andan edge between two supernodes is called a superedge. The superedges includeself-loops and are labeled with a non-negative weight, which indicates thedensity of edges within and across supernodes.

In the graph G = (V, E) in Figure 15.4, there are 3 supernodes, S1, S2, S3.That is, V = {S1, S2, S3}. The superedge between S1 and S2 has a weightof 2, since there are 2 edges in the original graph of Figure 15.1 linking nodesin S1 to nodes in S2. We say that d(S1, S2) = 2. Similarly the weight of thesuperedge between S1 and S3, given by d(S1, S3), is 2. Each node can have aself-loop also. The self-loop of S1 has a weight of 2 since within S1, there aretwo edges in the original graph among the nodes in S1. In a similar manner,the self-loops for S2 and S3 are given weights of d(S2, S3) = 1.

Since each supernode in Figure 15.4 has at least 2 nodes in the originalgraph, and the nodes inside each supernode are indistinguishable from eachother, this graph is 2-anonymous.

In order to preserve the utility, we examine the likelihood of recoveringthe original graph given the anonymized graph. If the graph G = (V, E) inFigure 15.4 is published instead of that in Figure 15.1, there will be multiplepossibilities of the original graphs, each being a possible world. Let W (G) bethe set of all possible worlds. If we assume equal probability for all possibleworlds, then the probability of the original graph, which is one of the possibleworlds, is given by 1/|W (G)|. This probability should be high in order toachieve high utility of the published data.

We can formulate |W (G)| by

|W (G)| =∏

X∈V

(12 |X |(|X | − 1)

d(X, X)

) ∏

X,Y ∈V

( |X ||Y |d(X, Y )

)(15.1)

For example, in Figure 15.4,

|W (G)| =(1

23 · 22

)(122 · 1

1

)(122 · 1

1

)(3 · 22

)(3 · 22

)= 675

The algorithm in [110] follows a simulated annealing method to search for agood solution. Starting with a graph with a single supernode which containsall the original nodes, the algorithm repeatedly tries to find alternatives bysplitting a supernode, merging two supernodes, or moving an original nodefrom one supernode to another. If the above probability of 1/|W (G)| increaseswith the newly form alternative, the alternative is accepted to be explored.

Page 317: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Social Networks 301

The termination criterion is that fewer than 10% of the current alternativesare accepted.

15.3.3 Supergraph Generation

In the model proposed by Liu and Terzi [159], the adversary attack is basedon the degree of the target node, that is the knowledge of the adversary isgiven by H1. Both edge insertion and deletion are allowed to modify the givengraph to form an anonymized graph. Given a graph G = (V, E), if we sortthe degrees of each node, then the sorted list is a degree sequence for G.Anonymization is based on the construction of a graph that follows a givendegree sequence.

DEFINITION 15.3 k-anonymous graph The degree sequence d of agraph G is k-anonymous if and only if each degree value in d appears at leastk times in d.

For example d = {5, 5, 3, 3, 2, 2, 2} is 2-anonymous, but not 3-anonymous.The degree sequence of Figure 15.1 is given by {4, 3, 2, 2, 2, 1}, which is notk-anonymous for any k greater than 1.

In Figure 15.1, if we delete edge (x2, x6) and add edge (x3, x7), then thedegree sequence of the modified graph becomes {3, 3, 2, 2, 2, 2, 2} and it is 2-anonymous. Obviously, if an adversary only knows the degree of a target node,then there are always at least 2 nodes that have the same degree and they arenot distinguishable.

To preserve the utility of the published data, we measure the distance of adegree sequence d from the original sequence d by the following:

L1(d− d) =∑

i

|d(i)− d(i)| (15.2)

where d(i) refers to the i-th value in the list d.There are two main steps in the anonymization:

1. Given a degree sequence d, construct a new degree sequence d that isk-anonymous and such that the degree-anonymization cost

DA(d, d) = L1(d− d) (15.3)

is minimized.

2. With d, try to construct a graph G(V, E) such that dG = d and E∩E =E.

A dynamic programming algorithm is proposed for Step 1, which solvesthe problem in polynomial time. For the second step an algorithm Construct-Graph is taken from [80] as the backbone, this algorithm constructs a graph

Page 318: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

302 Introduction to Privacy-Preserving Data Publishing

given a degree sequence. After Step 1, we obtain a k-anonymous degree se-quence d. This is input to the algorithm ConstructGraph. If the resultinggraph G1 is a supergraph of G, then we can return G1. However, Step 2may fail so that some edges in G are not in G1. The degree sequence canbe relaxed to increase the values of some degrees while maintaining the k-anonymity. ConstructGraph is run again on the new sequence. If it is still notsuccessful, we consider deletion of edges that violate the supergraph condition.

15.3.4 Randomized Social Networks

In the report [111], an anonymization technique based on random edgedeletions and insertions is proposed, which can resist H1 attacks. However,the utility degradation from this anonymization is steep.

Ying and Wu [264] quantify the relationship between the amount of ran-domization and the protection against link re-identification. A randomizationstrategy is proposed that preserves the spectral properties of the graph. Withthis method, the utility of the published graph is enhanced.

15.3.5 Releasing Subgraphs to Users: Link Recovery

Korolova et al. [143] consider link privacy: the goal of an adversary is tofind a fraction of the links in the network. The links represent the relationshipamong users and it is deemed sensitive. Here the graph is not released, andthe owner of the network would like to hide the links. Each user can see itsneighborhood in the network. An adversary can bribe a number of users togain information about other linkages.

For example on Facebook, each user can determine if they allow their friendsto see his/her friend list, and also maybe allow the friends of friends to see it.

15.3.6 Not Releasing the Network

Rastogi et al. [192] propose a different scenario that does not publish thesocial network. Instead, from the system point of view, the input is the net-work I, and a query q on I, the output is an approximation A(I, q) of thequery answer q(I). I is considered a relation database containing tuples. Theadversary can revise the estimate of any tuple in I given the values of A(I, q).The aim is to resist the attack where the estimate becomes close to the actualvalues.

Page 319: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Anonymizing Social Networks 303

15.4 Data Sets

In almost all existing works, the proposed ideas or methods have been testedon some data sets. It is important to show that the methods work on somereal data. Here we list some of the data sets that have been used.

In [111], three sets of real data are tested:

1. Hep-Th: data from the arXiv archive and the Stanford Linear Acceler-ator Center SPRIES-HEP database. It describes papers and authors intheoretical high-energy physics.

2. Enron: we have introduced this data set in the beginning of the chapter.

3. Net-trace: this data set is derived from an IP-level network trace col-lected at a major university, with 201 internal address from a singlecampus and over 400 external addresses. Another set Net-common isderived from Net-trace with only the internal nodes.

Zhou and Pei [271] adopt a data set from the KDD cup 2003 on co-authorship for a set of papers in high-energy physics. In the co-authorshipgraphs, each node is an author and an edge between two authors representsa co-authorship for one or more papers. There are over 57,000 nodes and120,000 edges in the graph, with an average degree of about 4. There is infact no privacy issue with the above data set but it is a real social networkdata set.

In addition to real data sets, synthetic data from R-MAT graph model hasbeen used which follows the power law [86] on node degree distribution andthe small-world characteristics [224].

15.5 Summary

We have seen a number of techniques for privacy-preserving publishing ofsocial networks in this chapter. The research is still at an early stage andthere are different issues to be investigated in the future; we list some suchdirections in the following.

• Enhancing the utility of the published social networks

With known methods, privacy is achieved with trade off of the utilityof the published data. Since the problems are typically very hard andan optimal solution is not possible, there is the possibility to find bettersolutions that involve less information loss.

Page 320: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

304 Introduction to Privacy-Preserving Data Publishing

• Multiple releases for updated social networks

Social networks typically evolve continuously and quickly, so that theykeep changing each day, with new nodes added, new edges added, orold edges deleted. It would be interesting to consider how privacy maybe breached and how it can be protected when we release the socialnetworks more than once.

• Privacy preservation on attributes of individuals

So far most of the existing works assume a simple network model, wherethere are no attributes attached to a node. It is interesting to considerthe case where additional attributes are attached and when such at-tributes can be sensitive.

• k-anonymity is not sufficient to protect privacy in some cases. Suppose apublished network is k-anonymous so that there are k nodes that can bethe target node. However, if all of these k nodes have the same sensitiveproperty, then the adversary is still successful in breaching the privacy.

In conclusion, privacy-preserving publishing of social networks remains achallenging problem, since graph problems are typically difficult and therecan be many different ways of adversary attacks. It is an important problemand it will be interesting to look for new solutions to the open issues.

Page 321: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Chapter 16

Sanitizing Textual Data

16.1 Introduction

All works studied in previous chapters focused on anonymizing the struc-tural relational and transaction data. What about the sensitive, person-specific information in unstructural text documents?

Sanitization of text documents involves removing sensitive information andpotential linking information that can associate an individual person to thesensitive information. Documents have to be sanitized for a variety of reasons.For example, government agencies have to remove the sensitive informationand/or person-specific identifiable information from some classified documentsbefore making them available to the public so that the secrecy and privacyare protected. Hospitals may need to sanitize some sensitive information inpatients’ medical reports before sharing them with other healthcare agencies,such as government health department, drugs companies, and research insti-tutes.

To guarantee that the privacy and secrecy requirements are met, the processof document sanitization is still performed manually in most of the governmentagencies and healthcare institutes. Automatic sanitization is still a researcharea in its infancy stage. An earlier work called Scrub [214] finds and replacesidentifying information of individuals, identifying information such as name,location, and medical terms with other terms of similar type, such as fakenames and locations. In this chapter, we focus on some recently developedtext sanitization techniques in privacy protection. We can broadly categorizethe literature in text sanitization into two categories:

1. Association with structural or semi-structural data. Textual data itselfis unstructural, but some of its information could be associated withsome structural or semi-structural data, such as, a relational database.The database could keep track of a set of individuals or entities. Thegeneral idea of document sanitization in this scenario is to remove someterms in documents so that the adversary cannot link a document to anentity in the structural data. This problem is studied in Chapter 16.2.

2. No association with structural or semi-structural data. This family ofsanitization methods rely on information extraction tool to identify en-

305

Page 322: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

306 Introduction to Privacy-Preserving Data Publishing

tities, such as names, phone number, and diseases, etc., from the textualdocuments. Then, suppress or generalize of the terms if they containidentifiable or sensitive information. The reliability of the sanitizationand quality of the resulting documents pretty much depend on the qual-ity of the information extraction modules. This problem is studied inChapter 16.3.

16.2 ERASE

Chakaravarthy et al. [44] introduce the ERASE system to sanitize docu-ments with the goal of minimal distortion. External knowledge is required toassociate a database of entities with their context. ERASE prevents disclo-sure of protected entities by removing certain terms of their context so thatno protected entity can be inferred from remaining document text. K-safety,in the same spirit of k-anonymity, is thereafter defined. A set of terms is K-safe if its intersection with every protected entity contains at least K entities.Then the proposed problem is to find the maximum cardinality subset of adocument satisfying K-safety. Chakaravarthy et al. [44] propose and evaluateboth a global optimal algorithm and an efficient greedy algorithm to achieveK-safety.

16.2.1 Sanitization Problem for Documents

Chakaravarthy et al. [44] model public knowledge as a database of entities,denoted by E, such as persons, products, and diseases, etc. Each entity e ∈ Eis associated with a set of terms, which forms the context of e, denoted byC(e). For instance, the context of a person entity could include his/her name,age, city of birth, and job. The database can be in the form of structuredrelational data or unstructured textual data, for example, an employee listin a company or Wikipedia [44]. The database can be composed manually orextracted automatically using an information extraction method [15].

16.2.2 Privacy Model: K-Safety

The data holder specifies some entities P ⊆ E to be protected. These are theentities that need to be protected against identity linkage. For instance, in adatabase of diseases, certain diseases, e.g., AIDS, can be marked as protected.Let D be the input document to be sanitized. A document contains a set ofterms. We assume that D contains only the terms that are in the context ofsome protected entity, i.e., D ⊆ ∪e∈P C(e). Any other terms in D but not in∪e∈P C(e) has no effect on privacy, therefore, need not be removed.

Page 323: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Sanitizing Textual Data 307

Suppose a document D contains a set of terms S that appears in the contextof some protected entity e. S ⊆ D and S ⊆ C(e) for some e ∈ P . The adversarymay use S to link the protected entity e with the document D. The likelihoodof a successful identity linkage depends on how many entities E− e contain Sin their context as well. Chakaravarthy et al.’s privacy model [44] assumes thatthe adversary has no other external knowledge on the target victim, therefore,the adversary cannot confirm whether or not the linkage is correct as long asthe number of entities in E − e containing S in their context is large. Theintuition of K-safety is to ensure that there are at least K entities in E, otherthan e, that can link to D, where K > 0 is an anonymity threshold specifiedby the data holder.

DEFINITION 16.1 K-safety Let E be a set of entities. Let T ⊆ D bea set of terms. Let P ⊆ E be a set of protected entities. For a protected entitye ∈ P , let AT (e) be the number of entities other than e that contain C(e)∩Tin their context. The set of terms T is K-safe with respect to a protectedentity e if AT (e) ≥ K. This set T is said to be K-safe if T is K-safe withrespect to every protected entity [44].

16.2.3 Problem Statement

The sanitization problem for K-safety can be defined as follows.

DEFINITION 16.2 Sanitization problem for K-safety Given a setof entities E, where each entity e ∈ E is associated with a set of terms C(e),a set of protected entities P ∈ E, a document D, and an anonymity thresholdK, the sanitization problem for K-safety is to find the maximum cardinalitysubset of D that is K-safe.

The following example, borrowed from [44], illustrates this problem.

Example 16.1Consider a set of entities E = {e1, e2, e3, e4, e5, e6, e7}, where P = {e1, e2, e3}is a set of protected entities. Suppose their contexts contain the followingterms:

C(e1) = {ta, tb, tc}C(e2) = {tb, td, te, tf}C(e3) = {ta, td, tg}C(e4) = {ta, tb, td, tg}C(e5) = {tc, te, tf}C(e6) = {tb, tg}C(e7) = {ta, tb, tc, te, tf , tg}

Page 324: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

308 Introduction to Privacy-Preserving Data Publishing

Suppose the data holder wants to achieve 2-safety on a document D ={ta, tb, td, te, tf , tg}. The subset T1 = {tb, td, tg} is not 2-safe because C(e2) ∩T1 = tb, td is contained in the context of only one other entity e1, and there-fore, AT1(e2) = 1.

The subset T2 = {ta, te, tf , tg} is 2-safe because:

e1 : C(e1) ∩ T2 = ta contained in e3, e4, e7

e2 : C(e2) ∩ T2 = te, tf contained in e5, e7

e3 : C(e3) ∩ T2 = ta, tg contained in e4, e7

Since AT2(ei) ≥ 2 for every ei ∈ P , T2 is 2-safe with respect to P . It can beverified that for any Tx ⊃ T2 is not 2-safe, so T2 is an optimal solution.

16.2.4 Sanitization Algorithms for K-Safety

Chakaravarthy et al. [44] present several sanitization algorithms to achieveK-safety on a given document D.

16.2.4.1 Levelwise Algorithm

A set of terms Ti is K-safe only if all its subsets are K-safe. The Levelwisealgorithm [44] exploits this Apriori-like property to enumerate the maximalK-safe subsets of the given document, and find the maximum cardinalityK-safe subset. The algorithm proceeds in a level-wise manner starting withsafe subsets of cardinality r = 1. In each iteration, the algorithm generatesthe candidate subsets of cardinality r + 1 based on the K-safe subsets ofcardinality r generated by the previous iteration. A subset of cardinality r+1is a candidate only if all it subsets of cardinality r are K-safe. The algorithmterminates when none of the subsets considered are K-safe, or after r = |D|.The idea of the algorithm is very similar to the Apriori algorithm for miningfrequent itemsets [18].

This Apriori-like approach computes all the maximal K-safe subsets, whichis in fact unnecessary. For document sanitization, it is sufficient to find oneof the maximal safe subsets. Thus, Chapter 16.2.4.2 presents a more efficientbranch and bound strategy, called Best-First [44], that systematically prunesmany unnecessary choices and converges on an optimal solution quickly.

16.2.4.2 Best-First Algorithm

Consider a document D with n terms {t1, . . . , tn}. We can build a binarytree of depth n such that each level represents a term and each root-to-leafrepresents a subset of D. The Best-First algorithm [44] performs a pruned-search over this binary tree by expanding the most promising branch in eachiteration.

For i ≤ j, we use D[i,j] to denote the substring titi+1 . . . tj . The Best-Firstalgorithm maintains a collection C of K-safe subsets of the prefixes of D, whereeach element in C is a pair 〈s, r〉 such that s ⊆ D[1,r] and r ≤ n. A K-safe

Page 325: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Sanitizing Textual Data 309

set T ⊆ D extends (s, r) if T ∩D[1,r] = s. For each pair 〈s, r〉, the algorithmkeeps track of the upperbound value UB(〈s, r〉) on the largest K-safe subsetextending 〈s, r〉. The Best-First iteratively extend the maximum cardinalityC by choosing terms with the maximum UB(〈s, r〉).

16.2.4.3 Greedy Algorithm

Chakaravarthy et al. [44] also suggest a greedy algorithm, called Fast-BTop,to accommodate the requirement of sanitizing a large document. The heuristicaims to ensure that only a minimum number of terms are removed from thedocument, but may occasionally remove a larger number of terms. The generalidea is to iteratively delete terms from the document until it is K-safe. In eachiteration, the algorithm selects a term for deletion by estimating the amountof progress made with respect to the K-safety goal. Refer to [44] for differentheuristic functions.

16.3 Health Information DE-identification (HIDE)

Gardner and Xiong [97] introduce a prototype system, called Health In-formation DE-identification (HIDE), for removing the personal identifyinginformation from healthcare-related text documents. We first discuss theiremployed privacy models, followed by the framework of HIDE.

16.3.1 De-Identification Models

According to the Health Insurance Portability and Accountability Act(HIPAA) in the United States, identifiable information refers to data explic-itly linked to a particular individual and the data that enables individualidentification. These notions correspond to the notion of explicit identifiers,such as names and SSN, and the notion of quasi-identifiers, such as age, gen-der, and postal code. The framework of HIDE [97] considers three types ofde-identification models:

• Full de-identification. According to HIPAA, information is consid-ered fully de-identified if all the explicit (direct) identifiers and quasi-(indirect-) identifiers have been removed. Enforcing full de-identificationoften renders the data useless for most data mining tasks.

• Partial de-identification. According to HIPAA, information is consideredpartially de-identified if all the explicit (direct) identifiers have been re-moved. This model yields better data utility in the resulting anonymizeddata.

Page 326: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

310 Introduction to Privacy-Preserving Data Publishing

HIDE

Attribute Extraction

Attribute Linking AnonymizationIdentifier

View

FIGURE 16.1: Health Information DE-identification (HIDE)

• Statistical de-identification. This type of de-identification model guaran-tees the probability of identifying an individual from the released infor-mation to be under certain probability while also aiming at preservingas much information as possible for some data mining tasks.

16.3.2 The HIDE Framework

Figure 16.1 shows an overview of the HIDE framework in three phases:attribute extraction, attribute linking, and anonymization.

Phase 1: Attribute extraction. Use a statistical learning approach for ex-tracting and sensitive information from the text documents. To facilitatethe overall attribute extraction process, HIDE uses an iterative processfor classifying and retagging which allows the construction of a largetraining data set. Specifically, this attribute extraction process consistsof four steps:

• User tags the identifying information and sensitive attributes forbuilding the training data set.

• Extract features from text documents for the classifier.

• Classify terms extracted from the text documents into multipleclasses. Different types of identifiers and sensitive attributes couldbe classified into different classes.

• Feed the classified data back to the tagging step for retagging andcorrections.

Phase 2: Attribute linking. Link the extracted identifying and sensitiveinformation to an individual. This step is challenging for text docu-ments because, in most cases, there does not exist a unique identifierthat can link all the relevant information to an individual. To improvethe accuracy of linking relevant information to an individual, HIDEemploys an iterative two-step solution involving attribute linking andattribute extraction. The extraction component (Phase 1) extracts rel-evant attributes from the text and links or adds them to the existingor new entities in the database. The linking component (Phase 2) links

Page 327: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Sanitizing Textual Data 311

and merges the records based on the extracted attributes using existingrecord linkage techniques [107].

Phase 3 Anonymization. After the first two phases, HIDE has trans-formed the unstructured textual data into a structured identifier view,which facilitates the application of privacy models for relational data,such as k-anonymity [201, 217] and �-diversity, in this final phase. Referto Chapter 2 for details of different privacy models. Finally, the in-formation in text documents is sanitized according to the anonymizedidentifier view.

16.4 Summary

Text sanitization is a challenging problem due to its unstructural nature.In this chapter, we have studied two major categories of text sanitizationmethods.

The first model assumes that the information in text documents is associ-ated with some entities in a structural or semi-structural database. The goalis to sanitize some terms in the documents so that the adversary can no longeraccurately associate the documents to any entities in the associated database.

The second model does not associate with a database, and depends on someinformation extraction methods to retrieve entities from the text documents,and sanitize them accordingly. The second category has a more general ap-plication because it does not assume the presence of an associated database,but the sanitization reliability relies on the quality of the information extrac-tion tools. In contrast, the sanitization reliability is more reliable in the firstmodel because the privacy guarantee can be measured based on number ofassociated entities in a well-structured database.

There are several other sanitization methods illustrated on real-life med-ical text documents. Kokkinakis and Thurin [141] implement a system forautomatically anonymizing hospital discharge letters by identifying and de-liberately removing all phrases from clinical text that satisfy some pre-definedtypes of sensitive entities. The identification phase is achieved by collaboratingwith an underlying generic Named Entity Recognition (NER) system. Sayginet al. [205] describe implicit and explicit privacy threats in text documentrepositories.

Page 328: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Chapter 17

Other Privacy-PreservingTechniques and Future Trends

17.1 Interactive Query Model

Closely related, but orthogonal to PPDP, is the extensive literature on in-ference control in multilevel secure databases [87, 125]. Attribute linkages areidentified and eliminated either at the database design phase [104, 115, 116],by modifying the schemes and meta-data, or during the interactive querytime [59, 223], by restricting and modifying queries. These techniques, whichfocus on query database answering, are not readily applicable to PPDP, wherethe data holder may not have sophisticated database management knowl-edge, or does not want to provide an interface for database query. A dataholder, such as a hospital, has no intention to be a database server. An-swering database queries is not part of its normal business. Therefore, queryanswering is quite different from the PPDP scenarios studied in this book.Here, we briefly discuss the interactive query model.

In the interactive query model, the user can submit a sequence of queriesbased on previously received query results. Although this query model couldimprove the satisfaction of the data recipients’ information needs [77], thedynamic nature of queries makes the returned results even more vulnerableto attacks, as illustrated in the following example. Refer to [32, 33, 76, 62] formore privacy-preserving techniques on the interactive query model.

Example 17.1Suppose that an examination center allows a data miner to access its database,Table 17.1, for research purposes. The attribute Score is sensitive. An adver-sary wants to identify the Score of a target victim, Bob, who is a student fromthe computer science department at Illinois. The adversary can first submitthe query

Q1: COUNT (University = Illinois) AND (Department = CS)Since the count is 1, the adversary can determine Bob’s Score = 96 by thefollowing query

Q2: AVERAGE Score WHERE (University = Illinois) AND (Department= CS).

313

Page 329: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

314 Introduction to Privacy-Preserving Data Publishing

Table 17.1: Interactive query model:original examination data

ID University Department Score1 Concordia CS 922 Simon Fraser EE 913 Concordia CS 974 Illinois CS 96

Table 17.2: Interactive query model:after adding one new record

ID University Department Score1 Concordia CS 922 Simon Fraser EE 913 Concordia CS 974 Illinois CS 965 Illinois CS 99

Suppose that the data holder has inserted a new record as shown in Ta-ble 17.2. Now, the adversary tries to identify another victim by re-submittingquery Q1. Since the answer is 2, the adversary knows another student fromthe computer science department of Illinois took this exam and can then sub-mit query

Q3: SUM Score WHERE (University = Illinois) AND (Department =CS)Benefiting from this update, the adversary can learn the Score of the newrecord by calculating Q3−Q2 = 99.

Query auditing has a long history in statistical disclosure control. It can bebroadly divided into two categories: online auditing and offline auditing.

Online Auditing: The objective of online query auditing is to detect anddeny queries that violate privacy requirements. Miklau and Suciu [169] mea-sure information disclosure of a view set V , with respect to a secret viewS. S is secure if publishing V does not alter the probability of inferring theanswer to S. Deutsch and Papakonstantinou [60] study whether a new viewdisclosed more information than the existing views with respect to a secretview. To put the data publishing scenario considered in this book into theirterms, superficially the anonymous release can be considered as the “view”and the underlying data can be considered as the “secret query.” However, thetwo problems have two major differences: First, the anonymous release is ob-tained by anonymization operations, not by conjunctive queries as in [60, 169].Second, the publishing scenarios employ anonymity as the privacy measure,whereas [169] and [60] adopt the perfect secrecy for the security measure. Thereleased data satisfies perfect secrecy if the probability that the adversaryfinds the original data after observing the anonymous data is the same as

Page 330: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Other Privacy-Preserving Techniques and Future Trends 315

the probability or difficulty of getting the original data before observing theanonymous data.

Kenthapadi et al. [137] propose another privacy model called stimulatableauditing for an interactive query model. If the adversary has access to allprevious query results, the method denies the new query if it leaks any infor-mation beyond what the adversary has already known. Although this “detectand deny” approach is practical, Kenthapadi et al. [137] point out that thedenials themselves may implicitly disclose sensitive information, making theprivacy protection problem even more complicated. This motivates the offlinequery auditing.

Offline Auditing: In offline query auditing [82], the data recipients submittheir queries and receive their results. The auditor checks if a privacy re-quirement has been violated after the queries have been executed. The datarecipients have no access to the audit results and, therefore, the audit resultsdo not trigger extra privacy threats as in the online mode. The objective ofoffline query audition is to check for compliance of privacy requirement, notto prevent the adversaries from accessing the sensitive information.

17.2 Privacy Threats Caused by Data Mining Results

The release of data mining results or patterns could pose privacy threats.There are two broad research directions in this family.

The first direction is to anonymize the data so that sensitive data miningpatterns cannot be generated. Aggarwal et al. [7] point out that simply sup-pressing the sensitive values chosen by individual record owners is insufficientbecause an adversary can use association rules learnt from the data to esti-mate the suppressed values. They proposed a heuristic algorithm to suppressa minimal set of values to combat such attacks. Vassilios et al. [233] proposealgorithms for hiding sensitive association rules in a transaction database. Thegeneral idea is to hide one rule at a time by either decreasing its support or itsconfidence, achieved by removing items from transactions. Rules satisfying aspecified minimum support and minimum confidence are removed. However,in the notion of anonymity, a rule applying to a small group of individuals(i.e., low support) presents a more serious threat because record owners froma small group are more identifiable.

The second direction is to directly anonymize the data mining patterns.Atzori et al. [22] propose the insightful suggestion that if the goal is to re-lease data mining results, such as frequent patterns, then it is sufficient toanonymize the patterns rather than the data. Their study suggested thatanonymizing the patterns yields much better information utility than per-forming data mining on anonymous data. This opens up a new research di-rection for privacy-preserving patterns publishing. Kantarcioglu et al. [135]

Page 331: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

316 Introduction to Privacy-Preserving Data Publishing

Table 17.3: 3-anonymous patient dataJob Sex Age Disease

Professional Male [35-40) HepatitisProfessional Male [35-40) HepatitisProfessional Male [35-40) HIV

Artist Female [30-35) FluArtist Female [30-35) HIVArtist Female [30-35) HIVArtist Female [30-35) HIV

define an evaluation method to measure the loss of privacy due to releasingdata mining results.

The classifier attack is a variant type of attribute linkage attack. Supposea data miner has released a classifier, not the data, that models a sensitive at-tribute. An adversary can make use of the classifier to infer sensitive values ofindividuals. This scenario presents a dilemma between privacy protection anddata mining because the target attribute to be modeled is also the sensitiveattribute to be protected. A naive solution is to suppress sensitive classifica-tion rules, but it may defeat the data mining goal. Another possible solutionis to build a classifier from the anonymous data that has bounded confidenceor breach probability on some selected sensitive values. Alternatively, recordowners may specify some guarding nodes on their own records, as discussedin Chapter 2.2.

Example 17.2A data recipient has built a classifier on the target attribute Disease fromTable 17.3, and then released two classification rules:

〈Professional, Male, [35-40)〉 → Hepatitis〈Artist, Female, [30-35)〉 → HIV

The adversary can use these rules to infer that Emily, who is a female artistat age 30, has HIV . Even though the data recipient has not released any data,the adversary can confidently make such inference if the adversary knows theqid of Emily who comes from the same population where the classifier wasbuilt from. Even though the inference may not be correct, the adversary canmake decision based on such inference.

17.3 Privacy-Preserving Distributed Data Mining

Privacy-preserving distributed data mining (PPDDM) is a cousin researchtopic of privacy-preserving data publishing (PPDP). PPDDM assumes a sce-

Page 332: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

Other Privacy-Preserving Techniques and Future Trends 317

nario that multiple data holders want to collaboratively perform data min-ing on the union of their data without revealing their sensitive information.PPDDM usually employs cryptographic solutions. Although the ultimate goalof both PPDDM and PPDP is to perform data mining, they have very dif-ferent assumptions on data ownerships, attack models, privacy models, andsolutions, so PPDDM is out of the scope of this book. We refer readers inter-ested in PPDDM to these works [50, 132, 187, 227, 248].

17.4 Future Directions

Information sharing has become part of the routine activities of manyindividuals, companies, organizations, and government agencies. Privacy-preserving data publishing is a promising approach to information sharing,while preserving individual privacy and protecting sensitive information. Inthis book, we reviewed the recent developments in the field. The general objec-tive is to transform the original data into some anonymous form to prevent in-ferring its record owners’ sensitive information. We presented our views on thedifference between privacy-preserving data publishing and privacy-preservingdata mining, and a list of desirable properties of a privacy-preserving data pub-lishing method. We reviewed and compared existing methods in terms of pri-vacy models, anonymization operations, information metrics, and anonymiza-tion algorithms. Most of these approaches assumed a single release from asingle publisher, and thus only protected the data up to the first release or thefirst recipient. We also reviewed several works on more challenging publishingscenarios, including multiple release publishing, sequential release publishing,continuous data publishing, and collaborative data publishing.

Privacy protection is a complex social issue, which involves policy making,technology, psychology, and politics. Privacy protection research in computerscience can provide only technical solutions to the problem. Successful appli-cation of privacy-preserving technology will rely on the cooperation of policymakers in governments and decision makers in companies and organizations.Unfortunately, while the deployment of privacy-threatening technology, suchas RFID and social networks, grows quickly, the implementation of privacy-preserving technology in real-life applications is very limited. As the gap be-comes larger, we foresee that the number of incidents and the scope of privacybreach will increase in the near future. Below, we identify a few potential re-search directions in privacy preservation, together with some desirable prop-erties that could facilitate the general public, decision makers, and systemsengineers to adopt privacy-preserving technology.

• Privacy-preserving tools for individuals. Most previous privacy-preserving techniques were proposed for data holders, but individual

Page 333: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

318 Introduction to Privacy-Preserving Data Publishing

record owners should also have the rights and responsibilities to protecttheir own private information. There is an urgent need for personalizedprivacy-preserving tools, such as privacy-preserving web browser andminimal information disclosure protocol for e-commerce activities. It isimportant that the privacy-preserving notions and tools developed areintuitive for novice users. Xiao and Tao’s work [250] on “PersonalizedPrivacy Preservation” provides a good start, but little work has beenconducted on this direction.

• Privacy protection in emerging technologies. Emerging technologies, likelocation-based services [23, 114, 265], RFID [240], bioinformatics, andmashup web applications, enhance our quality of life. These new tech-nologies allow corporations and individuals to have access to previouslyunavailable information and knowledge; however, they also bring upmany new privacy issues. Nowadays, once a new technology has beenadopted by a small community, it can become very popular in a shortperiod of time. A typical example is the social network application Face-book. Since its deployment in 2004, it has acquired 70 million activeusers. Due to the massive number of users, the harm could be extensiveif the new technology is misused. One research direction is to customizeexisting privacy-preserving models for emerging technologies.

• Incorporating privacy protection in engineering process. The issue of pri-vacy protection is often considered after the deployment of a new tech-nology. Typical examples are the deployments of mobile devices withlocation-based services [1, 23, 114, 265], sensor networks, and social net-works. Preferably, the privacy issue should be considered as a primaryrequirement in the engineering process of developing new technology.This involves formal specification of privacy requirements and formalverification tools to prove the correctness of a privacy-preserving sys-tem.

Finally, we emphasize that privacy-preserving technology solves only oneside of the problem. It is equally important to identify and overcome the non-technical difficulties faced by decision makers when they deploy a privacy-preserving technology. Their typical concerns include the degradation ofdata/service quality, loss of valuable information, increased costs, and in-creased complexity. We believe that cross-disciplinary research is the key toremove these obstacles, and urge computer scientists in the privacy protectionfield to conduct cross-disciplinary research with social scientists in sociology,psychology, and public policy studies. Having a better understanding of theprivacy problem from different perspectives can help realize successful appli-cations of privacy-preserving technology.

Page 334: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

References

[1] O. Abul, F. Bonchi, and M. Nanni. Never walk alone: Uncertainty foranonymity in moving objects databases. In Proc. of the 24th IEEEInternational Conference on Data Engineering (ICDE), pages 376–385,April 2008.

[2] D. Achlioptas. Database-friendly random projections. In Proc. of the20th ACM Symposium on Principles of Database Systems (PODS),pages 274–281, May 2001.

[3] N. R. Adam and J. C. Wortman. Security control methods for statisticaldatabases. ACM Computer Surveys, 21(4):515–556, December 1989.

[4] E. Adar. User 4xxxxx9: Anonymizing query logs. In Proc. of the 16thInternational Conference on World Wide Web (WWW), 2007.

[5] E. Adar, D. S. Weld, B. N. Bershad, and S. D. Gribble. Why we search:Visualizing and predicting user behavior. In Proc. of the 16th Interna-tional Conference on World Wide Web (WWW), pages 161–170, 2007.

[6] C. C. Aggarwal. On k-anonymity and the curse of dimensionality.In Proc. of the 31st Very Large Data Bases (VLDB), pages 901–909,Trondheim, Norway, 2005.

[7] C. C. Aggarwal, J. Pei, and B. Zhang. On privacy preservation againstadversarial data mining. In Proc. of the 12th ACM International Con-ference on Knowledge Discovery and Data Mining (SIGKDD), Philadel-phia, PA, August 2006.

[8] C. C. Aggarwal and P. S. Yu. A condensation approach to privacypreserving data mining. In Proc. of the International Conference onExtending Database Technology (EDBT), pages 183–199, 2004.

[9] C. C. Aggarwal and P. S. Yu. A framework for condensation-basedanonymization of string data. Data Mining and Knowledge Discovery(DMKD), 13(3):251–275, February 2008.

[10] C. C. Aggarwal and P. S. Yu. On static and dynamic methods forcondensation-based privacy-preserving data mining. ACM Transactionson Database Systems (TODS), 33(1), March 2008.

[11] C. C. Aggarwal and P. S. Yu. Privacy-Preserving Data Mining: Modelsand Algorithms. Springer, March 2008.

319

Page 335: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

320 References

[12] G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani, R. Panigrahy,D. Thomas, and A. Zhu. Anonymizing tables. In Proc. of the 10thInternational Conference on Database Theory (ICDT), pages 246–258,Edinburgh, UK, January 2005.

[13] G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani, R. Panigrahy,D. Thomas, and A. Zhu. Achieving anonymity via clustering. In Proc. ofthe 25th ACM SIGMOD-SIGACT-SIGART Symposium on Principlesof Database Systems (PODS), Chicago, IL, June 2006.

[14] G. Aggarwal, N. Mishra, and B. Pinkas. Secure computation of thekth-ranked element. In Proc. of the Eurocyrpt, 2004.

[15] E. Agichtein, L. Gravano, J. Pavel, V. Sokolova, and A. Voskoboynik.Snowball: A prototype system for extracting relations from large textcollections. ACM SIGMOD Record, 30(2):612, 2001.

[16] D. Agrawal and C. C. Aggarwal. On the design and quantification ofprivacy preserving data mining algorithms. In Proc. of the 20th ACMSymposium on Principles of Satabase Systems (PODS), pages 247–255,Santa Barbara, CA, May 2001.

[17] R. Agrawal, A. Evfimievski, and R. Srikant. Information sharing acrossprivate databases. In Proc. of ACM International Conference on Man-agement of Data (SIGMOD), San Diego, CA, 2003.

[18] R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rulesbetween sets of items in large databases. In Proc. of ACM InternationalConference on Management of Data (SIGMOD), 1993.

[19] R. Agrawal and R. Srikant. Privacy preserving data mining. In Proc.of ACM International Conference on Management of Data (SIGMOD),pages 439–450, Dallas, Texas, May 2000.

[20] R. Agrawal, R. Srikant, and D. Thomas. Privacy preserving olap. InProc. of ACM International Conference on Management of Data (SIG-MOD), pages 251–262, 2005.

[21] S. Agrawal and J. R. Haritsa. A framework for high-accuracy privacy-preserving mining. In Proc. of the 21st IEEE International Conferenceon Data Engineering (ICDE), pages 193–204, Tokyo, Japan, April 2005.

[22] M. Atzori, F. Bonchi, F. Giannotti, and D. Pedreschi. Anonymity pre-serving pattern discovery. International Journal on Very Large DataBases (VLDBJ), 17(4):703–727, July 2008.

[23] M. Atzori, F. Bonchi, F. Giannotti, D. Pedreschi, and O. Abul. Privacy-aware knowledge discovery from location data. In Proc. of the Inter-national Workshop on Privacy-Aware Location-based Mobile Services(PALMS), pages 283–287, May 2007.

Page 336: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

References 321

[24] R. Axelrod. The Evolution of Cooperation. Basic Books, New York,1984.

[25] L. Backstrom, C. Dwork, and J. Kleinberg. Wherefore art thou r3579x?Anonymized social networks, hidden patterns and structural steganog-raphy. In Proc. of the International World Wide Web Conference(WWW), Banff, Alberta, May 2007.

[26] B. Barak, K. Chaudhuri, C. Dwork, S. Kale, F. McSherry, and K. Tal-war. Privacy, accuracy, and consistency too: A holistic solution to con-tingency table release. In Proc. of the 26th ACM Symposium on Princi-ples of Database Systems (PODS), pages 273–282, Beijing, China, June2007.

[27] M. Barbaro and T. Zeller. A face is exposed for aol searcher no. 4417749.New York Times, August 9, 2006.

[28] M. Barbaro, T. Zeller, and S. Hansell. A face is exposed for aol searcherno. 4417749. New York Times, August 2006.

[29] R. J. Bayardo and R. Agrawal. Data privacy through optimal k-anonymization. In Proc. of the 21st IEEE International Conferenceon Data Engineering (ICDE), pages 217–228, Tokyo, Japan, 2005.

[30] Z. Berenyi and H. Charaf. Retrieving frequent walks from trackingdata in RFID-equipped warehouses. In Proc. of Conference on HumanSystem Interactions, pages 663–667, May 2008.

[31] A. R. Beresford and F. Stajano. Location privacy in pervasive comput-ing. IEEE Pervasive Computing, 1:46–55, 2003.

[32] A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical privacy: Thesulq framework. In Proc. of the 24th ACM Symposium on Principles ofDatabase Systems (PODS), pages 128–138, Baltimore, MD, June 2005.

[33] A. Blum, K. Ligett, and A. Roth. A learning theory approach tonon-interactive database privacy. In Proc. of the 40th annual ACMSymposium on Theory of Computing (STOC), pages 609–618, Victoria,Canada, 2008.

[34] B. Bollobas. Random Graphs. Cambridge, 2001.

[35] F. Bonchi, F. Giannotti, and D. Pedreschi. Blocking anonymity threatsraised by frequent itemset mining. In Proc. of the 5th IEEE Interna-tional Conference on Data Mining (ICDM), 2005.

[36] R. Brand. Microdata protection through noise addition. In InferenceControl in Statistical Databases, From Theory to Practice, pages 97–116, London, UK, 2002.

Page 337: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

322 References

[37] S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Gener-alizing association rules to correlations. In Proc. of ACM InternationalConference on Management of Data (SIGMOD), 1997.

[38] Y. Bu, A. W. C. Fu, R. C. W. Wong, L. Chen, and J. Li. Privacy pre-serving serial data publishing by role composition. Proc. of the VLDBEndowment, 1(1):845–856, August 2008.

[39] D. Burdick, M. Calimlim, J. Flannick, J. Gehrke, and T. Yiu. MAFIA: Amaximal frequent itemset algorithm. IEEE Transactions on Knowledgeand Data Engineering (TKDE), 17(11):1490–1504, 2005.

[40] L. Burnett, K. Barlow-Stewart, A. Pros, and H. Aizenberg. The genetrustee: A universal identification system that ensures privacy and con-fidentiality for human genetic databases. Journal of Law and Medicine,10:506–513, 2003.

[41] Business for Social Responsibility. BSR Report on Privacy, 1999.http://www.bsr.org/.

[42] J.-W. Byun, Y. Sohn, E. Bertino, and N. Li. Secure anonymization forincremental datasets. In Proc. of the VLDB Workshop on Secure DataManagement (SDM), 2006.

[43] D. M. Carlisle, M. L. Rodrian, and C. L. Diamond. California inpatientdata reporting manual, medical information reporting for california, 5thedition. Technical report, Office of Statewide Health Planning and De-velopment, July 2007.

[44] V. T. Chakaravarthy, H. Gupta, P. Roy, and M. Mohania. Efficient tech-niques for documents sanitization. In Proc. of the ACM 17th Conferenceon Information and Knowledge Management (CIKM), 2008.

[45] D. Chaum. Untraceable electronic mail, return addresses, and digitalpseudonyms. Communications of the ACM, 24(2):84–88, 1981.

[46] S. Chawla, C. Dwork, F. McSherry, A. Smith, and H. Wee. Towardprivacy in public databases. In Proc. of Theory of Cryptography Con-ference (TCC), pages 363–385, Cambridge, MA, February 2005.

[47] S. Chawla, C. Dwork, F. McSherry, and K. Talwar. On privacy-preserving histograms. In Proc. of Uncertainty in Artificial Intelligence(UAI), Edinburgh, Scotland, July 2005.

[48] B.-C. Chen, R. Ramakrishnan, and K. LeFevre. Privacy skyline: Pri-vacy with multidimensional adversarial knowledge. In Proc. of the 33rdInternational Conference on Very Large Databases (VLDB), 2007.

[49] C. Clifton. Using sample size to limit exposure to data mining. Journalof Computer Security, 8(4):281–307, 2000.

Page 338: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

References 323

[50] C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and M. Y. Zhu. Toolsfor privacy preserving distributed data mining. ACM SIGKDD Explo-rations Newsletter, 4(2):28–34, December 2002.

[51] Confidentiality and Data Access Committee. Report on statistical dis-closure limitation methodology. Technical Report 22, Office of Manage-ment and Budget, December 2005.

[52] L. H. Cox. Suppression methodology and statistical disclosure control.Journal of the American Statistical Association, 75(370):377–385, June1980.

[53] H. Cui, J. Wen, J. Nie, and W. Ma. Probabilistic query expansion usingquery logs. In Proc. of the 11th International Conference on World WideWeb (WWW), 2002.

[54] E. Cuthill and J. McKee. Reducing the bandwidth of sparse symmetricmatrices. In 4th National ACM Conference, 1969.

[55] T. Dalenius. Towards a methodology for statistical disclosure control.Statistik Tidskrift, 15:429–444, 1977.

[56] T. Dalenius. Finding a needle in a haystack - or identifying anonymouscensus record. Journal of Official Statistics, 2(3):329–336, 1986.

[57] R. N. Dave and R. Krishnapuram. Robust clustering methods: A unifiedview. IEEE Transactions on Fuzzy Systems, 5(2):270–293, May 1997.

[58] U. Dayal and H. Y. Hwang. View definition and generalization fordatabase integration in a multidatabase system. IEEE Transactions onSoftware Engineering (TSE), 10(6):628–645, 1984.

[59] D. E. Denning. Commutative filters for reducing inference threats inmultilevel database systems. In Proc. of the IEEE Symposium on Se-curity and Privacy (S&P), Oakland, CA, April 1985.

[60] A. Deutsch and Y. Papakonstantinou. Privacy in database publish-ing. In Proc. of the 10th International Conference on Database Theory(ICDT), pages 230–245, Edinburgh, UK, January 2005.

[61] C. Diaz, S. Seys, J. Claessens, and B. Preneel. Towards measuringanonymity. In Proc. of the 2nd Internationl Workshop on Privacy En-chancing Technologies (PET), pages 54–68, San Francisco, CA, April2002.

[62] I. Dinur and K. Nissim. Revealing information while preserving pri-vacy. In Proc. of the 22nd ACM Symposium on Principles of DatabaseSystems (PODS), pages 202–210, San Diego, June 2003.

[63] J. Domingo-Ferrer. Confidentiality, Disclosure and Data Access: Theoryand Practical Applications for Statistical Agencies, chapter Disclosure

Page 339: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

324 References

Control Methods and Information Loss for Microdata, pages 91–110.2001.

[64] J. Domingo-Ferrer. Privacy-Preserving Data Mining: Models and Al-gorithms, chapter A Survey of Inference Control Methods for Privacy-Preserving Data Mining, pages 53–80. Springer, 2008.

[65] J. Domingo-Ferrer and V. Torra. Theory and Practical Applications forStatistical Agencies, chapter A Quantitative Comparison of DisclosureControl Methods for Microdata, Confidentiality, Disclosure and DataAccess, pages 113–134. North-Holland, Amsterdam, 2002.

[66] J. Domingo-Ferrer and V. Torra. A critique of k-anonymity and someof its enhancements. In Proc. of the 3rd International Conference onAvailability, Reliability and Security (ARES), pages 990–993, 2008.

[67] G Dong and J. Li. Efficient mining of emerging patterns: discoveringtrends and differences. In Proc. of 5th ACM International Conferenceon Knowledge Discovery and Data Mining (SIGKDD), 1999.

[68] Z. Dou, R. Song, and J. Wen A large-scale evaluation and analysisof personalized search strategies. In Proc. of the 16th InternationalConference on World Wide Web (WWW), 2007.

[69] W. Du, Y. S. Han, and S. Chen. Privacy-preserving multivariate statis-tical analysis: Linear regression and classification. In Proc. of the SIAMInternational Conference on Data Mining (SDM), Florida, 2004.

[70] W. Du, Z. Teng, and Z. Zhu. Privacy-MaxEnt: Integrating backgroundknowledge in privacy quantification. In Proc. of the 34th ACM Inter-national Conference on Management of Data (SIGMOD), 2008.

[71] W. Du and Z. Zhan. Building decision tree classifier on private data.In Workshop on Privacy, Security, and Data Mining at the 2002 IEEEInternational Conference on Data Mining, Maebashi City, Japan, De-cember 2002.

[72] W. Du and Z. Zhan. Using randomized response techniques for privacypreserving data mining. In Proc. of the 9the ACM International Con-ference on Knowledge Discovery and Data Mining (SIGKDD), pages505–510, 2003.

[73] G. Duncan and S. Fienberg. Obtaining information while preservingprivacy: A Markov perturbation method for tabular data. In StatisticalData Protection, pages 351–362, 1998.

[74] C. Dwork. Differential privacy. In Proc. of the 33rd International Col-loquium on Automata, Languages and Programming (ICALP), pages1–12, Venice, Italy, July 2006.

Page 340: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

References 325

[75] C. Dwork. Ask a better question, get a better answer: A new approachto private data analysis. In Proc. of the Interntaional Conference onDatabase Theory (ICDT), pages 18–27, Barcelona, Spain, January 2007.

[76] C. Dwork. Differential privacy: A survey of results. In Proc. of the5th International Conference on Theory and Applications of Models ofComputation (TAMC), pages 1–19, Xian, China, April 2008.

[77] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise tosensitivity in private data analysis. In Proc. of the 3rd Theory of Cryp-tography Conference (TCC), pages 265–284, New York, March 2006.

[78] C. Dwork and K. Nissim. Privacy-preserving data mining on verticallypartitioned databases. In Proceedings of the 24th International Cryp-tology Conference (CRYPTO), pages 528–544, Santa Barbara, August2004.

[79] K. E. Emam. Data anonymization practices in clinical research: A de-scriptive study. Technical report, Access to Information and PrivacyDivision of Health Canada, May 2006.

[80] P. Erdos and T. Gallai. Graphs with prescribed degrees of vertices.Mat. Lapok, 1960.

[81] A. Evfimievski. Randomization in privacy-preserving data mining.ACM SIGKDD Explorations Newsletter, 4(2):43–48, December 2002.

[82] A. Evfimievski, R. Fagin, and D. P. Woodruff. Epistemic privacy. InProc. of the 27th ACM Symposium on Principles of Database Systems(PODS), pages 171–180, Vancouver, Canada, 2008.

[83] A. Evfimievski, J. Gehrke, and R. Srikant. Limiting privacy breachesin privacy preserving data mining. In Proc. of the 22nd ACMSIGMOD-SIGACT-SIGART Symposium on Principles of DatabaseSystems (PODS), pages 211–222, 2002.

[84] A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke. Privacy preserv-ing mining of association rules. In Proc. of 8th ACM International Con-ference on Knowledge Discovery and Data Mining (SIGKDD), pages217–228, Edmonton, AB, Canada, July 2002.

[85] R. Fagin, Amnon Lotem, and Moni Naor. Optimal aggregation algo-rithms for middleware. In Proc. of the 20th ACM Symposium on Princi-ples of Database Systems (PODS), pages 102–113, Santa Barbara, CA,2001.

[86] M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relation-ships of the internet topology. In Proc. of the Conference on Appli-cations, Technologies, Architectures, and Protocols for Computer Com-munication (SIGCOMM), pages 251–262, 1999.

Page 341: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

326 References

[87] C. Farkas and S. Jajodia. The inference problem: A survey. ACMSIGKDD Explorations Newsletter, 4(2):6–11, 2003.

[88] T. Fawcett and F. Provost. Activity monitoring: Noticing interestingchanges in behavior. In Proc. of the 5th ACM International Conferenceon Knowledge Discovery and Data Mining (SIGKDD), pages 53–62, SanDiego, CA, 1999.

[89] A. W. C. Fu, R. C. W. Wong, and K. Wang. Privacy-preserving frequentpattern mining across private databases. In Proc. of the 5th IEEE Inter-national Conference on Data Mining (ICDM), pages 613–616, Houston,TX, November 2005.

[90] W. A. Fuller. Masking procedures for microdata disclosure limitation.Official Statistics, 9(2):383–406, 1993.

[91] B. C. M. Fung, M. Cao, B. C. Desai, and H. Xu. Privacy protection forRFID data. In Proc. of the 24th ACM SIGAPP Symposium on AppliedComputing (SAC), pages 1528–1535, Honolulu, HI, March 2009.

[92] B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu. Privacy-preservingdata publishing: A survey on recent developments. ACM ComputingSurveys, 42(4), December 2010.

[93] B. C. M. Fung, K. Wang, A. W. C. Fu, and J. Pei. Anonymity for con-tinuous data publishing. In Proc. of the 11th International Conferenceon Extending Database Technology (EDBT), pages 264–275, Nantes,France, March 2008.

[94] B. C. M. Fung, K. Wang, L. Wang, and P. C. K. Hung. Privacy-preserving data publishing for cluster analysis. Data & Knowledge En-gineering (DKE), 68(6):552–575, June 2009.

[95] B. C. M. Fung, K. Wang, and P. S. Yu. Top-down specialization forinformation and privacy preservation. In Proc. of the 21st IEEE In-ternational Conference on Data Engineering (ICDE), pages 205–216,Tokyo, Japan, April 2005.

[96] B. C. M. Fung, Ke Wang, and P. S. Yu. Anonymizing classification datafor privacy preservation. IEEE Transactions on Knowledge and DataEngineering (TKDE), 19(5):711–725, May 2007.

[97] J. Gardner and L. Xiong. An integrated framework for de-identifyingheterogeneous data. Data and Knowledge Engineering (DKE),68(12):1441–1451, December 2009.

[98] B. Gedik and L. Liu. Protecting location privacy with personalized k-anonymity: Architecture and algorithms. IEEE Transactions on MobileComputing (TMC), 7(1):1–18, January 2008.

Page 342: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

References 327

[99] J. Gehrke. Models and methods for privacy-preserving data publishingand analysis. In Tutorial at the 12th ACM International Conference onKnowledge Discovery and Data Mining (SIGKDD), Philadelphia, PA,August 2006.

[100] L. Getor and C. P. Diehl. Link mining: A survey. ACM SIGKDDExplorations Newsletter, 7(2):3–12, 2005.

[101] G. Ghinita, Y. Tao, and P. Kalnis. On the anonymization of sparse high-dimensional data. In Proc. of the 24th IEEE International Conferenceon Data Engineering (ICDE), pages 715–724, April 2008.

[102] F. Giannotti, M. Nanni, D. Pedreschi, and F. Pinelli. Trajectory pat-tern mining. In Proc. of the 13th ACM International Conference onKnowledge Discovery and Data Mining (SIGKDD), 2007.

[103] N. Gibbs, W. Poole, and P. Stockmeyer. An algorithm for reducing thebandwidth and profile of a sparse matrix. SIAM Journal on NumericalAnalysis (SINUM), 13:236–250, 1976.

[104] J. Goguen and J. Meseguer. Unwinding and inference control. In Proc.of the IEEE Symposium on Security and Privacy (S&P), Oakland, CA,1984.

[105] H. Gonzalez, J. Han, and X. Li. Mining compressed commodity work-flows from massive RFID data sets. In Proc. of the 16th ACM Confer-ence on Information and Knowledge Management (CIKM), 2006.

[106] M. Gruteser and D. Grunwald. Anonymous usage of location-basedservices through spatial and temporal cloaking. In Proc. of the 1stInternational Conference on Mobile Systems, Applications, and Services(MobiSys), 2003.

[107] L. Gu, R. Baxter, D. Vickers, and C. Rainsford. Record linkage: Currentpractice and future directions. Technical report, CSIRO Mathematicaland Information Sciences, 2003.

[108] K. Hafner. Researchers yearn to use aol logs, but they hesitate. NewYork Times, 2006.

[109] Jiawei Han and Micheline Kamber. Data Mining: Concepts and Tech-niques. Morgan Kaufmann, 2 edition, November 2005.

[110] M. Hay, G. Miklau, D. Jensen, D. Towsley, and P. Weis. Resistingstructural re-identification in anonymized social networks. In Proc. ofthe Very Large Data Bases (VLDB), Auckland, New Zealand.

[111] M. Hay, G. Miklau, D. Jensen, P. Weis, and S. Srivastava. Anonymizingsocial networks. Technical report, University of Massachusetts Amherst,2007.

Page 343: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

328 References

[112] Y. He and J. Naughton. Anonymization of set valued data via top down,local generalization. In Proc. of the Very Large Databases (VLDB),2009.

[113] M. Hegland, I. Mcintosh, and B. A. Turlach. A parallel solver for gen-eralized additive models. Computational Statistics & Data Analysis,31(4):377–396, 1999.

[114] U. Hengartner. Hiding location information from location-based ser-vices. In Proc. of the International Workshop on Privacy-AwareLocation-based Mobile Services (PALMS), pages 268–272, Mannheim,Germany, May 2007.

[115] T. Hinke. Inference aggregation detection in database managementsystems. In Proc. of the IEEE Symposium on Security and Privacy(S&P), pages 96–107, Oakland, CA, April 1988.

[116] T. Hinke, H. Degulach, and A. Chandrasekhar. A fast algorithm fordetecting second paths in database inference analysis. Journal of Com-puter Security, 1995.

[117] R. D. Hof. Mix, match, and mutate. Business Week, July 2005.

[118] Z. Huang, W. Du, and B. Chen. Deriving private information fromrandomized data. In Proc. of ACM International Conference on Man-agement of Data (SIGMOD), pages 37–48, Baltimore, ML, 2005.

[119] Z. Huang and W. Du. OptRR: Optimizing randomized response schemesfor privacy-preserving data mining. In Proc. of the 24th IEEE Interna-tional Conference on Data Engineering (ICDE), pages 705–714, 2008.

[120] A. Hundepool and L. Willenborg. μ- and τ -argus: Software for statis-tical disclosure control. In Proc. of the 3rd International Seminar onStatistical Confidentiality, Bled, 1996.

[121] Ali Inan, Selim V. Kaya, Yucel Saygin, Erkay Savas, Ayca A. Hintoglu,and Albert Levi. Privacy preserving clustering on horizontally parti-tioned data. Data & Knowledge Engineering (DKE), 63(3):646–666,2007.

[122] T. Iwuchukwu and J. F. Naughton. K-anonymization as spatial in-dexing: Toward scalable and incremental anonymization. In Proc. ofthe 33rd International Conference on Very Large Data Bases (VLDB),2007.

[123] V. S. Iyengar. Transforming data to satisfy privacy constraints. In Proc.of the 8th ACM International Conference on Knowledge Discovery andData Mining (SIGKDD), pages 279–288, Edmonton, AB, Canada, July2002.

Page 344: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

References 329

[124] A. Jain, E. Y. Chang, and Y.-F. Wang. Adaptive stream resource man-agement using Kalman filters. In Proc. of 2004 ACM InternationalConference on Management of Data (SIGMOD), pages 11–22, 2004.

[125] S. Jajodia and C. Meadows. Inference problems in multilevel databasemanagement systems. IEEE Information Security: An Integrated Col-lection of Essays, pages 570–584, 1995.

[126] M. Jakobsson, A. Juels, and R. L. Rivest. Making mix nets robust forelectronic voting by randomized partial checking. In Proc. of the 11thUSENIX Security Symposium, pages 339–353, 2002.

[127] W. Jiang and C. Clifton. Privacy-preserving distributed k-anonymity.In Proc. of the 19th Annual IFIP WG 11.3 Working Conference on Dataand Applications Security, pages 166–177, Storrs, CT, August 2005.

[128] W. Jiang and C. Clifton. A secure distributed framework for achievingk-anonymity. Very Large Data Bases Journal (VLDBJ), 15(4):316–333,November 2006.

[129] R. Jones, R. Kumar, B. Pang, and A. Tomkins. I know what you didlast summer. In Proc. of the 16th ACM Conference on Information andKnowledge Management (CIKM), 2007.

[130] A. Juels. RFID security and privacy: A research survey. 2006.

[131] P. Jurczyk and L. Xiong. Distributed anonymization: Achieving privacyfor both data subjects and data providers. In Proc. of the 23rd AnnualIFIP WG 11.3 Working Conference on Data and Applications Security(DBSec), 2009.

[132] M. Kantarcioglu. Privacy-Preserving Data Mining: Models and Algo-rithms, chapter A Survey of Privacy-Preserving Methods Across Hori-zontally Partitioned Data, pages 313–335. Springer, 2008.

[133] M. Kantarcioglu and C. Clifton. Privacy preserving data mining ofassociation rules on horizontally partitioned data. IEEE Transactionson Knowledge and Data Engineering (TKDE), 16(9):1026–1037, 2004.

[134] M. Kantarcioglu and C. Clifton. Privately computing a distributed k-nnclassifier. In Proc. of the 8th European Conference on Principles andPractice of Knowledge Discovery in Databases (PKDD), volume LNCS3202, pages 279–290, Pisa, Italy, September 2004.

[135] M. Kantarcioglu, J. Jin, and C. Clifton. When do data mining resultsviolate privacy? In Proc. of the 2004 ACM International Conferenceon Knowledge Discovery and Data Mining (SIGKDD), pages 599–604,Seattle, WA, 2004.

[136] H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar. On the privacypreserving properties of random data perturbation techniques. In Proc.

Page 345: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

330 References

of the 3rd IEEE International Conference on Data Mining (ICDM),pages 99–106, Melbourne, FL, 2003.

[137] K. Kenthapadi, Nina Mishra, and Kobbi Nissim. Simulatable auditing.In Proc. of the 24th ACM Symposium on Principles of Database Systems(PODS), pages 118–127, Baltimore, MD, June 2005.

[138] D. Kifer. Attacks on privacy and deFinetti’s theorem. In Proc. ofthe 35th ACM International Conference on Management of Data (SIG-MOD), pages 127–138, Providence, RI, 2009.

[139] D. Kifer and J. Gehrke. Injecting utility into anonymized datasets.In Proc. of ACM International Conference on Management of Data(SIGMOD), Chicago, IL, June 2006.

[140] J. Kim and W. Winkler. Masking microdata files. In Proc. of the ASASection on Survey Research Methods, pages 114–119, 1995.

[141] D. Kokkinakis and A. Thurin. Anonymisation of Swedish clinical data.In Proc. of the 11th Conference on Artificial Intelligence in Medicine(AIME), pages 237–241, 2007.

[142] A. Korolova. Releasing search queries and clicks privately. In Proc. ofthe 18th International Conference on World Wide Web (WWW), 2009.

[143] A. Korolova, R. Motwani, U.N. Shubha, and Y. Xu. Link privacy insocial networks. In Proc. of IEEE 24th International Conference onData Engineering (ICDE), April 2008.

[144] B. Krishnamurthy. Privacy vs. security in the aftermathof the September 11 terrorist attacks, November 2001.http://www.scu.edu/ethics/publications/briefings/privacy.html.

[145] J. B. Kruskal and M. Wish. Multidimensional Scaling. Sage Publica-tions, Beverly Hills, CA, USA, 1978.

[146] R. Kumar, J. Novak, B. Pang, and A. Tomkins. On anonymizing querylogs via token-based hashing. In Proc. of the 16th International Con-ference on World Wide Web (WWW), pages 628–638, May 2007.

[147] R. Kumar, K. Punera, and A. Tomkins. Hierarchical topic segmenta-tion of websites. In Proc. of the 12th ACM International Conferenceon Knowledge Discovery and Data Mining (SIGKDD), pages 257–266,2006.

[148] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan. Incognito: Efficientfull-domain k-anonymity. In Proc. of ACM International Conference onManagement of Data (SIGMOD), pages 49–60, Baltimore, ML, 2005.

[149] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan. Mondrian multidi-mensional k-anonymity. In Proc. of the 22nd IEEE International Con-ference on Data Engineering (ICDE), Atlanta, GA, 2006.

Page 346: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

References 331

[150] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan. Workload-awareanonymization. In Proc. of the 12th ACM International Conferenceon Knowledge Discovery and Data Mining (SIGKDD), Philadelphia,PA, August 2006.

[151] K. LeFevre, D. J. DeWitt, and R. Ramakrishnan. Workload-awareanonymization techniques for large-scale data sets. ACM Transactionson Database Systems (TODS), 33(3), 2008.

[152] J. Li, Y. Tao, and X. Xiao. Preservation of proximity privacy in pub-lishing numerical sensitive data. In Proc. of the ACM InternationalConference on Management of Data (SIGMOD), pages 437–486, Van-couver, Canada, June 2008.

[153] N. Li, T. Li, and S. Venkatasubramanian. t-closeness: Privacy beyondk-anonymity and �-diversity. In Proc. of the 21st IEEE InternationalConference on Data Engineering (ICDE), Istanbul, Turkey, April 2007.

[154] T. Li and N. Li. Modeling and integrating background knowledge. InProc. of the 23rd IEEE International Conference on Data Engineering(ICDE), 2009.

[155] G. Liang and S. S. Chawathe. Privacy-preserving inter-database op-erations. In Proc. of the 2nd Symposium on Intelligence and SecurityInformatics (ISI), Tucson, AZ, June 2004.

[156] B. Liu, W. Hsu, and Y. Ma. Integrating classification and associationrule mining. In Proc. of 4th ACM International Conference on Knowl-edge Discovery and Data Mining (SIGKDD), 1998.

[157] J. Liu and K. Wang. On optimal anonymization for �+-diversity. InProc. of the 26th IEEE International Conference on Data Engineering(ICDE), 2010.

[158] K. Liu, H. Kargupta, and J. Ryan. Random projection-based multiplica-tive perturbation for privacy preserving distributed data mining. IEEETransactions on Knowledge and Data Engineering (TKDE), 18(1):92–106, January 2006.

[159] K. Liu and E. Terzi. Towards identity anonymization on graphs. In Proc.of ACM International Conference on Management of Data (SIGMOD),Vancouver, Canada, 2008.

[160] A. Machanavajjhala, J. Gehrke, D. Kifer, and M. Venkitasubramaniam.�-diversity: Privacy beyond k-anonymity. In Proc. of the 22nd IEEEInternational Conference on Data Engineering (ICDE), Atlanta, GA,2006.

[161] A. Machanavajjhala, D. Kifer, J. M. Abowd, J. Gehrke, and L. Vilhuber.Privacy: Theory meets practice on the map. In Proc. of the 24th IEEE

Page 347: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

332 References

International Conference on Data Engineering (ICDE), pages 277–286,2008.

[162] A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam.�-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowl-edge Discovery from Data (TKDD), 1(1), March 2007.

[163] B. Malin and E. Airoldi. The effects of location access behavior on re-identification risk in a distributed environment. In Proc. of 6th Work-shop on Privacy Enhancing Technologies (PET), pages 413–429, 2006.

[164] B. Malin and L. Sweeney. How to protect genomic data privacy in adistributed network. In Journal of Biomed Info, 37(3): 179-192, 2004.

[165] D. Martin, D. Kifer, A. Machanavajjhala, J. Gehrke, and J. Halpern.Worst-case background knowledge in privacy-preserving data publish-ing. In Proc. of the 23rd IEEE International Conference on Data En-gineering (ICDE), April 2007.

[166] N. S. Matloff. Inference control via query restriction vs. data modifica-tion: A perspective. In Database Security: Status and Prospects, pages159–166, Annapolis, ML, 1988.

[167] A. McCallum, A. Corrada-Emmanuel, and X. Wang. Topic and rolediscovery in social networks. In Proc. of International Joint Conferenceon Artificial Intelligence (IJCAI), 2005.

[168] A. Meyerson and R. Williams. On the complexity of optimal k-anonymity. In Proc. of the 23rd ACM SIGMOD-SIGACT-SIGARTPODS, pages 223–228, Paris, France, 2004.

[169] G. Miklau and D. Suciu. A formal analysis of information disclosure indata exchange. In Proc. of ACM International Conference on Manage-ment of Data (SIGMOD), pages 575–586, Paris, France, 2004.

[170] N. Mohammed, B. C. M. Fung, and M. Debbabi. Walking in the crowd:Anonymizing trajectory data for pattern analysis. In Proc. of the 18thACM Conference on Information and Knowledge Management (CIKM),pages 1441–1444, Hong Kong, November 2009.

[171] N. Mohammed, B. C. M. Fung, P. C. K. Hung, and C. K. Lee.Anonymizing healthcare data: A case study on the blood transfusion ser-vice. In Proc. of the 15th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (SIGKDD), pages 1285–1294,Paris, France, June 2009.

[172] N. Mohammed, B. C. M. Fung, K. Wang, and P. C. K. Hung. Privacy-preserving data mashup. In Proc. of the 12th International Confer-ence on Extending Database Technology (EDBT), pages 228–239, Saint-Petersburg, Russia, March 2009.

Page 348: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

References 333

[173] R. A. Moore, Jr. Controlled data-swapping techniques for maskingpublic use microdata sets. Statistical Research Division Report SeriesRR 96-04, U.S. Bureau of the Census, Washington, DC., 1996.

[174] R. Motwani and Y. Xu. Efficient algorithms for masking and findingquasi-identifiers. In Proc. of SIAM International Workshop on PracticalPrivacy-Preserving Data Mining (P3DM), Atlanta, GA, April 2008.

[175] A. Narayanan and V. Shmatikov. De-anonymizing social networks. InProc. of the IEEE Symposium on Security and Privacy (S&P), 2009.

[176] M. Ercan Nergiz, M. Atzori, and C. W. Clifton. Hiding the presence ofindividuals from shared databases. In Proc. of ACM International Con-ference on Management of Data (SIGMOD), pages 665–676, Vancouver,Canada, 2007.

[177] M. Ercan Nergiz and C. Clifton. Thoughts on k-anonymization. Data& Knowledge Engineering, 63(3):622–645, Decemeber 2007.

[178] M. Ercan Nergiz, C. Clifton, and A. Erhan Nergiz. Multirelationalk-anonymity. In Proc. of the 23rd International Conference on DataEngineering (ICDE), pages 1417–1421, Istanbul, Turkey, 2007.

[179] D. J. Newman, S. Hettich, C. L. Blake, and C. J. Merz. UCI repositoryof machine learning databases, 1998.

[180] N. Nisan. Algorithms for selfish agents. In Proc. of the 16th Symposiumon Theoretical Aspects of Computer Science, Trier, Germany, March1999.

[181] A. Ohrn and L. Ohno-Machado. Using Boolean reasoning to anonymizedatabases. Artificial Intelligence in Medicine, 15:235–254, 1999.

[182] S. R. Oliveira and O. R. Zaiane. A privacy-preserving clustering ap-proach toward secure and effective data analysis for business collabora-tion. Journal on Computers and Security, 26(1):81–93, 2007.

[183] M. J. Osborne and A. Rubinstein. A Course in Game Theory. Cam-bridge, MA: The MIT Press, 1994.

[184] G. Ozsoyoglu and T. Su. On inference control in semantic data modelsfor statistical databases. Journal of Computer and System Sciences,40(3):405–443, 1990.

[185] S. Papadimitriou, F. Li, G. Kollios, and P. S. Yu. Time series com-pressibility and privacy. In Proc. of the 33rd International Conferenceon Very Large Data Bases (VLDB), pages 459–470, Vienna, Austria,September 2007.

[186] J. Pei, J. Xu, Z. Wang, W. Wang, and K. Wang. Maintaining k-anonymity against incremental updates. In Proc. of the 19th Inter-

Page 349: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

334 References

national Conference on Scientific and Statistical Database Management(SSDBM), Banff, Canada, 2007.

[187] B. Pinkas. Cryptographic techniques for privacy-preserving data min-ing. ACM SIGKDD Explorations Newsletter, 4(2):12–19, January 2002.

[188] B. Poblete, M. Spiliopoulou, and R. Baeza-Yates. Website privacypreservation for query log publishing. Privacy, Security, and Trust inKDD. Springer Berlin / Heidelberg, 2007.

[189] S. C. Pohlig and M. E. Hellman. An improved algorithm for computinglogarithms over gf(p) and its cryptographic significance. IEEE Trans-actions on Information Theory (TIT), IT-24:106–110, 1978.

[190] President Information Technology Advisory Committee. Revolution-izing health care through information technology. Technical report,Executive Office of the President of the United States, June 2004.

[191] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kauf-mann, 1993.

[192] V. Rastogi, M. Hay, G. Miklau, and D. Suciu. Relationship pri-vacy: Output perturbation for queries with joins. In Proc. of the28th ACM SIGMOD-SIGACT-SIGART Symposium on Principles ofDatabase Systems (PODS), Providence, Rhode Island, USA, July 2009.

[193] V. Rastogi, D. Suciu, and S. Hong. The boundary between privacy andutility in data publishing. In Proc. of the 33rd International Conferenceon Very Large Data Bases (VLDB), pages 531–542, Vienna, Austria,September 2007.

[194] K. Reid and J. A. Scott. Reducing the total bandwidth of a sparseunsymmetric matrix. In SIAM Journal on Matrix Analysis and Appli-cations, volume 28, pages 805–821, 2006.

[195] S. P. Reiss. Practical data-swapping: The first steps. ACM Transactionson Database Systems (TODS), 9(1):20–37, 1984.

[196] Steven P. Reiss, Mark J. Post, and Tore Dalenius. Non-reversible pri-vacy transformations. In Proc. of the 1st ACM Symposium on Principlesof Database Systems (PODS), pages 139–146, 1982.

[197] M. K. Reiter and A. D. Rubin. Crowds: Anonymity for web transac-tions. ACM Transactions on Information and System Security (TIS-SEC), 1(1):66–92, November 1998.

[198] S. Rizvi and J. R. Harista. Maintaining data privacy in association rulemining. In Proc. of the 28th International Conference on Very LargeData Bases (VLDB), pages 682–693, 2002.

[199] B. E. Rosen, J. M. Goodwin, and J. J. Vidal. Process control withadaptive range coding. Biological Cybernetics, 67:419–428, 1992.

Page 350: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

References 335

[200] D. B. Rubin. Discussion statistical disclosure limitation. Journal ofOfficial Statistics, 9(2).

[201] P. Samarati. Protecting respondents’ identities in microdata release.IEEE Transactions on Knowledge and Data Engineering (TKDE),13(6):1010–1027, 2001.

[202] P. Samarati and L. Sweeney. Generalizing data to provide anonymitywhen disclosing information. In Proc. of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems(PODS), page 188, Seattle, WA, June 1998.

[203] P. Samarati and L. Sweeney. Protecting privacy when disclosing infor-mation: k-anonymity and its enforcement through generalization andsuppression. Technical report, SRI International, March 1998.

[204] S. E. Sarma, S. A. Weis, and D. W. Engels. RFID systems and securityand privacy implications. In Proc. of the 4th International Workshopon Cryptographic Hardware and Embedded Systems (CHES), pages 454–469, 2003.

[205] Y. Saygin, D. Hakkani-Tur, and G. Tur. Web and Information Secu-rity, chapter Sanitization and Anonymization of Document Reposito-ries, pages 133–148. IRM Press, 2006.

[206] Y. Saygin, V. S. Verykios, and C. Clifton. Using unknowns to preventdiscovery of association rules. In Conference on Research Issues in DataEngineering, 2002.

[207] B. Schneier. Applied Cryptography. John Wiley & Sons, 2 edition, 1996.

[208] J. W. Seifert. Data mining and homeland security: Anoverview. CRS Report for Congress, (RL31798), January 2006.http://www.fas.org/sgp/crs/intel/RL31798.pdf.

[209] A. Serjantov and G. Danezis. Towards an information theoretic metricfor anonymity. In Proc. of the 2nd Internationl Workshop on PrivacyEnchancing Technologies (PET), pages 41–53, San Francisco, CA, April2002.

[210] C. E. Shannon. A mathematical theory of communication. The BellSystem Technical Journal, 27:379 and 623, 1948.

[211] W. Shen, X. Li, and A. Doan. Constraint-based entity matching. InProc. of the 20th National Conference on Artificial Intelligence (AAAI),2005.

[212] A. Shoshani. Statistical databases: Characteristics, problems and somesolutions. In Proc. of the 8th Very Large Data Bases (VLDB), pages208–213, September 1982.

Page 351: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

336 References

[213] A. Skowron and C. Rauszer. Intelligent Decision Support: Handbookof Applications and Advances of the Rough Set Theory, chapter TheDiscernibility Matrices and Functions in Information Systems. 1992.

[214] L. Sweeney. Replacing personally-identifying information in medicalrecords, the scrub system. In Proc. of the AMIA Annual Fall Sympo-sium, pages 333–337, 1996.

[215] L. Sweeney. Datafly: A system for providing anonymity in medical data.In Proc. of the IFIP TC11 WG11.3 11th International Conference onDatabase Securty XI: Status and Prospects, pages 356–381, August 1998.

[216] L. Sweeney. Achieving k-anonymity privacy protection using generaliza-tion and suppression. International Journal of Uncertainty, Fuzziness,and Knowledge-based Systems, 10(5):571–588, 2002.

[217] L. Sweeney. k-Anonymity: A model for protecting privacy. Interna-tional Journal of Uncertainty, Fuzziness and Knowledge-based Systems,10(5):557–570, 2002.

[218] Y. Tao, X. Xiao, J. Li, and D. Zhang. On anti-corruption privacy pre-serving publication. In Proc. of the 24th IEEE International Conferenceon Data Engineering (ICDE), pages 725–734, April 2008.

[219] M. Terrovitis and N. Mamoulis. Privacy preservation in the publicationof trajectories. In Proc. of the 9th International Conference on MobileData Management (MDM), pages 65–72, April 2008.

[220] M. Terrovitis, N. Mamoulis, and P. Kalnis. Privacy-preservinganonymization of set-valued data. Proc. of the VLDB Endowment,1(1):115–125, August 2008.

[221] The House of Commons in Canada. The personal information protectionand electronic documents act, April 2000. http://www.privcom.gc.ca/.

[222] The RFID Knowledgebase. Oyster transport for London TfL cardUK, January 2007. http://www.idtechex.com/knowledgebase/en/ cas-estudy.asp?casestudyid=227.

[223] B. M. Thuraisingham. Security checking in relational database man-agement systems augmented with inference engines. Computers andSecurity, 6:479–492, 1987.

[224] J. Travers and S. Milgram. An experimental study of the small worldproblem. Sociometry, 32(4):425–443, 1969.

[225] T. Trojer, B. C. M. Fung, and P. C. K. Hung. Service-oriented archi-tecture for privacy-preserving data mashup. In Proc. of the 7th IEEEInternational Conference on Web Services (ICWS), pages 767–774, LosAngeles, CA, July 2009.

Page 352: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

References 337

[226] T. M. Truta and V. Bindu. Privacy protection: p-sensitive k-anonymityproperty. In Proc. of the Workshop on Privacy Data Management(PDM), page 94, April 2006.

[227] J. Vaidya. Privacy-Preserving Data Mining: Models and Algorithms,chapter A Survey of Privacy-Preserving Methods Across Vertically Par-titioned Data, pages 337–358. Springer, 2008.

[228] J. Vaidya and C. Clifton. Privacy preserving association rule mining invertically partitioned data. In Proc. of the 8th ACM International Con-ference on Knowledge Discovery and Data Mining (SIGKDD), pages639–644, Edmonton, AB, Canada, 2002.

[229] J. Vaidya and C. Clifton. Privacy-preserving k-means clustering oververtically partitioned data. In Proc. of the 9th ACM International Con-ference on Knowledge Discovery and Data Mining (SIGKDD), pages206–215, Washington, DC, 2003.

[230] J. Vaidya, C. W. Clifton, and M. Zhu. Privacy Preserving Data Mining.Springer, 2006.

[231] C. J. van Rijsbergen. Information Retrieval, 2nd edition. London, But-terworths, 1979.

[232] V. S. Verykios, E. Bertino, I. N. Fovino, L. P. Provenza, Y. Saygin,and Y. Theodoridis. State-of-the-art in privacy preserving data mining.ACM SIGMOD Record, 3(1):50–57, March 2004.

[233] V. S. Verykios, A. K. Elmagarmid, A. Bertino, Y. Saygin, andE. Dasseni. Association rule hiding. IEEE Transactions on Knowledgeand Data Engineering (TKDE), 16(4):434–447, 2004.

[234] S. A. Vinterbo. Privacy: A machine learning view. IEEE Transactionson Knowledge and Data Engineering (TKDE), 16(8):939–948, August2004.

[235] D.-W. Wang, C.-J. Liau, and T.-S. Hsu. Privacy protection in socialnetwork data disclosure based on granular computing. In Proc. of IEEEInternational Conference on Fuzzy Systems, Vancouver, BC, July 2006.

[236] K. Wang and B. C. M. Fung. Anonymizing sequential releases. In Proc.of the 12th ACM SIGKDD International Conference on Knowledge Dis-covery and Data Mining (SIGKDD), pages 414–423, Philadelphia, PA,August 2006.

[237] K. Wang, B. C. M. Fung, and P. S. Yu. Handicapping attacker’s confi-dence: An alternative to k-anonymization. Knowledge and InformationSystems (KAIS), 11(3):345–368, April 2007.

[238] K. Wang, Y. Xu, A. W. C. Fu, and R. C. W. Wong. ff -anonymity:When quasi-identifiers are missing. In Proc. of the 25th IEEE Interna-tional Conference on Data Engineering (ICDE), March 2009.

Page 353: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

338 References

[239] K. Wang, P. S. Yu, and S. Chakraborty. Bottom-up generalization:A data mining solution to privacy protection. In Proc. of the 4thIEEE International Conference on Data Mining (ICDM), pages 249–256, November 2004.

[240] S.-W. Wang, W.-H. Chen, C.-S. Ong, L. Liu, and Y. W. Chuang. RFIDapplications in hospitals: a case study on a demonstration RFID projectin a taiwan hospital. In Proc. of the 39th Hawaii International Confer-ence on System Sciences, 2006.

[241] T. Wang and X. Wu. Approximate inverse frequent itemset mining:Privacy, complexity, and approximation. In Proc. of the 5th IEEE In-ternational Conference on Data Mining (ICDM), 2005.

[242] S. L. Warner. Randomized response: A survey technique for eliminatingevasive answer bias. Journal of the American Statistical Association,60(309):63–69, 1965.

[243] D. Whitley. The genitor algorithm and selective pressure: Why rank-based allocation of reproductive trials is best. In Proc. of the 3rd Inter-national Conference on Genetic Algorithms, pages 116–121, 1989.

[244] G. Wiederhold. Intelligent integration of information. In Proc. of ACMInternational Conference on Management of Data (SIGMOD), pages434–437, 1993.

[245] R. C. W. Wong, A. W. C. Fu, K. Wang, and J. Pei. Minimality attackin privacy preserving data publishing. In Proc. of the 33rd InternationalConference on Very Large Data Bases (VLDB), pages 543–554, Vienna,Austria, 2007.

[246] R. C. W. Wong, J. Li., A. W. C. Fu, and K. Wang. (α,k)-anonymity:An enhanced k-anonymity model for privacy preserving data publish-ing. In Proc. of the 12th ACM International Conference on KnowledgeDiscovery and Data Mining (SIGKDD), pages 754–759, Philadelphia,PA, 2006.

[247] R. C. W. Wong, J. Li., A. W. C. Fu, and K. Wang. (α,k)-anonymousdata publishing. Journal of Intelligent Information Systems, in press.

[248] R. N. Wright, Z. Yang, and S. Zhong. Distributed data mining protocolsfor privacy: A review of some recent results. In Proc. of the SecureMobile Ad-hoc Networks and Sensors Workshop (MADNES), 2005.

[249] X. Xiao and Y. Tao. Anatomy: Simple and effective privacy preser-vation. In Proc. of the 32nd Very Large Data Bases (VLDB), Seoul,Korea, September 2006.

[250] X. Xiao and Y. Tao. Personalized privacy preservation. In Proc. ofACM International Conference on Management of Data (SIGMOD),Chicago, IL, 2006.

Page 354: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

References 339

[251] X. Xiao and Y. Tao. m-invariance: Towards privacy preserving re-publication of dynamic datasets. In Proc. of ACM International Con-ference on Management of Data (SIGMOD), Beijing, China, June 2007.

[252] X. Xiao, Y. Tao, and M. Chen. Optimal random perturbation at multi-ple privacy levels. In Proc. of the 35th Very Large Data Bases (VLDB),pages 814–825, 2009.

[253] L. Xiong and E. Agichtein. Towards privacy-preserving query log pub-lishing. In Query Logs Workshop at the 16th International Conferenceon World Wide Web, 2007.

[254] J. Xu, W. Wang, J. Pei, X. Wang, B. Shi, and A. W. C. Fu. Utility-based anonymization using local recoding. In Proc. of the 12th ACMInternational Conference on Knowledge Discovery and Data Mining(SIGKDD), Philadelphia, PA, August 2006.

[255] Y. Xu, B. C. M. Fung, K. Wang, A. W. C. Fu, and J. Pei. Publishingsensitive transactions for itemset utility. In Proc. of the 8th IEEE In-ternational Conference on Data Mining (ICDM), Pisa, Italy, December2008.

[256] Y. Xu, K. Wang, A. W. C. Fu, and P. S. Yu. Anonymizing transactiondatabases for publication. In Proc. of the 14th ACM International Con-ference on Knowledge Discovery and Data Mining (SIGKDD), August2008.

[257] X. Yan and J. Han. gSpan: Graph-based substructure pattern mining.In Proc. of the 2nd IEEE International Conference on Data Mining(ICDM), 2002.

[258] Z. Yang, S. Zhong, and R. N. Wright. Anonymity-preserving data collec-tion. In Proc. of the 11th ACM International Conference on KnowledgeDiscovery and Data Mining (SIGKDD), pages 334–343, 2005.

[259] Z. Yang, S. Zhong, and R. N. Wright. Privacy-preserving classificationof customer data without loss of accuracy. In Proc. of the 5th SIAM In-ternational Conference on Data Mining (SDM), pages 92–102, NewportBeach, CA, 2005.

[260] A. C. Yao. Protocols for secure computations. In Proc. of the 23rdIEEE Symposium on Foundations of Computer Science, pages 160–164,1982.

[261] A. C. Yao. How to generate and exchange secrets. In Proc. of the 27thAnnual IEEE Symposium on Foundations of Computer Science, pages162–167, 1986.

[262] C. Yao, X. S. Wang, and S. Jajodia. Checking for k-anonymity violationby views. In Proc. of the 31st Very Large Data Bases (VLDB), pages910–921, Trondheim, Norway, 2005.

Page 355: Introduction to Privacy-Preserving Data Publishingdl.booktolearn.com › ...to...data_publishing_4f0b.pdf · Data Mining and Knowledge Discovery Series UNDERSTANDING COMPLEX DATASETS:

340 References

[263] R. Yarovoy, F. Bonchi, L. V. S. Lakshmanan, and W. H. Wang.Anonymizing moving objects: How to hide a MOB in a crowd? In Proc.of the 12th International Conference on Extending Database Technology(EDBT), pages 72–83, 2009.

[264] X. Ying and X. Wu. Randomizing social networks: a spectrum pre-serving approach. In Proc. of SIAM International Conference on DataMining (SDM), Atlanta, GA, April 2008.

[265] T.-H. You, W.-C. Peng, and W.-C. Lee. Protect moving trajectorieswith dummies. In Proc. of the International Workshop on Privacy-Aware Location-based Mobile Services (PALMS), pages 278–282, May2007.

[266] H. Yu, P. Gibbons, M. Kaminsky, and F. Xiao. Sybil-limit: A near-optimal social network defense against sybil attacks. In Proc. of theIEEE Symposium on Security and Privacy, 2008.

[267] L. Zayatz. Disclosure avoidance practices and research at the U.S. cen-sus bureau: An update. Journal of Official Statistics, 23(2):253–265,2007.

[268] P. Zhang, Y. Tong, S. Tang, and D. Yang. Privacy-preserving naivebayes classifier. Lecture Notes in Computer Science, 3584, 2005.

[269] Q. Zhang, N. Koudas, D. Srivastava, and T. Yu. Aggregate query an-swering on anonymized tables. In Proc. of the 23rd IEEE InternationalConference on Data Engineering (ICDE), April 2007.

[270] E. Zheleva and L. Getoor. Preserving the privacy of sensitive relation-ships in graph data. In Proc. of the 1st ACM SIGKDD InternationalWorkshop on Privacy, Security and Trust in KDD, San Jose, USA, Aug2007.

[271] B. Zhou and J. Pei. Preserving privacy in social networks against neigh-borhood attacks. In Proc. of IEEE 24th International Conference onData Engineering (ICDE), April 2008.

[272] L. Zou, L. Chen, and M. T. Ozsu. K-automorphism: A general frame-work for privacy preserving network publication. Proc. of the VLDBEndowment, 2(1):946–957, 2009.