Scalability of Privacy-Preserving Linear Regression in Epidemiological Studies

In many hospitals, data related to patients are observed and collected to a central database for medical research. For instance, DPC dataset, which stands for Disease, Procedure and Combination, covers medical records for more than 7 million patients in more than 1000 hospitals. Using the distributed DPC data set, a number of epidemiological studied are feasible to reveal useful knowledge on medical treatments. Hence, cryptography helps to preserve the privacy of personal data. The study called as Privacy-Preserving Data Mining (PPDM) aims to perform a data mining algorithm with preserving confidentiality of datasets.

This paper studies the scalability of privacy-preserving data mining in epidemiological study. As for the data-mining algorithm, we focus to a linear regression since it is used in many applications and simple to be evaluated. We try to identify the linear model to estimate a length of hospital stay from distributed dataset related to the patient and the disease information. Our contributions of this paper include (1) to propose privacy-preserving protocols for linear regression with horizontally or vertically partitioned datasets, and (2) to clarify the limitation of size of problem to be performed. These information are useful to determine the dominant element in PPDM and to figure out the direction of study for further improvement.

You might also like