While some privacy advocates may claim that federal agencies are not taking the privacy rule seriously enough, regulators state that they are prepared to impose civil monetary penalties where appropriate. According to the Department of Health and Human Services' (HHS) Office of Civil Rights, voluntary compliance has worked well thus far in dealing with complaints.1 Very few would disagree, however, that the unintentional disclosure of personal health information can have serious consequences, including monetary penalties, litigation and a public relations failure.
Some of the data produced by health care organizations, while deemed necessary for making business decisions, may be based on personal health information that could likely be reasonably identifiable, thereby placing the organization at risk. Thus, a balance must be found between business needs and the mitigation of disclosure risk related to HIPAA.
Regulatory and Legal Background
HIPAA legislation has had a profound impact on the current regulatory and legal landscape. The Department of Health and Human Services issued a final Rule on December 28, 2000, establishing "Standards for Privacy of Individually Identifiable Health Information" (Privacy Rule) that took effect in April 2003. Under the Privacy Rule, health care organizations must guard against the misuse of individuals' identifiable health information. The Privacy Rule does allow businesses to create de-identified health information in order to meet their needs. However, the de-identification procedures must be sufficient to eliminate a reasonable basis to believe that the information can be used to identify an individual.
HHS is responsible for the enforcement of HIPAA. In addition to enforcement,
"... the Privacy Rule has the potential to dramatically increase the number of lawsuits against health care providers and payers for wrongful disclosures. Some experts believe that private litigation will play a more significant role in enforcement than government agencies will. Some even consider the potential magnitude of HIPAA litigation to be in the same class with tobacco litigation, breast implant litigation and asbestos litigation. In any event, HIPAA expands opportunities for civil lawsuits ..."2
Under HIPAA, health care organizations must render anonymous any individually identifiable health information, must establish privacy practices, and are advised to seek third-party review.
Research and Guidelines
Many studies on de-identifying information arise from a sample being drawn from a population. Sophisticated mathematical models have been developed to assign the probability of correctly de-identifying an individual.
Often, however, health care data contained on tables represent the entire population and not a sample. Hence, the previously developed theory does not necessarily apply to these types of data. Moreover, our review of the underlying statistical theory indicates that extensions of these models are often nonsensical for population data.
Although there are differences in the statistical applications for de-identifying individuals for a population and a sample, statistical theory can be used to calculate probabilities of identifying individuals in a population. This theory relies on the number of individuals in a cell and the number of individuals in a table. It assumes implicitly that no other information is known about an individual. These calculations indicate that small cell sizes are optimal: that is, if a cell, defined by a row and column, has a small number of observations, the probability of identifying correctly at least one individual in that cell is small. Moreover, when a cell has a large number of observations, the probability of identifying at least one individual in that cell is large.
These results are counterintuitive to what occurs in practice. Specifically, small cell sizes and large cell sizes are both problematic: a solution is to thereby suggest general guidelines relating to the table structures that will mitigate the risk of de-identification or disclosure. More specifically:
- Maintain a minimum number of individuals per cell. The Federal Committee on Statistical Methodology, Subcommittee on Disclosure Limitation Methodology, reports the use of a threshold rule; that is, a cell in a table of frequencies is defined to be sensitive if the number of individuals is less than some specified number.3Some Federal agencies require at least five in a cell, others require three.
- Apply the concentration rule. Regardless of the number of individuals in a cell, if a small number (n or fewer) of individuals contribute a large percentage (k percent or more) of the total cell value, then the n respondent, k percent rule of cell dominance defines the cell as sensitive. Many statistical agencies use an (n, k) rule with n = 1 or 2 and k = 60 to 90 percent.4 In practice, we suggest that no one individual represents more than 70 percent of a cell total, or that his or her individual cell value is no larger than two standard deviations from the mean.
- Display data in percentages rather than actual counts.
- Report either the totals or averages (without counts) to display cost or account information.
- Reduce the number of rows that display demographic characteristics (for example, the age distribution). Distributions displayed in groups of 10 or 20 percent of the population should be considered.
- Combine complimentary information: Sensitive data can be easily masked by combining these data with neighboring cells or complimentary tables. Alternatively, the frequency of reporting (e.g., quarterly instead of monthly) could be changed so that minimum cell size can be achieved.
Common Privacy Concerns for Population Data
To safeguard and protect health information, health care organizations are interested in de-identifying tabular population data. Figure 1 presents an example of magnitude and count data requiring de-identification. Attributes include age; population counts and/or population percentages; and magnitude data including averages and sums of medical expenditures. These data can be described by any distribution and are often skewed.









