How to manage disclosure risk with Roseman Labs

Written by Roseman Labs | Nov 1, 2024 11:00:00 AM

Performing analyses on data from multiple parties

With Roseman Labs, organizations can run statistical analyses on combined sensitive data sets, even if participating organizations are not necessarily trustful of each other. This is because Multi-Party Computation (the technological foundation of our platform) allows several participants to securely encrypt, link and analyze data sets without revealing the underlying records.

When working with sensitive data, information related to individuals or data contributors should always remain private. Running statistics on large datasets often hides information about individuals, if the output is sufficiently aggregated. However, this is not always the case for sets that contain several small subgroups, or subgroups that contain high imbalances in their data.

If analyses focus on aggregations such as averages or descriptive statistics then it is important to take into consideration that certain analyses may indirectly disclose information about individuals or single organizations too.

Mitigating risks through disclosure control

The potential to reveal any information beyond what is agreed between project participants is known as disclosure risk. Disclosure risks can occur by accident, or from a data analyst's active attempt to access more information than they are supposed to. By inspecting the questions that an analyst wants to learn about a data set, data owners and approvers of analyses can see which results the analyst is trying to access – and whether these answers are in line with the disclosure risks the approvers are willing to accept.

As a guiding principle, disclosure risks should always be minimized. In many cases, a thorough review of a query is sufficient, but not always. In fact, for some code, it can only be determined whether it will disclose more information than expected while it is already running on the data.

To better illustrate a disclosure risk, suppose that a few supermarkets are investigating their combined sales and revenue figures on different product categories. If one supermarket had significantly higher sales than the others, or some supermarkets hardly sold any products of a particular category, the analysis would disproportionately reveal information about the well-performing companies.

The risk here is that the numbers contributed by one of the supermarkets were significantly higher than those of the others, and would therefore be especially disclosive (i.e. leaking sensitive data), about a particular supermarket within the group. For instance, if one supermarket had 90% of the sales for a given product category, the sum of all sales would be approximately the same as the contribution of this single supermarket. This approximation would even be more accurate for the other supermarkets, who could simply deduct their own contributions from the total.

What happens to the data?

In the above scenario, the combined data is vertically concatenated, meaning multiple datasets of the same variable (i.e. the sales and revenue of certain product categories) from the different supermarkets are combined into one dataset. This involves stacking tables of information from each of the supermarkets, thereby extending the total number of rows of the specific variable. After this, magnitude tables are produced on the newly combined dataset which contains the aggregated values.

Using the p%-rule, it can be ascertained how dominant each of the contributor’s numbers are as part of the aggregated total, and whether this will pose a risk to the participant if the output is disclosed. This metric can be evaluated in the blind with Roseman Labs, during the encrypted computation in Multi-Party Computation. This is particularly powerful, because if the outcome does not meet the p%-rule, we can prevent that participants learn the actual input and output values.

If the outcome of the analysis shows that one contributor is in fact dominant, then the results can be prohibited from being published to avoid disclosing information outside the intended scope of the original analysis.

Controlling and preventing disclosure risks

Disclosure risks are a common phenomena inherent to all forms of data analysis. In fact, there is an entire field of research dedicated to this topic, known as Statistical Disclosure Control; largely driven by national statistical institutes who want to minimize disclosure risks on any analysis results they publish. In particular, it is important that there is a balance between the risks involved in publishing the results of an analysis, and on the other hand, the usefulness of those results.

The application of Statistical Disclosure Control methodology in the context of Multi-Party Computation has not been extensively explored before now. Its particular complexity is that the measurement of disclosure risk needs to be assessed in the blind during the computation. This means that rather than manually inspecting the output of an analysis and deciding on disclosure risks before publication, Multi-Party Computation necessitates that the methodology assesses and limits disclosure risks automatically, before any publication. Otherwise, the intermediate output might still leak unwanted information to the collaborating parties. This is contrary to a situation in, for example, a large demographic study where the results are extensively assessed manually before being published.

What's the value-add?

Disclosure risks are inherent in collaborative data analysis. If they are insufficiently examined before publication, sensitive information might be exposed, and the value of private computing would be partially offset. This could happen both accidentally, when results are particularly disclosive to some individuals, or as part of a dedicated attempt by an analyst or data provider to access additional information.

That is why Roseman Labs offers a set of tools that can be used to assess and limit these risks during execution of the script. This would flag and halt the disclosure of any unwanted information, protecting the privacy of the data and organizations involved. For further information, read our:

Developer documentation on statistical disclosure.

About the author

Guus Vogelaar is a Solution Engineer at Roseman Labs. As part of the Product team, he supports customers to ensure they can utilize the platform seamlessly. Guus has been working with the company since 2022, starting as a working student. Last year he joined the company full time after finishing his Master’s degree in Data Science and Entrepreneurship at JADS in Den Bosch. His final thesis developed a framework to apply Disclosure Control in the context of Multi-Party Computation. This was the first of its kind.

At Roseman Labs, Guus continues his work on Statistical Disclosure Control, and finds new scenarios in which an active adversary could use the Roseman Labs platform to derive additional information from the sensitive source data of another organization. This includes creating supporting documentation for users to validate if an analysis is secure and compliant.

Guus continues to work on an extension of the disclosure control features that currently exist in our python package, crandas. These would help find potential risks in the code while it is being executed, making the package even more secure than it already is.

Generate new insights on sensitive data with Roseman Labs’ secure Multi-Party Computation technology. Want to find out how your organization can do that? Contact us using the form below.

View full post