The virtual data lake: Work together without revealing your sensitive data
It is widely accepted that data sharing is a key component in unlocking business value and solving social issues. According to McKinsey, data sharing unlocks about 3 trillion dollars of value annually1.
However, as soon as sensitive data is involved, data sharing becomes difficult and often even impossible. For security and privacy reasons access to this data has been inhibited and sharing is discouraged, thereby preserving data silos.
A single dataset can be of great value in different contexts and this value grows exponentially once datasets are combined, thus the real value of data is only unlocked once the silos are broken down. The Virtual Data Lake enables this by allowing organisations to extract insights from virtually combined datasets without the need to reveal their sensitive input data.
The product has several benefits over traditional data collaboration:
- Input data is never revealed – Sensitive data is encrypted locally at each party. As part of the encryption process, the data is partitioned into parts that do not disclose any information. These parts are then shared with the different nodes in the virtual data lake.
- Data is only combined virtually; underlying data is not disclosed – Due to the nature of MPC the data is only combined virtually as part of the execution of a query. Only the outcome of this query is presented to the designated parties, no underlying data is disclosed.
- Data analysts can interact with the VDL as if it was a normal database – The VDL is an abstraction layer on top of our state of the art MPC engine (Cranmera). This enables data scientists to interact with the VDL like they would with any other database, giving you access to state of the art cryptography without the need for any additional training or resources.
- Types of queries, output and viewers are pre-agreed and signed off – A collaboration based on the VDL is governed by a group of trustees that agree upon the queries that can be executed, output that can be generated and the individuals that can access the VDL. This is to ensure that no data is being extracted beyond the scope of the collaboration.
Deployment of the VDL
The security and privacy of the data that is processed in the privacy engine relies on the segregation of duties between the management of the individual servers. There are three deployment scenarios:
- Centralized – All servers deployed in a single organization; segregation of duties implemented between administrators.
- Distributed – All servers deployed in logically and legally separated environments.
- SaaS – All servers deployed in a SaaS environment that is managed by Roseman Labs; segregation of duties implemented between administrators (and cloud providers).
Secure multi-party computation
At the core is a technique called Secure Multi-Party Computation (MPC), which is a mature cryptographic technique that enables analyses on joint data with unprecedented privacy guarantees. Please refer to the video below for an explanation of MPC.
We believe this solution could revolutionize how organizations create exponential value from data. Please do not hesitate to reach out to discuss how it could help to accelerate your data collaborations.
CEO at Roseman Labs
Published on: 14 December 2021
1. Collaborating for the common good: Navigating public-private data partnerships. https://www.mckinsey.com/business-functions/mckinsey-analytics/our-insights/collaborating-for-the-common-good