Virtual data lake: Accelerate intra-company analytics


Big data silos and data lakes are becoming the norm in large organizations. However, collaboration across business units and/or countries is often inhibited, and significantly delayed, by the requirements of transferring the data to one central location.

Roseman Labs developed a virtual data lake solution that enables organizations to virtually combine data silos, run queries against this virtually joined data, with strong input privacy, and without the need to set up a compliant new environment.

At the core is a technique called Secure Multi-Party Computation (MPC), which is a mature cryptographic technique that enables analyses on joint data with unprecedented privacy guarantees.

Example: Eight data lakes in one organization

We illustrate our solution with an example.

Bank A is a large financial services group with 8 data sources, several in each large business unit and corporate headquarters. Bank A constantly strives to increase insights from its local anti-fraud and customer analytics efforts. In this process, several data lakes were set up, joining data at regional levels.

With our virtual data lake solution, bank A continues to combine insights from across its data silos. However, the setup of the data collaboration effort takes a few months instead of 1-2 years.

Concrete steps

Instead of embarking on the significant task of mandating a new team, centralizing and standardizing the data, sourcing and hardening the new IT environment, and passing all compliance requirements, the virtual data lake takes an incremental -- less “big bang” -- approach.

The set up phase works as follows: A steering group is formed with members of all participating data silos. The steering group approves on the type of analyses and the freedom to operate for data scientists. The owners of the various data sources can opt-in on a query set and computations can be run. Then, the virtual data lake software is deployed on each data silo.

Deployment lead time is an order of magnitude faster than deployment of a central data lake and costs typically are a significant factor lower due to reduced technical and legal complexity.

Purpose binding, privacy, fair use

Purpose binding is enforced by requiring that the steering group signs off on a whitelist of statistics. Input privacy is guaranteed by the MPC technique, which ensures that no input data is revealed to any of the participants (other than what the output reveals). The issue of output privacy is addressed by defining a rigid set of queries that protects (differential) privacy. (For example, shifting windows over the data is not allowed.)

Fair-use is further enforced by monitoring queries retrospectively at the end of a period, e.g. each month. These strong privacy properties make the solution ideal for collaboration within organisations but also well suited for external collaboration as data sovereignty is ensured.

Background to MPC and Roseman Labs

Secure Multi-Party Computation (MPC) is a cryptographic technique that stems from the 1980s. It has recently gained attention from industry because the technology has significantly matured.

Roseman Labs is founded in 2020 focusing on MPC. Its core team consists of developers and scientists that have specialized in this technique.

To conclude

We believe this solution could revolutionize how organizations set up intra-company analytics. Please do not hesitate to reach out to discuss how it could help to accelerate your data collaborations.

Contact us

Toon Segers

CPO at Roseman Labs

Published on: 11 March 2021