Merge, join, and concatenate#

Crandas is a tool that allows for secure data- collaboration. It allows combining data from multiple sources to do computations. This section will show you exactly that. Here you will learn how to can combine CDataFrames in different ways.

There are multiple ways to combine tables in crandas, and here we will see them all:

Concatenating tables#

The simplest way to concatenate tables is by just stacking them together, either on top of each other or side by side. In order to do this we must be sure that the tables have the same structure, otherwise we will not be able to concatenate them. In order to do this you can use crandas.concat().

When you have multiple tables with the same structure that you want to combine (like when you are working with multiple data sources that do the same thing, like for example hospitals), that is when you use vertical concatenation (axis=0).

Warning

The tables you want to join must have the same number of columns, the same column names and the same types. It is also important to know that data is formatted in the same way before doing this.

import crandas as cd

# Creating sample CDataFrames
a = cd.DataFrame({"id": [2, 1, 3, 4], "id2": [5, 1, 4, 0]})
b = cd.DataFrame({"id": [3, 0, 1, 2], "id2": [54, 1, 42, 1]})
c = cd.DataFrame({"id3": [3, 0, 1, 2], "id": [54, 1, 42, 1]})
d = cd.DataFrame({"id2": [3, 0, 1, 2], "id": [54, 1, 42, 1]})
e = cd.DataFrame({"id": [3, 0, 1, 2], "id2": [54, 1, 42, 1]})


# Simple vertical concatenation
t1 = cd.concat([a, b])

# Vertical concatenation, equivalent to the above
t2 = cd.concat([a, b], axis=0)

# Concatenate only common columns, ignore index
t3 = cd.concat([a, c], join="inner")

# Concatenate CDataFrames with columns in a different order
t4 = cd.concat([a, d])

# Concatenate multiple CDataFrames at once
t4 = cd.concat([a, b, e])

It is also possible to concatenate tables horizontally, that is, next to each other. You can do this by using axis=1, however this is highly discouraged when working with private data, as it is impossible to check whether the two tables “match”. Thankfully, there are smarter ways that we can do that.

Merging/Joining CDataFrames#

In order to properly concatenate the rows of two tables, we need to have a key (or multiple) that identifies which rows should be joined together. When we have this, we can join the tables together using crandas.merge().

The main way we can join two tables is as an inner join. This means that given a key, only rows where that key appears in both tables will appear in the joined table. This type of join is very useful as this way we avoid carrying missing values when one value appears in one table and not the other.

left_df = cd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Eve"]})
right_df = cd.DataFrame({"id": [4, 2, 1], "salary": [1234, 1111, 98765], "name": ["staple", "horse", "correct"]})

# Inner join
merged = cd.merge(left_df, right_df, on="id")
>>> print(merged.open())
       id name_x  salary  name_y
    0   1  Alice   98765  correct
    1   2    Bob    1111   horse

The new table combined the two tables on id regardless of the original order and did not include the values that were not there. Additionally, it added suffixes _x and _y to the columns that had a repeated name.

Note that the name of the key column names do not need to match, in that case you can use left_on and right_on instead of on. Additionally, you can perform an outer join or join using multiple keys.

left_df = cd.DataFrame(
    {
        "id": [1, 2, 3],
        "name": ["Alice", "Bob", "Eve"],
        "good/evil": [1, 1, 0],
        "color": ["green", "black", "green"],
    }
)
right_df = cd.DataFrame(
    {
        "key": [4, 2, 1],
        "salary": [1234, 1111, 98765],
        "name": ["staple", "horse", "correct"],
        "good/evil": [1, 1, 0],
        "color": ["green", "black", "green"],
    }
)

# Outer join, joining on different named keys
merged_outer = cd.merge(left_df, right_df, how="outer", left_on="id", right_on="key")

# Inner join over multiple keys
merged_mult_keys = cd.merge(left_df, right_df, on=["good/evil", "color"])

Note

Currently, joins only work if the keys are unique in both tables (except for one exception that we present below).

Sometimes, you may wish to join two datasets that have different column names that you want to join on. In this case, you can specify left_on and right_on (left referring to the first dataset mentioned and right referring to the second).

“Left” join#

We said before that merging tables requires all keys to be unique. The one exception to this is the following situation:

One database consists of purchases by a group of people. The second database contains information about people, where each row corresponds to a person. As a person might make more than one purchase (or none), there will likely be not exactly one row per person in the purchases database.

We want to join both tables, which in this case effectively means adding the information of each person to each purchase that that person did. In other words, we want to fill the left table with information from the right one. This is why we call this a left join. [1]

We are able to do this, but because there are some additional cryptographic implications, it is not as straightforward as before.

First we have to create a crandas.groupby object in the following way:

purchases = cd.DataFrame({"price": [11,22,33,44,55,66,77,88,99], "buyer_id": [1,2,3,3,5,5,5,5,6]})

grouped = purchases.groupby("buyer_id")

After doing this we can join it with our people database

people = cd.DataFrame({"names": ["Alice", "Bob", "Crow", "Dave", "Eve"], "id": [1,2,3,4,5], "age": [30,32,29,45,55]})

# Note how `left_on` has the grouped object and not the name of the column
merged = cd.merge(purchases, people, left_on=grouped, right_on="id", how="left")
>>> print(merged.open())
       price  buyer_id  names  age
    0     11         1  Alice   30
    1     22         2    Bob   32
    2     33         3   Crow   29
    3     44         3   Crow   29
    4     55         5    Eve   55
    5     66         5    Eve   55
    6     77         5    Eve   55
    7     88         5    Eve   55
    8     99         6           0

This is one of the clearest examples of how crandas is different to pandas. Unfortunately in order to maintain the cryptographic guarantees for the data, we have to diverge how things are done in pandas. However, the fact that we can create such a table without leaking information is worth it!

Warning

The left join does not fill the missing entries with null values. Instead, it adds a particular value depending of the datatype. See the above example: missing entries in the varchar column names in the people database are represented by the empty string "". On the other hand, missing values for the integer column age are filled by zeros.