Joining Data - [Party 2]#
In this tutorial, we will demonstrate how the second party in a collaboration would upload their data to the VDL analysis to be performed by Party 1.
Step 1: Start the VDL#
In this step, we import the necessary libraries, including crandas and pandas.
import crandas as cd
import pandas as pd
from pathlib import Path
We then set up the VDL session by specifying the location of the VDL certificates and the VDL endpoint URL. These certificates are essential to authenticate the connection between our code and the VDL.
# Update your path to where you have stored your VDL certificates
from crandas.base import session
force_reupload = False
session.base_path = Path('../../../../vdl-instance/vdl/secrets')
session.endpoint = 'https://localhost:9820/api/v1'
Step 2: Input data#
In this step, we define some parameters to limit the amount of data and set a table name as a reference for the other party (same as for party 1). We also specify the relevant columns that we will use in our analysis.
#Set this to limit the amount of data
rows_per_dataset = 1000
# Use this name as a reference for the other party
nutrition_table_name = 'nutrition'
# Select relevant columns
relevant_columns = ['ean_code',
'SPP_DESCRIPTION',
'ENERGY_VALUE_IN_KJ',
'ENERGY_VALUE_IN_KCAL',
'SODIUM_IN_MG',
'SATURATED_FATTY_ACIDS_IN_G',
'TOTAL_PROTEIN_IN_G',
'MONO_AND_DISACCHARIDES_IN_G',
'DIETARY_FIBER_IN_G',
'STANDARD_PORTION_SIZE',
'NUMBER_OF_PORTIONS_PER_PACKAGE']
Read the local file and upload it:#
Now, we read the local CSV file containing the nutrition data using pandas and limit the number of rows based on the rows_per_dataset we set earlier. We then convert the ean_code
column to a string data type.
file_path = '../../data/sales_nutrition_data/nutrition_data.csv'
# Read the local csv using pandas
nutrition_table_data = pd.read_csv(file_path, nrows=rows_per_dataset)
# Convert 'ean_code' to string data type
nutrition_table_data['ean_code'] = nutrition_table_data["ean_code"].astype(str)
# Upload nutrition data to the VDL
nutrition_table = cd.upload_pandas_dataframe(nutrition_table_data[relevant_columns], name=nutrition_table_name)
# show metadata for the table (column titles and field types, i=integer, s=string)
print("Table meta-data:", repr(nutrition_table))
Reading data...
Uploading data...
Table meta-data: Name: 94EDD8E55C616AEB5C734AE212064DA5DFB4D59018BD8DCC8DC44C3F71BF610F
Size: 1000 rows x 11 columns
CIndex([Col("ean_code", "s", 14), Col("SPP_DESCRIPTION", "s", 14), Col("ENERGY_VALUE_IN_KJ", "i", 1), Col("ENERGY_VALUE_IN_KCAL", "i", 1), Col("SODIUM_IN_MG", "i", 1), Col("SATURATED_FATTY_ACIDS_IN_G", "i", 1), Col("TOTAL_PROTEIN_IN_G", "i", 1), Col("MONO_AND_DISACCHARIDES_IN_G", "i", 1), Col("DIETARY_FIBER_IN_G", "i", 1), Col("STANDARD_PORTION_SIZE", "i", 1), Col("NUMBER_OF_PORTIONS_PER_PACKAGE", "i", 1)])
Now our data is in the database, ready to be used by another party by simply retrieving it by name.