Joining Data - [Party 1]#
In this tutorial, we will walk through a step-by-step process of analyzing sales and nutrition data using the crandas library.
Step 1: Start the VDL#
First, we import the necessary libraries, including crandas (cd
), pandas (pd
) and other necessary packages.
import crandas as cd
import pandas as pd
import plotly.express as px
from pathlib import Path
Next, we update the base path to where the VDL (Virtual Data Lake) certificates are stored, and set the VDL endpoint.
# Update your path to where you have stored your VDL certificates
from crandas.base import session
force_reupload = False
session.base_path = Path('../../../../vdl-instance/vdl/secrets')
session.endpoint = 'https://localhost:9820/api/v1'
Step 2: Input data#
We set the number of rows to read for each dataset and define the table name for the food sales data.
#Set this to limit the amount of data
rows_per_dataset = 1000
food_sales_table_name = 'food_sales'
We then select the relevant columns that we need for our analysis.
# Select relevant columns
relevant_columns = ['Year',
'ProductNumber',
'ProductName',
'EAN_Number',
'QuantitySold']
Read the local file and upload it#
We read the food sales data from a local CSV file and limit the number of rows to our specified value (1000). After that, we upload the data to the VDL using cd.upload_pandas_dataframe()
# Read in the csv to pandas
food_sales_data = pd.read_csv('../../data/sales_nutrition_data/food_sales_data.csv', nrows=rows_per_dataset)
# Make the column 'EAN_Number' a string
food_sales_data['EAN_Number'] = food_sales_data["EAN_Number"].astype(str)
# Upload the pandas dataframe to the VDL
food_sales_table = cd.upload_pandas_dataframe(food_sales_data[relevant_columns], name=food_sales_table_name)
# Show metadata for the table (column titles and field types, i=integer, s=string)
>>> print("Table meta-data:", repr(food_sales_table))
Table meta-data: Name: 4D172B6A3F7412FE967F17D1B5BB98755EAB558B81EB76137849AB20E0F2C6F4
Size: 1000 rows x 5 columns
CIndex([Col("Jaar", "i", 1), Col("ProductNummer", "i", 1), Col("ProductNaam", "s", 14), Col("EAN_Nummer", "s", 14), Col("AantalVerkocht", "i", 1)])
Step 3: Access the Other Party’s Data#
Note
We assume that a different party upload their data in a table called nutrition
according to the steps in Joining Data - [Party 2].
Now we can access and connect to the second table containing nutrition data.
# Show that we can access and connect to the second table: nutrition_table_name
nutrition_table_name = 'nutrition'
# Access the table
nutrition_table = cd.get_table(nutrition_table_name)
# Print the meta-data for the table we just accessed
>>> print(repr(nutrition_table))
Name: BE7D6667201FFE6DD2352FB926FC1A06062853662B1348DD037837F47F65A69D
Size: 1000 rows x 11 columns
CIndex([Col("ean_code", "s", 14), Col("SPP_OMSCHRIJVING", "s", 14), Col("ENERGIEWAARDEINKJ_KJ", "i", 1), Col("ENERGIEWAARDEINKCAL_KCAL", "i", 1), Col("NATRIUM_MG", "i", 1), Col("VETZURENVERZADIGD_G", "i", 1), Col("EIWITTOTAAL_G", "i", 1), Col("MONOENDISACCHARIDEN_G", "i", 1), Col("VOEDINGSVEZEL_G", "i", 1), Col("STANDAARD_PORTIEGROOTTE", "i", 1), Col("AANTAL_PORTIES_PER_VERPAKKING", "i", 1)])
Step 4: Join tables#
We perform an inner join on the food sales table and the nutrition table using the EAN code as the key, and account for the fact that they have different column names in each table.
#Join the two datasets using an inner join on EAN code
joined_table = cd.merge(food_sales_table, nutrition_table, how="inner", left_on='EAN_Number', right_on='ean_code')
Step 5: Run the analyses#
We compute the average number of products sold and the number of products with zero sales for some simple data quality checks.
avg_sales = joined_table['QuantitySold'].mean()
zero_sales = sum(joined_table['QuantitySold'] == 0)
>>> print('Average number of products sold:', "%.0f" % avg_sales)
Average number of products sold: 23909
>>> print('Number of products with zero sales:', zero_sales)
Number of products with zero sales: 2
In the next step, we define a list of labels that represent different nutritional elements in the datasets.
labels_list = ['ENERGY_VALUE_IN_KJ',
'ENERGY_VALUE_IN_KCAL',
'SODIUM_IN_MG',
'SATURATED_FATTY_ACIDS_IN_G',
'TOTAL_PROTEIN_IN_G',
'MONO_AND_DISACCHARIDES_IN_G',
'DIETARY_FIBER_IN_G']
Now, we compute the total nutritional values for each product by multiplying the nutritional element columns by the number of items sold (AantalVerkocht
) and store these values in new columns with the suffix _tot
.
# For each label (k) listed above, create a "totals" column that multiplies column k times "AantalVerkocht"
merged = joined_table.assign(**{k + "_tot":(joined_table[k] * joined_table['QuantitySold']) for k in labels_list})
After this, we set the year to 2019, define a new list of labels without ENERGY_VALUE_IN_KJ
(calories in kilojoules), and create a dictionary coicop
to map product numbers to their names. We also create an empty list named table
to store the data.
year = 2019
labels_list = ['ENERGY_VALUE_IN_KCAL',
'SODIUM_IN_MG',
'SATURATED_FATTY_ACIDS_IN_G',
'TOTAL_PROTEIN_IN_G',
'MONO_AND_DISACCHARIDES_IN_G',
'DIETARY_FIBER_IN_G']
coicop = {1113: 'Bread',
1117: 'Breakfast cereals',
1175: 'Chips',
1183: 'Chocolate',
1222: 'Soft drinks',
1184: 'Candy'}
table = []
In this loop, we iterate through the coicop
dictionary and labels_list
, calculating the total nutritional values for each product and nutritional element in the specified year. We then append this data to the table
list and create a DataFrame
df
with the appropriate column names.
for key,val in coicop.items():
for label in labels_list:
sub = []
sub.append(key)
sub.append(val)
sub.append(label)
sub.append(sum(merged[(merged["ProductNumber"]==key) & (merged["Year"]==year)][label+"_tot"]))
table.append(sub)
df = pd.DataFrame(table, columns =['ProductNumber', 'ProductName', 'Element', 'Totaal'])
df.head()
Finally, we create a bar chart using the data from the df
DataFrame
. The x-axis represents the nutritional elements, the y-axis represents the total nutritional value, and the bars are colored by product names. The chart’s title displays the year of analysis (2019).
fig = px.bar(df, x="Element", y="Totaal", color="ProductName", text_auto=False, title='Total intake for: '+str(year), height=800)
fig.show()
