Pharma Party 1 [Design]#
In this scenario, two pharmaceutical companies are collaborating on a joint analysis using crandas and plotly to visualize the results.
One pharmaceutical company will upload their dataset via the web portal and the other (this script) will upload via crandas and also perform the analysis.
They want to merge and analyze this data to gain insights. The script performs data merging, computes conditional values, and visualizes the results using bar plots.
This shows how we can record scripts for analysis on production data. Notice how this script sets the analyst key (downloaded from the web portal) and then uses the script.record()
function to record the script for approval (creates a JSON that can be uploaded to the portal - script.save()
).
Let’s get started by importing the necessary libraries and set the analyst key that we downloaded in the portal.
import crandas as cd
import pandas as pd
from crandas.base import session
from pathlib import Path
import plotly.express as px
import crandas.script
# On a jupyter environment provided by Roseman Labs, session variables are set automatically in the background - unless you wish to test the approval flow within jupyter (shown below)
# Set the session base_path and session.endpoint manually when executing this notebook in any other environment (commented out below)
# To test the approval flow in Jupyter, you will need the following lines to set your analyst key (downloaded from the portal). This must be referenced correctly.
# Replace with correct path and name
cd.base.session.analyst_key = cd.base.session.base_path / '../private_key.sk'
# session.endpoint = 'https://localhost:9820/api/v1' # If in jupyter you will not need this even to test approval
# session.base_path = Path('base/path/to/vdl/secrets')
Now we want to select which dummy data we want to use to test our script (this dummy data has same column names, data types and bounds).
The handle in dummy_for
will need to be replaced with the handle from the production environment when you upload ‘party2_prod.csv’ to the web portal (different each time).
This ensures that when executing in the production environment, it uses the production data.
party2_table = cd.read_csv('../../data/pharma_data/party2_dummy.csv', auto_bounds=True, dummy_for='D751A877E6FDBA5CB00D870B77D88C33742F0CFE0DC0B3202DB52D9C1980B027')
After this, we can start recording the script:
# Record the script from here
script = cd.script.record()
Note
If for some reason it says that a script is already running, just execute cd.script.reset()
Next, lets load the dummy CSV file we will use, this is also has an identical structure to the production data (allowing us to use the production data when executing in the production environment).
party1_table = cd.read_csv('../../data/pharma_data/party1_dummy.csv', auto_bounds=True)
Next, we need to access the data that party 2 uploaded in the production web portal (remember to replace the handle here, with the same one as you put for dummy_for
above).
Since this is executed in design, it will use the dummy data csv for party 2 above. In production it will use the table specified by the handle below.
party2_table = cd.get_table('D751A877E6FDBA5CB00D870B77D88C33742F0CFE0DC0B3202DB52D9C1980B027')
Option 1: Join tables and retrieve statistics#
Now we can merge the two tables with an inner join on a number of columns/keys shown below:
# Option 1 - a simple join; that requires specific requires for queries afterwards
merged = cd.merge(party1_table, party2_table, how='inner', on=['year', 'month', 'day', 'article_nr', 'batch_nr', 'smokes'])
print(repr(merged))
>>> Handle: DA3EB48E7A505A751A83C234A19276782FAB06441C0849E718C71DC513E190A7
Size: 130 rows x 14 columns
CIndex([Col("Unnamed: 0_x", "i", 1), Col("year", "i", 1), Col("month", "i", 1), Col("day", "i", 1), Col("article_nr", "i", 1), Col("condition_x", "i", 1), Col("category", "s", 8), Col("batch_nr", "i", 1), Col("smokes", "i", 1), Col("prescription", "i", 1), Col("quantity", "i", 1), Col("packaging", "i", 1), Col("Unnamed: 0_y", "i", 1), Col("condition_y", "i", 1)])
Retrieve statistics#
Then we can use the merged table to gain some insights from our dataset.
merged['condition_y'].mean()
0.49230769230769234
merged[(merged["smokes"]==1).with_threshold(10)]["condition_y"].mean()
0.44642857142857145
def compute_sum(month):
try:
result = merged[((merged["condition_y"]==1) & (merged['month']==month)).with_threshold(3)]["condition_y"].sum()
return result # computes means only if there are 3 cases for the month
except:
return None # if not, no output is given
dic={"Month": ['Jan', 'Feb', 'March'], "Sum of Astma": [
compute_sum(1),compute_sum(2),compute_sum(3)]}
plot=px.bar(dic, x="Month", y="Sum of Astma")
plot.show()

Option 2: Join tables and create a restricted output#
Here we can merge the tables and create a restricted output (i.e. an output with only a number of the original columns)
with cd.Transaction():
joined=cd.merge(party1_table, party2_table, how='inner', on=['year', 'month', 'day', 'article_nr', 'batch_nr', 'smokes'])
joined_restricted = joined[['year', 'month', 'condition_x', 'smokes', 'condition_y']]
opened_result = joined_restricted.open()
print(joined_restricted.result)
8B9C4B195E0F18DE30A4AFD9F9922EDA...[130x5]
opened_result.result
year | month | condition_x | smokes | condition_y | |
---|---|---|---|---|---|
0 | 2021 | 5 | 51783558 | 1 | 1 |
1 | 2021 | 5 | 62723894 | 0 | 1 |
2 | 2021 | 7 | 66361162 | 0 | 1 |
3 | 2021 | 11 | 89719651 | 1 | 0 |
4 | 2021 | 11 | 35151117 | 1 | 0 |
... | ... | ... | ... | ... | ... |
125 | 2021 | 5 | 93370742 | 0 | 0 |
126 | 2021 | 7 | 79925901 | 0 | 1 |
127 | 2021 | 7 | 32720617 | 0 | 0 |
128 | 2021 | 7 | 66140578 | 0 | 0 |
129 | 2021 | 1 | 39935444 | 0 | 1 |
130 rows × 5 columns
Finally, we can use script.save()
in order to save the recorded script as a JSON that we can upload into the portal.
# This saves the script (everything from record till here) as a json file for you to download
script = cd.script.save('Pharma-Party1.json')