Pharma Party 1 [Production]#
Please refer to the [Design] script before this one.
This script would be run on the real/production data and it shows how you would execute the script that you recorded using the _des
script (in the design environment).
You upload the approved script that has been downloaded from the web portal using script.load
.
When running in production, changing any part of the script from the original recording will result in an error.
Let’s get started by importing the necessary libraries and set the analyst key that we downloaded in the portal.
import crandas as cd
import pandas as pd
from crandas.base import session
from pathlib import Path
import plotly.express as px
import crandas.script
# On a jupyter environment provided by Roseman Labs, session variables are set automatically in the background - unless you wish to test the approval flow within jupyter (shown below)
# Set the session base_path and session.endpoint manually when executing this notebook in any other environment (commented out below)
# To test the approval flow in Jupyter, you will need the following lines to set your analyst key (downloaded from the portal). This must be referenced correctly.
# Replace with correct path and name
cd.base.session.analyst_key = cd.base.session.base_path / '../private_key.sk'
# session.endpoint = 'https://localhost:9820/api/v1' # If in jupyter you will not need this even to test approval
# session.base_path = Path('base/path/to/vdl/secrets')
Now we are working in production and have uploaded the analysis to the portal and downloaded the approved file. Now we can load in the .approved
file from the portal.
cd.script.reset()
script = cd.script.load('PharmaParty1.approved')
Next, lets load the CSV file. Note that this time we use party1_prod.csv
rather than the dummy data (identical structure).
party1_table = cd.read_csv('../../data/pharma_data/party1_prod.csv', auto_bounds=True)
In design, this would have used the csv we uploaded before cd.script.record()
. Since this is executed in production, it will now use the data corresponding to the handle!
party2_table = cd.get_table('D751A877E6FDBA5CB00D870B77D88C33742F0CFE0DC0B3202DB52D9C1980B027')
Option 1: Join tables and retrieve statistics#
Now we can merge the two tables with an inner join on a number of columns/keys shown below:
# Option 1 - a simple join; that requires specific requires for queries afterwards
merged = cd.merge(party1_table, party2_table, how='inner', on=['year', 'month', 'day', 'article_nr', 'batch_nr', 'smokes'])
print(repr(merged))
>>> Handle: DA3EB48E7A505A751A83C234A19276782FAB06441C0849E718C71DC513E190A7
Size: 130 rows x 14 columns
CIndex([Col("Unnamed: 0_x", "i", 1), Col("year", "i", 1), Col("month", "i", 1), Col("day", "i", 1), Col("article_nr", "i", 1), Col("condition_x", "i", 1), Col("category", "s", 8), Col("batch_nr", "i", 1), Col("smokes", "i", 1), Col("prescription", "i", 1), Col("quantity", "i", 1), Col("packaging", "i", 1), Col("Unnamed: 0_y", "i", 1), Col("condition_y", "i", 1)])
Retrieve statistics#
Then we can use the merged table to gain some insights from our dataset.
merged['condition_y'].mean()
0.49230769230769234
merged[(merged["smokes"]==1).with_threshold(10)]["condition_y"].mean()
0.44642857142857145
def compute_sum(month):
try:
result = merged[((merged["condition_y"]==1) & (merged['month']==month)).with_threshold(3)]["condition_y"].sum()
return result # computes means only if there are 3 cases for the month
except:
return None # if not, no output is given
dic={"Month": ['Jan', 'Feb', 'March'], "Sum of Astma": [
compute_sum(1),compute_sum(2),compute_sum(3)]}
plot=px.bar(dic, x="Month", y="Sum of Astma")
plot.show()

Option 2: Join tables and create a restricted output#
Here we can merge the tables and create a restricted output (i.e. an output with only a number of the original columns)
with cd.Transaction():
joined=cd.merge(party1_table, party2_table, how='inner', on=['year', 'month', 'day', 'article_nr', 'batch_nr', 'smokes'])
joined_restricted = joined[['year', 'month', 'condition_x', 'smokes', 'condition_y']]
opened_result = joined_restricted.open()
print(joined_restricted.result)
8B9C4B195E0F18DE30A4AFD9F9922EDA...[130x5]
opened_result.result
year | month | condition_x | smokes | condition_y | |
---|---|---|---|---|---|
0 | 2021 | 5 | 51783558 | 1 | 1 |
1 | 2021 | 5 | 62723894 | 0 | 1 |
2 | 2021 | 7 | 66361162 | 0 | 1 |
3 | 2021 | 11 | 89719651 | 1 | 0 |
4 | 2021 | 11 | 35151117 | 1 | 0 |
... | ... | ... | ... | ... | ... |
125 | 2021 | 5 | 93370742 | 0 | 0 |
126 | 2021 | 7 | 79925901 | 0 | 1 |
127 | 2021 | 7 | 32720617 | 0 | 0 |
128 | 2021 | 7 | 66140578 | 0 | 0 |
129 | 2021 | 1 | 39935444 | 0 | 1 |
130 rows × 5 columns
To show that queries that were not in the original recording cannot be executed, we can add a new query and you will see we get an error:
merged.open()
---------------------------------------------------------------------------
NoMatchingAuthorizationError Traceback (most recent call last)
Cell In [14], line 1
----> 1 merged.open()
File /opt/conda/lib/python3.10/site-packages/crandas/crandas.py:1052, in CDataFrame.open(self, **query_args)
1050 def open(self, **query_args):
1051 query_args["mode"] = "open"
-> 1052 return self.get(**query_args)
File /opt/conda/lib/python3.10/site-packages/crandas/crandas.py:1055, in CDataFrame.get(self, **query_args)
1054 def get(self, **query_args): # useful to go from deferred table to real table (with dimensions)
-> 1055 return vdl_query({"get": self.reference}, CDataFrame.expect(self.columns), **query_args)
File /opt/conda/lib/python3.10/site-packages/crandas/query.py:357, in vdl_query(cmd, handler, name, _retrieve_by_name, dummy_for, mode, add_nonce, bitlength, _keep, print_json, authorization_file, dry_run, session)
354 script.known_handles[handle] = script_step, None
356 handler.handle_hooks.append(_add_handle)
--> 357 return vdl_query_execute(
358 cmd, handler, print_json, authorization_file, authorizations, dry_run, session=session
359 )
File /opt/conda/lib/python3.10/site-packages/crandas/query.py:239, in vdl_query_execute(cmd, handler, print_json, authorization_file, authorizations, dry_run, session, run_tag)
237 return handler.get_dry_run_result(cmd, crandas.base.random_handle())
238 else:
--> 239 cmd_authorized = authorize_query(authorizations, cmd, session, run_tag)
240 return handler.parse_response(cmd, session.execute(cmd_authorized))
File /opt/conda/lib/python3.10/site-packages/crandas/auth.py:147, in authorize_query(templates_signed, json_query, session, run_tag)
145 raise NoMatchingAuthorizationError(f"You are trying to execute a different query from the one that was approved. Are you executing the script in exactly the same order as when you recorded it?\nThe mismatch is {last_error}.\nThe approved query is {template}.\nThe executed query is {json_query}.")
146 else:
--> 147 raise NoMatchingAuthorizationError("The current query is not among the set of approved queries.")
NoMatchingAuthorizationError: The current query is not among the set of approved queries.
Finally we can close the script that we loaded in:
# Here we just close the script
script.close()