Pharma Party 1 [Production]#

Please refer to the [Design] script before this one.

This script would be run on the real/production data and it shows how you would execute the script that you recorded using the _des script (in the design environment). You upload the approved script that has been downloaded from the web portal using script.load. When running in production, changing any part of the script from the original recording will result in an error.

Let’s get started by importing the necessary libraries and set the analyst key that we downloaded in the portal.

import crandas as cd
import pandas as pd
from crandas.base import session
from pathlib import Path
import plotly.express as px
import crandas.script

# On a jupyter environment provided by Roseman Labs, session variables are set automatically in the background - unless you wish to test the approval flow within jupyter (shown below)
# Set the session base_path and session.endpoint manually when executing this notebook in any other environment (commented out below)

# To test the approval flow in Jupyter, you will need the following lines to set your analyst key (downloaded from the portal). This must be referenced correctly.

# Replace with correct path and name
cd.base.session.analyst_key = cd.base.session.base_path / '../private_key.sk'

# session.endpoint = 'https://localhost:9820/api/v1' # If in jupyter you will not need this even to test approval
# session.base_path = Path('base/path/to/vdl/secrets')

Now we are working in production and have uploaded the analysis to the portal and downloaded the approved file. Now we can load in the .approved file from the portal.

cd.script.reset()
script = cd.script.load('PharmaParty1.approved')

Next, lets load the CSV file. Note that this time we use party1_prod.csv rather than the dummy data (identical structure).

party1_table = cd.read_csv('../../data/pharma_data/party1_prod.csv', auto_bounds=True)

In design, this would have used the csv we uploaded before cd.script.record(). Since this is executed in production, it will now use the data corresponding to the handle!

party2_table = cd.get_table('D751A877E6FDBA5CB00D870B77D88C33742F0CFE0DC0B3202DB52D9C1980B027')

Option 1: Join tables and retrieve statistics#

Now we can merge the two tables with an inner join on a number of columns/keys shown below:

# Option 1 - a simple join; that requires specific requires for queries afterwards
merged = cd.merge(party1_table, party2_table, how='inner', on=['year', 'month', 'day', 'article_nr', 'batch_nr', 'smokes'])
print(repr(merged))
>>> Handle: DA3EB48E7A505A751A83C234A19276782FAB06441C0849E718C71DC513E190A7
Size: 130 rows x 14 columns
CIndex([Col("Unnamed: 0_x", "i", 1), Col("year", "i", 1), Col("month", "i", 1), Col("day", "i", 1), Col("article_nr", "i", 1), Col("condition_x", "i", 1), Col("category", "s", 8), Col("batch_nr", "i", 1), Col("smokes", "i", 1), Col("prescription", "i", 1), Col("quantity", "i", 1), Col("packaging", "i", 1), Col("Unnamed: 0_y", "i", 1), Col("condition_y", "i", 1)])

Retrieve statistics#

Then we can use the merged table to gain some insights from our dataset.

merged['condition_y'].mean()
0.49230769230769234
merged[(merged["smokes"]==1).with_threshold(10)]["condition_y"].mean()
0.44642857142857145
def compute_sum(month):
    try:
        result = merged[((merged["condition_y"]==1) & (merged['month']==month)).with_threshold(3)]["condition_y"].sum()
        return result             # computes means only if there are 3 cases for the month
    except:
        return None                # if not, no output is given

dic={"Month": ['Jan', 'Feb', 'March'], "Sum of Astma": [
     compute_sum(1),compute_sum(2),compute_sum(3)]}

plot=px.bar(dic, x="Month", y="Sum of Astma")
plot.show()
../../_images/pharma_party1_des.png

Option 2: Join tables and create a restricted output#

Here we can merge the tables and create a restricted output (i.e. an output with only a number of the original columns)

with cd.Transaction():
    joined=cd.merge(party1_table, party2_table, how='inner', on=['year', 'month', 'day', 'article_nr', 'batch_nr', 'smokes'])
    joined_restricted = joined[['year', 'month', 'condition_x', 'smokes', 'condition_y']]
    opened_result = joined_restricted.open()
print(joined_restricted.result)
8B9C4B195E0F18DE30A4AFD9F9922EDA...[130x5]
opened_result.result
year month condition_x smokes condition_y
0 2021 5 51783558 1 1
1 2021 5 62723894 0 1
2 2021 7 66361162 0 1
3 2021 11 89719651 1 0
4 2021 11 35151117 1 0
... ... ... ... ... ...
125 2021 5 93370742 0 0
126 2021 7 79925901 0 1
127 2021 7 32720617 0 0
128 2021 7 66140578 0 0
129 2021 1 39935444 0 1

130 rows × 5 columns

To show that queries that were not in the original recording cannot be executed, we can add a new query and you will see we get an error:

merged.open()
---------------------------------------------------------------------------

NoMatchingAuthorizationError              Traceback (most recent call last)

Cell In [14], line 1
----> 1 merged.open()


File /opt/conda/lib/python3.10/site-packages/crandas/crandas.py:1052, in CDataFrame.open(self, **query_args)
   1050 def open(self, **query_args):
   1051     query_args["mode"] = "open"
-> 1052     return self.get(**query_args)


File /opt/conda/lib/python3.10/site-packages/crandas/crandas.py:1055, in CDataFrame.get(self, **query_args)
   1054 def get(self, **query_args):  # useful to go from deferred table to real table (with dimensions)
-> 1055     return vdl_query({"get": self.reference}, CDataFrame.expect(self.columns), **query_args)


File /opt/conda/lib/python3.10/site-packages/crandas/query.py:357, in vdl_query(cmd, handler, name, _retrieve_by_name, dummy_for, mode, add_nonce, bitlength, _keep, print_json, authorization_file, dry_run, session)
    354         script.known_handles[handle] = script_step, None
    356     handler.handle_hooks.append(_add_handle)
--> 357 return vdl_query_execute(
    358     cmd, handler, print_json, authorization_file, authorizations, dry_run, session=session
    359 )


File /opt/conda/lib/python3.10/site-packages/crandas/query.py:239, in vdl_query_execute(cmd, handler, print_json, authorization_file, authorizations, dry_run, session, run_tag)
    237     return handler.get_dry_run_result(cmd, crandas.base.random_handle())
    238 else:
--> 239     cmd_authorized = authorize_query(authorizations, cmd, session, run_tag)
    240     return handler.parse_response(cmd, session.execute(cmd_authorized))


File /opt/conda/lib/python3.10/site-packages/crandas/auth.py:147, in authorize_query(templates_signed, json_query, session, run_tag)
    145     raise NoMatchingAuthorizationError(f"You are trying to execute a different query from the one that was approved. Are you executing the script in exactly the same order as when you recorded it?\nThe mismatch is {last_error}.\nThe approved query is {template}.\nThe executed query is {json_query}.")
    146 else:
--> 147     raise NoMatchingAuthorizationError("The current query is not among the set of approved queries.")


NoMatchingAuthorizationError: The current query is not among the set of approved queries.

Finally we can close the script that we loaded in:

# Here we just close the script
script.close()