crandas.crandas#

Main crandas functionality: dataframes (CDataFrame), series (CSeries), and analysis operations (e.g., merge())

class crandas.crandas.CDataFrame(columns, nrows=None, **kwargs)#

Bases: StateObject

Dataframe stored in the VDL CDataFrame provides access to tables stored in the VDL using an API modeled upon Panda’s `DataFrame`s. A CDataFrame may be obtained in one of the following ways:

  • By uploading data into the VDL using read_csv() or upload_pandas_dataframe()

  • By accessing an earlier uploaded table using get_table()

__getitem__(key)#

Implements df[key]

  • If key is a CSeries or a function, call CDataFrame.filter.

  • If key is a list, call CDataFrame.project

  • If key is a str, return a CSeries representing the column with the given name

  • If key is a slice, call CDataFrame.slice

Raises:

TypeError – the key must be one of the accepted types

add_prefix(prefix)#

Implements pandas.DataFrame.add_prefix

Parameters:

prefix (str) – prefix to be added

Returns:

Copy of CDataFrame where prefixis added to all column names

Return type:

CDataFrame

add_suffix(suffix)#

Implements pandas.DataFrame.add_suffix

Parameters:

suffix (str) – suffix to be added

Returns:

Copy of CDataFrame where suffix is added to all column names

Return type:

CDataFrame

append(other, ignore_index=True)#

Implements pandas.DataFrame.append by calling crandas.concat accordingly TODO: To be deprecated

Parameters:
  • other (DataFrame or CDataFrame) – The data to append.

  • ignore_index (bool, optional) – If True, the resulting axis will be labeled 0, 1, …, n - 1., currently only True is allowed, by default True

Returns:

The concatenated table

Return type:

CDataFrame

assign(**query_args)#

Implements pandas.DataFrame.assign. Assigns new columns to a CDataFrame, and outputs a new CDataFrame with the new columns.

Assigned values need to be CSeries (or callable providing a CSeries), assignment of clear Series/scalar/array not supported.

The assigned variables are used as column names. The assigned variables cannot be vdl_query arguments; these are passed on to vdl_query unchanged.

describe()#

Generate descriptive statistics

filter(key, **query_args)#

Filter table

Returns table with all rows of the original table satisfying the criterion represented by key.

key can be a CSeries representing a table column or a computation on table row(s). In this case, the CSeries values need to be 1 (indicating that the corresponding row will be selected) or 0 (indicating that the row will not be selected).

If the CSeries used for indexing has a threshold (see CSeries.with_threshold), the filtered result is only returned if it has the minimum number of rows as indicated by the threshold.

Alternatively, key can be a function to be applied to the table columns. The function is called with one argument representing the table, of which the fields correspond to the columns. E.g., key lambda x: x.col1==1 rerpesents the function that checks whether the value of column with name “col1” equals one.

See function_to_json for more information.

Parameters:
  • key (CSeries or callable) – Filter criterion

  • query_args (query arguments, see vdl_query) –

Returns:

The filtered table

Return type:

CDataFrame

groupby(col, **query_args)#

Computes a grouping of the table by the values of a given column. Returns a grouping object that can be used in aggregation (see CSeriesGroupBy) or as an argument to crandas.merge(). :param col: Name of column to group by :type col: str

Returns:

Grouping object

Return type:

CDataFrameGroupBy

classmethod json_to_closed(deferred, json_response)#

Returns an instance of the class corresponding to to the provided JSON represention.

If the instance comes from a transaction, then deferred is the deferred object originally returned by vdl_query. This function should then check that the returned answer complies with the expected deferred. Otherwise, deferred is None.

classmethod json_to_opened(deferred, json_response, masker)#

Returns an opened object (e.g. a Pandas dataframe) corresponding to the provided JSON representation.

If the instance comes from a transaction, then deferred is the deferred object originally returned by vdl_query. This function should then check that the returned answer complies with the expected deferred. Otherwise, deferred is None.

max(axis=0)#

Computes the maximum of each (numeric) column

Parameters:

axis (int, optional) – Which axis of the dataframe, only 0 is implemented, by default 0

min(axis=0)#

Computes the minimum of each (numeric) column

Parameters:

axis (int, optional) – Which axis of the dataframe, only 0 is implemented, by default 0

open_dry_run_result()#

Run on a deferred instance of the class as returned by cls.expect(). Should return an example opened object (e.g., DataFrame) of the same type as would be obtained by opening the object.

project(cols, **query_args)#

Project table

Returns table with same rows but a selection of columns

Parameters:
  • cols (list of str) – Columns to select. Can be empty. Columns can occur multiple times.

  • query_args (query arguments, see vdl_query) –

Returns:

The projected table

Return type:

CDataFrame

rename(columns, **query_args)#

Implements pandas.DataFrame.rename Only renaming of columns via columns argument is supported.

Parameters:

columns (dict) – dictionary of columns to be renamed of the form {“oldname”: “newname”}

Returns:

CDataFrame with updated column names

Return type:

CDataFrame

sample(*, n=None, frac=None, random_state=None, **query_args)#

Samples rows from the dataframe.

The number of rows can be specified either as an integer n or a fraction frac. The case frac==1 corresponds to returning a shuffling of the table and is equivalent to CDataFrame.shuffle.

If a random_state is given, the sampling is performed in a deterministic way and according to a public selection (i.e., known to the servers and predictable to the client); otherwise, the sampling is non-deterministic and private (not known to the client and servers). See also CDataFrame.shuffle.

Parameters:
  • n (integer, default: None) – Number of rows to sample

  • frac (floating-point, default: None) – Proportion of rows (between 0.0 and 1.0, inclusive) to sample

  • random_state (long integer, default: None) – Seed for deterministic sampling (otherwise is non-determinitic)

Returns:

ret – Copy of the table with rows sampled

Return type:

CDataFrame

shuffle(*, random_state=None, **query_args)#

Return table with rows shuffled. If a random_state is given, the shuffle is determinstic and performed according to a public permutation (i.e., known to the servers and predictable to the client); otherwise, the shuffle is non-deterministic and private (not known to the client and servers).

Parameters:

random_state (long integer, default: None) – Seed for deterministic shuffle (otherwise is non-determinitic)

Returns:

ret – Copy of the table with rows shuffled

Return type:

CDataFrame

slice(key, **query_args)#

Slice table

Returns table with same columns but a selection of rows

Parameters:
  • key (slice) – Python slice object representing rows to select

  • query_args (query arguments, see vdl_query) –

Returns:

The sliced table

Return type:

CDataFrame

validate(*validations, **query_args)#

Applies input validation to the table.

Input validation leads to a table that has the validations as constraints on the respective columns (e.g., checking that a column contains values in [0,2] leads to a column with values constrained to that domain). These constraints can be inspected by accessing tab.columns.cols[i].constaints.

Validations are instances of the Validation class and can be set by calling validation functions such as CSeriesColRef.in_range() and CSeriesColRef.sum_in_range().

Parameters:

*validations (list of Validation objects) – Validations to apply to the table

Returns:

If all validations have succeeded: copy of the table having the validations as constraints

Return type:

CDataFrame

class crandas.crandas.CIndex(cols, **kwargs)#

Bases: object

Index (set of columns) of a CDataFrame

For a regular CDataFrame, this represents the columns (name and type) of the CDataFrame.

For a deferred CDataFrame (in a transaction, or resulting from a dry run), this represents the columns (name and type) that the result of an operation is expected to have based on its inputs. For such an expected column, the name is set, but the type and size (“elements per value”) may be undefined.

__eq__(other)#

Checks equality with input

__getitem__(ix)#

Returns name of column ix

__len__()#

Returns number of columns

__repr__()#

Returns printable representation

classmethod from_json(json)#

Constructor from a JSON

get_loc(name)#

Get integer location for requested label

Parameters:

name (str) – column name label;

Returns:

index of column with name name

Return type:

int

Raises:

KeyError – value not found

matches_template(expected)#

Checks whether the number and names of columns fit a template

to_dict()#

Returns column names in dictionary form

class crandas.crandas.CSeries(**kwargs)#

Bases: Summable

One dimensional array which represents either the column of a CDataFrame or the result of applying a rowwise function to one or more columns of a CDataFrame

as_table(*, column_name='', **query_args)#

Outputs CDataFrame having the CSeries as column

Parameters:

column_name (str, optional) – name for the column in the resulting CDataFrame

Returns:

CDataFrame having the expected CSeries as its only column

Return type:

CDataFrame

astype(ctype, validate=False)#

Converts output to a specific type

Parameters:
  • ctype (Ctypes type specification (see ctypes.Ctype.from_spec())) – Type to convert to

  • validate (bool, default False) – If set, validate that the resulting column is of the correct type, e.g., is an 8-bit integer when tp=uint8.

Returns:

CSeries converted to given type

Return type:

CSeries

Raises:

ServerError – Conversion failed or not supported

get(*, name='', **query_args)#

Deprecated. Use CSeries.as_table() instead.

if_else(ifval, elseval)#

Allows values to be assigned with an if-else statement where self is the guard and has to be a column of bits; the value from ifval is selected for rows of self that have the value one and the value from elseval is selected for rows of self that have the value zero

Parameters:
  • ifval (int) – Value if true

  • elseval (int) – Value otherwise

inner(other)#

Inner product of two vectors

isna()#

Returns whether respective values are NULL, boolean inverse of notna

isnull()#

Returns whether respective values are NULL, boolean inverse of notna

len()#

Returns the character length of each element of the CSeries (only works for Cseries of type string)

Returns:

CSeries of character lengths

Return type:

CSeriesFun

lower()#

Returns string values in lowercase

notna()#

Returns whether respective values are not NULL, boolean inverse of isna

notnull()#

Alias for isna

vsum()#

Sum the elements of a vector

with_threshold(threshold)#

Adds a threshold to the CSeries. When the column is used as a filtering column, this threshold indicates the minimum number of items that need to be in the filtering result.

Parameters:

threshold (int) – minimum number of elements for operation to be allowed

class crandas.crandas.CSeriesColRef(table, name, **kwargs)#

Bases: CSeries

Subclass of CSeries. Represents a column of a CDataFrame df as accesed via df[“colname”]

as_table(*, column_name='', **query_args)#

Outputs CDataFrame having the CSeries as column

Parameters:

column_name (str, optional) – name for the column in the resulting CDataFrame

Returns:

CDataFrame having the expected CSeries as its only column

Return type:

CDataFrame

count(*, as_table=False, threshold=None, **query_args)#

Computes the count (number of not-NULL elements) of the series

See CSeriesColRef.sum() for a description of the arguments.

function_to_json(tablist)#

Creates a JSON query for the series

get(*, name='', **query_args)#

Deprecated. Use :meth:`.CSeriesColRef.as_table instead.

in_range(minval, maxval)#

Validation that column values lie in specified range

Apples to integer/integer vector columns only

Parameters:
  • minval (int) – minimum (inclusive);

  • maxval (int) – maximum (inclusive)

Returns:

Validator for use in CDataFrame.validate

Return type:

Validation

max(*, as_table=False, threshold=None, **query_args)#

Computes the maximum of the series

See CSeriesColRef.sum() for a description of the arguments.

mean(*, as_table=False, threshold=None, **query_args)#

Computes the mean of the elements of the series.

See CSeriesColRef.sum() for arguments. Note: this leaks the number of not-NULL elements.

min(*, as_table=False, threshold=None, **query_args)#

Computes the minimum of the series

See CSeriesColRef.sum() for a description of the arguments.

sum(*, as_table=False, threshold=None, **query_args)#

Computes the sum of the elements of the series

Parameters:
  • as_table (boolean, default: False) – if True, result is returned as DataFrame instead of value

  • threshold (int, default None) – if given, only return value as long as the number of not-NULL elements is above the minimum threshold of elements for the operation

Returns:

Result of applicable type, depending on as_table and mode

Return type:

int/Deferred/DataFrame/CDataFrame

sum_in_range(minval, maxval)#

Validation that sum of column values lies in specified range

Applies to integer/integer vector columns only

Parameters:
  • minval (int) – minimum (inclusive);

  • maxval (int) – maximum (inclusive)

Returns:

Validator for use in CDataFrame.validate

Return type:

Validation

sum_squares(*, as_table=False, threshold=None, **query_args)#

Computes the sum of squares of the elements of the series.

See CSeriesColRef.sum() for a description of the arguments.

var(*, as_table=False, threshold=None, **query_args)#

Computes the variance of the series.

See CSeriesColRef.sum() for a description of the arguments.

class crandas.crandas.CSeriesFun(op, vals, args={}, **kwargs)#

Bases: CSeries

Subclass of CSeries over which a function was applied to it

class crandas.crandas.Col(name, type, elperv, nullable=False, constraints=None, **kwargs)#

Bases: object

Represents the type of a column.

The type and elperv fields can be equal to “?” and -1, respectively, to indicate that these are not known (e.g., for colums in an expected specification to vdl_query).

__eq__(other)#

Checks structural equality between columns

__repr__()#

Returns printable representation

classmethod from_json(json)#

Constructs a column from a JSON-deserialized dict

matches_except_size(other)#

Checks structural equality between columns, ignoring number of field elements per value

matches_template(template)#

Checks whether the column fits a template

renamed(name)#

Return copy of the Col with a different name

crandas.crandas.DataFrame(*args, ctype=None, auto_bounds=False, **query_args)#

Creates a crandas dataframe. This function calls the pandas DataFrame constructor, and uploads the resulting table using upload_pandas_dataframe(). If a name is given as one of the command-line arguments, it is passed on to `upload_pandas_dataframe().

Parameters:
  • ctype (dict, default: None) – explicitly given types for columns

  • auto_bounds (bool, default: False) – if given, do not warn about automatically derived column bounds

Returns:

uploaded table

Return type:

CDataFrame

class crandas.crandas.ReturnValue(type, elperv, is_series, num_rows=None, *, name, **kwargs)#

Bases: StateObject, CSeries

Represent a value or series of values computed by the VDL

Various VDL commands, e.g., CSeries.sum(), return values or series of values, as opposed to returning a DataFrame. This class is the analogue of CDataFrame that is used to represent such remote values.

A ReturnValue can be used as a CSeries, making it possible e.g. to filter on a value computed by the VDL without having to open it. For example, the following filters all maximum elements without revealing the maximum: tab[tab["col"]==tab["col"].max(mode="regular")].

To obtain the value/series in the clear, call ReturnValue.open(). This returns a single value, unless .is_series is set, in which case it returns a Pandas series, which needs to have .num_rows rows if set.

get(**query_args)#

Deprecated. Use CSeries.as_table() instead.

classmethod json_to_closed(deferred, json_response)#

Returns an instance of the class corresponding to to the provided JSON represention.

If the instance comes from a transaction, then deferred is the deferred object originally returned by vdl_query. This function should then check that the returned answer complies with the expected deferred. Otherwise, deferred is None.

classmethod json_to_opened(deferred, json_response, masker)#

Returns an opened object (e.g. a Pandas dataframe) corresponding to the provided JSON representation.

If the instance comes from a transaction, then deferred is the deferred object originally returned by vdl_query. This function should then check that the returned answer complies with the expected deferred. Otherwise, deferred is None.

open(**query_args)#

Open value

Parameters:

query_args (query arguments) –

Returns:

Value represented by remote object, see main class documentation

Return type:

int/…/pd.Series

open_dry_run_result()#

Run on a deferred instance of the class as returned by cls.expect(). Should return an example opened object (e.g., DataFrame) of the same type as would be obtained by opening the object.

crandas.crandas.Series(*args, **kwargs)#

Alias for pd.Series to allow easier conversion between pandas and crandas code.

class crandas.crandas.Validation(table, col, json_desc)#

Bases: object

Represents a validation that can be applied to a column.

Returned by functions like CSeriesColRef.in_range, etc. Used as an argument to CDataFrame.validate.

crandas.crandas.concat(tables_, *, ignore_index=True, axis=0, join='outer', **query_args)#

Table concatenation Performs horizontal/vertical concatenation of tables, modelled on pandas pd.concat. Currently, only inner joins are suported for vertical concatenation. The first table defines the set of columns that the resulting table has. If join=”inner”, only columns common to all tables are included. Else, the remaining tables need to have the same set of columns as the first table (up to ordering), else an error is returned.

Parameters:
  • tables (list of CDataFrames) – DataFrames to be concatenated

  • ignore_index (bool, optional) – does nothing, but is used in crandas.append, by default True

  • axis (int, optional) – Concatenation axis, 0=vertical, 1=horizontal, by default 0

  • join (str, optional) – type of join (currently only inner join is supported for vertical join), by default “outer”

Returns:

mode-dependent return table representing vertical/horizontal join

Return type:

CDataFrame

Raises:
  • RuntimeError – Received wrong inputs

  • NotImplementedError – Limited vertical concatenation is allowed, there must be a matching column on both tables to be concatenated

  • ValueError – Limited vertical concatenation is allowed, number of columns should be the same in all tables

  • RuntimeError – Horizontal join would create table with duplicate column names

crandas.crandas.cut(series, bins, *, labels, right=True, add_inf=False)#

Bin values into discrete intervals (aka quantization)

Bins values into discrete intervals, a la pandas.cut. Quantizes series into bins [bins[0],bins[1]), [bins[1],bins[2]), etc, and returns the corresponding bin labels (so labels[0] for bin [bins[0],bins[1]), labels[1] for bin [bins[1],bins[2]), etc. The bins include the left edge and exclude the right edge.

The first bin should have -np.inf as left edge and the last bin should have np.inf as its right edge. If the argument add_inf is set to true, these edges are automatically added and do not need to be given as arguments.

Args
  • series (CSeries): series to apply quantization to

  • bins (CSeries): series defining the bin edges

  • labels (CSeries): series defining the bin labels

  • right (bool): specifies whether bins include their right edges

  • add_inf: when set to False, bins should include -np.inf and

    np.inf; when set to True they are automatically added

Returns
  • CSeries representing the result of the quantization

crandas.crandas.dataframe_to_command(df, ctype, *, name=None, auto_bounds=False)#

Turns DataFrame into a VDL “new” command

Parameters:
  • df (DataFrame) – table to be turned into command

  • name (string, optional) – if supplied, the server will attach this name to the uploaded table

  • auto_bounds (bool, default: False) – if given, do not warn about automatically derived column bounds

Returns:

  • (cmd, null_assignments) (tuple)

  • cmd ((JSON-serializable) dict) – VDL command to generate table

  • null_assignments (List[) – (value_column_name : str, null_column_name : str) | (not_null_column_name : str)

  • ]

crandas.crandas.demo_table(number_of_rows=1, number_of_columns=1, **query_args)#

Create demo table.

Creates a demo table with the given number of rows and columns. The columns are respectively named “col1”, “col2”, … and have sequential integer values 1, 2, …

A nonce is included in the command so that every time this command is called, it receives a fresh table handle.

Parameters:
  • number_of_rows (int, optional) – Number of rows of resulting table, by default 1

  • number_of_columns (int, optional) – Number of columns of resulting table, by default 1

Returns:

A demo table with a fresh name

Return type:

CDataFrame

crandas.crandas.function_to_json(fun, tablist, table=None)#

Turns a rowwise function into a JSON string

Parameters:
  • fun (CSeries/callable function/CDataFrame/constant value) – The function to be converted. Can be of multiple forms: - a CSeries obtained by taking a column of a table, e.g., tab[“col”], or performing operations on it, e.g., tab[“col”].lower(), tab[“col”]+1, tab[“col1”]+tab[“col2”], etc - a callable (e.g. a lambda function) that represents the function and takes as single argument an object x. The object x corresponds to the table on which the operation (e.g. merge or assign) is being applied, and its fields correspond to the columns of the table. E.g., tab.assign(newcol=lambda x: x.oldcol+1) is equivalent to tab.assign(newcol=tab[“oldcol”]+1). - a CDataFrame representing a table that holds a single value, e.g., the result of tab[“col”].sum(mode=”regular”) is a table having a single row and table that can then be used for filtering, e.g., you could do something like tab[tab[“col”]*10<tab[“col”].sum(mode=”regular”)]. (CURRENTLY UNUSED)) - a constant value, e.g., for tab.assign(newcol=5)

  • tablist (list of DataFrame) – List of tables, may be initially empty

  • table (DataFrame, optional) – Associated table, by default None

Returns:

JSON representing the function fun

Return type:

(JSON-serializable) object

Raises:

ValueError – This function is only callable in a table context

crandas.crandas.get2(**query_args)#

Provides the VDL query for a table with two int columns for test_transaction

Returns:

A test table with a fresh name

Return type:

CDataFrame

crandas.crandas.get_table(id_, *, schema=None, **query_args)#

Access table by name. Access a previously uploaded table by its handle or name.

Parameters:
  • id (str) – Handle (hex-encoded string) or name

  • schema (CIndex/list of column names/DataFrame/any valid argument to pandas.read_csv, optional) – represents the structure of the table to be added. Needed if get_table is called from a Transaction, or if it is desired to check that the table corresponds to the given schema, by default None

Returns:

The table with id id

Return type:

CDataFrame

Raises:

ValueError – Schema not specified for importing from a transaction

crandas.crandas.merge(tab1, tab2, how='inner', on=None, left_on=None, right_on=None, validate='one_to_one', **query_args)#

Merge tables using a database-style join. Implements pandas.merge. The following types of merge are supported:

  • inner join: returns only the rows where the join columns match; requires join column values to be unique in both tables

  • outer join: returns rows from both tables, matched where possible; requires join column values to be unique in both tables

  • left join: return rows of left table in original order, matched with a row of the right table where possible; requires join column values to be unique in right table

Columns to join on are given either by a common on argument, or separate left_on and right_on arguments for the left and right tables. To perform a left join where the join column values are not unique, provide a CDataFrameGroupBy object (as returned by CDataFrame.groupby()) as left_on. Currently, this is only possible with a single join column. :param tab1: Left table to be joined :type tab1: CDataFrame :param tab2: Right table to be joined :type tab2: CDataFrame :param how: Type of join :type how: “inner” (default), “outer”, or “left”, optional :param on: Column(s) to join on; must be common to both tables :type on: str or list of str, optional :param left_on: Column(s) of tab1 to join on :type left_on: str, list of str, or CDataFrameGroupBy, optional :param right_on: Column(s) of tab2 to join on, by default None :type right_on: str or list of str, optional :param validate: Type of validation; currently, only join with one_to_one validation

is supported, by default “one_to_one”

Parameters:

query_args – VDL query arguments

Returns:

Result of the merging operation

Return type:

CDataFrame

Raises:

MergeError – Values of the join columns are not unique

crandas.crandas.read_csv(file, *, name=None, auto_bounds=False, **query_args)#

Upload the given CSV file to the VDL

Parameters:
  • file (str) – name of the file

  • name (sr, optional) – name for the table; passed on to upload_pandas_dataframe() if given, by default None

  • auto_bounds (bool, default: False) – if given, do not warn about automatically derived column bounds

Returns:

uploaded table

Return type:

CDataFrame

crandas.crandas.series_max(col1, col2)#

Compute the maximum of two CSeries

Parameters:
  • col1 (CSeries) – integer series;

  • col2 (CSeries) – integer series;

Returns:

the maximum of the values of col1 and col2

Return type:

integer CSeries

crandas.crandas.series_min(col1, col2)#

Compute the minumum of two CSeries

Parameters:
  • col1 (CSeries) – integer series;

  • col2 (CSeries) – integer series;

Returns:

the minimum of the values of col1 and col2

Return type:

integer CSeries

crandas.crandas.upload_pandas_dataframe(df, ctype=None, auto_bounds=False, **query_args)#

Uploads an existing pandas DataFrame into the VDL

Parameters:
  • df (pandas.DataFrame) – DataFrame to upload

  • name (str, optional) – name for the table; passed on to upload_pandas_dataframe() if given, by default None

  • auto_bounds (bool, default: False) – if given, do not warn about automatically derived column bounds

Returns:

the uploaded DataFrame

Return type:

CDataFrame