# Working with numeric data#

crandas offers a wide range of tools for working with numeric data. These tools allow you to perform various operations on columns, such as arithmetic operations and aggregations, to analyze and process your data. In this guide, we will provide you with an introduction to working with these various operations. First, we will familiarize you with the particularities of numerical types in the VDL.

Warning

Currently, arithmetic functionality is only available for integers and not for fixed-point numbers. Always make sure that data is of the right format whe working with it in the VDL.

## Integer data types and conversion#

crandas supports signed and unsigned integer numbers with several bit lengths: 8, 16, 24, and 32. For example, a 16-bit unsigned integer is referred to as an `uint16`; a 24-bit signed integer is referred to as an `int24`. It is possible to specify the type of integer as a ctype (see ctypes), e.g.:

```import crandas as cd
cdf = cd.DataFrame({"vals": [1,2,3]}, ctype={"vals": "int16"})
```

crandas does not allow computations to overflow. For example, the following does not work:

```cdf = cd.DataFrame({"vals": [1,2,3]}, ctype={"vals": "int16"})
cdf = cdf.assign(cube=lambda x: x.vals*x.vals*x.vals) # this overflows
```

In this case, because the column `vals` is 16-bit, its cube will be 48-bit, so the above computation fails with the error message `Integer operation overflow: value does not fit in 32 bits`.

Note

In many cases, the cube of a 16-int bit can easily fit in 16 bits; clearly 3^3 = 9 can be expressed with 16 bits (it can be expressed with only five). However, as the values in the VDL are secret-shared it is very costly to preemptively check whether the result of an operation would fit the current type. Therefore, the VDL assumes that it cannot fit, in order to prevent overflow errors.

In this example, it is possible to convert the column `vals` to another integer type so that the computation of the cube can succeed, e.g.:

```cdf = cd.DataFrame({"vals": [1,2,3]}, ctype={"vals": "int16"})
cdf = cdf.assign(vals=lambda x: x.vals.astype("int8"))
cdf = cdf.assign(cube=lambda x: x.vals*x.vals*x.vals) # this works
```

Note that the above conversion does not actually check that the column values lie in the specified range. If this is not the case, then the result of computing on the column is undefined. To check that the conversion succeeds, the following can be used:

```cdf = cd.DataFrame({"vals": [1,2,3]}, ctype={"vals": "int16"})
cdf = cdf.assign(vals=lambda x: x.vals.astype("int8", validate=True))
cdf = cdf.assign(cube=lambda x: x.vals*x.vals*x.vals) # this works
```

### Automatic type detection#

If no ctype is specified for an integer column, the smallest integer type is selected that is valid for the column, e.g.:

```cdf = cd.DataFrame({"vals": [1,2,3]})
```

In this example, the column `vals` will be derived to be of type `uint8` (unsigned 8-bit integer) because all values lie in the range from 0 to 255 (inclusive).

Note that, because the column type is derived from the data, this potentially leaks information. For example, if one party uploads a list of salaries that are all smaller than `65536` and another party uploads a list of salaries that contains higher salaries, then the first column will have column type `uint16` and the second column will have column `uint24`. The first party can prevent this by explicitly assigning the ctype `uint24` to the uploaded column. When an automatic column type is given, the user gets a warning about this: ```ColumnBoundDerivedWarning: Column "vals" was automatically derived to be of type uint8```. These warnings can be suppressed with the standard Python mechanism, e.g.,

```import warnings

# Suppress warnings about automatically derived column types
warnings.simplefilter("ignore", category=cd.ctypes.ColumnBoundDerivedWarning)

cdf = cd.DataFrame({"vals": [1,2,3]})
```

### Boolean columns#

As a special case, the ctype `bool` will be used for columns that contain the boolean values `True` and `False`, and for columns for which this ctype is explicitly specified:

```cdf = cd.DataFrame({"col1": [True,True,False],   # automatically a bool
"col2": [1,0,1]},            # specified using ctype
ctype={"col2": "bool"})
```

Certain operations, such as filtering on a column and using binary operators `&`, `|`, work only on boolean columns. For example, the following does not work and results in the error message `Filter column should be boolean; try converting to bool`:

```cdf = cd.DataFrame({"vals": [1,2,3], "filter": [1,0,1]})
cdf = cdf[cdf["filter"]] # does not work: cdf["filter"] is uint8, not bool
```

To fix this, the filter column needs to be converted to a boolean column:

```cdf = cd.DataFrame({"vals": [1,2,3], "filter": [1,0,1]}) # both of type uint8
cdf = cdf[cdf["filter"].astype(bool)] # converted to bool, now it works
```

As above, specify `validate=True` to check that the column is actually boolean.

### Inspecting the integer type#

In the current version of crandas, it is not possible to check the integer type directly. The feature is coming soon and will be accesible through `repr`.

## Elementwise Operations and Comparisons#

In this section, we will demonstrate how to perform basic arithmetic operations on columns using comparisons in crandas. All of these operations output a `CSeries` with the result of applying each operation to each row of the columns.

```df = cd.DataFrame({"a":[1,2,3,4], "b":[5,6,7,8]})

# Equal to
df_equal_to = df["a"] == df["b"]

# Not equal to
df_not_equal_to = df["a"] != df["b"]

# Less than
df_less_than = df["a"] < df["b"]

# Less than or equal to
df_less_than_equal = df["a"] <= df["b"]

# Greater than
df_greater_than = df["a"] > df["b"]

# Greater than or equal to
df_greater_than_equal = df["a"] >= df["b"]

# Subtract
df_subtract = df["a"] - df["b"]

# Multiply
df_multiply = df["a"] * df["b"]

# Find the absolute values of a column
df_abs = df["a"].abs()
```

Beyond these standard common operations, crandas has functions that compares two columns and returns a column with wither the maximum or the minimum in each row:

```# Calculate the minimum between "a" and "b" - returns a series
abmin = cd.series_min(df["a"], df["b"])

# Calculate the maximum between "a" and "b" - returns a series
abmax = cd.series_max(df["a"], df["b"])
```

## Aggregation operations#

In addition to this, crandas provides several aggregation operations, which can be applied to columns.

```df = cd.DataFrame({"a":[1,2,3,4], "b":[5,6,7,8]})

# Sum of a column
a_sum = df["a"].sum()

# Mean of a column
a_mean = df["a"].mean()

# Variance of a column
a_var = df["a"].var()

# Count number of items of a column (not NULL elements)
a_count = df["a"].count()

# Maximum of a column
a_max = df["a"].max()

# Minimum of a column
a_min = df["a"].min()

# Sum of squares of a column
a_sum_of_squares = df["a"].sum_squares()
```

Maximum and minimum can be applied over a whole table, computing the maximum per (numeric) column.

```# Compute the maximum of each (numeric) column in df
df_maximums = df.max()

# Compute the minimum of each (numeric) column in df
df_minimums = df.min()
```

Note

Note the difference between the elementwise functions that compare the minimum/maximum of two columns (`CSeriesColRef.max()` / `CSeriesColRef.min()`) and the ones that compute the minimum/maximum of the elements in one column (`crandas.series_min()` / `crandas.series_max()`)

## Working with Vectors#

crandas allows for vector operations on integer vectors.

```df = cd.DataFrame({
"col1": [[22,3,11], [76,99,0], [37, 358, 48]],
"col2": [[66, 81, 17], [53, 11, 99], [63, 471, 23]],
})
```

Addition and multiplication work the same way as if we working with scalars:

```# Addition
df["col1"] + df["col2"]

# Multiplication
df["col1"] * df["col2"]
```

Moreover, we also provide vector operations, such as inner product and the sum of the elements of a vector. These two functions (`CSeries.inner()` and `CSeries.vsum()`) are implemented as functions of the column, so the notation might be slightly unnatural (in particular vor the inner product).

```# Assign a new column "inner" with the inner product of "col1" and "col2"
df = df.assign(inner=df["col1"].inner(df["col2"]))

# For each row in col1, add all the elements in that vector
a['vec1'].vsum()
```

Hint

Whenever you want to access the result of a column operation without adding it to a table, use `as_table()` like this `(df["col1"]+df["col2"]).as_table()`.

Now that we know how to work with numeric columns as well as the operations we can do with them, in the next section we will learn how to combine multiple tables. This feature is one of the most important to allow for data sharing, as it will allow us to consolidate data from multiple sources into one.