Data types#
This section provides an overview of which data types the Virtual Data Lake (VDL) supports and how we can convert between data types using crandas.
The supported data types include:
int8
,int16
,int24
,int32
uint8
,uint16
,uint24
,uint32
vector of integers
string
bytes
bool
(as integer)fixed-point numbers [1]
For specific details about numeric values, see the next section.
Warning
When boolean values are uploaded into the VDL, they are transformed to integers and therefore take the values 0
and 1
instead of False
and True
respectively.
Integer and string values allow for missing values, yet this must be specified
Data Type Conversion#
crandas provides a method for converting data types using the CSeries.astype()
method. In the following example, we will show you how to convert a column of integers to a column of strings.
import numpy as np
#Generate data for 100 rows
n = 100
np.random.seed(1)
vals = np.random.randint(10**8, 10**9 - 1, size=n)
# Create a crandas DataFrame with an int column
uploaded = cd.DataFrame({"vals": vals})
# Convert the int column to a string column
uploaded = uploaded.assign(vals2=uploaded["vals"].astype(str))
The above example converts the integer column vals
to a new string column called vals2
.
Note
It is also possible to specify the desired type while uploading the data, using ctype={"val": "varchar[9]"}
.
Converting a string column to an integer column#
Similarly to above, you can also use the CSeries.astype()
method to convert string columns back to integer columns.
Using the same CDataFrame
as above:
# Create a crandas DataFrame with a string column
uploaded = cd.DataFrame({"vals": [str(val) for val in vals]})
# Convert the string column to an int column
uploaded = uploaded.assign(vals2=uploaded["vals"].astype(int))
For a more in depth look at specifying integer values, go to the next section.
ctypes#
Because the VDL uses highly specialized algorithms to compute on secret data, it uses a specialized typing system that is more fine-grained than what pandas uses.
crandas implements this type system, which we call ctypes (similarly to pandas’ dtypes
).
In certain situations, it is important to specify the specific type of our data
import pandas as pd
from crandas import ctypes
# Specify data types for the DataFrame
table = cd.DataFrame(
{"ints": [1, 2, 3], "strings": ["a", "bb", "ccc"]},
ctype={"ints": "int8", "strings": "varchar[5]"},
)
In the above example, we define the ints
column with a NonNullableInteger
data type (crandas.ctypes.NonNullableInteger()
), and the strings
column is defined with a varchar[5]
data type (variable but less than 5 characters).
Note
If there are missing/null values in the column that can be specified by adding ?
after the ctype (i.e. int8?
)
crandas also supports custom data types for other Python data types, such as byte arrays:
from uuid import uuid4
# Create a DataFrame with UUIDs stored as bytes
cd.DataFrame({"uuids": [uuid4().bytes for _ in range(5)]}, ctype="bytes")
You are also able to specify types through pandas’ typing, known as dtypes. Note that not all dtypes have an equivalent ctypes.
# Create a DataFrame with multiple data types
df = cd.DataFrame(
{
"strings": pd.Series(["test", "hoi", "ok"], dtype="string"),
"int": pd.Series([1, 2, 3], dtype="int"),
"int64": pd.Series([23, 11, 91238], dtype="int64"),
"int32": pd.Series([12831, 1231, -1231], dtype="int32"),
}
)
Working with missing values#
crandas can work with null values, although this requires extra care. Columns do not allow null values by default but this can be achieved in multiple ways. Whenever a column with missing values is added, the VDL will determine that such column can have null values. Additionally, it is possible to specify that a column will allow null values when uploading it, even if the column currently does not contain any such values.
Warning
When uploading an integer column with missing values, it is necessary to add the ctype
of the column. If there are already missing values it is not necessary to specify ctypes.NullableInteger()
, specifying the size of the integer will be enough e.g ctype={"int_column": int8}
.
from crandas import ctypes
table = cd.DataFrame(
{"ints": [1, 2, 3], "strings": ["a", "bb", None]},
ctype={"ints": ctypes.NullableInteger(), "strings": "varchar[5]"},
)
Both columns created in this example allow for null values. The first one because it was explictly specified and the latter because it contains a null value.
Currently, there is no direct way to turn a nullable column into a non-nullable one, although it is possible to do so by creating a new column using CSeries.astype()
.
Numeric types have additional particularities that are important to know, both in the typing system and because we can do arithmetic operations over them. The next section deals with these types.