Working with missing data#

Sometimes data has missing values but we still need to work with it. Thankfully, crandas allows us to do this.

Uploading data with missing values#

Uploading data with missing values does not always work automatically. The main issue is that by default, data is considered to be always present (not null). To allow columns to have missing values, it should be manually set to nullable using the crandas.ctypes system.

Note

For illustration, assume we have a CSV file called test.csv with the following data:

name,age,city
Alice,25,
Bob,,New York
Charlie,30,Los Angeles

The following will not work:

>>> cd.read_csv("test.csv")
[...]
NotImplementedError: FixedPoint ctype does not support nullable columns yet

The above exception was the direct cause of the following exception:
[...]
SeriesEncodingError: Could not encode column age

The problem is that the age column gets interpreted as a fixed-point column, which does not support missing values.

We need to manually set the ctypes. For age an appropriate ctype would be uint8? – this is shorthand for int[nullable,min=0,max=255].

Note

The ? after the data type indicates that there are missing/null values in that column.

>>> people = cd.read_csv("test.csv", ctype={"age": "uint8?"})
>>> people
Name: 618568F443ED8F6E0EC2B4FB225C628BA75781B2834854F40B8FA687BCA3E9EA
Size: 3 rows x 3 columns
CIndex([Col("name", "s", 8), Col("age", "i", 1, nullable=True), Col("city", "s", 12)])

Finding missing values#

Once we have uploaded the data, we want to ensure that the data we are accessing is not missing. For that, we have two methods to check whether values of a column in the CDataFrame are missing:

One is the negation of the other and they output a column (a CSeriesFun) with zeroes and ones representing whether the values are null or not.

# This returns a whether respective values are NULL
missing_age = people["age"].isna()

# Returns whether respective values are not NULL
not_missing_age = people["age"].notna()