Group by: split-apply-combine#

Using the CDataFrame.groupby() functionality, we can partition the rows of a table based on the value of one of the columns, and perform subsequent operations on the resulting parts. The grouping functionality is similar to the Pandas library, and it is strongly related to the SQL GROUP BY clause. The process involves splitting the data into groups, applying an aggregation function to each group, and then combining the results.

Note

Currently, the key column to compute the groupby on, should be an integer column.

To group data in crandas, you can use the CDataFrame.groupby() method. The CDataFrameGroupBy.col() and CSeriesGroupBy.agg() methods are also available to work with the grouped data.

Here is a list of aggregation operations that can be performed on grouped data:

Below is an example that demonstrates how to use these operations with crandas:

import numpy as np
import pandas as pd

import crandas as cd

# Create a dataframe using pandas
ipl_data = {"Team": ["Riders", "Riders", "Devils", "Devils", "Kings", "kings", "Kings", "Kings", "Riders", "Royals", "Royals", "Riders"],
"Rank": [1, 2, 2, 3, 3, 4, 1, 1, 2, 4, 1, 2],
"Year": [2014, 2015, 2014, 2015, 2014, 2015, 2016, 2017, 2016, 2014, 2015, 2017],
"Points": [876, 789, 863, 673, 741, 812, 756, 788, 694, 701, 804, 690]}

df = pd.DataFrame(ipl_data)

After creating the dataframe, we are able to use crandas.upload_pandas_dataframe() and then group by Year to execute different operations.

# Upload the pandas dataframe to the VDL (CDataFrame), then group by year and perform aggregation operations
cdf = cd.upload_pandas_dataframe(df)
cdgrouped = cdf.groupby("Year")
cdsum = cdgrouped["Points"].agg(cd.groupby.sum)
cdmax = cdgrouped["Points"].agg(cd.groupby.max)
cdmin = cdgrouped["Points"].agg(cd.groupby.min)

Note

For reference, this is slightly different to how one would use the GroupBy method in pandas. The difference can be seen below:

df = pd.DataFrame(ipl_data)

pdgrouped = df.groupby("Year")
pdsum = pdgrouped["Points"].agg(np.sum)
pdmax = pdgrouped["Points"].agg(np.max)
pdmin = pdgrouped["Points"].agg(np.min)

We can also look at the size of each group that we create using the CDataFrame.groupby() method. Below we create a pandas dataframe which we upload to the VDL. We then group on column a (leading to 3 groups - 1, 2 and 3). We can then use the CDataFrameGroupBy.size() to compute the size of each groups.

pa = pd.DataFrame({"a": [1, 1, 1, 2, 2, 2, 2, 3, 3]})
ca = cd.upload_pandas_dataframe(pa)
cgrouping = ca.groupby("a")
csizes = cgrouping.size()
>>>print(csizes)

a

1 3 2 4 3 2