# Group by: split-apply-combine#

Using the `CDataFrame.groupby()`

functionality, we can partition the rows of a table based on the value of one of the columns, and perform subsequent operations on the resulting parts.
The grouping functionality is similar to the Pandas library, and it is strongly related to the SQL `GROUP BY`

clause.
The process involves splitting the data into groups, applying an aggregation function to each group, and then combining the results.

Note

Currently, the key column to compute the groupby on, should be an *integer* column.

To group data in crandas, you can use the `CDataFrame.groupby()`

method. The `CDataFrameGroupBy.col()`

and `CSeriesGroupBy.agg()`

methods are also available to work with the grouped data.

Here is a list of aggregation operations that can be performed on grouped data:

`crandas.groupby.sum`

: Compute the sum of each group.`crandas.groupby.max`

: Find the maximum value of each group.`crandas.groupby.min`

: Find the minimum value of each group.

Below is an example that demonstrates how to use these operations with crandas:

```
import numpy as np
import pandas as pd
import crandas as cd
# Create a dataframe using pandas
ipl_data = {"Team": ["Riders", "Riders", "Devils", "Devils", "Kings", "kings", "Kings", "Kings", "Riders", "Royals", "Royals", "Riders"],
"Rank": [1, 2, 2, 3, 3, 4, 1, 1, 2, 4, 1, 2],
"Year": [2014, 2015, 2014, 2015, 2014, 2015, 2016, 2017, 2016, 2014, 2015, 2017],
"Points": [876, 789, 863, 673, 741, 812, 756, 788, 694, 701, 804, 690]}
df = pd.DataFrame(ipl_data)
```

After creating the dataframe, we are able to use `crandas.upload_pandas_dataframe()`

and then group by `Year`

to execute different operations.

```
# Upload the pandas dataframe to the VDL (CDataFrame), then group by year and perform aggregation operations
cdf = cd.upload_pandas_dataframe(df)
cdgrouped = cdf.groupby("Year")
cdsum = cdgrouped["Points"].agg(cd.groupby.sum)
cdmax = cdgrouped["Points"].agg(cd.groupby.max)
cdmin = cdgrouped["Points"].agg(cd.groupby.min)
```

Note

For reference, this is slightly different to how one would use the `GroupBy`

method in pandas. The difference can be seen below:

```
df = pd.DataFrame(ipl_data)
pdgrouped = df.groupby("Year")
pdsum = pdgrouped["Points"].agg(np.sum)
pdmax = pdgrouped["Points"].agg(np.max)
pdmin = pdgrouped["Points"].agg(np.min)
```

We can also look at the size of each group that we create using the `CDataFrame.groupby()`

method.
Below we create a pandas dataframe which we upload to the VDL. We then group on column `a`

(leading to 3 groups - 1, 2 and 3).
We can then use the `CDataFrameGroupBy.size()`

to compute the size of each groups.

```
pa = pd.DataFrame({"a": [1, 1, 1, 2, 2, 2, 2, 3, 3]})
ca = cd.upload_pandas_dataframe(pa)
cgrouping = ca.groupby("a")
csizes = cgrouping.size()
```

- >>>print(csizes)
a

1 3 2 4 3 2