Skip to content

Commit b67e56b

Browse files
committed
initial @mmcky review, in-work
1 parent ea139df commit b67e56b

File tree

2 files changed

+42
-16
lines changed

2 files changed

+42
-16
lines changed

lectures/pandas.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,7 @@ You can think of a `Series` as a "column" of data, such as a collection of obser
7878

7979
A `DataFrame` is a two-dimensional object for storing related columns of data.
8080

81+
(pandas:series)=
8182
## Series
8283

8384
```{index} single: Pandas; Series

lectures/polars.md

Lines changed: 41 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,7 @@ In addition to what's in Anaconda, this lecture will need the following librarie
3737

3838
## Overview
3939

40-
[Polars](https://pola.rs/) is a lightning-fast data manipulation library for Python written in Rust.
40+
[Polars](https://pola.rs/) is a fast data manipulation library for Python written in Rust.
4141

4242
Polars has gained significant popularity in recent years due to its superior performance
4343
compared to traditional data analysis tools, making it an excellent choice for modern
@@ -58,7 +58,7 @@ Just as [NumPy](https://numpy.org/) provides the basic array data type plus core
5858
* adjusting indices
5959
* working with dates and time series
6060
* sorting, grouping, re-ordering and general data munging [^mung]
61-
* dealing with missing values, etc., etc.
61+
* dealing with missing values, etc.
6262

6363
More sophisticated statistical functionality is left to other packages, such
6464
as [statsmodels](https://www.statsmodels.org/) and [scikit-learn](https://scikit-learn.org/), which can work with polars DataFrames through their interoperability with pandas.
@@ -70,6 +70,7 @@ place
7070

7171
```{code-cell} ipython3
7272
import polars as pl
73+
import pandas as pd
7374
import numpy as np
7475
import matplotlib.pyplot as plt
7576
import requests
@@ -96,11 +97,16 @@ s = pl.Series(name='daily returns', values=np.random.randn(4))
9697
s
9798
```
9899

99-
Here you can imagine the indices `0, 1, 2, 3` as indexing four listed
100-
companies, and the values being daily returns on their shares.
100+
```{note}
101+
You may notice the above series has no indexes, unlike in [](pandas:series).
102+
103+
This is because Polars' is column centric and accessing data is predominantly managed through filtering and boolean masks.
104+
105+
Here is [an interesting blog post discussing this in more detail](https://medium.com/data-science/understand-polars-lack-of-indexes-526ea75e413)
106+
```
101107

102108
Polars `Series` are built on top of Apache Arrow arrays and support many similar
103-
operations
109+
operations to Pandas `Series`.
104110

105111
```{code-cell} ipython3
106112
s * 100
@@ -112,16 +118,27 @@ s.abs()
112118

113119
But `Series` provide more than basic arrays.
114120

115-
Not only do they have some additional (statistically oriented) methods
121+
For example they have some additional (statistically oriented) methods
116122

117123
```{code-cell} ipython3
118124
s.describe()
119125
```
120126

121-
But they can also be used with custom indices
127+
However the Polars `series` cannot be used in the same as as a Pandas `series` when pairing data with indices.
128+
129+
For example, using a Pandas `series` you can do the following:
130+
131+
```{code-cell} ipython3
132+
s = pd.Series(np.random.randn(4), name='daily returns')
133+
s.index = ['AMZN', 'AAPL', 'MSFT', 'GOOG']
134+
s
135+
```
136+
137+
However, in Polars you will need to use the `DataFrame` object to do the same task.
138+
139+
Essentially any column in a Polars `DataFrame` can be used as an indices through the `filter` method.
122140

123141
```{code-cell} ipython3
124-
# Create a new series with custom index using a DataFrame
125142
df_temp = pl.DataFrame({
126143
'company': ['AMZN', 'AAPL', 'MSFT', 'GOOG'],
127144
'daily returns': s.to_list()
@@ -149,7 +166,7 @@ df_temp
149166

150167
```{code-cell} ipython3
151168
# Check if AAPL is in the companies
152-
'AAPL' in df_temp.get_column('company').to_list()
169+
'AAPL' in df_temp.get_column('company')
153170
```
154171

155172
## DataFrames
@@ -161,7 +178,7 @@ While a `Series` is a single column of data, a `DataFrame` is several columns, o
161178

162179
In essence, a `DataFrame` in polars is analogous to a (highly optimized) Excel spreadsheet.
163180

164-
Thus, it is a powerful tool for representing and analyzing data that are naturally organized into rows and columns, often with descriptive indexes for individual rows and individual columns.
181+
Thus, it is a powerful tool for representing and analyzing data that are naturally organized into rows and columns.
165182

166183
Let's look at an example that reads data from the CSV file `pandas/data/test_pwt.csv`, which is taken from the [Penn World Tables](https://www.rug.nl/ggdc/productivity/pwt/pwt-releases/pwt-7.0).
167184

@@ -293,16 +310,14 @@ Polars provides powerful methods for applying functions to data.
293310

294311
Instead of pandas' `apply` method, polars uses expressions within `select`, `with_columns`, or `filter` methods.
295312

296-
Here is an example using built-in functions
313+
Here is an example using built-in functions to find the `max` value for each column
297314

298315
```{code-cell} ipython3
299316
df.select([
300317
pl.col(['year', 'POP', 'XRAT', 'tcgdp', 'cc', 'cg']).max().name.suffix('_max')
301318
])
302319
```
303320

304-
This line of code applies the `max` function to all selected columns.
305-
306321
For more complex operations, we can use `map_elements` (similar to pandas' apply):
307322

308323
```{code-cell} ipython3
@@ -314,6 +329,16 @@ df.with_row_index().select([
314329
])
315330
```
316331

332+
However as you can see from the Warning issued by Polars there is often a better way to achieve this using the Polars API.
333+
334+
```{code-cell} ipython3
335+
df.with_row_index().select([
336+
pl.col('index'),
337+
pl.col('country'),
338+
(pl.col('POP') * 2).alias('POP_doubled')
339+
])
340+
```
341+
317342
We can use complex filtering conditions with boolean logic:
318343

319344
```{code-cell} ipython3
@@ -362,7 +387,7 @@ df.with_columns([
362387
])
363388
```
364389

365-
**4.** We can use `map_elements` to modify all individual entries in specific columns.
390+
**4.** We can use in-built functions to modify all individual entries in specific columns.
366391

367392
```{code-cell} ipython3
368393
# Round all decimal numbers to 2 decimal places in numeric columns
@@ -408,7 +433,7 @@ For example, we can use forward fill, backward fill, or interpolation
408433
```{code-cell} ipython3
409434
# Fill with column means for numeric columns
410435
df_filled = df_with_nulls.with_columns([
411-
pl.col(pl.Float64, pl.Int64).fill_null(pl.col(pl.Float64, pl.Int64).mean())
436+
pl.col(pl.Float64).fill_null(pl.col(pl.Float64).mean())
412437
])
413438
df_filled
414439
```
@@ -860,4 +885,4 @@ plt.tight_layout()
860885
```{solution-end}
861886
```
862887

863-
[^mung]: Wikipedia defines munging as cleaning data from one raw form into a structured, purged one.
888+
[^mung]: Wikipedia defines munging as cleaning data from one raw form into a structured, purged one.

0 commit comments

Comments
 (0)