You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: lectures/polars.md
+41-16Lines changed: 41 additions & 16 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -37,7 +37,7 @@ In addition to what's in Anaconda, this lecture will need the following librarie
37
37
38
38
## Overview
39
39
40
-
[Polars](https://pola.rs/) is a lightning-fast data manipulation library for Python written in Rust.
40
+
[Polars](https://pola.rs/) is a fast data manipulation library for Python written in Rust.
41
41
42
42
Polars has gained significant popularity in recent years due to its superior performance
43
43
compared to traditional data analysis tools, making it an excellent choice for modern
@@ -58,7 +58,7 @@ Just as [NumPy](https://numpy.org/) provides the basic array data type plus core
58
58
* adjusting indices
59
59
* working with dates and time series
60
60
* sorting, grouping, re-ordering and general data munging [^mung]
61
-
* dealing with missing values, etc., etc.
61
+
* dealing with missing values, etc.
62
62
63
63
More sophisticated statistical functionality is left to other packages, such
64
64
as [statsmodels](https://www.statsmodels.org/) and [scikit-learn](https://scikit-learn.org/), which can work with polars DataFrames through their interoperability with pandas.
@@ -70,6 +70,7 @@ place
70
70
71
71
```{code-cell} ipython3
72
72
import polars as pl
73
+
import pandas as pd
73
74
import numpy as np
74
75
import matplotlib.pyplot as plt
75
76
import requests
@@ -96,11 +97,16 @@ s = pl.Series(name='daily returns', values=np.random.randn(4))
96
97
s
97
98
```
98
99
99
-
Here you can imagine the indices `0, 1, 2, 3` as indexing four listed
100
-
companies, and the values being daily returns on their shares.
100
+
```{note}
101
+
You may notice the above series has no indexes, unlike in [](pandas:series).
102
+
103
+
This is because Polars' is column centric and accessing data is predominantly managed through filtering and boolean masks.
104
+
105
+
Here is [an interesting blog post discussing this in more detail](https://medium.com/data-science/understand-polars-lack-of-indexes-526ea75e413)
106
+
```
101
107
102
108
Polars `Series` are built on top of Apache Arrow arrays and support many similar
103
-
operations
109
+
operations to Pandas `Series`.
104
110
105
111
```{code-cell} ipython3
106
112
s * 100
@@ -112,16 +118,27 @@ s.abs()
112
118
113
119
But `Series` provide more than basic arrays.
114
120
115
-
Not only do they have some additional (statistically oriented) methods
121
+
For example they have some additional (statistically oriented) methods
116
122
117
123
```{code-cell} ipython3
118
124
s.describe()
119
125
```
120
126
121
-
But they can also be used with custom indices
127
+
However the Polars `series` cannot be used in the same as as a Pandas `series` when pairing data with indices.
128
+
129
+
For example, using a Pandas `series` you can do the following:
130
+
131
+
```{code-cell} ipython3
132
+
s = pd.Series(np.random.randn(4), name='daily returns')
133
+
s.index = ['AMZN', 'AAPL', 'MSFT', 'GOOG']
134
+
s
135
+
```
136
+
137
+
However, in Polars you will need to use the `DataFrame` object to do the same task.
138
+
139
+
Essentially any column in a Polars `DataFrame` can be used as an indices through the `filter` method.
122
140
123
141
```{code-cell} ipython3
124
-
# Create a new series with custom index using a DataFrame
125
142
df_temp = pl.DataFrame({
126
143
'company': ['AMZN', 'AAPL', 'MSFT', 'GOOG'],
127
144
'daily returns': s.to_list()
@@ -149,7 +166,7 @@ df_temp
149
166
150
167
```{code-cell} ipython3
151
168
# Check if AAPL is in the companies
152
-
'AAPL' in df_temp.get_column('company').to_list()
169
+
'AAPL' in df_temp.get_column('company')
153
170
```
154
171
155
172
## DataFrames
@@ -161,7 +178,7 @@ While a `Series` is a single column of data, a `DataFrame` is several columns, o
161
178
162
179
In essence, a `DataFrame` in polars is analogous to a (highly optimized) Excel spreadsheet.
163
180
164
-
Thus, it is a powerful tool for representing and analyzing data that are naturally organized into rows and columns, often with descriptive indexes for individual rows and individual columns.
181
+
Thus, it is a powerful tool for representing and analyzing data that are naturally organized into rows and columns.
165
182
166
183
Let's look at an example that reads data from the CSV file `pandas/data/test_pwt.csv`, which is taken from the [Penn World Tables](https://www.rug.nl/ggdc/productivity/pwt/pwt-releases/pwt-7.0).
167
184
@@ -293,16 +310,14 @@ Polars provides powerful methods for applying functions to data.
293
310
294
311
Instead of pandas' `apply` method, polars uses expressions within `select`, `with_columns`, or `filter` methods.
295
312
296
-
Here is an example using built-in functions
313
+
Here is an example using built-in functions to find the `max` value for each column
0 commit comments