Learn Python Series (#54) - Pandas in 2026: What's Changed

Learn Python Series (#54) - Pandas in 2026: What's Changed

python-logo.png

Repository

What will I learn

  • You will learn what changed in Pandas since we last covered it in episodes #30-33;
  • how copy-on-write semantics eliminate the SettingWithCopyWarning nightmare;
  • why the PyArrow backend matters for performance and memory;
  • method chaining patterns that make data pipelines readable;
  • nullable dtypes and why they fix Pandas' oldest design flaw;
  • practical examples applying modern Pandas to real-world data analysis.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.11+) distribution;
  • Familiarity with Pandas basics from episodes #30-33;
  • The ambition to learn Python programming.

Difficulty

  • Intermediate

Curriculum (of the Learn Python Series):

GitHub Account

https://github.com/realScipio

Learn Python Series (#54) - Pandas in 2026: What's Changed

Back in episodes #30-33, we covered Pandas fundamentals - DataFrames, Series, indexing, groupby, merging, and the basic workflow of loading, transforming, and analyzing tabular data. That was 2018. Pandas has evolved significantly since then, and not just in minor ways - the internal architecture got a major overhaul.

This episode isn't a rehash. If you need a refresher on the basics, go back to #30-33. What we're covering here is what's new, what's better, and what you should be doing differently in 2026. Some of your old Pandas habits are now actively counterproductive.

Nota bene: Pandas 2.x (released 2023) was a seismic shift. Copy-on-write semantics, Apache Arrow backend, nullable dtypes, the removal of append() - if you learned Pandas before 2.0, parts of your muscle memory are obsolete. Let's fix that.

The SettingWithCopyWarning Is Dead - And Good Riddance

If you've used Pandas for any length of time, you've seen this:

SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.

This warning haunted every Pandas user, from beginners to experts. The underlying problem was that Pandas couldn't always tell whether an indexing operation returned a view (shared memory with the original DataFrame) or a copy (independent memory). Modifying a view would silently modify the original. Modifying a copy would silently not modify the original. The warning was Pandas saying "I don't know which one this is, and neither do you."

The community's workaround was defensive .copy() calls everywhere:

# Old Pandas - the paranoid approach
subset = df[df['amount'] > 100].copy()   # .copy() just in case
subset['tax'] = subset['amount'] * 0.21  # now safe to modify

Pandas 2.0 introduced copy-on-write (CoW). In Pandas 3.0, it's the default behavior. The semantics are clean:

import pandas as pd

df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
subset = df[df['a'] > 1]

# In old Pandas: is this a view or copy? Nobody knows!
# In Pandas 3.0 (CoW): always independent on modification
subset['b'] = 99  # Only modifies subset, NEVER df

print(df)
#    a  b
# 0  1  4
# 1  2  5
# 2  3  6    ← unchanged, as you'd expect

Under the hood, CoW doesn't eagerly copy data. When you create subset, it shares memory with df (like a view). Only when you modify subset does Pandas create a copy of the affected data. This means read-only operations stay fast and memory-efficient, while modifications are always safe.

If you're still on Pandas 2.x, enable it explicitly:

pd.set_option('mode.copy_on_write', True)

The practical impact is enormous: you can delete every defensive .copy() call in your codebase. The mental overhead of "is this a view or a copy?" is gone. Code becomes simpler, and an entire category of subtle bugs disappears.

The PyArrow Backend - A New Engine Under the Hood

Old Pandas stored data using NumPy arrays internally. This worked but had fundamental limitations:

  1. No native support for missing values in integer columns (NaN is a float, so integer columns with missing values got cast to float64)
  2. Strings stored as Python objects (slow, memory-hungry, no vectorized operations)
  3. No support for complex types (nested structs, lists, maps)

Pandas 2.0+ can use Apache Arrow as the backend storage:

# Read directly with Arrow backend
df = pd.read_csv('large_file.csv', dtype_backend='pyarrow')

# Or convert an existing DataFrame
df = df.convert_dtypes(dtype_backend='pyarrow')

print(df.dtypes)
# id           int64[pyarrow]
# name    large_string[pyarrow]
# active       bool[pyarrow]

What does this buy you? Let me be specific.

Memory savings - dramatic ones. Arrow uses columnar storage with dictionary encoding for repeated strings. Imagine a column with 1 million rows containing 100 unique city names. NumPy backend: stores each string as a separate Python object (28+ bytes overhead per string, plus the string data). Arrow backend: stores the 100 unique strings once, then uses integer indices for each row. Real-world savings: 50-80% for string-heavy datasets.

import pandas as pd

# Compare memory usage
df_numpy = pd.read_csv('users.csv')
df_arrow = pd.read_csv('users.csv', dtype_backend='pyarrow')

print(f"NumPy backend: {df_numpy.memory_usage(deep=True).sum() / 1e6:.1f} MB")
print(f"Arrow backend: {df_arrow.memory_usage(deep=True).sum() / 1e6:.1f} MB")
# NumPy backend: 42.3 MB
# Arrow backend: 11.7 MB  ← 72% less memory

Missing value support everywhere. Arrow has native null for every type. The ancient Pandas pain of "integer column becomes float because it has one missing value" is gone:

# Old Pandas (NumPy backend):
s_old = pd.Series([1, 2, None])
print(s_old.dtype)  # float64 (!)  - because NaN is a float

# Arrow backend:
s_new = pd.Series([1, 2, None], dtype='int64[pyarrow]')
print(s_new.dtype)  # int64[pyarrow] - null is null, integer stays integer

String performance. Arrow strings are stored as contiguous UTF-8 byte buffers, not individual Python heap objects. String operations (contains, replace, lower, split) on Arrow-backed columns run 2-10x faster because they operate on contiguous memory instead of chasing Python object pointers through the heap.

Nullable Dtypes Without Arrow

Even without the full Arrow backend, Pandas 2.x introduced nullable extension types that solve the NaN problem:

df = pd.DataFrame({
    'id': pd.array([1, 2, None], dtype='Int64'),           # Capital I!
    'active': pd.array([True, None, False], dtype='boolean'),
    'name': pd.array(['Alice', None, 'Charlie'], dtype='string'),
    'score': pd.array([3.14, None, 2.72], dtype='Float64'),
})

print(df.dtypes)
id          Int64
active    boolean
name       string
score     Float64
dtype: object

These use pd.NA instead of np.nan. The distinction matters: np.nan is a float that infects every column it touches - an integer column with one np.nan becomes float64. pd.NA propagates properly through any operation without type coercion.

The easiest migration path for existing code:

# Upgrade all columns at once
df = pd.read_csv('data.csv').convert_dtypes()

This automatically picks the best nullable dtype for each column. Integers stay integers, booleans stay booleans, strings get proper string dtype. It's the single most impactful line you can add to existing Pandas code.

How pd.NA Propagation Differs from np.nan

The behavioral difference between pd.NA and np.nan is subtle but important in practice:

import numpy as np

# np.nan: comparison returns False (NaN != NaN)
print(np.nan == np.nan)    # False
print(np.nan > 5)          # False
print(np.nan + 10)         # nan

# pd.NA: comparison returns pd.NA (unknown, not False)
print(pd.NA == pd.NA)      # <NA>
print(pd.NA > 5)           # <NA>
print(pd.NA + 10)          # <NA>

This three-valued logic (True / False / NA) is mathematically correct - if a value is unknown, a comparison with it should be unknown, not False. It also means pd.NA works properly in boolean contexts: if pd.NA: raises a TypeError instead of silently evaluating to False. This catches bugs that np.nan would let slip through.

Parquet - The File Format That Matches Arrow

If you're using the Arrow backend (and you should be for anything non-trivial), Parquet is the natural file format. Parquet is a columnar binary format that preserves dtypes, handles nulls natively, and compresses dramatically:

# Write - preserves all dtype information, no conversion needed
df.to_parquet('data.parquet', engine='pyarrow')

# Read - types come back exactly as they were, no inference
df = pd.read_parquet('data.parquet', dtype_backend='pyarrow')

CSV round-trips are lossy: you write an integer column, it comes back as float (because of NaN). You write dates, they come back as strings. You write booleans, they might come back as strings "True"/"False". Parquet round-trips are lossless: what you write is exactly what you get back.

The size difference is striking too:

# Typical file sizes for a 1M-row, 20-column dataset:
# CSV:     ~450 MB (text, no compression)
# Parquet: ~35 MB  (columnar, compressed with snappy by default)

For any data pipeline that reads and writes intermediate results, switching from CSV to Parquet is free performance. Read times are faster (less data to parse), write times are faster (no string conversion), and disk usage drops by 10-15x.

Method Chaining - The Modern Pandas Style

The most important stylistic shift in modern Pandas isn't a new feature - it's a pattern. Method chaining transforms multi-step data operations from imperative mutation to declarative pipelines:

# Old style - mutate in place, hard to follow, lots of intermediate variables
df = pd.read_csv('sales.csv')
df = df[df['amount'] > 0]
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
df['month'] = df.index.month
result = df.groupby('month')['amount'].sum()
# Modern style - chained, reads top to bottom like a recipe
result = (
    pd.read_csv('sales.csv')
    .query('amount > 0')
    .assign(date=lambda df: pd.to_datetime(df['date']))
    .set_index('date')
    .assign(month=lambda df: df.index.month)
    .groupby('month')['amount']
    .sum()
)

The chained version is not just shorter - it's safer. Each step produces a new DataFrame (especially with CoW). No accidental mutation, no wondering which version of df you're looking at, no variable name reuse bugs.

The key methods that enable chaining:

  • .query() - filter rows with a string expression
  • .assign() - add or modify columns, returns new DataFrame
  • .pipe() - apply any function that takes and returns a DataFrame
  • .rename() - rename columns inline
  • .sort_values() - sort without inplace=True

.pipe() for Custom Transformations

.pipe() is the escape hatch that makes chaining work for any operation:

def remove_outliers(df, column, n_std=3):
    """Remove rows where column value is more than n_std from mean."""
    mean, std = df[column].mean(), df[column].std()
    return df[df[column].between(mean - n_std * std, mean + n_std * std)]

def add_rolling_stats(df, column, window=30):
    """Add rolling mean and std columns."""
    return df.assign(
        **{
            f'{column}_ma{window}': df[column].rolling(window).mean(),
            f'{column}_std{window}': df[column].rolling(window).std(),
        }
    )

result = (
    pd.read_csv('measurements.csv')
    .pipe(remove_outliers, 'temperature')
    .pipe(add_rolling_stats, 'temperature', window=7)
    .groupby('station')['temperature']
    .mean()
)

Each .pipe() call is a named, reusable, testable transformation. You can unit test remove_outliers independently with a tiny test DataFrame. Try doing that with imperative mutation code.

.query() and .eval() - SQL-Like Syntax

For complex filtering conditions, .query() is dramatically cleaner than bracket notation:

# Bracket notation - verbose, easy to mess up parentheses
df[(df['age'] > 25) & (df['city'] == 'Amsterdam') & (df['salary'] > 50000)]

# Query - reads like SQL WHERE clause
df.query('age > 25 and city == "Amsterdam" and salary > 50000')

Reference Python variables with @:

min_age = 25
target_city = 'Amsterdam'
min_salary = 50_000

filtered = df.query('age > @min_age and city == @target_city and salary > @min_salary')

.eval() does the same for column expressions - creating new columns or computing values without intermediate arrays:

# These are equivalent, but eval() avoids creating temporary arrays
df['bonus'] = df['salary'] * 0.1
df.eval('bonus = salary * 0.1', inplace=True)  # or:
df = df.eval('bonus = salary * 0.1')

# Multi-line eval
df = df.eval('''
    bonus = salary * 0.1
    total_comp = salary + bonus
    tax = total_comp * 0.37
    net = total_comp - tax
''')

For DataFrames with more than ~10,000 rows, .eval() and .query() use numexpr under the hood, which can be faster than standard NumPy operations because it avoids allocating intermediate arrays.

Practical Example: Analyzing Time-Series Data

Let's put it all together with a realistic pipeline:

import pandas as pd

def load_and_prepare(filepath):
    return (
        pd.read_csv(filepath, dtype_backend='pyarrow')
        .assign(date=lambda df: pd.to_datetime(df['date']))
        .set_index('date')
        .sort_index()
    )

def add_technical_indicators(df, price_col='close'):
    return df.assign(
        ma_7=lambda d: d[price_col].rolling(7).mean(),
        ma_30=lambda d: d[price_col].rolling(30).mean(),
        volatility_30=lambda d: d[price_col].rolling(30).std(),
        daily_return=lambda d: d[price_col].pct_change(),
        cumulative_return=lambda d: (1 + d[price_col].pct_change()).cumprod() - 1,
    )

def monthly_summary(df):
    return (
        df.resample('ME')
        .agg(
            open=('open', 'first'),
            high=('high', 'max'),
            low=('low', 'min'),
            close=('close', 'last'),
            volume=('volume', 'sum'),
            avg_volatility=('volatility_30', 'mean'),
        )
        .assign(monthly_return=lambda d: d['close'].pct_change() * 100)
    )

# The full pipeline - readable, testable, composable
report = (
    load_and_prepare('btc_daily.csv')
    .pipe(add_technical_indicators)
    .pipe(monthly_summary)
    .query('monthly_return > 0')   # Only positive months
    .round(2)
)

Each function is independently testable. The pipeline reads like a description of what it does. Adding a new step (say, filtering to a date range) is one extra .query() or .pipe() call.

The Migration Cheat Sheet

If you learned Pandas before 2.0, here's what to change:

Old habitModern approach
.copy() everywhereCopy-on-write handles it (Pandas 3.0 default)
np.nan for missingpd.NA with nullable dtypes
Integer columns become float with NaNUse Int64 (capital I) or Arrow int64[pyarrow]
Strings as object dtypeUse string dtype or Arrow backend
Imperative mutation (df = df[...])Method chaining with .query(), .assign(), .pipe()
inplace=TrueDon't - it breaks chaining, is deprecated, and was never faster
df.append()Use pd.concat([df1, df2]) - append was removed in 2.0
pd.read_csv() defaultAdd dtype_backend='pyarrow' for large files
df.groupby(...).agg({'col': 'sum'})Named agg: .agg(total=('col', 'sum'))

One thing to note about inplace=True: it was never actually faster. People assumed it avoided copying, but internally Pandas still created a new object and then swapped the reference. It just made code harder to chain and reason about. Its removal from the API is overdue.

What to remember from this one

In this episode, we caught up with modern Pandas - and there's a lot:

  • Copy-on-write eliminates the SettingWithCopyWarning and the view-vs-copy confusion entirely
  • The PyArrow backend brings 50-80% memory savings on string-heavy data, native null handling for every type, and faster string operations
  • Nullable dtypes (Int64, boolean, string, Float64) fix the NaN-infects-everything problem - use convert_dtypes() to upgrade existing DataFrames
  • Method chaining with .query(), .assign(), and .pipe() makes data pipelines readable, testable, and composable
  • .pipe() is the key to reusable, unit-testable data transformations
  • .eval() and .query() provide SQL-like syntax and better performance on large DataFrames via numexpr
  • inplace=True is going away and was never actually faster - embrace chaining
  • df.append() is gone - use pd.concat() instead

If you haven't upgraded your Pandas habits since episodes #30-33, now's the time. The library is fundamentally better - faster, safer, and more ergonomic. Next episode: the messy reality of data cleaning and transformation, where we'll deal with duplicates, inconsistent formats, missing patterns, and all the ugly problems that real-world datasets throw at you.

Bedankt en tot de volgende keer!

@scipio



0
0
0.000
2 comments
avatar

Thanks for your contribution to the STEMsocial community. Feel free to join us on discord to get to know the rest of us!

Please consider delegating to the @stemsocial account (85% of the curation rewards are returned).

Consider setting @stemsocial as a beneficiary of this post's rewards if you would like to support the community and contribute to its mission of promoting science and education on Hive. 
 

0
0
0.000
avatar

Thx again team STEMsocial! I appreciate the support, it's rewarding!
<3

0
0
0.000