Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World

@scipio 69

2 months ago

StemSocial

Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World

What will I learn

You will learn that everything a machine learning model sees is a number;
how real-world things like height, color, and "yes/no" get turned into numbers;
the spreadsheet mental model -- rows are samples, columns are features;
how to load and explore a small dataset in NumPy;
what a "feature" actually means and why it matters;
the difference between numerical and categorical data.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution;
The ambition to learn AI and machine learning.

Difficulty

Beginner

Curriculum (of the `Learn AI Series`):

Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World

How does a computer understand a photo? Or a sentence? Or a patient's medical history?

The answer is disarmingly simple: it doesn't. Not in any way you'd recognize as "understanding." What it does is much more humble and (honestly) much more powerful: it works with numbers. Only numbers. Every input, every output, every intermediate calculation -- numbers. That's it.

Your job as an ML practitioner is to translate the messy, rich, complicated real world into numbers that a model can crunch. And I'll tell you from experience -- this translation step (often called feature engineering) is where the actual craft lives. More important than knowing fancy algorithms. More important than having the latest GPU. If you get this wrong, no amount of computational power will save your model. It's where we start today ;-)

The first principle: everything becomes a number

A machine learning model is, at its core, a mathematical function. Functions take numbers in and produce numbers out. So no matter what you're working with -- images, text, audio, customer records, sensor readings -- it must become a number before the model can touch it.

Some things are already numbers:

Temperature: 22.5 degrees Celsius
Price: 347,000 euros
Height: 1.83 meters
Age: 34 years

These are numerical features. They slot directly into a model with no conversion needed. Easy.

Other things are not numbers, but they need to become numbers:

Color: "red", "blue", "green"
Country: "Netherlands", "Belgium", "Germany"
Spam or not spam: "yes" or "no"
Day of the week: "Monday", "Tuesday", ...

These are categorical features. They require a translation step. And this is where beginners (and quit some experienced practitioners, honestly) make mistakes. Let's look at the different approaches.

Turning categories into numbers

Binary categories (yes/no)

The easiest case. Two options become 0 and 1:

import numpy as np

# Is it spam? No = 0, Yes = 1
spam_labels = np.array([0, 1, 0, 0, 1, 1, 0])

# Is the customer active? No = 0, Yes = 1
active = np.array([1, 1, 0, 1, 0])

This is natural and intuitive. The model sees 0 and 1 as positions on a number line -- "not spam" at one end, "spam" at the other. It works because there IS a natural ordering: more "spammy" features should push the value toward 1. (You could argue whether 0 and 1 is the best choice, or maybe -1 and 1 -- we'll revisit that when we get to neural networks. For now, 0/1 is the convention.)

Multiple categories: label encoding

What about more than two categories? The tempting approach is to assign a number to each:

# Colors as numbers?
# red = 0, blue = 1, green = 2
colors = np.array([0, 1, 2, 0, 2, 1])

But wait -- do you see the problem? By assigning red = 0, blue = 1, green = 2, we've told the model that blue is "between" red and green, and that green is "bigger" than red. That's nonsense -- colors don't have a numeric order. The model might learn fake patterns from this invented ordering.

There ARE cases where label encoding makes sense: ordinal categories like "low"/"medium"/"high" or "small"/"medium"/"large" genuinely have an order. But for unordered categories, you need something else.

Multiple categories: one-hot encoding

The cleaner solution for unordered categories is one-hot encoding: create a separate column for each category, with a 1 where the category applies and 0 elsewhere.

# Three colors: red, blue, green
# Instead of one column, we use three:
#         [is_red, is_blue, is_green]
red    = [1,      0,       0]
blue   = [0,      1,       0]
green  = [0,      0,       1]

colors_onehot = np.array([
    [1, 0, 0],   # red
    [0, 1, 0],   # blue
    [0, 0, 1],   # green
    [1, 0, 0],   # red
    [0, 0, 1],   # green
    [0, 1, 0],   # blue
])

print(colors_onehot.shape)  # (6, 3)

Now each color is equidistant from every other color. No fake ordering. No hidden bias. This is the standard approach, and we'll use it extensively when we get to data preparation later in the series.

The downside: if you have 100 categories, you get 100 columns. That's a real concern for things like country codes or product IDs, and there are smarter approaches (embeddings, target encoding, frequency encoding -- we'll cover those later in the series). But one-hot encoding is the foundation you need to understand first.

Quick aside: there's a related trick called dummy encoding where you drop one of the one-hot columns (since the last category is implied when all others are 0). So for three colors, you'd use just two columns: is_red and is_blue. If both are 0, it must be green. This avoids a subtle mathematical problem called multicollinearity in some models. You don't need to worry about it now, but you'll see drop='first' in scikit-learn's OneHotEncoder -- and now you know why it's there.

The spreadsheet mental model

Now let's zoom out. In machine learning, your data almost always looks like a table -- a spreadsheet, if you will. Each row is one sample (one example, one observation, one data point). Each column is one feature (one measurement, one property).

# 5 apartments for sale
# Columns: square_meters, floor, has_elevator, rooms, price_eur
apartments = np.array([
    [65,  3, 1, 2, 185000],
    [82,  1, 0, 3, 210000],
    [45,  5, 1, 1, 145000],
    [120, 2, 1, 4, 320000],
    [55,  4, 0, 2, 165000],
], dtype=np.float64)

print(f"Samples (apartments): {apartments.shape[0]}")   # 5
print(f"Features (columns):   {apartments.shape[1]}")    # 5

In this table:

Row 0 is a 65m2 apartment on the 3rd floor with elevator, 2 rooms, priced at 185,000 euros
Column 0 is the square_meters feature across all apartments
The convention is always (n_samples, n_features) -- rows are samples, columns are features

This convention is not just a suggestion -- it's a hard requirement. It matters because every ML library expects this exact layout. When scikit-learn says a model takes X with shape (n_samples, n_features), it means exactly this. Get it backwards (features as rows, samples as columns) and you'll get cryptic errors at best, silently wrong results at worst. I've debugged this exact mistake more times than I'd like to admit, sometimes spending half an hour before realizing the data was transposed. Check your .shape obsessively. It costs nothing and saves everything.

Features: what makes a good one?

A feature is any measurable property that you believe might be relevent for your prediction. The word "believe" is doing heavy lifting there -- choosing which features to include (and which to ignore) is an art that separates good ML practitioners from everybody else.

Let's think about predicting apartment prices:

Strong features (probably very relevant):

Square meters -- bigger apartments cost more
Number of rooms -- same logic
Location (encoded somehow) -- a studio in Amsterdam costs more than a villa in Drenthe

Weak features (might help a little):

Floor number -- penthouse premium, ground floor convenience
Has elevator -- matters more for higher floors

Irrelevant features (noise):

The color of the front door
The name of the current owner
The day of the week you viewed the listing

Why should you care about this distinction? Because including irrelevant features doesn't just waste memory -- it can actively harm your model by introducing noise that looks like signal. The model doesn't know the front door color is irrelevant; it'll try to find patterns there anyway, and sometimes (by pure chance) it will find a spurious correlation. We call that overfitting (remember episode #1?), and it's the number one killer of otherwise solid models.

We'll cover this in depth in the feature engineering episode, but the instinct starts here: think about whether a feature could plausibly help make the prediction. If you can't explain why it would matter, leave it out.

Exploring data with NumPy

Once your data is in a NumPy array, you can immediately learn a lot about it:

# Using our apartments data
data = apartments

# Basic statistics per column
print("Mean per feature:", data.mean(axis=0))
print("Std per feature: ", data.std(axis=0))
print("Min per feature: ", data.min(axis=0))
print("Max per feature: ", data.max(axis=0))

The axis=0 parameter means "operate along rows" -- give me one result per column. This is confusing at first (trust me, everyone struggles with it) but becomes second nature: axis=0 collapses rows (gives per-column stats), axis=1 collapses columns (gives per-row stats). Still confused? Think of it as "the axis that disappears." axis=0 makes the row dimension disappear, leaving one value per column.

# The features (everything except the last column)
X = data[:, :-1]
print(f"Features shape: {X.shape}")   # (5, 4)

# The target (what we want to predict - the price)
y = data[:, -1]
print(f"Target shape: {y.shape}")     # (5,)

print(f"\nAverage price: EUR {y.mean():,.0f}")
print(f"Price range: EUR {y.min():,.0f} - EUR {y.max():,.0f}")

This split -- X for features and y for the target -- is the universal ML convention. You'll see it in every tutorial, every library, every paper. X is what the model sees, y is what it tries to predict. Get used to writing X, y in your code because you'll be typing it a thousand times ;-)

Building a dataset from scratch

Let's do something practical. Imagine you're collecting weather data -- you go outside every day for two weeks, measure the temperature, note whether it rained, and check the humidity. Here's how you'd turn that into an ML-ready dataset:

import numpy as np

# 14 days of weather observations
# Columns: temperature_c, humidity_pct, wind_kmh, rain (0=no, 1=yes)
weather = np.array([
    [18.2, 72, 15, 1],
    [22.1, 55, 8,  0],
    [15.8, 88, 22, 1],
    [25.3, 42, 5,  0],
    [19.7, 65, 12, 0],
    [14.1, 91, 28, 1],
    [23.8, 48, 6,  0],
    [20.5, 70, 18, 1],
    [26.2, 38, 4,  0],
    [17.3, 78, 20, 1],
    [21.9, 52, 9,  0],
    [16.5, 85, 25, 1],
    [24.7, 45, 7,  0],
    [19.0, 73, 16, 1],
], dtype=np.float64)

# Features = temperature, humidity, wind speed
X = weather[:, :3]
# Target = did it rain?
y = weather[:, 3]

# Quick sanity check
print(f"Dataset: {X.shape[0]} days, {X.shape[1]} features")
print(f"Rainy days: {int(y.sum())} out of {len(y)}")

# What's the average temperature on rainy vs dry days?
rainy_mask = y == 1
dry_mask = y == 0
print(f"Avg temp (rainy):  {X[rainy_mask, 0].mean():.1f} C")
print(f"Avg temp (dry):    {X[dry_mask, 0].mean():.1f} C")
print(f"Avg humidity (rainy): {X[rainy_mask, 1].mean():.1f}%")
print(f"Avg humidity (dry):   {X[dry_mask, 1].mean():.1f}%")

Even with this tiny dataset, you can already see patterns: rainy days are cooler, more humid, and windier. A machine learning model will find exactly these correlations -- just across many more dimensions and many more data points. What you did by eyeballing the averages is (conceptually) what the model does mathematically. The model just does it faster, across more features, and with more precision.

A complete example: the iris dataset

Let's work with a real (well, classic) dataset. The Iris dataset is the "Hello, World" of machine learning -- 150 flowers, 4 features each (sepal length/width, petal length/width), classified into 3 species. It was published by Ronald Fisher in 1936, and almost a century later it's still the first dataset every ML student touches ;-)

import numpy as np

# Iris dataset - first 5 samples of each species
# Features: sepal_length, sepal_width, petal_length, petal_width
setosa = np.array([
    [5.1, 3.5, 1.4, 0.2],
    [4.9, 3.0, 1.4, 0.2],
    [4.7, 3.2, 1.3, 0.2],
    [5.0, 3.6, 1.4, 0.2],
    [5.4, 3.9, 1.7, 0.4],
])

versicolor = np.array([
    [7.0, 3.2, 4.7, 1.4],
    [6.4, 3.2, 4.5, 1.5],
    [6.9, 3.1, 4.9, 1.5],
    [5.5, 2.3, 4.0, 1.3],
    [6.5, 2.8, 4.6, 1.5],
])

virginica = np.array([
    [6.3, 3.3, 6.0, 2.5],
    [5.8, 2.7, 5.1, 1.9],
    [7.1, 3.0, 5.9, 2.1],
    [6.3, 2.9, 5.6, 1.8],
    [6.5, 3.0, 5.8, 1.8],
])

# Stack them into one dataset
X = np.vstack([setosa, versicolor, virginica])
y = np.array([0]*5 + [1]*5 + [2]*5)  # 0=setosa, 1=versicolor, 2=virginica

print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Classes: {np.unique(y)}")

# Compare species by their average petal length (column 2)
for species, name in [(0, "Setosa"), (1, "Versicolor"), (2, "Virginica")]:
    mask = y == species
    avg_petal = X[mask, 2].mean()
    print(f"  {name:12s} avg petal length: {avg_petal:.2f} cm")

Look at that: setosa has tiny petals (~1.4 cm), versicolor has medium petals (~4.5 cm), and virginica has long petals (~5.7 cm). Just by looking at one feature, you can already separate the species pretty well. Isn't that remarkable? A machine learning model does exactly this -- but across all features simultaneously, finding the combination that seperates the classes best.

Pretty cool, right? This is also a good example of why feature selection matters. If we only had sepal width, the species would be much harder to tell apart (versicolor and virginica overlap a lot there). But petal length? Almost perfect separation with just that one measurement.

Combining features: when one column isn't enough

Sometimes individual features don't tell the full story, but combinations do. Let's look at our iris data again:

# Ratio: petal_length / sepal_length
ratio = X[:, 2] / X[:, 0]

for species, name in [(0, "Setosa"), (1, "Versicolor"), (2, "Virginica")]:
    mask = y == species
    print(f"  {name:12s} petal/sepal ratio: {ratio[mask].mean():.3f}")

# Even better separation! Setosa ~0.27, Versicolor ~0.68, Virginica ~0.87

Creating new features from existing ones (like this ratio) is called feature engineering, and it's one of the most creative parts of ML. You're not just feeding raw data -- you're encoding your domain knowledge into the data itself. A botanist would tell you "the petal-to-sepal ratio is characteristic of the species." We just translated that expert knowledge into a number. (And before you ask -- yes, there are methods that discover these combinations automatically. But knowing how to do it by hand gives you an edge, especially when the automatic methods miss something obvious.)

A word about scales

Look at our apartments dataset again. Square meters range from 45 to 120. Number of rooms: 1 to 4. Price: 145,000 to 320,000. These are wildly different scales, and most ML algorithms are sensitive to that.

Why? Because a model that measures "distance" between data points (and many do) would think the price dimension is overwhelmingly important -- not because price matters more, but because price has bigger numbers. A difference of 100,000 euros would dwarf a difference of 50 square meters, even though both might be equally informative.

The fix is normalization -- scaling all features to comparable ranges. The simplest approach:

# Min-max normalization: scale everything to [0, 1]
mins = apartments.min(axis=0)
maxs = apartments.max(axis=0)
normalized = (apartments - mins) / (maxs - mins)

print("Original first apartment:", apartments[0])
print("Normalized first apartment:", normalized[0].round(3))

After normalization, every feature lives in the [0, 1] range. Now the model can compare them fairly. We'll explore more sophisticated normalization techniques (like standardization, which centers data at mean 0 with standard deviation 1) in later episodes. For now, just know that raw numbers often need rescaling before feeding them to a model.

What we haven't converted yet

There's a whole universe of data types we haven't touched: text, images, audio, time series. Each of these requires more sophisticated conversion to numbers -- embeddings for text (where words become dense vectors of 100+ dimensions), pixel arrays for images (where a 28x28 grayscale image becomes 784 numbers), spectrograms for audio (frequency intensity over time), and rolling windows for time series.

Each of those topics deserves (and will get) its own episode. For now, the essential insight is this: no matter how complex the input, the model only ever sees arrays of numbers. A photo of a cat? 786,432 floating-point numbers (1024x768 pixels times 3 color channels). A sentence? A sequence of integers representing tokens, later transformed into dense vectors. A song? Amplitude samples at 44,100 per second, reshaped into overlapping frequency windows.

Your job is to make the translation as faithful as possible -- to preserve the information that matters and discard the noise. And that, more than any algorithm or framework, is what makes a good ML practitioner.

Before you close this tab...

Machine learning models only work with numbers -- everything must be converted;
Numerical features (temperature, price, age) need no conversion;
Binary categories become 0/1; multi-class categories need one-hot encoding to avoid fake ordering;
Data is organized as a table: rows are samples, columns are features -- shape (n_samples, n_features);
The X/y split is universal: X = features the model sees, y = target it predicts;
Good features plausibly relate to the prediction; irrelevant features add harmful noise (overfitting);
Combining features (feature engineering) can reveal patterns that individual columns miss;
NumPy's axis=0 gives per-column stats, axis=1 gives per-row stats -- you'll use this constantly.

Alsjeblieft, daar kun je mee aan de slag. Tot de volgende keer!

@scipio

stem stemsocial steemstem python programming

0.000

1 comments

@stemsocial 64

2 months ago

Thanks for your contribution to the STEMsocial community. Feel free to join us on discord to get to know the rest of us!

Please consider delegating to the @stemsocial account (85% of the curation rewards are returned).

Consider setting @stemsocial as a beneficiary of this post's rewards if you would like to support the community and contribute to its mission of promoting science and education on Hive.

0.000

Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World

Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World

What will I learn

Requirements

Difficulty

Curriculum (of the Learn AI Series):

Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World

The first principle: everything becomes a number

Turning categories into numbers

Binary categories (yes/no)

Multiple categories: label encoding

Multiple categories: one-hot encoding

The spreadsheet mental model

Features: what makes a good one?

Exploring data with NumPy

Building a dataset from scratch

A complete example: the iris dataset

Combining features: when one column isn't enough

A word about scales

What we haven't converted yet

Before you close this tab...

Alsjeblieft, daar kun je mee aan de slag. Tot de volgende keer!

Curriculum (of the `Learn AI Series`):