Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy

What will I learn
- You will learn how to install Python and set up an isolated environment for AI work;
- why NumPy is the foundation of all machine learning in Python;
- how to create, inspect, and manipulate your first NumPy arrays;
- the critical difference between Python lists and NumPy arrays;
- how to experiment interactively with the Python REPL;
- how to verify that your setup works with a quick sanity check.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy (this post)
Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
Before we can teach a machine to learn, we need a workbench. A carpenter doesn't start with the blueprint -- they set up the shop first. Saw? Check. Workbench? Check. Good lighting? Check. Only then does building begin.
If you've been following the Learn Python Series, you already know Python inside and out (and you know NumPy from episodes #11 and #56). But even if you haven't -- no worries, we'll set everything up from scratch here. Our workbench is Python, and our most important power tool is NumPy. Everything we build in this entire series -- from simple linear regression to transformer neural networks -- starts right here. So let's get this right ;-)
Why Python?
Machine learning has been done in many languages: R, Julia, MATLAB, C++, Java. But Python won. Not because it's the fastest language (it really isn't) or the most elegant (arguable, and I've had that argument more than once). Python won because of its ecosystem.
Every major ML library is Python-first: NumPy, pandas, scikit-learn, PyTorch, TensorFlow, Hugging Face. Every tutorial, every research paper with code, every production ML pipeline - Python. When you learn ML in Python, you're learning in the language the entire field speaks.
The other reason: Python is readable. ML is already complex enough without fighting your language's syntax. Python gets out of your way and lets you focus on the ideas.
Installing Python
If you're on macOS or Linux, Python might already be installed - but it's likely an older version. We want Python 3.10 or newer. Here's the clean way to get it:
# Check your current version
python3 --version
# macOS (using Homebrew - install from https://brew.sh if you don't have it)
brew install [email protected]
# Ubuntu/Debian
sudo apt update && sudo apt install python3.12 python3.12-venv
# Windows: download from https://python.org and check "Add Python to PATH"
On Windows, you may need to type python instead of python3. For the rest of this series, I'll use python3 - adjust accordingly.
Virtual environments: keeping things clean
A virtual environment is an isolated Python installation. It means the packages you install for this AI series won't interfere with other projects on your machine, and vice versa. This matters more than you think - version conflicts are a real productivity killer.
# Create a virtual environment called 'ai-env'
python3 -m venv ai-env
# Activate it
# macOS/Linux:
source ai-env/bin/activate
# Windows:
ai-env\Scripts\activate
When activated, your terminal prompt changes to show (ai-env). Everything you install with pip now goes into this environment only. To deactivate later, just type deactivate.
Rule of thumb: always work inside a virtual environment. Always. It costs nothing and prevents hours of debugging.
Installing NumPy
With your environment activated:
pip install numpy
That's it. NumPy is now available. Let's verify:
python3 -c "import numpy; print(numpy.__version__)"
# Should print something like 2.1.x or 2.2.x
Why NumPy? The speed that matters
You might wonder: Python already has lists. Why do we need a separate library for working with numbers?
The short answer: speed. The long answer is more interesting.
Python lists are flexible - you can mix types, append items, nest arbitrarily. This flexibility comes at a cost: each element is a full Python object, stored separately in memory, accessed through pointers. When you loop over a list of a million numbers, Python has to unwrap each object individually.
NumPy arrays are different. They store raw numbers in a contiguous block of memory, like a C array. When you perform an operation, NumPy doesn't loop through elements in Python - it calls optimized C code that processes the entire block at once.
import numpy as np
import time
size = 1_000_000
# Python list approach
py_list = list(range(size))
start = time.time()
result = [x * 2 for x in py_list]
py_time = time.time() - start
# NumPy approach
np_array = np.arange(size)
start = time.time()
result = np_array * 2
np_time = time.time() - start
print(f"Python list: {py_time:.4f}s")
print(f"NumPy array: {np_time:.4f}s")
print(f"NumPy is {py_time/np_time:.0f}x faster")
On most machines, NumPy will be 20-100x faster for this simple operation. For matrix multiplication - the backbone of neural networks - the difference is even larger because NumPy uses BLAS (Basic Linear Algebra Subprograms), libraries that are optimized down to the CPU instruction level.
This is why we use NumPy: when you're training a model on millions of data points, the speed difference is the difference between "runs in seconds" and "runs all night."
Your first NumPy arrays
Let's get hands-on. Open a Python shell (just type python3 in your terminal) or create a file called playground.py.
import numpy as np
# Creating arrays from Python lists
a = np.array([1, 2, 3, 4, 5])
print(a) # [1 2 3 4 5]
print(type(a)) # <class 'numpy.ndarray'>
# Arrays have a shape and a data type
print(a.shape) # (5,) - a 1D array with 5 elements
print(a.dtype) # int64
Shape and dtype are the two most important properties of any array. Shape tells you the dimensions: (5,) means one dimension with 5 elements. Dtype tells you what kind of numbers are stored: int64 means 64-bit integers.
# 2D array (a matrix)
m = np.array([[1, 2, 3],
[4, 5, 6]])
print(m.shape) # (2, 3) - 2 rows, 3 columns
print(m.dtype) # int64
# Floating point - what we'll use for ML
m_float = np.array([[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0]])
print(m_float.dtype) # float64
In ML, we almost always work with floating-point numbers (float32 or float64). Integer arrays are rare because gradient descent requires fractional adjustments.
Useful array creators
You don't always want to type arrays by hand. NumPy has shortcuts:
# All zeros
z = np.zeros((3, 4)) # 3 rows, 4 columns, all 0.0
print(z.shape) # (3, 4)
# All ones
o = np.ones((2, 5)) # 2 rows, 5 columns, all 1.0
# A sequence
seq = np.arange(0, 10, 2) # [0, 2, 4, 6, 8] - like range() but returns array
# Evenly spaced between two values
lin = np.linspace(0, 1, 5) # [0.0, 0.25, 0.5, 0.75, 1.0]
# Random numbers (we'll use this A LOT)
r = np.random.randn(3, 3) # 3x3 matrix, random from normal distribution
print(r)
That last one - np.random.randn - creates random numbers from a normal distribution (bell curve centered at 0). We'll use this for weight initialization in neural networks. Get used to seeing it.
Data types matter more than you think
I mentioned dtype earlier, but let me expand on this because it bites people. When you create an array from integers, NumPy defaults to int64. From floats, it defaults to float64. You can force a specific type:
# Explicit dtype
a = np.array([1, 2, 3], dtype=np.float32)
print(a.dtype) # float32
# Convert existing array
b = np.array([1, 2, 3]) # int64
c = b.astype(np.float32) # now float32
print(c) # [1. 2. 3.]
Why does this matter for ML? Two reasons. First, most ML frameworks (PyTorch, TensorFlow) expect float32 by default. If you feed them float64 data, they'll either complain or silently convert -- wasting memory. Second, float32 uses half the memory of float64. When your dataset has millions of rows, that's the difference between fitting in RAM and crashing.
A common pattern you'll see in ML code:
# Load data as whatever type it comes in
data = np.loadtxt("dataset.csv", delimiter=",")
# Immediately cast to float32 for ML
X = data[:, :-1].astype(np.float32) # features
y = data[:, -1].astype(np.float32) # target
Get used to checking .dtype obsessively. It's the NumPy equivalent of checking your types in a statically typed language -- except here, the compiler won't help you. You'll just get wrong results silently. Fun times ;-)
Reshaping: same data, different view
This is one of the most important NumPy operations for ML, and I promise you'll use it in almost every episode from here on out. Reshaping lets you change the dimensions of an array without copying the data:
a = np.arange(12) # [0, 1, 2, ..., 11] - shape (12,)
print(a.shape) # (12,)
# Reshape to 3 rows x 4 columns
b = a.reshape(3, 4)
print(b)
# [[ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]]
print(b.shape) # (3, 4)
# Reshape to 4 rows x 3 columns
c = a.reshape(4, 3)
print(c)
# [[ 0 1 2]
# [ 3 4 5]
# [ 6 7 8]
# [ 9 10 11]]
The total number of elements must stay the same. You can't reshape 12 elements into a 5x3 matrix (that would need 15). NumPy will throw a ValueError if you try.
There's a convenient shortcut: use -1 for one dimension and NumPy figures out the size automatically:
a = np.arange(12)
b = a.reshape(3, -1) # "3 rows, figure out the columns"
print(b.shape) # (3, 4) - NumPy calculated 12/3 = 4
c = a.reshape(-1, 6) # "figure out rows, 6 columns"
print(c.shape) # (2, 6)
You'll see reshape(-1, 1) everywhere in ML code. It turns a 1D array into a column vector:
prices = np.array([100, 200, 150, 300])
print(prices.shape) # (4,)
prices_col = prices.reshape(-1, 1)
print(prices_col.shape) # (4, 1)
print(prices_col)
# [[100]
# [200]
# [150]
# [300]]
Why would you want a column vector instead of a flat array? Because scikit-learn and most ML tools expect input features as a 2D array with shape (n_samples, n_features). Even if you have just one feature, it needs to be (n, 1) not (n,). This catches people off-guard constantly. Now you know.
Broadcasting: NumPy's superpower
Broadcasting is the mechanism that lets NumPy perform operations between arrays of different shapes. It's weird when you first see it, but once it clicks, it's incredibly powerful.
Simple example first:
a = np.array([1, 2, 3, 4])
b = a * 10
print(b) # [10, 20, 30, 40]
Wait -- we multiplied an array by a scalar. How? NumPy "broadcast" the scalar 10 to match the shape of a, as if it were [10, 10, 10, 10]. It didn't actually create that array in memory (that would be wasteful), it just applied the operation as if it had.
This extends to higher dimensions:
m = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
row_means = m.mean(axis=1, keepdims=True)
print(row_means)
# [[2.]
# [5.]
# [8.]]
centered = m - row_means
print(centered)
# [[-1. 0. 1.]
# [-1. 0. 1.]
# [-1. 0. 1.]]
We subtracted a (3, 1) array from a (3, 3) matrix. NumPy broadcast the column vector across all three columns. This "center the data by subtracting the mean" operation is one of the most common preprocessing steps in ML -- and broadcasting makes it a one-liner instead of a loop.
The rule is: dimensions are compared from right to left, and they're compatible if they're equal or one of them is 1. If one array has fewer dimensions, it's padded with 1s on the left. Don't memorize that rule -- just experiment with it. You'll build intuition pretty fast.
Boolean indexing: filtering data
This is the NumPy way to filter data, and you'll use it constantly:
temps = np.array([18.5, 22.1, 35.8, 15.2, 28.9, 31.4, 19.7])
# Which temperatures are above 25?
mask = temps > 25
print(mask) # [False False True False True True False]
# Use the mask to select only those values
hot_days = temps[mask]
print(hot_days) # [35.8 28.9 31.4]
# Or in one line
print(temps[temps > 25]) # [35.8 28.9 31.4]
You can combine conditions with & (and), | (or), and ~ (not). Important: you need parentheses around each condition because of Python's operator precedence:
# Temperatures between 20 and 30
comfortable = temps[(temps >= 20) & (temps <= 30)]
print(comfortable) # [22.1 28.9]
In ML, you'll use this for things like "show me all samples where feature X exceeds threshold Y" or "split the dataset into subsets based on class labels." It's fast because it's vectorized -- no Python loop involved.
Element-wise operations
This is where NumPy shines and where it diverges from how you'd think about list operations:
a = np.array([10, 20, 30, 40])
b = np.array([1, 2, 3, 4])
print(a + b) # [11 22 33 44] - element-wise addition
print(a * b) # [10 40 90 160] - element-wise multiplication
print(a / b) # [10. 10. 10. 10.]
print(a ** 2) # [100 400 900 1600] - element-wise square
print(np.sqrt(a)) # [3.16... 4.47... 5.47... 6.32...]
Every operation happens to all elements simultaneously. No loops. No list comprehensions. Just say what you want, and NumPy does it to every element at once. This paradigm - called vectorized operations - is the key to writing fast numerical code.
When you see ML code that looks like predictions = X @ weights + bias, this is what's happening: matrix multiplication and addition, applied to entire datasets at once. Clean, fast, and readable.
Indexing and slicing
Accessing parts of an array works like Python lists, with extra power for multiple dimensions:
a = np.array([10, 20, 30, 40, 50])
print(a[0]) # 10 - first element
print(a[-1]) # 50 - last element
print(a[1:4]) # [20 30 40] - slice
m = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
print(m[0, 0]) # 1 - row 0, column 0
print(m[1, :]) # [4 5 6] - entire row 1
print(m[:, 2]) # [3 6 9] - entire column 2
print(m[0:2, 1:3]) # [[2 3] [5 6]] - submatrix
The row, column indexing is fundamental. In ML, you'll constantly slice datasets: "give me the first 100 samples" (data[:100, :]), "give me just the target column" (data[:, -1]), "give me features 2 through 5" (data[:, 2:6]).
Sanity check: is everything working?
Let's put it all together with a quick test that touches everything we've set up:
import numpy as np
# Create some fake "data" - 5 houses with 3 features each
# [square_meters, bedrooms, age_years]
houses = np.array([
[80, 2, 15],
[120, 3, 5],
[60, 1, 30],
[200, 5, 2],
[95, 2, 10]
], dtype=np.float64)
print(f"Dataset shape: {houses.shape}") # (5, 3)
print(f"Number of houses: {houses.shape[0]}") # 5
print(f"Number of features: {houses.shape[1]}") # 3
# Basic statistics per feature (column)
print(f"Average square meters: {houses[:, 0].mean():.1f}") # 111.0
print(f"Max bedrooms: {houses[:, 1].max():.0f}") # 5
print(f"Min age: {houses[:, 2].min():.0f}") # 2
# Normalize: scale each feature to 0-1 range
mins = houses.min(axis=0)
maxs = houses.max(axis=0)
normalized = (houses - mins) / (maxs - mins)
print(f"\nNormalized first house: {normalized[0]}")
# [0.142... 0.25 0.464...]
print("\nSetup complete. You're ready for AI.")
If this runs without errors and produces reasonable output, your workbench is ready. We have Python, we have NumPy, and we know how to create, manipulate, and inspect arrays.
Take a moment to appreciate what just happened in that normalization step. We took raw data with wildly different scales (square meters in the hundreds, bedrooms in single digits, age in double digits) and squeezed everything into the 0-1 range. Why bother? Because most ML algorithms are sensitive to the scale of features. If one feature ranges from 0 to 200 and another from 0 to 5, the algorithm will think the first feature is more "important" just becuse it has bigger numbers. Normalization puts all features on equal footing. We did it in two lines. No loops. No list comprehensions. Just NumPy.
This pattern -- load data, inspect shape, compute statistics, transform -- is the skeleton of every ML preprocessing pipeline you'll ever write. The specific transforms get fancier (standardization, log transforms, one-hot encoding), but the shape is always the same: raw data in, transformed data out, all vectorized.
A note on the REPL
Throughout this series, I'll sometimes show code as if typed into a Python shell (the REPL -- Read, Eval, Print, Loop) and sometimes as a .py script. Both work. The REPL is great for experimenting -- type a line, see the result imediately, adjust, repeat. Scripts are better when your code gets longer than a dozen lines.
My recommendation: use the REPL for exploration and a .py file for anything you want to keep. And keep playground.py around -- we'll use it as a scratch pad troughout this series.
Next time, we'll start using these tools on actual data - and you'll see why "everything is just numbers" is the first principle of machine learning.
So, what do you know now?
- Python is the language of ML because of its ecosystem (NumPy, PyTorch, scikit-learn, Hugging Face);
- Virtual environments isolate your project's packages - always use one;
- NumPy arrays store numbers in contiguous memory and operate 20-100x faster than Python lists through vectorized operations;
- Shape and dtype are the two essential properties of any array - check them constantly;
- Element-wise operations apply to all elements at once, no loops needed;
- Indexing with
[row, column]lets you slice datasets into samples, features, or subsets; - Your setup is complete and verified - ready for episode #3.
Congratulations @scipio! You have completed the following achievement on the Hive blockchain And have been rewarded with New badge(s)
You can view your badges on your board and compare yourself to others in the Ranking
If you no longer want to receive notifications, reply to this comment with the word
STOP