Learn Creative Coding (#92) - ml5.js: Machine Learning in the Browser
Learn Creative Coding (#92) - ml5.js: Machine Learning in the Browser

Last episode wrapped up the data art arc -- thirteen episodes from raw API calls to ethical responsibility. We built tools for fetching, parsing, mapping, laying out, interacting with, sonifying, and critically examining data as creative material. Every episode in that arc treated data as something static that YOU interpret. You decide how to map temperature to color. You decide which data goes on which axis. You choose the visual vocabulary. The data just sits there waiting for your creative decisions.
This episode flips that relationship. What if the data isn't static numbers from a CSV? What if the input is a live webcam feed, and a neural network is the one doing the interpreting? Instead of you deciding "this pixel is blue, therefore draw a circle," a model says "I see a cat -- confidence 87%" and your code responds to that interpretation. The model becomes a creative collaborator. Its perception drives your art.
ml5.js makes this accessible. It's a JavaScript library built on top of TensorFlow.js that wraps pre-trained machine learning models in a friendly API. No Python. No GPU servers. No training datasets. Everything runs in the browser. You load a model, feed it an image or a video frame, and get back structured results -- labels, coordinates, confidence scores -- that you can map to visuals using every technique we've built over ninety-one episodes. The ml5 team built it specifically for creative coders, artists, and students. It's not trying to compete with PyTorch. It's trying to make ML usable for people who want to make art, not publish papers.
What ml5 gives you
ml5 wraps several categories of pre-trained models. Each one takes a different kind of input and produces a different kind of structured output. Here's what's available:
Image classification -- feed it an image, get back labels and confidence scores. "This looks like a tabby cat (87%), or maybe a tiger cat (6%), or possibly an Egyptian cat (3%)." The model sees the world through about 1000 categories from the ImageNet dataset.
Object detection -- not just "what's in the image" but "where in the image." Returns bounding boxes with coordinates, labels, and confidence. Multiple objects per frame.
Pose detection -- track the human body. Returns 17 keypoints: nose, eyes, ears, shoulders, elbows, wrists, hips, knees, ankles. Each keypoint has x, y coordinates and a confidence score. Works in real time on webcam video.
Hand tracking -- 21 keypoints per hand. Fingertips, joints, palm center. Fine-grained enough to detect gestures.
Face mesh -- 468 points on the face. Eyebrows, nose bridge, lip contour, jaw line, cheek surface. Enough detail to track expressions.
Sound classification -- feed it microphone audio, get back labels. "speech," "music," "clapping," "whistle." The audio equivalent of image classification.
Sentiment analysis -- give it text, get back a score from negative to positive.
All of these ship pre-trained. You don't build the model. You don't collect training data. You load it, call a function, and get results. The creative work is deciding what to DO with those results. And that's where everything we've learned comes in.
Loading your first model
ml5 works in the browser. Include it from a CDN (or install via npm if you're using a bundler), and you're ready to go. Here's the minimal HTML you need:
<!DOCTYPE html>
<html>
<head>
<script src="https://cdn.jsdelivr.net/npm/[email protected]/lib/p5.min.js"></script>
<script src="https://unpkg.com/ml5@latest/dist/ml5.min.js"></script>
</head>
<body>
<script src="sketch.js"></script>
</body>
</html>
Two script tags. p5.js for the canvas and drawing. ml5 for the machine learning models. That's your entire dependency list. Let's start with image classification using MobileNet -- a lightweight model designed to run on mobile devices.
// include ml5 in your HTML:
// <script src="https://unpkg.com/ml5@latest/dist/ml5.min.js"></script>
let classifier;
let isModelReady = false;
function setup() {
// load MobileNet -- downloads ~5MB of model weights
classifier = ml5.imageClassifier('MobileNet', function() {
console.log('model loaded');
isModelReady = true;
});
}
setup();
That's it. One function call. The model downloads (about 5MB the first time -- it gets cached after that), initializes, and calls your callback when it's ready. The first time you load MobileNet on a slow connection, there's a noticeable delay. After that, the browser cache makes subsequent loads almost instant.
The ml5.imageClassifier call creates a classifier object. You feed it images. It returns predictions. Let's classify a static image first to see what comes back.
// classify a single image
// assume you have an <img id="testImage" src="cat.jpg"> in your HTML
function classifyImage() {
if (!isModelReady) {
console.log('model still loading..');
return;
}
const img = document.getElementById('testImage');
classifier.classify(img, function(results) {
console.log('predictions:', results);
// results is an array of { label, confidence } objects
// sorted by confidence, highest first
for (const r of results) {
console.log(r.label + ': ' + (r.confidence * 100).toFixed(1) + '%');
}
});
}
The results come back as an array of objects, each with a label (string) and confidence (float between 0 and 1). MobileNet returns the top 3 predictions by default. For a photo of a cat, you might get:
// what the results array looks like
[
{ label: 'tabby, tabby cat', confidence: 0.723 },
{ label: 'tiger cat', confidence: 0.141 },
{ label: 'Egyptian cat', confidence: 0.086 }
]
Those labels come from ImageNet's 1000 categories. They're specific and sometimes weird -- ImageNet distinguishes between "tabby," "tiger_cat," "Persian_cat," and "Egyptian_cat" because the dataset was built for scientific classification, not casual use. But for creative coding, the specificity is a feature, not a bug. Each label is a data point you can map to a visual response.
Webcam input: real-time classification
Static images are fine for testing. The real magic happens with a live webcam feed. p5.js makes webcam access trivial with createCapture(VIDEO). Feed each frame to the classifier and you get real-time labels for whatever the camera sees.
// p5.js sketch with webcam classification
let video;
let classifier;
let currentLabel = 'loading...';
let currentConfidence = 0;
function setup() {
createCanvas(640, 480);
video = createCapture(VIDEO);
video.size(640, 480);
video.hide(); // hide the raw video element, we'll draw it ourselves
classifier = ml5.imageClassifier('MobileNet', video, function() {
console.log('model ready');
classifyFrame();
});
}
function classifyFrame() {
classifier.classify(function(results) {
currentLabel = results[0].label;
currentConfidence = results[0].confidence;
// classify again immediately -- creates a loop
classifyFrame();
});
}
function draw() {
image(video, 0, 0);
// overlay the classification result
fill(0, 0, 0, 180);
noStroke();
rect(10, height - 60, 400, 50, 5);
fill(255);
textSize(14);
textFont('monospace');
text(currentLabel, 20, height - 35);
// confidence bar
const barWidth = currentConfidence * 380;
fill(100, 200, 150, 200);
rect(15, height - 22, barWidth, 8, 3);
}
The recursive classifyFrame() creates a continuous classification loop. Each time the model finishes classifying a frame, it immediately starts classifying the next one. On modern hardware, MobileNet runs at about 15-30 classifications per second -- not quite 60fps, but fast enought to feel responsive. You see the webcam image with the current classification overlaid in real time. Hold up your phone: "cellular telephone: 91%." Hold up a coffee mug: "cup: 78%." Wave your hand: the label flickers between categories as the model tries to make sense of a moving target.
The flickering is interesting actually. When the model is confident, the label stays stable. When it's uncertain, the label changes rapidly -- "water bottle" one frame, "vase" the next, "wine bottle" after that. That uncertainty is data. We can use it.
Confidence as creative parameter
The confidence score (0 to 1) is the most underrated output from a classification model. Most tutorials ignore it -- they just show the label. But for creative coding, confidence is a rich signal. It tells you how sure the model is about what it sees. And that certainty maps beautifully to visual properties.
High confidence = the model is sure = strong, clear, stable visuals.
Low confidence = the model is uncertain = glitchy, ambiguous, transitional visuals.
let video;
let classifier;
let confidence = 0;
let label = '';
function setup() {
createCanvas(640, 480);
video = createCapture(VIDEO);
video.size(640, 480);
video.hide();
classifier = ml5.imageClassifier('MobileNet', video, function() {
classifyLoop();
});
}
function classifyLoop() {
classifier.classify(function(results) {
label = results[0].label;
confidence = results[0].confidence;
classifyLoop();
});
}
function draw() {
// confidence drives the visual treatment
if (confidence > 0.7) {
// high confidence: clean, sharp rendering
tint(255, 255);
image(video, 0, 0);
} else if (confidence > 0.3) {
// medium confidence: pixelated, desaturated
const blockSize = map(confidence, 0.3, 0.7, 20, 4);
video.loadPixels();
for (let y = 0; y < height; y += blockSize) {
for (let x = 0; x < width; x += blockSize) {
const idx = (Math.floor(y) * video.width + Math.floor(x)) * 4;
const r = video.pixels[idx];
const g = video.pixels[idx + 1];
const b = video.pixels[idx + 2];
// desaturate based on uncertainty
const gray = (r + g + b) / 3;
const sat = map(confidence, 0.3, 0.7, 0, 1);
const cr = lerp(gray, r, sat);
const cg = lerp(gray, g, sat);
const cb = lerp(gray, b, sat);
fill(cr, cg, cb);
noStroke();
rect(x, y, blockSize, blockSize);
}
}
} else {
// low confidence: abstract noise, model is lost
for (let i = 0; i < 200; i++) {
const x = random(width);
const y = random(height);
const s = random(5, 30);
fill(random(255), random(255), random(255), 30);
noStroke();
ellipse(x, y, s);
}
}
// label overlay
fill(0, 0, 0, 160);
rect(10, 10, 300, 30, 4);
fill(255);
textSize(12);
textFont('monospace');
text(label + ' (' + (confidence * 100).toFixed(0) + '%)', 20, 30);
}
When the model is confident (above 70%), you see a clean webcam image. As confidence drops, the image pixelates and loses color -- the visual literally becomes uncertain as the model becomes uncertain. Below 30%, the visual dissolves into abstract colored noise. The model can't figure out what it's looking at, so the art becomes formless. The model's internal state drives the aesthetic.
This is the key insight for this whole arc: ML model outputs aren't just answers to be displayed in a text box. They're continuous signals that can drive any visual parameter. Confidence to pixelation. Label changes to color shifts. Bounding box coordinates to particle emitter positions. The model's perception becomes creative input.
Transfer learning: teach the model your categories
MobileNet knows 1000 ImageNet categories. But what if you want the model to recognize things that aren't in ImageNet? Like your specific hand gestures, or whether you're holding a red object vs a blue one, or whether you're smiling or frowning?
Transfer learning lets you keep MobileNet's general visual understanding but retrain just the last layer to recognize YOUR categories. ml5 wraps this in ml5.featureExtractor. You provide a few examples of each category (as few as 10-20 images), and the model learns to distinguish them.
let featureExtractor;
let classifier;
let video;
let label = 'training...';
function setup() {
createCanvas(640, 480);
video = createCapture(VIDEO);
video.size(640, 480);
video.hide();
// extract features from MobileNet, then train a custom classifier on top
featureExtractor = ml5.featureExtractor('MobileNet', function() {
console.log('feature extractor ready');
classifier = featureExtractor.classification(video);
});
}
// call these from buttons in your HTML
function addExample(categoryLabel) {
// add current video frame as a training example for this category
classifier.addImage(categoryLabel, function() {
console.log('added example for: ' + categoryLabel);
});
}
function trainModel() {
classifier.train(function(lossValue) {
if (lossValue) {
console.log('training.. loss: ' + lossValue.toFixed(4));
} else {
console.log('training done');
classifyLoop();
}
});
}
function classifyLoop() {
classifier.classify(function(results) {
label = results[0].label;
classifyLoop();
});
}
function draw() {
image(video, 0, 0);
fill(0, 0, 0, 180);
rect(10, 10, 250, 30, 4);
fill(255);
textSize(14);
textFont('monospace');
text(label, 20, 30);
}
The workflow: hold up a red object, click "add example" for category "red" a dozen times from different angles. Hold up a blue object, add examples for "blue". Hold up nothing, add examples for "empty". Click train. The model learns in seconds -- it's only training one small layer, not the whole network. After training, point the camera at a red thing and the label reads "red." Point at blue and it switches. The model learned your specific categories from your specific examples. Pretty cool, right? :-)
For creative coding, this is powerful. You could train it to recognize three hand positions and map each one to a different generative visual. Or train it on mood-related objects (coffee mug = working, book = relaxing, keys = leaving) and build art that responds to your daily routine. The categories are yours to define. The model adapts to your creative intent.
The creative exercise: a magic mirror
Allez, let's build something. A "magic mirror" that classifies whatever you hold up to the webcam and responds with a unique generative visual for each recognized category. This combines MobileNet classification with the particle systems from episode 11, the color palettes from episode 28, and the noise fields from episode 12. The AI's perception drives the aesthetics.
let video;
let classifier;
let currentLabel = '';
let currentConf = 0;
let particles = [];
// visual rules per category
const categoryVisuals = {
'default': { hue: 200, particleCount: 50, speed: 1, shape: 'circle' },
'book': { hue: 45, particleCount: 80, speed: 0.5, shape: 'rect' },
'cup': { hue: 30, particleCount: 120, speed: 1.5, shape: 'circle' },
'phone': { hue: 260, particleCount: 200, speed: 2.0, shape: 'triangle' },
'banana': { hue: 55, particleCount: 60, speed: 0.8, shape: 'circle' },
'plant': { hue: 120, particleCount: 150, speed: 0.3, shape: 'circle' },
'hand': { hue: 0, particleCount: 300, speed: 3.0, shape: 'circle' }
};
function setup() {
createCanvas(800, 600);
colorMode(HSB, 360, 100, 100, 100);
video = createCapture(VIDEO);
video.size(160, 120); // small for performance
video.hide();
classifier = ml5.imageClassifier('MobileNet', video, function() {
classifyLoop();
});
}
function classifyLoop() {
classifier.classify(function(results) {
currentLabel = results[0].label.split(',')[0].trim().toLowerCase();
currentConf = results[0].confidence;
classifyLoop();
});
}
function getVisualConfig() {
// check if any key in categoryVisuals matches the current label
for (const key of Object.keys(categoryVisuals)) {
if (currentLabel.indexOf(key) !== -1) {
return categoryVisuals[key];
}
}
return categoryVisuals['default'];
}
function draw() {
background(0, 0, 5, 15); // slow fade
const config = getVisualConfig();
// spawn particles based on config
while (particles.length < config.particleCount) {
particles.push({
x: width / 2 + random(-50, 50),
y: height / 2 + random(-50, 50),
vx: random(-1, 1) * config.speed,
vy: random(-1, 1) * config.speed,
size: random(3, 12),
life: 1.0,
hue: config.hue + random(-20, 20)
});
}
// trim excess particles
while (particles.length > config.particleCount) {
particles.shift();
}
// update and draw
for (let i = particles.length - 1; i >= 0; i--) {
const p = particles[i];
// noise-based movement (episode 12)
const n = noise(p.x * 0.005, p.y * 0.005, frameCount * 0.005);
const angle = n * TWO_PI * 2;
p.vx += cos(angle) * 0.1 * config.speed;
p.vy += sin(angle) * 0.1 * config.speed;
// damping
p.vx *= 0.98;
p.vy *= 0.98;
p.x += p.vx;
p.y += p.vy;
p.life -= 0.003;
if (p.life <= 0 || p.x < -20 || p.x > width + 20 ||
p.y < -20 || p.y > height + 20) {
particles.splice(i, 1);
continue;
}
const alpha = p.life * currentConf * 80;
fill(p.hue % 360, 60, 70, alpha);
noStroke();
if (config.shape === 'circle') {
ellipse(p.x, p.y, p.size);
} else if (config.shape === 'rect') {
rectMode(CENTER);
rect(p.x, p.y, p.size, p.size);
} else if (config.shape === 'triangle') {
const s = p.size / 2;
triangle(
p.x, p.y - s,
p.x - s, p.y + s,
p.x + s, p.y + s
);
}
}
// small webcam preview in the corner
image(video, 10, height - 90, 120, 80);
stroke(200, 20, 80, 30);
noFill();
rect(10, height - 90, 120, 80);
// label + confidence
fill(0, 0, 100, 80);
noStroke();
textSize(12);
textFont('monospace');
textAlign(LEFT);
text(currentLabel + ' ' + (currentConf * 100).toFixed(0) + '%', 10, height - 95);
}
Hold up a book and the background fills with warm, slow-moving rectangles drifting through a noise field. Hold up a coffee cup and amber circles swirl faster. Show your phone and the screen explodes with purple triangles. Wave your hand and red particles scatter in all directions. Put everything down and the default blue circles drift calmly. The AI's interpretation of what it sees directly controls the generative system -- color palette, particle count, movement speed, shape. Each object creates a visually distinct response.
The currentConf multiplier on particle alpha means uncertain classifications produce faded, ghost-like particles. When the model is confident, the visuals are vivid. When the model hesitates, the art hesitates too. That uncertanty feedback loop adds a quality that purely rule-based generative art doesn't have.
Understanding model limitations
MobileNet was trained on ImageNet, which is a dataset with serious biases worth knowing about. ImageNet was built primarily from photos posted to Flickr and scraped from Google Images, which means it overrepresents objects and contexts common in Western, English-speaking internet culture. It knows 120 dog breeds but only a handful of tropical fruits. It recognizes an espresso machine but might struggle with a Korean rice cooker.
For creative coding, this bias can actually be interesting material. Build a piece where you test what the model recognizes and doesn't recognize. Show it objects from different cultures and visualize the confidence gap. The model's blind spots become visible art -- a visualization of what a particular dataset considers "normal."
// tracking what the model recognizes well vs poorly
let recognitionLog = [];
function logClassification(label, confidence) {
recognitionLog.push({
label: label,
confidence: confidence,
timestamp: Date.now()
});
// keep last 100 classifications
if (recognitionLog.length > 100) {
recognitionLog.shift();
}
}
function drawRecognitionHistory() {
// draw as a timeline: x = time, y = confidence
// high confidence = model "gets it", low = confused
const canvas = document.createElement('canvas');
canvas.width = 800;
canvas.height = 200;
document.body.appendChild(canvas);
const ctx = canvas.getContext('2d');
ctx.fillStyle = '#0a0a1a';
ctx.fillRect(0, 0, 800, 200);
for (let i = 0; i < recognitionLog.length; i++) {
const entry = recognitionLog[i];
const x = (i / 100) * 780 + 10;
const y = 190 - entry.confidence * 170;
// color: green for confident, red for confused
const hue = entry.confidence * 120; // 0=red, 120=green
ctx.fillStyle = `hsla(${hue}, 60%, 50%, 0.6)`;
ctx.beginPath();
ctx.arc(x, y, 4, 0, Math.PI * 2);
ctx.fill();
// label for low-confidence entries (the interesting ones)
if (entry.confidence < 0.3) {
ctx.fillStyle = 'rgba(255, 120, 100, 0.5)';
ctx.font = '8px monospace';
ctx.textAlign = 'center';
ctx.fillText(entry.label.split(',')[0], x, y - 8);
}
}
}
The timeline becomes a map of the model's competence. Green dots high up = objects it knows well. Red dots low down = things it's confused by. The labels on the confused entries tell you exactly what stumped it. Show the model a succession of objects and the timeline becomes a visual record of ImageNet's training data distribution. Common Western household items cluster at the top. Unusual or culturally specific objects drop to the bottom. It's a portrait of the dataset, not just the objects.
Performance: what to expect
ML models in the browser are not free. MobileNet is designed to be lightweight, but "lightweight" for a neural network still means millions of operations per frame. Here's what to expect:
MobileNet classification: 15-30 fps on a modern laptop. Acceptable for real-time creative work. On an older machine or a phone, expect 8-15 fps. Still usable but noticeably laggy.
Pose detection (we'll cover this next episode): 10-20 fps. Heavier than classification because it outputs 17 keypoints with coordinates, not just a label.
Face mesh: 8-15 fps. 468 points per face. The most computationally expensive model in ml5's lineup.
The key optimization: don't classify every frame. If your creative coding sketch runs at 60fps for smooth animation, you don't need 60 classifications per second. Classify every 3rd or 4th frame and interpolate between results. The human eye won't notice the classification lag if the animation remains smooth.
let classifyEvery = 4; // classify every 4th frame
let framesSinceLastClassify = 0;
function draw() {
// always draw smoothly at full framerate
drawParticles();
drawBackground();
// only classify periodically
framesSinceLastClassify++;
if (framesSinceLastClassify >= classifyEvery) {
framesSinceLastClassify = 0;
// trigger async classification
classifier.classify(function(results) {
updateVisualsFromResults(results);
});
}
}
Smooth 60fps animation with 15fps classification underneath. The animation responds to classification changes with a slight delay (at most 4 frames = 66ms) that's imperceptible to the viewer. This is the standard pattern for any ML-driven creative coding project: decouple the render loop from the inference loop.
Privacy: a note on webcam-based art
When your art uses a webcam, you're capturing images of real people. ml5 runs entirely in the browser -- no data is sent to any server. The video frames never leave the user's machine. That's good. But it's not the whole story.
If you're building an installation in a public space, people who walk past the camera become data subjects. They didn't consent to being classified by a neural network. Even if the classification results are used purely for art and never stored, the person might not know that. A camera pointing at people triggers assumptions about surveillance.
Best practices: if the art is webcam-based and in a public setting, display a visible notice that a camera is active. Process everything locally (ml5 already does this). Don't store video frames or classification results. And give people a way to walk away from the camera's field of view. For personal projects in your own space, obviously this is less of a concern -- but it's good to build the awareness early, because the models we're about to explore in the next few episodes (body tracking, face mesh, hand gestures) get much more personal.
Where this goes next
This episode gave you the foundation: loading models, classifying images, using webcam input, working with confidence scores, transfer learning, performance optimization, and the ethical considerations that come with putting ML in creative coding. MobileNet is the simplest model in ml5's lineup. It sees the world as 1000 flat categories. No spatial awareness, no body understanding, no facial expression reading.
The next episodes open that up. Pose detection tracks your skeleton in real time -- 17 keypoints that you can map to particle emitters, line drawings, physics simulations. Hand tracking gives you 21 points per hand, enough to detect gestures and finger positions. Face mesh provides 468 facial landmarks that respond to every eyebrow raise and lip movement. Each model gives your creative code a richer understanding of the human body as input. The body becomes a controller, an instrument, a brush. That's a different kind of creative coding from anything we've done so far.
't Komt erop neer...
- ml5.js wraps pre-trained machine learning models in a browser-friendly JavaScript API built on TensorFlow.js. No Python, no GPU servers, no training required. Load a model, feed it data, get structured results. Designed for creative coders and artists, not ML researchers
- MobileNet image classification takes an image or video frame and returns labels with confidence scores from 1000 ImageNet categories. First load downloads ~5MB of model weights (cached after that). Runs at 15-30 fps on modern hardware
- Real-time webcam classification uses a recursive callback loop -- each classification triggers the next. p5.js
createCapture(VIDEO)provides the video feed. The continuous loop gives you a live stream of labels and confidence values that update as the camera sees different things - Confidence (0 to 1) is the most creative output from a classifier. High confidence = strong, clear visuals. Low confidence = glitchy, uncertain, transitional aesthetics. The model's internal uncertainty becomes a visual quality that purely rule-based generative art can't produce
- Transfer learning via
ml5.featureExtractorlets you retrain MobileNet's last layer on your own categories with as few as 10-20 examples per class. Train the model to recognize your specific hand gestures, objects, or scenes. The categories you define become the vocabulary of your interactive art - The magic mirror pattern: classify the webcam feed continuously, map each recognized category to a different generative visual configuration (color palette, particle count, movement speed, shape type). Hold up different objects and the art transforms in response. The AI's perception drives the aesthetic
- ImageNet bias is real and worth knowing. The dataset overrepresents Western household objects and underrepresents objects from other cultures. For creative coding, this bias can become material -- visualize what the model recognizes vs what confuses it and you've made a portrait of the dataset's worldview
- Performance optimization: don't classify every frame. Run your animation at 60fps but classify every 3rd or 4th frame. Interpolate between results. The render loop and the inference loop should be decoupled so animation stays smooth regardless of classification speed
- Webcam-based ML art raises privacy concerns. ml5 processes everything locally (no data leaves the browser), but public installations should display notices about active cameras. Don't store video frames or classification results. Give people the ability to walk away
- This is the start of the ML + creative coding arc. MobileNet gives you flat labels. The models coming next -- pose detection, hand tracking, face mesh -- give you spatial, bodily, expressive data. The human body becomes a real-time controller for generative art
This starts a new arc -- ML as creative input. The data isn't in a file anymore. It's in the room with you, interpreted in real time by models that see, track, and classify. Every technique from the last ninety-one episodes applies. The difference is where the data comes from.
Sallukes! Thanks for reading.
X