Learn Creative Coding (#96) - Image Classification for Art

about 12 hours ago

Learn Creative Coding (#96) - Image Classification for Art

Four episodes ago we loaded our first ML model -- MobileNet image classification through ml5.js. Feed a video frame, get back a label and a confidence score. "Tabby cat, 87%." Simple, useful, and we moved on quickly to body tracking, hand tracking, and face mesh because those models gave us spatial data -- keypoints we could map to particles, trails, colors, meshes. Classification seemed flat by comparison. One label per frame. Where's the creative depth in that?

Turns out: a lot deeper than it looks. We barely scratched MobileNet in episode 92. We used it as a label printer -- show an object, read the label, move on. But a classification model doesn't just output a label. It outputs 1000 confidence scores simultaneously, one for every category it knows. The label is just the highest-scoring one. The full distribution of scores is where the interesting stuff lives. A photo of a coffee mug might score 72% on "cup", 8% on "vase", 5% on "pitcher" -- the model is simultaneously seeing all three and being most convinced by one. That distribution is data. Rich, continuous, creative data.

This episode takes image classification seriously as a creative tool. Not just "what is it" but "what does the model see across the whole image", "where does it see it", "what confuses it", and "how can we use its internal representations for things it was never designed to do." We're going to turn a webcam classifier into a creative instrument.

The confidence distribution: beyond the top label

When we classified webcam frames in episode 92, we only used results[0] -- the top prediction. But ml5 returns multiple predictions, and MobileNet internally computes scores for all 1000 categories. The full distribution tells you more than any single label.

let video;
let classifier;
let allResults = [];

function preload() {
  classifier = ml5.imageClassifier('MobileNet');
}

function setup() {
  createCanvas(800, 500);
  video = createCapture(VIDEO);
  video.size(640, 480);
  video.hide();
  classifyLoop();
}

function classifyLoop() {
  classifier.classify(video, 10, function(results) {
    allResults = results;
    classifyLoop();
  });
}

function draw() {
  background(10, 12, 18);

  if (allResults.length === 0) return;

  // draw top 10 predictions as horizontal bars
  for (let i = 0; i < allResults.length; i++) {
    const r = allResults[i];
    const y = 30 + i * 45;
    const barW = r.confidence * 700;

    // color based on rank
    const hue = map(i, 0, 9, 160, 0);
    fill(hue, 60, 55, 50);
    noStroke();
    rect(130, y, barW, 30, 3);

    // label
    fill(160, 170, 190);
    textSize(9);
    textFont('monospace');
    textAlign(RIGHT);
    text(r.label.split(',')[0].substring(0, 16), 125, y + 20);

    // confidence value
    textAlign(LEFT);
    fill(120, 130, 150);
    text((r.confidence * 100).toFixed(1) + '%', 135 + barW, y + 20);
  }
}

Point the camera at a coffee mug. The top bar says "cup" at maybe 65%. But below it you see "pitcher" at 8%, "vase" at 6%, "water_jug" at 4%. The model is seeing the cylindrical shape and open top -- features shared by cups, vases, pitchers, and jugs. The gap between the top score and the second score tells you how "certain" the model is. A 90% top score with 2% second means the model is very sure. A 30% top score with 25% second means it's basically guessing between two options. That gap is a number you can map to visuals.

function getConfidenceGap() {
  if (allResults.length < 2) return 1;
  return allResults[0].confidence - allResults[1].confidence;
}

function draw() {
  const gap = getConfidenceGap();

  // certain = calm, clean visuals
  // uncertain = jittery, noisy visuals
  if (gap > 0.4) {
    // model is sure -- draw clean circles
    background(10, 12, 18, 30);
    fill(120, 200, 180, 40);
    noStroke();
    for (let i = 0; i < 5; i++) {
      circle(
        width / 2 + random(-10, 10),
        height / 2 + random(-10, 10),
        80 + random(-5, 5)
      );
    }
  } else {
    // model is confused -- draw noisy scatter
    background(10, 12, 18, 15);
    for (let i = 0; i < 30; i++) {
      fill(random(150, 255), random(80, 180), random(100, 200), 25);
      noStroke();
      circle(random(width), random(height), random(3, 20));
    }
  }
}

Show the camera something it knows well -- a keyboard, a monitor, a phone -- and the visual stays calm and ordered. Cover the lens partially, hold up something weird, or show it abstract art and the visual explodes into scattered noise. The model's internal certainty drives the aesthetic directly. It's a kind of neural barometer.

Scanning the image: confidence landscapes

Here's where it gets genuinely interesting. Instead of classifying the whole frame once, what if you classify different regions of the image? Slide a virtual window across the frame, classify each patch, and map the confidence for a specific category to color or brightness at that position. You get a heatmap of where the model "sees" a particular thing.

let heatmapCanvas;
let gridSize = 40;
let heatData = [];
let targetCategory = 'cup';

function setup() {
  createCanvas(640, 480);
  heatmapCanvas = createGraphics(640, 480);
  video = createCapture(VIDEO);
  video.size(640, 480);
  video.hide();

  classifier = ml5.imageClassifier('MobileNet');

  // initialize grid
  for (let y = 0; y < height; y += gridSize) {
    for (let x = 0; x < width; x += gridSize) {
      heatData.push({ x: x, y: y, conf: 0 });
    }
  }

  scanNext(0);
}

function scanNext(idx) {
  if (idx >= heatData.length) {
    // restart scan
    scanNext(0);
    return;
  }

  const cell = heatData[idx];

  // crop a region from the video
  const region = video.get(cell.x, cell.y, gridSize, gridSize);

  classifier.classify(region, function(results) {
    // find confidence for our target category
    let conf = 0;
    for (const r of results) {
      if (r.label.toLowerCase().indexOf(targetCategory) !== -1) {
        conf = r.confidence;
        break;
      }
    }
    cell.conf = lerp(cell.conf, conf, 0.4);

    scanNext(idx + 1);
  });
}

function draw() {
  image(video, 0, 0);

  // overlay heatmap
  for (const cell of heatData) {
    const alpha = cell.conf * 200;
    fill(255, 80, 60, alpha);
    noStroke();
    rect(cell.x, cell.y, gridSize, gridSize);
  }

  // label
  fill(220);
  noStroke();
  textSize(11);
  textFont('monospace');
  text('scanning for: ' + targetCategory, 10, height - 10);
}

This is slow -- classifying 192 patches takes several seconds for a full scan. But the result is fascinating. Set targetCategory to "cup" and hold a mug in frame. The grid cells covering the mug glow red. Move the mug and the glow follows. The model's confidence is spatially distributed -- it's hottest right at the center of the mug and fades at the edges where the patch only partially contains it.

Change the target to "book" and hold up a book. Different region lights up. Change to "hand" and your hand glows. You're building an attention map -- a visualization of where the model's "gaze" falls for any given category. The scan is too slow for real-time interaction, but as a generative process it produces beautiful, ghostly overlays that reveal how the model parses space.

The webcam oracle: labels as live text art

The simplest creative use of continuous classification is treating the labels themselves as visual material. The model generates a stream of interpretations -- one per frame, each a word or phrase. That stream of words is itself interesting. Point the camera at your room and the model narrates what it sees in its own limited vocabulary: "monitor", "desk", "keyboard", "mouse", "coffee mug", "water bottle", "remote control." Pan slowly and the narrative changes. The gap between what's actually there and what the model calls it is often funny or poetic.

let labelHistory = [];
let maxLabels = 60;

function classifyLoop() {
  classifier.classify(video, function(results) {
    const label = results[0].label.split(',')[0].trim();
    const conf = results[0].confidence;

    labelHistory.push({ label: label, conf: conf, time: millis() });

    if (labelHistory.length > maxLabels) {
      labelHistory.shift();
    }

    classifyLoop();
  });
}

function draw() {
  background(8, 10, 16);

  for (let i = 0; i < labelHistory.length; i++) {
    const entry = labelHistory[i];
    const age = (millis() - entry.time) / 1000;
    const alpha = map(age, 0, 10, 220, 0, true);
    const size = map(entry.conf, 0.1, 0.9, 8, 22, true);

    // position in a flowing column
    const x = 40 + (i % 4) * 180;
    const y = 20 + Math.floor(i / 4) * 32;

    fill(180, 190, 210, alpha);
    noStroke();
    textSize(size);
    textFont('monospace');
    textAlign(LEFT);
    text(entry.label, x, y);
  }
}

A flowing wall of words. Confident labels appear large and bright. Uncertain labels are small and dim. Old labels fade out. New ones push in from the top. Point the camera at different things and watch the vocabulary shift. The model's interpretation of your physical space becomes a concrete poem -- a text artwork generated by a neural network that's trying to name everything it sees. The mistakes are the best part. It sees a lamp and calls it a "table lamp" one frame, then "spotlight" the next, then "lampshade." Each label is technically defensible but collectively they reveal the model groping for the right word.

Feature extraction: seeing with the model's eyes

Classification is the model's final opinion. But inside the neural network, before it arrives at that opinion, it builds a rich internal representation of the image. These intermediate representations -- called features or activations -- capture visual properties at different scales: edges, textures, shapes, objects. The early layers see low-level stuff (edges, gradients). The later layers see high-level stuff (wheels, faces, fur textures). You can extract these features and use them directly, bypassing the classification entirely.

ml5's featureExtractor gives you access to the layer just before classification. The output is a vector of numbers -- typically 1024 values -- that represent the model's "understanding" of the image without committing to a category. Two images with similar feature vectors look visually similar to the model, even if they'd get different labels.

let featureExtractor;
let features = [];
let prevFeatures = [];

function setup() {
  createCanvas(800, 400);
  video = createCapture(VIDEO);
  video.size(320, 240);
  video.hide();

  featureExtractor = ml5.featureExtractor('MobileNet', function() {
    console.log('feature extractor ready');
    extractLoop();
  });
}

function extractLoop() {
  featureExtractor.classify(video, function(err, results) {
    // store raw feature activations if accessible
    // ml5's API gives classification results, but we can
    // use the intermediate approach below
    extractLoop();
  });
}

function draw() {
  background(10, 12, 18);

  // visualize features as a bar chart
  // each feature is a floating point activation value
  if (features.length === 0) return;

  const barW = width / features.length;

  for (let i = 0; i < features.length; i++) {
    const v = features[i];
    const h = v * 300;
    const hue = map(v, 0, 1, 220, 0);

    fill(hue, 50, 60, 60);
    noStroke();
    rect(i * barW, height - h, barW, h);
  }
}

The feature vector is a fingerprint of the image. Show the camera your face and you get one pattern. Show it a coffee mug and the pattern changes completely. Show it a slightly different angle of the same mug and the pattern stays mostly the same -- because features capture WHAT things look like, not their exact pixel arrangement. That stability is what makes features useful for similarity comparisons.

Similarity: finding visual neighbors

If you extract features from multiple images, you can compare them to find which images "look alike" according to the model. The mathematical tool is cosine similarity -- it measures how close two vectors point in the same direction, regardless of their magnitude.

function cosineSimilarity(a, b) {
  let dot = 0;
  let magA = 0;
  let magB = 0;

  for (let i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    magA += a[i] * a[i];
    magB += b[i] * b[i];
  }

  magA = Math.sqrt(magA);
  magB = Math.sqrt(magB);

  if (magA === 0 || magB === 0) return 0;
  return dot / (magA * magB);
}

// compare current frame to a stored reference
let referenceFeatures = null;
let similarity = 0;

function keyPressed() {
  if (key === 's') {
    // store current frame as reference
    referenceFeatures = features.slice();
    console.log('reference stored');
  }
}

function draw() {
  background(10, 12, 18);
  image(video, 0, 0, 320, 240);

  if (referenceFeatures && features.length > 0) {
    similarity = cosineSimilarity(referenceFeatures, features);

    // draw similarity meter
    const barW = map(similarity, 0, 1, 0, 400);
    const hue = map(similarity, 0, 1, 0, 120);

    fill(hue, 55, 60, 70);
    noStroke();
    rect(350, 100, barW, 30, 3);

    fill(170, 180, 200);
    textSize(11);
    textFont('monospace');
    text('similarity: ' + similarity.toFixed(3), 350, 90);
    text('press S to store reference', 350, 160);
  }
}

Press S to store the current view as a reference. Then move things around. The similarity bar responds in real time -- show the same scene and it stays high. Move the camera or swap objects and it drops. Point back at the original scene and it climbs again. The model compares the structural content of the images, not the raw pixels. Slight lighting changes barely affect the similarity. Moving an object out of frame drops it significantly.

This is the basis for all sorts of creative applications. You could sort a photo collection by visual similarity -- images of skies cluster near each other, indoor scenes form another cluster. You could build a "visual memory" that remembers snapshots and triggers events when the current view matches a stored one. The feature space is a landscape, and navigation through it (by moving the camera, changing the scene) is a creative act.

Training your own categories

We touched on transfer learning briefly in episode 92. Let me expand on that here because it connects directly to the custom classifier idea. ml5's featureExtractor lets you train a new classifier on top of MobileNet's features with just a handful of examples.

let featureExtractor;
let customClassifier;
let currentLabel = 'gathering examples...';

function setup() {
  createCanvas(640, 480);
  video = createCapture(VIDEO);
  video.size(640, 480);
  video.hide();

  featureExtractor = ml5.featureExtractor('MobileNet', function() {
    customClassifier = featureExtractor.classification(video);
    console.log('ready to collect examples');
  });
}

// html buttons call these functions
function addSample(label) {
  customClassifier.addImage(label, function() {
    console.log('added: ' + label);
  });
}

function trainAndClassify() {
  customClassifier.train(function(loss) {
    if (loss === null) {
      console.log('training done');
      classifyCustom();
    }
  });
}

function classifyCustom() {
  customClassifier.classify(function(err, results) {
    if (!err && results.length > 0) {
      currentLabel = results[0].label + ' (' +
        (results[0].confidence * 100).toFixed(0) + '%)';
    }
    classifyCustom();
  });
}

function draw() {
  image(video, 0, 0);

  fill(0, 0, 0, 160);
  noStroke();
  rect(10, height - 40, 300, 30, 4);

  fill(220);
  textSize(12);
  textFont('monospace');
  text(currentLabel, 20, height - 20);
}

The workflow: define your categories (say, "cluttered desk", "clean desk", "nobody here"). Collect 15-20 examples of each by clicking a button while pointing the camera at the right scene. Hit train. Seconds later you have a custom classifier that responds to YOUR specific environment. The creative applications are wide open. Train it on hand gestures unique to your art practice. Train it on different lighting moods in your studio. Train it to distinguish between "audience engaged" and "audience bored" for an installation. The categories you define are the vocabulary of your interactive piece.

The model doesn't need much data because it's not learning to see from scratch. MobileNet already knows what edges, textures, shapes, and objects look like. Transfer learning just teaches it your specific groupings on top of that existing knowledge. Twenty images per category is usually enough for a clear distinction.

Multi-model layering

One thing we haven't tried yet: running multiple models simultaneously. Classification gives you what's in the scene. Pose detection gives you where the body is. Hand tracking gives you finger positions. Face mesh gives you expressions. Each model reads a different aspect of the same video feed. Combining them creates a multi-layered reading of the scene.

let classifier, bodyPose, handPose;
let classLabel = '';
let classConf = 0;
let poses = [];
let hands = [];

function preload() {
  classifier = ml5.imageClassifier('MobileNet');
  bodyPose = ml5.bodyPose('MoveNet', { flipped: true });
  handPose = ml5.handPose({ flipped: true });
}

function setup() {
  createCanvas(800, 600);
  colorMode(HSB, 360, 100, 100, 100);
  video = createCapture(VIDEO, { flipped: true });
  video.size(320, 240);
  video.hide();

  // start all detection loops
  classifyLoop();
  bodyPose.detectStart(video, function(r) { poses = r; });
  handPose.detectStart(video, function(r) { hands = r; });
}

function classifyLoop() {
  classifier.classify(video, function(results) {
    classLabel = results[0].label.split(',')[0];
    classConf = results[0].confidence;
    classifyLoop();
  });
}

function draw() {
  background(0, 0, 5, 20);

  // classification drives background hue
  const bgHue = (classLabel.length * 27) % 360;
  fill(bgHue, 15, 12, 10);
  rect(0, 0, width, height);

  // body skeleton drives structural lines
  if (poses.length > 0) {
    const pose = poses[0];
    stroke(bgHue, 30, 50, 30);
    strokeWeight(2);

    for (const kp of pose.keypoints) {
      if (kp.confidence > 0.3) {
        const px = map(kp.x, 0, 320, 0, width);
        const py = map(kp.y, 0, 240, 0, height);
        point(px, py);
      }
    }
  }

  // hand tracking drives particle emission
  if (hands.length > 0) {
    for (const hand of hands) {
      for (let i = 0; i < 21; i += 4) {
        const kp = hand.keypoints[i];
        const px = map(kp.x, 0, 320, 0, width);
        const py = map(kp.y, 0, 240, 0, height);

        noStroke();
        fill(bgHue + 60, 50, 65, 30);
        circle(px, py, random(5, 15));
      }
    }
  }

  // classification confidence drives overall opacity
  const textAlpha = map(classConf, 0.1, 0.9, 15, 60, true);
  fill(0, 0, 80, textAlpha);
  noStroke();
  textSize(10);
  textFont('monospace');
  textAlign(LEFT);
  text(classLabel + ' ' + (classConf * 100).toFixed(0) + '%', 10, height - 10);
}

Three models running concurrently. Classification picks the color palette (each recognized object gives a different hue). Pose detection draws the structural skeleton. Hand tracking emits particles from finger positions. The scene is read three ways simultaneously, and each reading drives a different visual layer. Hold up a book and the palette shifts to warm tones while your body skeleton draws in the background and particles spray from your fingertips. Swap the book for a phone and the palette shifts but your body and hand contributions stay consistent. It's a rich, layered interpretation of one camera feed through three neural lenses.

Performance takes a hit with three models -- expect maybe 5-10 fps total. But for a generative art piece that accumulates over time, that's fine. The slow update rate actually gives the visuals a dreamlike quality where changes drift in rather than snapping.

The politics of the 1000 categories

I mentioned ImageNet bias back in episode 92. It's worth digging a bit deeper here because classification categories ARE the model's worldview, and using them creatively means working with their limitations.

MobileNet recognizes 1000 categories from ImageNet. Among them: 120 dog breeds. 37 cat-like animals. Hundreds of Western consumer products. Very few objects from non-Western cultures. Very few foods that aren't European or American. The model can distinguish a "Pembroke Welsh corgi" from a "Cardigan Welsh corgi" but can't tell you what a "jollof rice" is. Its vocabulary reflects who built the dataset and what they found interesting.

For creative coding, this bias is material. Build a piece where you deliberately show the model objects from different cultural contexts. Track which ones it recognizes confidently and which ones confuse it. The confidence gap between "laptop" (95%) and "djembe drum" (maybe 12% as "drum" if you're lucky) is itself a commentary on representation in training data. The model's blind spots are as creatively interesting as its confident recognitions.

You don't have to agree with the model's worldview to use it creatively. You just have to be aware of it.

The creative exercise: object-controlled visual synthesizer

Allez, let's tie it all together. A visual synth controlled entirely by what the camera sees. Each recognized category maps to a unique generative visual -- different particle behavior, color palette, and motion pattern. The confidence controls intensity. The confidence gap controls visual stability. Switching between objects creates smooth transitions between visual worlds.

let video, classifier;
let currentLabel = '';
let currentConf = 0;
let prevLabel = '';
let transition = 0;
let particles = [];

const visualModes = {
  monitor:  { hue: 210, speed: 0.5, gravity: -0.02, shape: 'rect', count: 80 },
  cup:      { hue: 30,  speed: 1.2, gravity: 0.03,  shape: 'circle', count: 120 },
  phone:    { hue: 270, speed: 2.5, gravity: 0,      shape: 'circle', count: 200 },
  keyboard: { hue: 160, speed: 0.3, gravity: 0.01,   shape: 'rect', count: 60 },
  book:     { hue: 45,  speed: 0.8, gravity: -0.01,  shape: 'rect', count: 90 },
  mouse:    { hue: 320, speed: 1.5, gravity: 0,      shape: 'circle', count: 150 },
  bottle:   { hue: 190, speed: 1.0, gravity: 0.05,   shape: 'circle', count: 100 },
  hand:     { hue: 5,   speed: 3.0, gravity: -0.03,  shape: 'circle', count: 250 },
  _default: { hue: 200, speed: 0.6, gravity: 0,      shape: 'circle', count: 50 }
};

function preload() {
  classifier = ml5.imageClassifier('MobileNet');
}

function setup() {
  createCanvas(800, 600);
  colorMode(HSB, 360, 100, 100, 100);
  video = createCapture(VIDEO);
  video.size(160, 120);
  video.hide();
  classifyLoop();
}

function classifyLoop() {
  classifier.classify(video, function(results) {
    const newLabel = results[0].label.split(',')[0].trim().toLowerCase();
    currentConf = results[0].confidence;

    if (newLabel !== currentLabel) {
      prevLabel = currentLabel;
      currentLabel = newLabel;
      transition = 0;
    }

    classifyLoop();
  });
}

function getMode(label) {
  for (const key of Object.keys(visualModes)) {
    if (key === '_default') continue;
    if (label.indexOf(key) !== -1) return visualModes[key];
  }
  return visualModes._default;
}

function draw() {
  background(0, 0, 4, 18);

  transition = min(transition + 0.02, 1);
  const mode = getMode(currentLabel);

  // emit particles
  const emitCount = Math.floor(mode.count * currentConf * 0.15);
  for (let i = 0; i < emitCount; i++) {
    particles.push({
      x: random(width),
      y: height / 2 + random(-100, 100),
      vx: random(-1, 1) * mode.speed,
      vy: random(-1, 1) * mode.speed,
      hue: mode.hue + random(-25, 25),
      size: random(3, 10),
      life: 1.0,
      shape: mode.shape
    });
  }

  // update and draw particles
  for (let i = particles.length - 1; i >= 0; i--) {
    const p = particles[i];
    p.vy += mode.gravity;
    p.x += p.vx;
    p.y += p.vy;
    p.life -= 0.005;

    if (p.life <= 0 || p.x < -20 || p.x > width + 20 ||
        p.y < -20 || p.y > height + 20) {
      particles.splice(i, 1);
      continue;
    }

    noStroke();
    fill(p.hue % 360, 55, 65, p.life * 40);

    if (p.shape === 'rect') {
      rectMode(CENTER);
      rect(p.x, p.y, p.size * p.life, p.size * p.life);
    } else {
      circle(p.x, p.y, p.size * p.life);
    }
  }

  // cap particle count
  while (particles.length > 800) {
    particles.shift();
  }

  // info overlay
  fill(0, 0, 70, 40);
  noStroke();
  textSize(10);
  textFont('monospace');
  textAlign(LEFT);
  text(currentLabel + ' ' + (currentConf * 100).toFixed(0) + '%', 10, height - 10);
  text('particles: ' + particles.length, 10, height - 24);
}

Show a coffee mug and warm amber circles drift downward like falling embers. Switch to your phone and purple particles explode outward with high energy. Put down everything and show just your hand -- red particles fountain upward. Show a book and warm rectangles float slowly upward like pages caught in a draft. Each object creates a distinct visual world, and switching between them creates smooth crossfades as the old particles die out and new ones take over. The confidence modulates intensity -- hold the object clearly and the visual is dense, partially occlude it and the emission thins out.

The piece is a performance instrument. What you show the camera IS the music. Each object is a note. The transitions between objects are the melody. The confidence is the volume. You compose the visual by arranging objects in front of the lens. Makes sense, right? :-)

Practical considerations

Classification speed vs render speed. Classification runs at maybe 15-20 fps but your render loop should run at 60fps for smooth particle animation. We already covered this decoupling in episode 92 -- the classification runs asynchronously in its own loop, updating shared variables that the draw loop reads. Don't tie them together or your animation stutters at classification speed.

// the async pattern: classification and rendering are decoupled
let sharedLabel = '';
let sharedConf = 0;

// this loop runs at ML speed (~15fps)
function classifyLoop() {
  classifier.classify(video, function(results) {
    sharedLabel = results[0].label;
    sharedConf = results[0].confidence;
    classifyLoop();  // immediately queue next classification
  });
}

// this loop runs at render speed (60fps)
function draw() {
  // read shared state, never wait for classification
  background(10);
  fill(200);
  text(sharedLabel + ' ' + (sharedConf * 100).toFixed(0) + '%', 20, 30);
  // ... smooth animation code here
}

Category matching is fuzzy. MobileNet's labels are specific ("studio couch", "chiffonier", "washbasin") but your visual mode keys are generic ("cup", "book"). The indexOf check handles this by matching substrings, so "coffee_mug" matches the "cup" key if you also check for "mug". You'll want to tune the keyword list for whatever objects you actually have around. I spent a surprizing amount of time discovering what MobileNet calls things in my apartment -- the desk lamp is consistently "table_lamp" which is helpful, but the headphones alternate between "headphone" and "radio" depending on angle.

Combine with the previous models. Nothing stops you from running classification alongside pose detection or hand tracking. Classification picks the visual theme, hand tracking controls particle emission points, face mesh modulates parameters through expressions. Layer the readings for richer interaction.

't Komt erop neer...

MobileNet doesn't just output a single label -- it computes 1000 confidence scores simultaneously. The top label is the most confident category, but the full distribution tells you how the model distributes its "attention" across all known categories. The gap between the top two scores measures certainty: wide gap = confident, narrow gap = confused. Both the distribution and the gap are mappable creative parameters
Confidence landscapes: classify patches of the image (sliding a window across the frame) and map the confidence for a specific category to color at each position. The result is a heatmap showing WHERE the model sees a particular thing. Slow but visually fascinating -- the model's spatial attention becomes visible as a ghostly overlay
The label stream from continuous classification is itself creative material. Treat the model as a narrator that names what it sees. Confident labels appear large, uncertain ones small. Old labels fade, new ones appear. The gap between reality and the model's vocabulary creates accidental poetry
Feature extraction bypasses classification entirely. MobileNet's internal representation (a 1024-value vector) captures visual properties without committing to a category. Two images with similar feature vectors look alike to the model. Extract, compare with cosine similarity, and you can build visual memory, sort photo collections, or detect when the scene matches a stored reference
Transfer learning via ml5.featureExtractor lets you train custom categories with 15-20 examples each. The model already knows how to see -- you just teach it your groupings. Train it on your specific environment, gestures, moods, or objects. The categories you choose define the vocabulary of your interactive piece
Multi-model layering: run classification, pose detection, and hand tracking on the same video feed simultaneously. Each model provides a different reading. Classification picks colors, body pose drives structure, hands emit particles. Three neural lenses on one scene. Performance drops to 5-10 fps but creates a slow, dreamlike aesthetic
ImageNet's 1000 categories reflect Western consumer culture: 120 dog breeds, very few non-Western objects. The model's vocabulary IS its limitation. Deliberately showing it objects from different cultural contexts and visualizing the confidence gap becomes a commentary on representation in training data
The object-controlled visual synthesizer maps each recognized category to a unique generative visual: different particle behavior, color, and physics. Confidence modulates intensity. Object switches create smooth visual transitions. The camera becomes a performance instrument -- each object is a note, each transition is part of the composition
Decoupling classification speed (15-20 fps) from render speed (60fps) is essential. Classification runs in its own async loop, updating shared state. The draw loop reads that state every frame for smooth animation. Don't block rendering on classification callbacks
Keyword matching for category lookup needs tuning per environment. MobileNet labels are specific ("chiffonier", "studio_couch") while your visual mode keys are generic. Use substring matching and test with your actual objects to discover what the model calls things

Five episodes into the ML arc. We started with flat labels (episode 92), built up through body skeletons (93), finger joints (94), and full face meshes (95). This episode circled back to classification and found a lot more depth than we originally gave it credit for. The confidence distribution, spatial scanning, feature extraction, custom training, multi-model layering -- classification isn't just "what is it." It's a rich creative signal. And we still haven't gotten to what happens when one neural network transforms images based on the style of another.

Sallukes! Thanks for reading.

@femdev

stem stemsocial steemstem programming creativecoding

0.000

0 comments