Learn Creative Coding (#98) - Pix2Pix: Sketch to Image

about 1 month ago

Learn Creative Coding (#98) - Pix2Pix: Sketch to Image

Last episode we turned webcam feeds into paintings. Style transfer took the content from one image and the rendering from another and mashed them together -- your room painted like Monet, your face as a Van Gogh, generative code output repainted as cubist abstraction. The neural network separated what's in the image from how it looks and recombined them. Powerful stuff. But style transfer has a fundamental limitation: it can only change the rendering of an image. The shapes, the objects, the composition -- those stay the same. Your room looks like a painting of your room. It's the same room.

Pix2Pix breaks that boundary. Instead of preserving content and changing style, it learns to translate between entirely different image domains. Give it an edge drawing of a cat and it generates a photograph of a cat. Give it a rough architectural sketch and it produces a building facade. Give it a daytime photo and it makes it nighttime. The input and output aren't the same image rendered differently -- they're genuinely different images that correspond to each other. The neural network learned the mapping between two visual worlds from examples, and now it applies that mapping to inputs it has never seen before.

And the creative goldmine is this: it gets things wrong. Beautifully, fascinatingly wrong. Draw a vaguely cat-shaped blob and the model hallucinates a nightmare cat with too many eyes and fur that melts into the background. Draw impossible architecture and you get dreamlike buildings that couldn't exist. The gap between what you draw and what the model generates IS the creative material. You're not using Pix2Pix to make accurate images. You're using it as a collaborator that interprets your sketches through its own learned understanding of what things should look like. The interpretation is the art.

How Pix2Pix works: the conditional GAN

Pix2Pix is a conditional generative adversarial network (cGAN). That's a lot of jargon. Let me break it down.

A GAN has two neural networks training against each other. The generator tries to create realistic images. The discriminator tries to tell real images from fake ones. They train together: the generator gets better at fooling the discriminator, and the discriminator gets better at detecting fakes. After enough training, the generator produces images that the discriminator can't distinguish from real ones. The adversarial tension drives quality up.

The "conditional" part means the generator doesn't produce images from random noise (that would be a regular GAN). Instead, it takes an input image and produces an output image. The condition is the input. "Given this edge drawing, produce a photograph." "Given this sketch, produce a building." The generator learns a mapping from input domain to output domain.

The training data is pairs. Hundreds or thousands of aligned image pairs: (edge drawing, photograph), (sketch, facade), (label map, street scene). The model sees both sides of each pair during training. The generator learns to produce outputs that (1) fool the discriminator into thinking they're real photographs and (2) match the paired ground truth closely. The discriminator gets both the input AND the output -- it judges whether the input-output pair looks realistic as a whole. This paired structure is what makes Pix2Pix different from unpaired translation methods. The alignment is tight.

// conceptual: what pix2pix learns from training data
//
// training pairs:
//   sketch_1.png <-> photo_1.png
//   sketch_2.png <-> photo_2.png
//   ...
//   sketch_500.png <-> photo_500.png
//
// generator: sketch -> photo (learns the mapping)
// discriminator: (sketch, photo) -> real or fake?
//
// at inference time:
//   your_sketch.png -> generator -> generated_photo.png
//
// the generator has never seen your_sketch before
// but it learned the general mapping from the training pairs

The architecture uses a U-Net generator (encoder-decoder with skip connections that preserve spatial detail) and a PatchGAN discriminator (classifies overlapping patches rather than the whole image, which captures texture quality better). We don't need to implement any of this -- pre-trained models handle it. But understanding the architecture helps you understand why the outputs look the way they do: the U-Net preserves the structure of your input (lines stay as lines, regions stay as regions) while the PatchGAN pushes the textures toward photorealism at the patch level.

Loading Pix2Pix in ml5

ml5 provides a Pix2Pix implementation with pre-trained models. The models are trained on specific domain pairs -- edges to cats, edges to shoes, facades, etc. Each model understands one specific translation. You can't feed a shoe sketch to the cat model and expect cats. (Well, you can, and the results are gloriously weird, but that's a feature not a bug.)

let pix2pix;
let inputCanvas;
let outputImage;

function setup() {
  createCanvas(512, 256);

  // create a drawing surface for input
  inputCanvas = createGraphics(256, 256);
  inputCanvas.background(255);

  // load edges2cats model
  pix2pix = ml5.pix2pix('edges2cats', function() {
    console.log('pix2pix model loaded');
  });
}

function draw() {
  background(10, 12, 18);

  // show input canvas on the left
  image(inputCanvas, 0, 0);

  // show generated output on the right
  if (outputImage) {
    image(outputImage, 256, 0);
  }

  // divider
  stroke(60);
  strokeWeight(1);
  line(256, 0, 256, 256);

  // labels
  fill(140);
  noStroke();
  textSize(10);
  textFont('monospace');
  text('your sketch', 10, 250);
  text('pix2pix output', 266, 250);
}

function mouseDragged() {
  if (mouseX < 256) {
    inputCanvas.stroke(0);
    inputCanvas.strokeWeight(3);
    inputCanvas.line(mouseX, mouseY, pmouseX, pmouseY);
  }
}

function keyPressed() {
  if (key === 'g') {
    // generate from current sketch
    pix2pix.transfer(inputCanvas.canvas, function(err, result) {
      if (!err) {
        outputImage = result.raw;
      }
    });
  }
  if (key === 'c') {
    inputCanvas.background(255);
    outputImage = null;
  }
}

Draw on the left side with your mouse. Black lines on white background -- that's what the edges2cats model expects (it was trained on edge-detected outlines of cats). Press G to generate. The model processes your sketch and produces an image on the right side. Press C to clear and start over.

The first time you do this, two things will happen. First, there's a delay -- the model takes a second or two to process each generation. It's not real-time like style transfer. Second, the output will look... wrong. Wonderfully, disturbingly wrong. A few oval shapes that vaguely resemble eyes become glassy cat eyes staring from a mass of fur-textured skin. Lines that weren't meant to be whiskers become whiskers. Empty space becomes fur. The model fills in everything you didn't draw based on what it learned "cat" looks like. And because your sketch is ambiguous, its interpretation is creative.

Drawing tools for better sketches

The basic mouse drawing above works but it's crude. A few drawing tools make the sketch-to-generation loop more interesting.

let brushSize = 3;
let erasing = false;

function mouseDragged() {
  if (mouseX >= 256) return;

  if (erasing) {
    inputCanvas.noStroke();
    inputCanvas.fill(255);
    inputCanvas.circle(mouseX, mouseY, brushSize * 4);
  } else {
    inputCanvas.stroke(0);
    inputCanvas.strokeWeight(brushSize);
    inputCanvas.line(mouseX, mouseY, pmouseX, pmouseY);
  }
}

function keyPressed() {
  if (key === 'g') {
    pix2pix.transfer(inputCanvas.canvas, function(err, result) {
      if (!err) outputImage = result.raw;
    });
  }
  if (key === 'c') {
    inputCanvas.background(255);
    outputImage = null;
  }
  if (key === 'e') {
    erasing = !erasing;
  }
  if (key === '1') brushSize = 1;
  if (key === '2') brushSize = 3;
  if (key === '3') brushSize = 6;
  if (key === '4') brushSize = 12;
}

function draw() {
  background(10, 12, 18);
  image(inputCanvas, 0, 0);

  if (outputImage) {
    image(outputImage, 256, 0);
  }

  // cursor preview
  if (mouseX < 256) {
    noFill();
    stroke(erasing ? color(200, 80, 80) : color(80, 80, 200));
    strokeWeight(1);
    circle(mouseX, mouseY, erasing ? brushSize * 4 : brushSize);
  }

  // controls info
  fill(120);
  noStroke();
  textSize(9);
  textFont('monospace');
  text('G: generate | C: clear | E: eraser | 1-4: brush size', 10, height - 6);
  text('brush: ' + brushSize + (erasing ? ' [eraser]' : ''), 10, height - 18);
}

Now you can erase parts of your sketch and try again. The thick brush (key 4) creates bold outlines that the model reads as strong edges. The thin brush (key 1) creates delicate lines that the model sometimes ignores or sometimes interprets as whiskers, fur texture, or fine detail. The difference between a thick-line cat and a thin-line cat is dramatic -- same shape, different line weight, completely different output. That sensitivity to input subtlety is part of what makes Pix2Pix interesting as a creative tool. You learn the model's preferences through experimentation.

Live generation: drawing as conversation

The generate-on-keypress workflow is nice for careful sketching, but the real magic happens when you generate continuously as you draw. Every stroke changes the input and the model re-interprets it. The output evolves alongside your sketch. It's a conversation between you and the model -- you propose a line, the model responds with its interpretation, you react to that interpretation by adding or erasing, the model responds again.

let lastGenTime = 0;
let genInterval = 1500;
let isGenerating = false;

function draw() {
  background(10, 12, 18);
  image(inputCanvas, 0, 0);

  if (outputImage) {
    image(outputImage, 256, 0);
  }

  // auto-generate every genInterval ms
  if (millis() - lastGenTime > genInterval && !isGenerating) {
    generateFromSketch();
  }

  // generation indicator
  if (isGenerating) {
    fill(200, 160, 80);
    noStroke();
    textSize(9);
    textFont('monospace');
    text('generating...', 266, 250);
  }
}

function generateFromSketch() {
  isGenerating = true;
  lastGenTime = millis();

  pix2pix.transfer(inputCanvas.canvas, function(err, result) {
    isGenerating = false;
    if (!err) {
      outputImage = result.raw;
    }
  });
}

Every 1.5 seconds, the model regenerates from whatever's on the canvas. Draw a circle and watch the model try to make it into a cat face. Add two dots for eyes and suddenly the interpretation snaps -- the model "sees" a face now and generates accordingly. Erase one eye and the output becomes confused, asymetrical, alien. The feedback loop is tight enough to feel like collaboration but slow enough to actually see each interpretation before the next one arrives.

The genInterval matters. At 500ms the generation is frantic -- every tiny stroke triggers a new interpretation before you've finished the thought. At 3000ms it's too slow, you draw multiple strokes and only see the combined result. 1000-2000ms is the sweet spot where you can see the model "thinking" about your latest change.

Multiple models: same sketch, different worlds

Each pre-trained model translates between a different pair of domains. The same sketch drawn on the left can be interpreted as a cat, a shoe, a building, depending on which model is loaded. The sketch is a seed. The model is the lens.

let models = {};
let modelNames = ['edges2cats', 'edges2shoes'];
let currentModel = 'edges2cats';
let loadedCount = 0;

function setup() {
  createCanvas(768, 256);
  inputCanvas = createGraphics(256, 256);
  inputCanvas.background(255);

  for (const name of modelNames) {
    models[name] = ml5.pix2pix(name, function() {
      loadedCount++;
      console.log(name + ' loaded (' + loadedCount + '/' + modelNames.length + ')');
    });
  }
}

function keyPressed() {
  if (key === 'g') {
    // generate from all loaded models
    generateAll();
  }
  if (key === '1') currentModel = 'edges2cats';
  if (key === '2') currentModel = 'edges2shoes';
}

function generateAll() {
  let xOffset = 256;

  for (const name of modelNames) {
    if (!models[name]) continue;

    const xPos = xOffset;
    models[name].transfer(inputCanvas.canvas, function(err, result) {
      if (!err) {
        // store results per model
        models[name]._lastResult = result.raw;
      }
    });

    xOffset += 256;
  }
}

function draw() {
  background(10, 12, 18);

  // input
  image(inputCanvas, 0, 0);

  // outputs
  let xPos = 256;
  for (const name of modelNames) {
    if (models[name] && models[name]._lastResult) {
      image(models[name]._lastResult, xPos, 0, 256, 256);
    }

    fill(120);
    noStroke();
    textSize(9);
    textFont('monospace');
    text(name, xPos + 5, 250);
    xPos += 256;
  }
}

Draw an oval shape. The edges2cats model interprets it as a face and generates furry features, glassy eyes, maybe a nose. The edges2shoes model sees the same oval as a shoe silhouette and generates leather, stitching, a sole. Same input, radically different outputs. The model's training data defines what "reality" means -- reality is cats in one world and shoes in another. Your sketch exists in both worlds simultaneously, interpreted through different lenses.

This is genuinely interesting as a creative exercise. Draw something abstract -- not a cat, not a shoe, just shapes and lines -- and see what each model makes of it. A zigzag line becomes whiskers in one world and stitching in another. A circle becomes an eye or an eyelet. The model finds the closest thing in its vocabulary to your ambiguous input. Those interpretations reveal what the model "wants" to see, what patterns dominate its training data.

Feeding generative art to Pix2Pix

You don't have to draw by hand. Everything we've built in this series -- particles, noise, L-systems, geometry -- can be drawn to the input canvas. Use code as the "artist" and Pix2Pix as the "interpreter."

let inputCanvas;
let outputImage;
let pix2pix;
let phase = 0;

function setup() {
  createCanvas(512, 256);
  inputCanvas = createGraphics(256, 256);

  pix2pix = ml5.pix2pix('edges2cats', function() {
    console.log('ready');
  });
}

function draw() {
  background(10, 12, 18);

  // generate input with code
  inputCanvas.background(255);
  inputCanvas.stroke(0);
  inputCanvas.strokeWeight(2);
  inputCanvas.noFill();

  phase += 0.01;

  // draw noise-driven curves
  for (let i = 0; i < 8; i++) {
    inputCanvas.beginShape();
    for (let t = 0; t < TWO_PI; t += 0.1) {
      const r = 60 + noise(i * 0.5, t * 0.3, phase) * 50;
      const x = 128 + cos(t) * r;
      const y = 128 + sin(t) * r;
      inputCanvas.vertex(x, y);
    }
    inputCanvas.endShape(CLOSE);
  }

  // add some dots
  for (let i = 0; i < 12; i++) {
    const dx = 128 + noise(i * 0.7, phase * 0.5) * 200 - 100;
    const dy = 128 + noise(i * 0.7 + 10, phase * 0.5) * 200 - 100;
    inputCanvas.circle(dx, dy, 4);
  }

  image(inputCanvas, 0, 0);

  if (outputImage) {
    image(outputImage, 256, 0);
  }

  // auto-generate periodically
  if (frameCount % 90 === 0) {
    pix2pix.transfer(inputCanvas.canvas, function(err, result) {
      if (!err) outputImage = result.raw;
    });
  }
}

Perlin noise drives organic, blobby shapes on the input canvas. Every 90 frames (about 1.5 seconds at 60fps), the model interprets the current blob configuration as a cat. The noise evolves slowly, so successive interpretations show the model's response to gradually changing input. Some configurations trigger clear cat features -- the model suddenly "sees" a face and generates defined eyes and fur. Other configurations confuse it and the output becomes abstract, hallucinatory, more texture than form. The noise explores the model's input space automatically, finding shapes that activate different aspects of the model's cat-knowledge.

You could replace the noise curves with L-system branches (episode 54), particle traces (episode 11), or audio-reactive geometry (episode 19). The creative coding output becomes the raw material for neural interpretation. Code draws. Model interprets. The output is a hybrid -- neither purely algorithmic nor purely neural.

The uncanny valley: embracing the artifacts

Pix2Pix outputs are never perfect. The proportions are slightly off. Textures blur at boundaries. Fur melts into background. Eyes stare from wrong positions. Seams between generated patches show. These aren't bugs -- they're the signature of the technique. The model is hallucinating an image from a sketch, and hallucinations have artifacts.

For practical image-to-image translation, these artifacts are problems to fix. For creative coding, they're the whole point. The wrongness has its own aesthetic. A cat with three eyes is more interesting than a cat with two. Architecture that bends in impossible ways is more evocative than a straight building. The model's confident interpretation of impossible input produces surreal imagery that no human artist and no algorithm alone would create. It takes both -- your sketch as provocation, the model's trained imagination as response.

// intentionally "bad" input to maximize creative artifacts
// the model's response to impossible geometry is the art

function drawImpossibleCat() {
  inputCanvas.background(255);
  inputCanvas.stroke(0);
  inputCanvas.strokeWeight(2);
  inputCanvas.noFill();

  // draw a face with 5 eyes
  inputCanvas.circle(128, 120, 120);

  // five eye-like ovals
  for (let i = 0; i < 5; i++) {
    const angle = map(i, 0, 5, -PI * 0.6, PI * 0.6);
    const ex = 128 + cos(angle) * 40;
    const ey = 110 + sin(angle) * 15;
    inputCanvas.ellipse(ex, ey, 14, 8);
    inputCanvas.circle(ex, ey, 3);
  }

  // three mouths at different heights
  for (let i = 0; i < 3; i++) {
    const my = 140 + i * 15;
    inputCanvas.arc(128, my, 30 - i * 5, 12, 0, PI);
  }

  // too many whiskers
  for (let i = 0; i < 12; i++) {
    const side = i < 6 ? -1 : 1;
    const idx = i % 6;
    const startX = 128 + side * 35;
    const startY = 130 + idx * 4;
    const endX = startX + side * (40 + random(20));
    const endY = startY + random(-8, 8);
    inputCanvas.line(startX, startY, endX, endY);
  }
}

Feed this to the edges2cats model. Five eyes. Three mouths. Twelve whiskers. The model tries to make sense of every element. Some eyes become actual cat eyes. Others become patches of fur or dark spots. The mouths merge or stack. The whiskers extend and multiply. The result is a cat from a fever dream -- recognizably feline but deeply wrong. Exactly the kind of image that sits at the intersection of human intention and machine interpretation.

Style variation: the sensitivity boundary

Even small changes to the input produce different outputs. Line thickness, position, completeness -- each affects the generation. The boundary between "the model sees a cat" and "the model sees garbage" is narrow and fascinating. Exploring that boundary systematically reveals the model's internal thresholds.

let variants = [];
let variantOutputs = [];

function generateVariants() {
  variants = [];
  variantOutputs = [];

  for (let v = 0; v < 4; v++) {
    const vc = createGraphics(256, 256);
    vc.background(255);
    vc.stroke(0);
    vc.noFill();

    // same basic cat shape, different line weights
    const weight = 1 + v * 2;
    vc.strokeWeight(weight);

    // head
    vc.ellipse(128, 128, 100, 90);

    // ears
    vc.triangle(85, 90, 75, 50, 105, 75);
    vc.triangle(171, 90, 181, 50, 151, 75);

    // eyes
    vc.ellipse(108, 118, 16, 10);
    vc.ellipse(148, 118, 16, 10);

    // nose
    vc.triangle(128, 130, 122, 138, 134, 138);

    // mouth
    vc.arc(128, 142, 20, 10, 0, PI);

    variants.push(vc);

    // generate each variant
    pix2pix.transfer(vc.canvas, function(err, result) {
      if (!err) variantOutputs[v] = result.raw;
    });
  }
}

function draw() {
  background(10, 12, 18);

  for (let v = 0; v < variants.length; v++) {
    const x = v * 128;
    if (variants[v]) {
      image(variants[v], x, 0, 128, 128);
    }
    if (variantOutputs[v]) {
      image(variantOutputs[v], x, 128, 128, 128);
    }

    fill(100);
    noStroke();
    textSize(8);
    textFont('monospace');
    text('weight: ' + (1 + v * 2), x + 5, 125);
  }
}

Four identical cat sketches with line weights of 1, 3, 5, and 7 pixels. The thin-line version produces a different cat than the thick-line version even though the shape is the same. Thin lines might generate a delicate, wispy cat. Thick lines generate a bolder, more defined cat. The model reads line weight as a signal for edge strength, and that affects how confidently it fills in the regions.

This sensitivity means you have more control than you might think. Line weight is a creative parameter. So is line completeness -- a half-drawn circle generates differently from a full circle. A dotted line generates differently from a solid one. The vocabulary of sketch marks maps to a vocabulary of photographic qualities in the output space.

Ethical considerations: what the model learned

Every Pix2Pix model is shaped by its training data. The edges2cats model was trained on a dataset of cat photographs with their edge maps. Whatever biases exist in that dataset -- the breeds represented, the poses, the backgrounds, the lighting conditions -- exist in the model's outputs. It doesn't generate cats from first principles. It generates cats from memory. Its memory is the training set.

This matters more for models trained on human data. A face model trained primarily on one demographic will generate faces from that demographic. A building model trained on Western architecture will struggle with non-Western forms. The model doesn't know what it doesn't know -- it confidently generates something for any input, even when that something reflects a limited training dataset.

For creative coding, being aware of this is part of the practice. When you choose a pre-trained model, you're choosing the dataset's worldview. When you train your own model (which is possible but requires paired data), you define the worldview. The artistic choice includes the training data, not just the input sketch.

The creative exercise: collaborative drawing studio

Allez, let's put it all together. A proper sketch-to-generation studio where you draw freely, the model interprets continuously, and you iterate based on what you see. Your sketch and the model's interpretation evolve together. Neither has full control. The final image is a negotiation.

let inputCanvas, pix2pix, outputImage;
let brushSize = 3;
let erasing = false;
let isGenerating = false;
let genCount = 0;

function setup() {
  createCanvas(512, 300);
  inputCanvas = createGraphics(256, 256);
  inputCanvas.background(255);

  pix2pix = ml5.pix2pix('edges2cats', function() {
    console.log('model loaded -- draw something');
  });
}

function draw() {
  background(10, 12, 18);

  image(inputCanvas, 0, 0);
  if (outputImage) {
    image(outputImage, 256, 0);
  }

  // divider
  stroke(40);
  strokeWeight(1);
  line(256, 0, 256, 256);

  // cursor
  if (mouseX < 256 && mouseY < 256) {
    noFill();
    stroke(erasing ? color(200, 80, 80, 120) : color(80, 100, 200, 120));
    strokeWeight(1);
    circle(mouseX, mouseY, brushSize * 2);
  }

  // controls panel
  fill(20, 22, 30);
  noStroke();
  rect(0, 256, 512, 44);

  fill(130, 140, 160);
  textSize(9);
  textFont('monospace');
  text('G: generate | C: clear | E: eraser (' +
       (erasing ? 'ON' : 'off') + ') | 1-4: brush', 10, 274);
  text('generations: ' + genCount +
       (isGenerating ? '  [working...]' : ''), 10, 290);
  text('brush: ' + brushSize + 'px', 350, 274);
}

function mouseDragged() {
  if (mouseX >= 256 || mouseY >= 256) return;

  if (erasing) {
    inputCanvas.noStroke();
    inputCanvas.fill(255);
    inputCanvas.circle(mouseX, mouseY, brushSize * 5);
  } else {
    inputCanvas.stroke(0);
    inputCanvas.strokeWeight(brushSize);
    inputCanvas.line(mouseX, mouseY, pmouseX, pmouseY);
  }
}

function keyPressed() {
  if (key === 'g' && !isGenerating) {
    isGenerating = true;
    pix2pix.transfer(inputCanvas.canvas, function(err, result) {
      isGenerating = false;
      if (!err) {
        outputImage = result.raw;
        genCount++;
      }
    });
  }
  if (key === 'c') {
    inputCanvas.background(255);
    outputImage = null;
  }
  if (key === 'e') erasing = !erasing;
  if (key === '1') brushSize = 1;
  if (key === '2') brushSize = 3;
  if (key === '3') brushSize = 6;
  if (key === '4') brushSize = 12;
}

Draw a cat face. Press G. See what the model makes of it. Not right? Erase the ears. Redraw them bigger. Press G again. Better. Add whiskers. G. The mouth is wrong now -- erase it, try a different shape. G. The model's interpretation shifts with every change. You learn what it responds to. It learns nothing -- it's the same frozen weights every time -- but your understanding of its tendencies grows with each generation. After ten iterations, you know: wide ovals make better eyes than circles. The nose needs to be small. Ears need to be pointy and high. You've reverse-engineered the model's preferences through play.

That loop -- draw, generate, evaluate, modify -- is the creative process. Same loop you use when sketching on paper, but with a neural network as co-author. The final image belongs to both of you. Pretty cool, right? :-)

Performance and practical notes

Model loading time. Pix2Pix models are large -- typically 50-200MB depending on the domain. The initial load takes several seconds on a fast connection. Don't expect instant startup. Show a loading indicator.

Generation speed. Each transfer call takes 0.5-2 seconds depending on hardware. Not real-time. This limits Pix2Pix to a sketch-then-generate workflow rather than the continuous-feed approach of style transfer. You can auto-generate on a timer (every 1-2 seconds) but expect visible lag between strokes and responses.

Input resolution. Pix2Pix models are trained at 256x256. Larger inputs get downsampled. The output is also 256x256. For high-res output, you'd need to tile the image or use a super-resolution post-process, but for creative coding purposes 256x256 is fine -- the artifacts and hallucinations are more intresting at low resolution anyway.

Canvas vs webcam. We used a drawing canvas as input. You can also feed webcam frames through an edge detection filter and pass those to Pix2Pix. The result is similar to style transfer but with a very different visual quality -- instead of re-rendering the same scene, the model generates a new scene based on the edges it detects. Try running Canny edge detection (via OpenCV.js or a simple Sobel filter in p5) on your webcam and feeding the edges to edges2cats. Your room becomes a landscape of cats.

// tip: edge-detect webcam frames for live pix2pix input
// 1. capture frame from video
// 2. convert to grayscale
// 3. apply Sobel or threshold to get edges
// 4. invert (pix2pix expects dark lines on white)
// 5. feed to pix2pix.transfer()
//
// the result: every edge in your room becomes a cat feature
// doorframes become whiskers. light switches become eyes.
// your face becomes a cat face (sort of).

Where does this lead?

Pix2Pix translates between paired image domains. It needs aligned training pairs -- every input has a matching output. That pairing requirement limits what domains you can work with (someone had to prepare those pairs). The next step is models that work without paired data, that learn to translate between two collections of images without any alignment at all. No sketch-photo pairs needed -- just a pile of sketches and a pile of photos, and the model figures out the mapping. That opens up radically more creative possibilities because you can work with any two collections of images.

Beyond domain translation, there are models that generate images from nothing -- from random noise, from text descriptions, from latent codes. GANs that create faces, landscapes, artworks that never existed. The progression from style transfer (change rendering) to Pix2Pix (translate between domains) to pure generation (create from scratch) follows a clear arc of increasing generative power. Each step gives the model more creative autonomy and gives you different creative levers to pull.

't Komt erop neer...

Pix2Pix is a conditional GAN that learns to translate between paired image domains. Trained on aligned pairs (sketch, photograph), it learns the mapping. At inference time, draw a sketch and get a photographic interpretation. Unlike style transfer which changes rendering while preserving content, Pix2Pix generates genuinely different images that correspond to the input
The architecture uses a U-Net generator (encoder-decoder with skip connections for spatial detail) and a PatchGAN discriminator (classifies overlapping patches for texture quality). The generator must fool the discriminator AND match the ground truth. The adversarial training pushes outputs toward photorealism while the paired loss keeps them aligned with the input
ml5 provides pre-trained Pix2Pix models for specific domain pairs: edges2cats, edges2shoes, facades. Each model understands one translation. Same sketch fed to different models produces radically different outputs -- a circle becomes a cat eye or a shoe eyelet depending on the model's training domain
The creative value is in the mistakes. Ambiguous sketches produce hallucinated, surreal outputs. A vaguely cat-shaped blob generates a nightmare cat. Impossible geometry produces dreamlike architecture. The gap between your intention and the model's interpretation IS the creative material. You're collaborating with the model's trained imagination
Live generation creates a feedback loop: draw, generate, evaluate, modify. Every 1-2 seconds the model re-interprets the current sketch. Add a line and the interpretation shifts. Erase and it changes again. This loop is the creative process -- same as sketching on paper but with a neural co-author
Input sensitivity is a feature. Line weight, completeness, position -- all affect the output. Thick lines generate differently from thin lines. A half-drawn circle generates differently from a complete one. These are creative parameters. Exploring the boundary between "the model sees a cat" and "the model sees garbage" reveals the model's internal thresholds
Feeding generative code to Pix2Pix creates hybrid outputs. Noise-driven curves, L-system branches, particle traces -- any algorithmic drawing can be the input. Code draws, model interprets. The output is neither purely algorithmic nor purely neural. Two creative systems layered on each other
Training data shapes the model's worldview. Biases in the dataset become biases in the output -- breeds represented, poses, demographics, architectural styles. Choosing a pre-trained model means choosing the dataset's perspective. Training a custom model means defining your own perspective. The training data is an artistic choice
Performance: models are 50-200MB (slow initial load), generation takes 0.5-2 seconds per image (not real-time), input/output is 256x256. These constraints push toward a deliberate sketch-then-generate workflow rather than continuous processing. The delay between stroke and response is part of the creative rhythm
Pix2Pix requires paired training data. The next step is unpaired translation -- models that learn mappings between two image collections without aligned pairs. Beyond that, models that generate images from scratch. The progression from style transfer to domain translation to pure generation follows an arc of increasing generative power

Six episodes into the ML arc. We started with flat labels (92), built up through body tracking (93-95), circled back for deep classification (96), discovered neural painting with style transfer (97), and now we're generating images from sketches. Each episode the network's role shifts further from observer to creator. Classification watches. Style transfer paints. Pix2Pix imagines. And we haven't gotten to the models that generate entirely new images from nothing but random noise.

Sallukes! Thanks for reading.

@femdev

stem stemsocial steemstem programming creativecoding

0.000

0 comments