Learn AI Series (#86) - Image-to-Image and Editing

@scipio 70

20 days ago

StemSocial

Learn AI Series (#86) - Image-to-Image and Editing

What will I learn

You will learn inpainting: filling in masked regions of images with contextually appropriate content using diffusion models;
outpainting: extending images beyond their original borders;
neural style transfer: separating and recombining content and style from different images using Gram matrices;
super-resolution: enhancing image quality and upscaling low-resolution images with perceptual losses;
image-to-image translation: pix2pix for paired domains and CycleGAN for unpaired domain translation;
diffusion-based image editing: controlling edit strength with the noise parameter;
InstructPix2Pix: natural language editing instructions applied to existing images;
building a practical image editing pipeline that combines multiple techniques.

Requirements

A working modern computer running macOS, Windows or Ubuntu;
An installed Python 3(.11+) distribution;
The ambition to learn AI and machine learning.

Difficulty

Beginner

Curriculum (of the `Learn AI Series`):

Learn AI Series (#86) - Image-to-Image and Editing

Solutions to Episode #85 Exercises

Exercise 1: VAE compression analyzer.

import numpy as np


class VAECompressionAnalyzer:
    """Analyze VAE compression ratios and memory
    savings across different image resolutions."""

    def __init__(self, latent_ch=4,
                 downsample_factor=8):
        self.latent_ch = latent_ch
        self.factor = downsample_factor

    def analyze(self, resolutions):
        results = []
        for res in resolutions:
            pixel_size = res * res * 3
            lat_h = res // self.factor
            lat_w = res // self.factor
            latent_size = lat_h * lat_w * self.latent_ch
            ratio = pixel_size / latent_size
            # float32 = 4 bytes per value
            pixel_mb = pixel_size * 4 / (1024 ** 2)
            latent_mb = latent_size * 4 / (1024 ** 2)
            saved_mb = pixel_mb - latent_mb
            # DDIM steps in same memory budget
            steps_in_budget = pixel_size / latent_size
            results.append({
                "res": res,
                "pixel_size": pixel_size,
                "latent_size": latent_size,
                "ratio": ratio,
                "pixel_mb": pixel_mb,
                "latent_mb": latent_mb,
                "saved_mb": saved_mb,
                "steps_in_budget": steps_in_budget,
            })
        return results

    def print_table(self, resolutions):
        results = self.analyze(resolutions)
        print(f"{'Res':>6} {'Pixels':>10} "
              f"{'Latent':>10} {'Ratio':>7} "
              f"{'Saved(MB)':>10} {'Steps':>6}")
        print("-" * 55)
        for r in results:
            print(f"{r['res']:>6} "
                  f"{r['pixel_size']:>10,} "
                  f"{r['latent_size']:>10,} "
                  f"{r['ratio']:>7.1f}x "
                  f"{r['saved_mb']:>10.2f} "
                  f"{r['steps_in_budget']:>6.0f}")

        # Verify constant ratio
        ratios = [r["ratio"] for r in results]
        print(f"\nAll ratios equal: "
              f"{all(abs(r - ratios[0]) < 0.01
                     for r in ratios)}")


analyzer = VAECompressionAnalyzer()
analyzer.print_table([256, 512, 768, 1024])

The compression ratio is 48x across all resolutions -- exactly as expected since both numerator (H * W * 3) and denominator (H/8 * W/8 * 4) scale with H * W, so the ratio (3 * 64) / 4 = 48 is resolution-independent. The absolute memory savings grow quadratically though: going from 256 to 512 (2x resolution) quadruples the pixel count, so the savings in MB also roughly quadruple. The "steps in budget" column shows that in the memory needed for one pixel-space forward pass, you could run 48 DDIM steps in latent space -- nearly a complete generation cycle.

Exercise 2: CFG scale simulator.

import numpy as np


class CFGSimulator:
    """Simulate classifier-free guidance on
    synthetic 1D features."""

    def __init__(self, dim=100, seed=42):
        rng = np.random.RandomState(seed)
        self.unconditional = rng.randn(dim)
        self.signal = rng.randn(dim) * 0.5
        self.conditional = (self.unconditional
                            + self.signal)

    def apply_cfg(self, unconditional,
                  conditional, scale):
        return (unconditional
                + scale
                * (conditional - unconditional))

    def cosine_sim(self, a, b):
        dot = np.dot(a, b)
        na = np.linalg.norm(a)
        nb = np.linalg.norm(b)
        return dot / max(na * nb, 1e-12)

    def run(self):
        scales = [1.0, 3.0, 5.0, 7.5,
                  10.0, 15.0, 20.0]
        cond_norm = np.linalg.norm(
            self.conditional)

        print(f"{'Scale':>6} {'CosSim':>8} "
              f"{'L2 Norm':>9} {'Amplify':>8}")
        print("-" * 35)

        artifact_scale = None
        for s in scales:
            guided = self.apply_cfg(
                self.unconditional,
                self.conditional, s)
            cs = self.cosine_sim(
                guided, self.signal)
            norm = np.linalg.norm(guided)
            # Signal component in guided output
            guided_signal = (guided
                             - self.unconditional)
            raw_signal = (self.conditional
                          - self.unconditional)
            amp = (np.linalg.norm(guided_signal)
                   / max(np.linalg.norm(raw_signal),
                         1e-12))
            print(f"{s:>6.1f} {cs:>8.4f} "
                  f"{norm:>9.2f} {amp:>8.2f}x")

            if (artifact_scale is None
                    and norm > 3 * cond_norm):
                artifact_scale = s

        if artifact_scale:
            print(f"\nArtifact threshold "
                  f"(3x norm): scale={artifact_scale}")


sim = CFGSimulator()
sim.run()

Cosine similarity to the signal increases with scale -- the guided output aligns more closely with the text direction. But the L2 norm grows linearly without bound (because the CFG formula is a linear extrapolation). The amplification factor equals the scale itself (by construction -- scale * signal has scale times the norm of signal). The artifact threshold typically hits around scale 5-10, which matches real-world experience where scales above 10-12 start producing oversaturated, artifact-heavy images.

Exercise 3: DDIM step quality estimator.

import numpy as np


class DDIMStepEstimator:
    """Estimate DDIM reconstruction quality
    as a function of step count."""

    def __init__(self, T=1000, dim=256, seed=42):
        rng = np.random.RandomState(seed)
        self.T = T
        betas = np.linspace(1e-4, 0.02, T)
        alphas = 1.0 - betas
        self.alpha_bars = np.cumprod(alphas)
        self.x0 = rng.randn(dim)

    def ddim_reconstruct(self, x0, num_steps):
        # Add noise at t=T-1
        ab_T = self.alpha_bars[-1]
        noise = np.random.RandomState(123).randn(
            len(x0))
        xT = (np.sqrt(ab_T) * x0
              + np.sqrt(1 - ab_T) * noise)

        # Select timestep subset
        step_size = self.T // num_steps
        timesteps = list(
            range(0, self.T, step_size))[::-1]

        x = xT.copy()
        for i, t in enumerate(timesteps):
            ab_t = self.alpha_bars[t]
            # Perfect noise predictor
            pred_noise = (
                (x - np.sqrt(ab_t) * x0)
                / np.sqrt(1 - ab_t + 1e-12))

            # Predict x0
            pred_x0 = (
                (x - np.sqrt(1 - ab_t) * pred_noise)
                / np.sqrt(ab_t + 1e-12))
            pred_x0 = np.clip(pred_x0, -3, 3)

            if i + 1 < len(timesteps):
                ab_prev = self.alpha_bars[
                    timesteps[i + 1]]
            else:
                ab_prev = 1.0

            dir_xt = (np.sqrt(1 - ab_prev)
                      * pred_noise)
            x = np.sqrt(ab_prev) * pred_x0 + dir_xt

        return x

    def run(self):
        step_counts = [5, 10, 20, 50, 100,
                       200, 500, 1000]
        results = []
        for ns in step_counts:
            recon = self.ddim_reconstruct(
                self.x0, ns)
            mse = np.mean((recon - self.x0) ** 2)
            results.append((ns, mse))

        ref_mse = results[-1][1]
        print(f"{'Steps':>6} {'MSE':>12} "
              f"{'Relative':>10}")
        print("-" * 30)

        prev_mse = None
        knee = None
        for ns, mse in results:
            rel = mse / max(ref_mse, 1e-15)
            print(f"{ns:>6} {mse:>12.6f} "
                  f"{rel:>10.2f}x")
            if (prev_mse is not None
                    and knee is None
                    and prev_mse > 0):
                improvement = (
                    (prev_mse - mse) / prev_mse)
                if improvement < 0.10:
                    knee = ns
            prev_mse = mse

        if knee:
            print(f"\nKnee: {knee} steps "
                  f"(<10% improvement from "
                  f"doubling)")


estimator = DDIMStepEstimator()
estimator.run()

With a perfect noise predictor, the MSE is extremely small at all step counts -- the error comes purely from ODE discretization, not prediction quality. At 1000 steps the MSE is essentially zero (the timestep grid matches exactly). At 50 steps the MSE is still tiny. The knee point (where doubling steps gives less than 10% improvement) typically falls around 50-100 steps, confirming why 50-step DDIM is the standard practical choice -- you get >99% of the quality at 5% of the cost.

On to today's episode

Welcome back! In episodes #84 and #85 we built the complete text-to-image pipeline: DDPM for denoising in pixel space, then latent diffusion with VAE compression, CLIP text conditioning, classifier-free guidance, DDIM sampling, and ControlNet for structural control. That's the machinery for generating images from scratch -- you start from pure Gaussian noise and iteratively denoise until a photorealistic image materializes out of nothing.

But here's the thing most people actually NEED from generative models: not creating images from scratch, but editing existing ones. Remove that person photobombing your vacation picture. Change the color of a house. Upscale a blurry old family photo. Convert a pencil sketch into a watercolor painting. Turn a summer landscape into winter. Every one of these tasks is a form of conditional image generation -- given an input image (and possibly some instruction), produce a modified version that preserves what should stay and transforms what should change.

The techniques in this episode span roughly a decade of research, from neural style transfer (Gatys et al., 2015) through pix2pix and CycleGAN (both 2017) to modern diffusion-based editing. What's remarkable is how they're all converging toward a single paradigm: conditional diffusion with various forms of guidance ;-)

Inpainting: filling in the blanks

Inpainting replaces a masked region of an image with content that blends seamlessly with the surroundings. Remove a person from a beach photo. Erase a watermark. Fill in a scratched area of a scanned photograph. Reconstruct a damaged painting.

Classic inpainting (before deep learning) used texture propagation -- copy patterns from nearby pixels into the hole, working from the edges inward. It works reasonably well for small holes in textured regions but completely falls apart when the missing area is large or contains semantic content (like a face, a building, or an animal). You can't reconstruct a face from surrounding tree textures by copying pixels.

Deep learning inpainting uses the model's understanding of what should be there given the context. A diffusion inpainting model is essentially the same U-Net from episode #85, but with additional input channels -- the masked image and the mask itself are concatenated with the noisy latent. The model learns to denoise only in the masked region while keeping the unmasked areas untouched:

from diffusers import StableDiffusionInpaintPipeline
import torch
from PIL import Image

pipe = StableDiffusionInpaintPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-inpainting",
    torch_dtype=torch.float16
).to("cuda")

image = Image.open("beach_with_person.png").resize(
    (512, 512))
mask = Image.open("person_mask.png").resize(
    (512, 512))  # white = replace

result = pipe(
    prompt="a sandy beach with ocean waves, "
           "sunny day, golden sand",
    image=image,
    mask_image=mask,
    num_inference_steps=30,
).images[0]
result.save("beach_clean.png")

The prompt guides what gets generated in the masked region. Without a prompt (or with a generic one), the model fills in based purely on visual context -- surrounding textures, colors, and scene understanding. With a specific prompt, you can replace the masked area with something entirely new: mask out the sky and prompt "dramatic thunderstorm clouds" to change the weather.

Under the hood, the inpainting model modifies the standard diffusion denoising loop. At each step, after predicting the noise and computing the denoised estimate, the unmasked regions are forced to match the original image (at the appropriate noise level for that timestep). Only the masked region is free to evolve through the denoising process:

def inpaint_denoise_step(model, xt, t,
                         mask_latent,
                         original_latent,
                         text_emb):
    """Modified denoising step for inpainting.
    Unmasked regions are forced to match the
    original at each timestep."""
    # Predict noise for the whole image
    pred_noise = model(
        torch.cat([xt, mask_latent,
                   original_latent], dim=1),
        t, text_emb)

    # Standard denoise step
    xt_prev = denoise_step(xt, pred_noise, t)

    # Force unmasked regions to match original
    # (at appropriate noise level for t-1)
    original_noised = add_noise(
        original_latent, t - 1)
    xt_prev = (mask_latent * xt_prev
               + (1 - mask_latent)
               * original_noised)
    return xt_prev

This "force the known pixels" approach is simple but effective. The boundary between masked and unmasked regions can sometimes show visible seams, which is why more sophisticated methods use feathered masks (gradual transition from 0 to 1 at the boundary) or apply additional blending in post-processing. Having said that, modern inpainting models trained specifically for this task (like the SD2 inpainting checkpoint above) handle boundaries remarkably well because they've seen millions of examples during training.

Style transfer: content meets aesthetics

Neural style transfer, proposed by Gatys et al. (2015), was one of the first "wow" demonstrations of deep learning for creative purposes. The insight: a pretrained CNN (like VGG, which we covered in episode #46) naturally separates content (what objects are where) from style (textures, colors, brushstrokes) in its internal representations.

Content is captured by the feature maps themselves -- which neurons are active where. Style is captured by the Gram matrix -- the correlations between feature maps, which describe texture patterns independent of spatial layout. If two feature channels tend to activate together (say, one detecting horizontal edges and another detecting warm colors), that correlation is part of the "style" regardless of where in the image those patterns appear:

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.models as models


def gram_matrix(features):
    """Compute the Gram matrix: correlations
    between feature channels."""
    b, c, h, w = features.shape
    F_map = features.view(b, c, h * w)
    G = torch.bmm(F_map, F_map.transpose(1, 2))
    return G / (c * h * w)


def extract_features(model, x, layers):
    """Extract features at specified VGG layers."""
    features = {}
    current = x
    layer_idx = 0
    for module in model.features:
        current = module(current)
        if isinstance(module, nn.ReLU):
            name = f"relu{layer_idx}"
            if name in layers:
                features[name] = current
            layer_idx += 1
    return features


def style_transfer_loss(generated,
                        content_target,
                        style_target,
                        vgg,
                        alpha=1, beta=1e5):
    """Combined content + style loss."""
    content_layers = ['relu8']
    style_layers = ['relu0', 'relu2', 'relu4',
                    'relu8', 'relu12']
    all_layers = set(content_layers + style_layers)

    gen_feat = extract_features(
        vgg, generated, all_layers)
    con_feat = extract_features(
        vgg, content_target, all_layers)
    sty_feat = extract_features(
        vgg, style_target, all_layers)

    content_loss = sum(
        F.mse_loss(gen_feat[l], con_feat[l])
        for l in content_layers
        if l in gen_feat and l in con_feat
    )
    style_loss = sum(
        F.mse_loss(
            gram_matrix(gen_feat[l]),
            gram_matrix(sty_feat[l]))
        for l in style_layers
        if l in gen_feat and l in sty_feat
    )
    return alpha * content_loss + beta * style_loss

The optimization works differently from normal training: instead of updating model weights, you update the image itself. Start with the content image (or random noise), compute the combined loss, backpropagate through the frozen VGG network, and update the pixel values via gradient descent. The result is an image that has the spatial structure of the content image but the textures and color palette of the style image.

The beta parameter (style weight) relative to alpha (content weight) controls the tradeoff. High beta produces output that looks more like the style image at the expense of recognizability. Low beta preserves content faithfully but the style becomes subtle. The typical ratio is around 1e5:1, which gives a good balance for artistic style transfer.

Fast style transfer (Johnson et al., 2016) trains a feedforward network to apply a specific style in a single forward pass, trading generality for speed. AdaIN (Adaptive Instance Normalization, Huang and Belongie, 2017) enables arbitrary style transfer in real-time by adjusting the mean and variance of content features to match the style features -- it's essentially transferring the first-order statistics of the style onto the content.

Image-to-image translation: pix2pix

Pix2pix (Isola et al., 2017) learns to translate between paired image domains: sketch -> photo, day -> night, aerial view -> map, edges -> handbag. It uses a conditional GAN (episode #55) where the generator receives the source image and produces the target domain version, while the discriminator judges whether the output looks realistic given the input:

import torch
import torch.nn as nn


class Pix2PixGenerator(nn.Module):
    """U-Net generator for pix2pix. Encoder-decoder
    with skip connections that pass fine detail
    directly from input to output."""

    def __init__(self, in_ch=3, out_ch=3, base=64):
        super().__init__()
        # Encoder
        self.enc1 = nn.Conv2d(
            in_ch, base, 4, stride=2, padding=1)
        self.enc2 = nn.Sequential(
            nn.LeakyReLU(0.2),
            nn.Conv2d(base, base * 2, 4, 2, 1),
            nn.BatchNorm2d(base * 2))
        self.enc3 = nn.Sequential(
            nn.LeakyReLU(0.2),
            nn.Conv2d(base * 2, base * 4, 4, 2, 1),
            nn.BatchNorm2d(base * 4))
        self.enc4 = nn.Sequential(
            nn.LeakyReLU(0.2),
            nn.Conv2d(base * 4, base * 8, 4, 2, 1))

        # Decoder with skip connections
        self.dec4 = nn.Sequential(
            nn.ReLU(),
            nn.ConvTranspose2d(
                base * 8, base * 4, 4, 2, 1),
            nn.BatchNorm2d(base * 4))
        self.dec3 = nn.Sequential(
            nn.ReLU(),
            nn.ConvTranspose2d(
                base * 8, base * 2, 4, 2, 1),
            nn.BatchNorm2d(base * 2))
        self.dec2 = nn.Sequential(
            nn.ReLU(),
            nn.ConvTranspose2d(
                base * 4, base, 4, 2, 1),
            nn.BatchNorm2d(base))
        self.dec1 = nn.Sequential(
            nn.ReLU(),
            nn.ConvTranspose2d(
                base * 2, out_ch, 4, 2, 1),
            nn.Tanh())

    def forward(self, x):
        e1 = self.enc1(x)
        e2 = self.enc2(e1)
        e3 = self.enc3(e2)
        e4 = self.enc4(e3)
        d4 = self.dec4(e4)
        # Skip connections: cat encoder features
        # with decoder features
        d3 = self.dec3(torch.cat([d4, e3], dim=1))
        d2 = self.dec2(torch.cat([d3, e2], dim=1))
        d1 = self.dec1(torch.cat([d2, e1], dim=1))
        return d1


gen = Pix2PixGenerator()
sketch = torch.randn(1, 3, 256, 256)
photo = gen(sketch)
print(f"Input:  {sketch.shape}")
print(f"Output: {photo.shape}")

The skip connections are critical here -- without them, the decoder would have to reconstruct every spatial detail from the compressed bottleneck, which inevitably loses fine edges, textures, and precise alignments. With skip connections, the decoder receives the input's spatial detail directly and only needs to transform it rather than reconstruct it. This is the same principle we saw in segmentation U-Nets (episode #80) and the diffusion U-Net (episode #84).

Pix2pix requires paired training data: you need the SAME scene in both domains. That's easy for synthetic pairs (extract edges from photos, then train the reverse) but impossible for many real-world translations. You can't get paired horse-zebra photos of the exact same animal in the exact same pose.

CycleGAN: unpaired translation

CycleGAN (Zhu et al., 2017) removes the paired data requirement entirely. It learns to translate between domains using only unpaired examples: a collection of horse photos and a separate collection of zebra photos, with no correspondences between them.

The key constraint is cycle consistency: if you translate a horse photo to zebra, then translate the result back to horse, you should get the original photo back. And vice versa. This prevents the model from producing arbitrary outputs that happen to look like zebras but have no relationship to the input horse:

import torch.nn.functional as F


def cyclegan_losses(real_A, real_B,
                    G_AB, G_BA,
                    D_A, D_B,
                    lambda_cycle=10.0):
    """Full CycleGAN loss computation.
    Two generators, two discriminators."""
    # Forward cycle: A -> B -> A
    fake_B = G_AB(real_A)
    recon_A = G_BA(fake_B)
    cycle_loss_A = F.l1_loss(recon_A, real_A)

    # Backward cycle: B -> A -> B
    fake_A = G_BA(real_B)
    recon_B = G_AB(fake_A)
    cycle_loss_B = F.l1_loss(recon_B, real_B)

    cycle_loss = lambda_cycle * (
        cycle_loss_A + cycle_loss_B)

    # GAN losses: fake images should fool
    # discriminators
    gan_loss_AB = F.mse_loss(
        D_B(fake_B),
        torch.ones_like(D_B(fake_B)))
    gan_loss_BA = F.mse_loss(
        D_A(fake_A),
        torch.ones_like(D_A(fake_A)))

    # Identity loss (optional but helps preserve
    # color): G_AB(real_B) should look like real_B
    ident_B = G_AB(real_B)
    ident_A = G_BA(real_A)
    identity_loss = 0.5 * (
        F.l1_loss(ident_B, real_B)
        + F.l1_loss(ident_A, real_A))

    total = (gan_loss_AB + gan_loss_BA
             + cycle_loss + identity_loss * 5.0)
    return total, {
        "cycle": cycle_loss.item(),
        "gan": (gan_loss_AB + gan_loss_BA).item(),
        "identity": identity_loss.item(),
    }

CycleGAN requires training four networks: two generators (A->B, B->A) and two discriminators (one per domain). That's considerably more expensive than pix2pix, but the freedom from paired data makes it dramatically more versatile. The classic demos include horse<->zebra, summer<->winter, photo<->painting, and apple<->orange.

The identity loss term is a subtle addition: when you feed a real zebra to the horse->zebra generator, it should return the image unchanged (it's already a zebra). This prevents the generator from making unnecessary changes and helps preserve colors that are already correct.

Super-resolution: enhancing detail

Super-resolution upscales low-resolution images while adding realistic detail that the model invents based on learned priors about what real images look like at high resolution. A 128x128 image becomes a convincing 512x512 image -- not just bilinearly interpolated (which gives blurry upscaling), but with sharp edges, realistic textures, and fine details:

from diffusers import StableDiffusionUpscalePipeline
import torch
from PIL import Image

pipe = StableDiffusionUpscalePipeline.from_pretrained(
    "stabilityai/stable-diffusion-x4-upscaler",
    torch_dtype=torch.float16
).to("cuda")

low_res = Image.open("small_photo.png")
upscaled = pipe(
    prompt="high quality detailed photograph",
    image=low_res,
    num_inference_steps=20,
).images[0]
upscaled.save("upscaled_4x.png")

The classic approach, SRGAN (Super-Resolution GAN, Ledig et al., 2017), trains a generator to upscale and a discriminator to distinguish real high-res from fake upscaled images. The critical innovation was using perceptual loss -- comparing VGG features rather than raw pixels. MSE pixel loss produces blurry results because the "average" of all possible high-resolution versions of a low-res image is, well, blurry. Perceptual loss encourages the model to produce images that look realistic to a pretrained feature extractor, even if the specific details it invents don't match the ground truth exactly.

Modern super-resolution increasingly uses diffusion. The approach: take the low-res image, upscale it with bilinear interpolation to the target size, add noise, then denoise. The denoising process fills in the missing high-frequency detail -- sharpening edges, adding texture, refining fine structures. The prompt can guide what kind of detail gets added ("photograph" vs "oil painting" vs "pencil drawing"), which is something SRGAN couldn't do at all.

Diffusion-based image editing: the strength parameter

The most flexible approach to image editing today uses the img2img pipeline from Stable Diffusion. The core idea: take a real image, encode it into latent space, add some noise (not all the way to pure noise -- just partially), then denoise with a new text prompt. The amount of noise controls how much the image changes:

from diffusers import StableDiffusionImg2ImgPipeline
import torch
from PIL import Image

pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1",
    torch_dtype=torch.float16
).to("cuda")

original = Image.open("photo_of_house.png").resize(
    (512, 512))
edited = pipe(
    prompt="a house covered in snow, "
           "winter scene, frost on windows",
    image=original,
    strength=0.6,
    num_inference_steps=30,
).images[0]
edited.save("house_winter.png")

The strength parameter is the critical control. It determines how many of the total denoising steps actually run -- strength=0.6 with 30 steps means starting at step 18 (60% of the way through the noise schedule) and denoising from there. Lower strength preserves more of the original, higher strength allows more radical changes:

strength=0.2-0.3: subtle modifications. Color shifts, slight texture changes, minor atmospheric adjustments. The composition and structure are essentially unchanged.
strength=0.5-0.6: moderate edits. Weather changes, time-of-day shifts, material swaps. Overall composition preserved but significant details can change.
strength=0.7-0.8: major transformation. The subject and rough layout survive but fine details are largely regenerated.
strength=0.9-1.0: near-complete regeneration. The prompt dominates almost entirely. At 1.0 you're basically doing txt2img with a very weak hint from the input.

Finding the right strength for a specific edit is genuinely an art. Too low and nothing changes. Too high and you lose the original image entirely. I usually start at 0.5 and adjust up or down depending on whether the edit is too subtle or too destructive ;-)

InstructPix2Pix: editing with natural language

InstructPix2Pix (Brooks et al., 2023) takes a different approach: instead of providing a replacement prompt that describes the entire output image, you provide an editing instruction. "Make it winter", "Add sunglasses to the person", "Change the car color to red", "Make the dog look surprised". The model learns to apply the edit while preserving everything else:

from diffusers import (
    StableDiffusionInstructPix2PixPipeline)
import torch
from PIL import Image

pipe = (StableDiffusionInstructPix2PixPipeline
    .from_pretrained(
        "timbrooks/instruct-pix2pix",
        torch_dtype=torch.float16)
    .to("cuda"))

image = Image.open("summer_garden.png").resize(
    (512, 512))
edited = pipe(
    prompt="make it a snowy winter scene",
    image=image,
    num_inference_steps=20,
    image_guidance_scale=1.5,
    guidance_scale=7.0,
).images[0]
edited.save("winter_garden.png")

InstructPix2Pix has two guidance scales instead of one: guidance_scale controls how strongly the text instruction is followed (same as CFG from episode #85), and image_guidance_scale controls how much the original image is preserved. High image guidance = stay close to the original, change less. Low image guidance = more freedom to make dramatic changes.

The training data for InstructPix2Pix was generated synthetically: take an image caption, use a language model to generate editing instructions AND modified captions, then use Stable Diffusion to generate paired before/after images from the original and modified captions. This is a clever bootstrapping approach -- you don't need humans to manually edit millions of images, you use existing generative models to create the training data for a new editing model.

Building a practical editing pipeline

In practice, you often want to combine multiple techniques into a pipeline. Here's a class that composes different editing operations:

class ImageEditingPipeline:
    """Combine multiple image editing techniques
    into a reusable pipeline."""

    def __init__(self, device="cuda"):
        self.device = device
        self.dtype = torch.float16

    def compute_edit_strength(self, edit_type):
        """Suggest appropriate strength for
        different edit types."""
        suggestions = {
            "color_shift": 0.3,
            "weather_change": 0.55,
            "style_transfer": 0.65,
            "object_removal": 0.7,
            "scene_change": 0.8,
        }
        return suggestions.get(edit_type, 0.5)

    def create_mask_from_bbox(self, image_size,
                              bbox):
        """Create a binary mask from a bounding box
        (x1, y1, x2, y2) for inpainting."""
        import numpy as np
        from PIL import Image as PILImage
        mask = np.zeros(image_size[::-1],
                        dtype=np.uint8)
        x1, y1, x2, y2 = bbox
        mask[y1:y2, x1:x2] = 255
        return PILImage.fromarray(mask)

    def feather_mask(self, mask_array, radius=10):
        """Apply Gaussian blur to mask edges for
        smoother inpainting boundaries."""
        import numpy as np
        from scipy.ndimage import gaussian_filter
        feathered = gaussian_filter(
            mask_array.astype(np.float32),
            sigma=radius)
        # Renormalize to 0-255
        feathered = (feathered / feathered.max()
                     * 255).astype(np.uint8)
        return feathered

    def log_operation(self, operation, params,
                      success):
        """Log editing operations for
        reproducibility."""
        import json
        entry = {
            "operation": operation,
            "params": params,
            "success": success,
        }
        print(json.dumps(entry, indent=2))


pipeline = ImageEditingPipeline()

# Example: suggest strengths for different edits
for edit in ["color_shift", "weather_change",
             "style_transfer", "object_removal",
             "scene_change"]:
    s = pipeline.compute_edit_strength(edit)
    print(f"{edit:<20} -> strength={s:.2f}")

The key insight for building real editing systems: each technique has its sweet spot. Inpainting is best for localized changes (remove this object, replace this region). Style transfer works for global aesthetic changes. Img2img is best for scene-level transformations (day to night, summer to winter). Super-resolution is for enhancement without semantic changes. InstructPix2Pix bridges the gap between targeted edits and scene-level changes by letting you describe the edit in natural language.

All of these techniques are converging. Modern systems like SDXL editing pipelines, Adobe Firefly, and research prototypes increasingly combine them into unified models that can inpaint, restyle, translate, upscale, and edit -- all through a single interface controlled by text prompts and masks. The underlying machinery is the same conditional diffusion framework we've been building across episodes #84-86, just with different conditioning signals plugged in.

Samengevat

Inpainting fills masked regions using a modified diffusion process where unmasked pixels are forced to match the original at each denoising step; the text prompt guides what gets generated in the masked area;
neural style transfer separates content (feature maps) from style (Gram matrices) using a pretrained CNN like VGG; optimization updates the image pixels while keeping the network frozen; fast variants (AdaIN) enable real-time arbitrary style application;
pix2pix translates between paired image domains using a conditional GAN with a U-Net generator and skip connections; CycleGAN handles unpaired domains through cycle consistency constraints (A->B->A should reconstruct A);
super-resolution adds realistic high-frequency detail using perceptual losses (GAN-based) or diffusion-based denoising; perceptual loss (comparing VGG features, not pixels) was the key innovation over MSE-based approaches that produce blurry results;
diffusion img2img editing controls change intensity through the strength parameter: more noise = more timesteps denoised = more change; finding the right strength for a specific edit is an iterative process;
InstructPix2Pix adds natural language editing instructions with separate guidance scales for text adherence and image preservation, trained on synthetically generated before/after pairs;
all image editing tasks are converging toward a single paradigm: conditional diffusion with various forms of guidance -- masks for inpainting, images for img2img, text for prompting, structure maps for ControlNet (episode #85).

The vision section of this series has been a quite a journey -- from basic image processing through detection, segmentation, pose estimation, OCR, video understanding, and now generative and editing models. There's still fascinating ground to cover in how machines understand the visual world from different angles, including the geometry and structure of three-dimensional scenes.

Exercises

Exercise 1: Build a mask quality analyzer for inpainting. Create a class InpaintMaskAnalyzer that: (a) takes a binary mask as a 2D numpy array (0 = keep, 255 = replace), (b) computes mask statistics: total area (number of masked pixels), coverage ratio (masked / total pixels), bounding box of the masked region, aspect ratio of the bounding box, (c) computes a "boundary complexity" score: the number of masked pixels that have at least one unmasked neighbor (4-connectivity) divided by the total masked pixel count -- this measures how jagged/complex the mask boundary is (a square mask has low complexity, a fractal mask has high complexity), (d) implements classify_mask(mask) that categorizes the mask as "small_patch" (coverage < 5%), "medium_region" (5-25%), "large_area" (25-50%), or "reconstruction" (> 50%), and recommends an appropriate number of diffusion steps for each category (more steps for larger masks), (e) generates 4 test masks on a 256x256 canvas: a 30x30 centered square, a 100x100 centered square, a horizontal stripe (full width, 80px tall), and a checkerboard (alternate 16x16 blocks). Print the analysis for each mask and verify that boundary complexity is lowest for the square masks and highest for the checkerboard.

Exercise 2: Build a style transfer weight explorer. Create a class StyleWeightExplorer that: (a) generates synthetic "content features" and "style features" as random tensors of shape (1, 64, 32, 32), (b) implements gram_matrix(features) that computes the Gram matrix (episode formula: flatten spatial dims, then F * F^T, normalized by CHW), (c) implements content_loss(gen_features, content_features) as MSE between feature maps and style_loss(gen_features, style_features) as MSE between their Gram matrices, (d) for a grid of (alpha, beta) weight pairs -- alpha in [1, 10, 100] and beta in [1e3, 1e4, 1e5, 1e6] -- computes the combined loss alpha * content_loss + beta * style_loss using a fixed "generated" feature tensor (equal to the content features, so content loss starts at 0), (e) prints a table showing: for each (alpha, beta) pair, the content loss contribution, the style loss contribution, the total loss, and the "style dominance ratio" (style contribution / total). Identify which (alpha, beta) pairs give style dominance between 0.4 and 0.6 (the balanced sweet spot) and which are heavily style-dominated (> 0.9) or heavily content-dominated (< 0.1).

Exercise 3: Build a diffusion strength calibrator. Create a class StrengthCalibrator that: (a) implements a toy 1D "image editing" scenario: x0_original is a known signal of length 128, x0_target is a modified version (original with a sine wave added to simulate an edit), (b) implements simulate_edit(x0, strength, T=1000, num_steps=50) that: encodes x0 into "latent" (just the signal itself for this toy model), adds noise at timestep t_start = int((1 - strength) * T) using a linear schedule, then runs DDIM-style denoising from t_start to 0 using a "perfect oracle" that always predicts the correct noise (so any error comes from discretization and the strength setting, not model quality), (c) for strengths [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0], computes: MSE between the edited result and x0_original (should increase with strength -- more noise destroys more of the original), MSE between the edited result and x0_target (should decrease initially then plateau -- moderate strength finds the target), and a "preservation score" (1 - normalized MSE to original) and "transformation score" (1 - normalized MSE to target), (d) prints a table and identifies the strength that gives the best balance (closest to equal preservation and transformation scores). Verify that strength=0.0 (if you tested it) would perfectly preserve the original, and strength=1.0 almost completely destroys it.

Bedankt en tot de volgende keer!

@scipio

stem stemsocial steemstem python programming

0.000

0 comments

Learn AI Series (#86) - Image-to-Image and Editing

Learn AI Series (#86) - Image-to-Image and Editing

What will I learn

Requirements

Difficulty

Curriculum (of the Learn AI Series):

Learn AI Series (#86) - Image-to-Image and Editing

Solutions to Episode #85 Exercises

On to today's episode

Inpainting: filling in the blanks

Style transfer: content meets aesthetics

Image-to-image translation: pix2pix

CycleGAN: unpaired translation

Super-resolution: enhancing detail

Diffusion-based image editing: the strength parameter

InstructPix2Pix: editing with natural language

Building a practical editing pipeline

Samengevat

Exercises

Bedankt en tot de volgende keer!

Curriculum (of the `Learn AI Series`):