Learn AI Series (#86) - Image-to-Image and Editing
Learn AI Series (#86) - Image-to-Image and Editing

What will I learn
- You will learn inpainting: filling in masked regions of images with contextually appropriate content using diffusion models;
- outpainting: extending images beyond their original borders;
- neural style transfer: separating and recombining content and style from different images using Gram matrices;
- super-resolution: enhancing image quality and upscaling low-resolution images with perceptual losses;
- image-to-image translation: pix2pix for paired domains and CycleGAN for unpaired domain translation;
- diffusion-based image editing: controlling edit strength with the noise parameter;
- InstructPix2Pix: natural language editing instructions applied to existing images;
- building a practical image editing pipeline that combines multiple techniques.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
- Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
- Learn AI Series (#43) - PyTorch Data and Training
- Learn AI Series (#44) - PyTorch nn.Module - Building Real Networks
- Learn AI Series (#45) - Convolutional Neural Networks - Theory
- Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures
- Learn AI Series (#47) - CNN Applications - Detection, Segmentation, Style Transfer
- Learn AI Series (#48) - Recurrent Neural Networks - Sequences
- Learn AI Series (#49) - LSTM and GRU - Solving the Memory Problem
- Learn AI Series (#50) - Sequence-to-Sequence Models
- Learn AI Series (#51) - Attention Mechanisms
- Learn AI Series (#52) - The Transformer Architecture (Part 1)
- Learn AI Series (#53) - The Transformer Architecture (Part 2)
- Learn AI Series (#54) - Vision Transformers
- Learn AI Series (#55) - Generative Adversarial Networks
- Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch
- Learn AI Series (#57) - Language Modeling - Predicting the Next Word
- Learn AI Series (#58) - GPT Architecture - Decoder-Only Transformers
- Learn AI Series (#59) - BERT and Encoder Models
- Learn AI Series (#60) - Training Large Language Models
- Learn AI Series (#61) - Instruction Tuning and Alignment
- Learn AI Series (#62) - Prompt Engineering - Getting the Most from LLMs
- Learn AI Series (#63) - Embeddings and Vector Search
- Learn AI Series (#64) - Retrieval-Augmented Generation (RAG) - Basics
- Learn AI Series (#65) - RAG - Advanced Techniques
- Learn AI Series (#66) - Working with LLM APIs
- Learn AI Series (#67) - Building AI Agents (Part 1) - Foundations
- Learn AI Series (#68) - Building AI Agents (Part 2) - Advanced Patterns
- Learn AI Series (#69) - Fine-Tuning Language Models
- Learn AI Series (#70) - Running Local Models
- Learn AI Series (#71) - Text Generation Techniques
- Learn AI Series (#72) - Tokenization Deep Dive
- Learn AI Series (#73) - LLM Evaluation
- Learn AI Series (#74) - The Hugging Face Ecosystem
- Learn AI Series (#75) - Multimodal Models - Text Meets Vision
- Learn AI Series (#76) - Mini Project - Your Own AI Assistant
- Learn AI Series (#77) - Image Processing Fundamentals
- Learn AI Series (#78) - Object Detection (Part 1) - Foundations
- Learn AI Series (#79) - Object Detection (Part 2) - Modern Approaches
- Learn AI Series (#80) - Image Segmentation
- Learn AI Series (#81) - Pose Estimation and Tracking
- Learn AI Series (#82) - Optical Character Recognition
- Learn AI Series (#83) - Video Understanding
- Learn AI Series (#84) - Generative Images - Diffusion Models (Part 1)
- Learn AI Series (#85) - Generative Images - Diffusion Models (Part 2)
- Learn AI Series (#86) - Image-to-Image and Editing (this post)
Learn AI Series (#86) - Image-to-Image and Editing
Solutions to Episode #85 Exercises
Exercise 1: VAE compression analyzer.
import numpy as np
class VAECompressionAnalyzer:
"""Analyze VAE compression ratios and memory
savings across different image resolutions."""
def __init__(self, latent_ch=4,
downsample_factor=8):
self.latent_ch = latent_ch
self.factor = downsample_factor
def analyze(self, resolutions):
results = []
for res in resolutions:
pixel_size = res * res * 3
lat_h = res // self.factor
lat_w = res // self.factor
latent_size = lat_h * lat_w * self.latent_ch
ratio = pixel_size / latent_size
# float32 = 4 bytes per value
pixel_mb = pixel_size * 4 / (1024 ** 2)
latent_mb = latent_size * 4 / (1024 ** 2)
saved_mb = pixel_mb - latent_mb
# DDIM steps in same memory budget
steps_in_budget = pixel_size / latent_size
results.append({
"res": res,
"pixel_size": pixel_size,
"latent_size": latent_size,
"ratio": ratio,
"pixel_mb": pixel_mb,
"latent_mb": latent_mb,
"saved_mb": saved_mb,
"steps_in_budget": steps_in_budget,
})
return results
def print_table(self, resolutions):
results = self.analyze(resolutions)
print(f"{'Res':>6} {'Pixels':>10} "
f"{'Latent':>10} {'Ratio':>7} "
f"{'Saved(MB)':>10} {'Steps':>6}")
print("-" * 55)
for r in results:
print(f"{r['res']:>6} "
f"{r['pixel_size']:>10,} "
f"{r['latent_size']:>10,} "
f"{r['ratio']:>7.1f}x "
f"{r['saved_mb']:>10.2f} "
f"{r['steps_in_budget']:>6.0f}")
# Verify constant ratio
ratios = [r["ratio"] for r in results]
print(f"\nAll ratios equal: "
f"{all(abs(r - ratios[0]) < 0.01
for r in ratios)}")
analyzer = VAECompressionAnalyzer()
analyzer.print_table([256, 512, 768, 1024])
The compression ratio is 48x across all resolutions -- exactly as expected since both numerator (H * W * 3) and denominator (H/8 * W/8 * 4) scale with H * W, so the ratio (3 * 64) / 4 = 48 is resolution-independent. The absolute memory savings grow quadratically though: going from 256 to 512 (2x resolution) quadruples the pixel count, so the savings in MB also roughly quadruple. The "steps in budget" column shows that in the memory needed for one pixel-space forward pass, you could run 48 DDIM steps in latent space -- nearly a complete generation cycle.
Exercise 2: CFG scale simulator.
import numpy as np
class CFGSimulator:
"""Simulate classifier-free guidance on
synthetic 1D features."""
def __init__(self, dim=100, seed=42):
rng = np.random.RandomState(seed)
self.unconditional = rng.randn(dim)
self.signal = rng.randn(dim) * 0.5
self.conditional = (self.unconditional
+ self.signal)
def apply_cfg(self, unconditional,
conditional, scale):
return (unconditional
+ scale
* (conditional - unconditional))
def cosine_sim(self, a, b):
dot = np.dot(a, b)
na = np.linalg.norm(a)
nb = np.linalg.norm(b)
return dot / max(na * nb, 1e-12)
def run(self):
scales = [1.0, 3.0, 5.0, 7.5,
10.0, 15.0, 20.0]
cond_norm = np.linalg.norm(
self.conditional)
print(f"{'Scale':>6} {'CosSim':>8} "
f"{'L2 Norm':>9} {'Amplify':>8}")
print("-" * 35)
artifact_scale = None
for s in scales:
guided = self.apply_cfg(
self.unconditional,
self.conditional, s)
cs = self.cosine_sim(
guided, self.signal)
norm = np.linalg.norm(guided)
# Signal component in guided output
guided_signal = (guided
- self.unconditional)
raw_signal = (self.conditional
- self.unconditional)
amp = (np.linalg.norm(guided_signal)
/ max(np.linalg.norm(raw_signal),
1e-12))
print(f"{s:>6.1f} {cs:>8.4f} "
f"{norm:>9.2f} {amp:>8.2f}x")
if (artifact_scale is None
and norm > 3 * cond_norm):
artifact_scale = s
if artifact_scale:
print(f"\nArtifact threshold "
f"(3x norm): scale={artifact_scale}")
sim = CFGSimulator()
sim.run()
Cosine similarity to the signal increases with scale -- the guided output aligns more closely with the text direction. But the L2 norm grows linearly without bound (because the CFG formula is a linear extrapolation). The amplification factor equals the scale itself (by construction -- scale * signal has scale times the norm of signal). The artifact threshold typically hits around scale 5-10, which matches real-world experience where scales above 10-12 start producing oversaturated, artifact-heavy images.
Exercise 3: DDIM step quality estimator.
import numpy as np
class DDIMStepEstimator:
"""Estimate DDIM reconstruction quality
as a function of step count."""
def __init__(self, T=1000, dim=256, seed=42):
rng = np.random.RandomState(seed)
self.T = T
betas = np.linspace(1e-4, 0.02, T)
alphas = 1.0 - betas
self.alpha_bars = np.cumprod(alphas)
self.x0 = rng.randn(dim)
def ddim_reconstruct(self, x0, num_steps):
# Add noise at t=T-1
ab_T = self.alpha_bars[-1]
noise = np.random.RandomState(123).randn(
len(x0))
xT = (np.sqrt(ab_T) * x0
+ np.sqrt(1 - ab_T) * noise)
# Select timestep subset
step_size = self.T // num_steps
timesteps = list(
range(0, self.T, step_size))[::-1]
x = xT.copy()
for i, t in enumerate(timesteps):
ab_t = self.alpha_bars[t]
# Perfect noise predictor
pred_noise = (
(x - np.sqrt(ab_t) * x0)
/ np.sqrt(1 - ab_t + 1e-12))
# Predict x0
pred_x0 = (
(x - np.sqrt(1 - ab_t) * pred_noise)
/ np.sqrt(ab_t + 1e-12))
pred_x0 = np.clip(pred_x0, -3, 3)
if i + 1 < len(timesteps):
ab_prev = self.alpha_bars[
timesteps[i + 1]]
else:
ab_prev = 1.0
dir_xt = (np.sqrt(1 - ab_prev)
* pred_noise)
x = np.sqrt(ab_prev) * pred_x0 + dir_xt
return x
def run(self):
step_counts = [5, 10, 20, 50, 100,
200, 500, 1000]
results = []
for ns in step_counts:
recon = self.ddim_reconstruct(
self.x0, ns)
mse = np.mean((recon - self.x0) ** 2)
results.append((ns, mse))
ref_mse = results[-1][1]
print(f"{'Steps':>6} {'MSE':>12} "
f"{'Relative':>10}")
print("-" * 30)
prev_mse = None
knee = None
for ns, mse in results:
rel = mse / max(ref_mse, 1e-15)
print(f"{ns:>6} {mse:>12.6f} "
f"{rel:>10.2f}x")
if (prev_mse is not None
and knee is None
and prev_mse > 0):
improvement = (
(prev_mse - mse) / prev_mse)
if improvement < 0.10:
knee = ns
prev_mse = mse
if knee:
print(f"\nKnee: {knee} steps "
f"(<10% improvement from "
f"doubling)")
estimator = DDIMStepEstimator()
estimator.run()
With a perfect noise predictor, the MSE is extremely small at all step counts -- the error comes purely from ODE discretization, not prediction quality. At 1000 steps the MSE is essentially zero (the timestep grid matches exactly). At 50 steps the MSE is still tiny. The knee point (where doubling steps gives less than 10% improvement) typically falls around 50-100 steps, confirming why 50-step DDIM is the standard practical choice -- you get >99% of the quality at 5% of the cost.
On to today's episode
Welcome back! In episodes #84 and #85 we built the complete text-to-image pipeline: DDPM for denoising in pixel space, then latent diffusion with VAE compression, CLIP text conditioning, classifier-free guidance, DDIM sampling, and ControlNet for structural control. That's the machinery for generating images from scratch -- you start from pure Gaussian noise and iteratively denoise until a photorealistic image materializes out of nothing.
But here's the thing most people actually NEED from generative models: not creating images from scratch, but editing existing ones. Remove that person photobombing your vacation picture. Change the color of a house. Upscale a blurry old family photo. Convert a pencil sketch into a watercolor painting. Turn a summer landscape into winter. Every one of these tasks is a form of conditional image generation -- given an input image (and possibly some instruction), produce a modified version that preserves what should stay and transforms what should change.
The techniques in this episode span roughly a decade of research, from neural style transfer (Gatys et al., 2015) through pix2pix and CycleGAN (both 2017) to modern diffusion-based editing. What's remarkable is how they're all converging toward a single paradigm: conditional diffusion with various forms of guidance ;-)
Inpainting: filling in the blanks
Inpainting replaces a masked region of an image with content that blends seamlessly with the surroundings. Remove a person from a beach photo. Erase a watermark. Fill in a scratched area of a scanned photograph. Reconstruct a damaged painting.
Classic inpainting (before deep learning) used texture propagation -- copy patterns from nearby pixels into the hole, working from the edges inward. It works reasonably well for small holes in textured regions but completely falls apart when the missing area is large or contains semantic content (like a face, a building, or an animal). You can't reconstruct a face from surrounding tree textures by copying pixels.
Deep learning inpainting uses the model's understanding of what should be there given the context. A diffusion inpainting model is essentially the same U-Net from episode #85, but with additional input channels -- the masked image and the mask itself are concatenated with the noisy latent. The model learns to denoise only in the masked region while keeping the unmasked areas untouched:
from diffusers import StableDiffusionInpaintPipeline
import torch
from PIL import Image
pipe = StableDiffusionInpaintPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-inpainting",
torch_dtype=torch.float16
).to("cuda")
image = Image.open("beach_with_person.png").resize(
(512, 512))
mask = Image.open("person_mask.png").resize(
(512, 512)) # white = replace
result = pipe(
prompt="a sandy beach with ocean waves, "
"sunny day, golden sand",
image=image,
mask_image=mask,
num_inference_steps=30,
).images[0]
result.save("beach_clean.png")
The prompt guides what gets generated in the masked region. Without a prompt (or with a generic one), the model fills in based purely on visual context -- surrounding textures, colors, and scene understanding. With a specific prompt, you can replace the masked area with something entirely new: mask out the sky and prompt "dramatic thunderstorm clouds" to change the weather.
Under the hood, the inpainting model modifies the standard diffusion denoising loop. At each step, after predicting the noise and computing the denoised estimate, the unmasked regions are forced to match the original image (at the appropriate noise level for that timestep). Only the masked region is free to evolve through the denoising process:
def inpaint_denoise_step(model, xt, t,
mask_latent,
original_latent,
text_emb):
"""Modified denoising step for inpainting.
Unmasked regions are forced to match the
original at each timestep."""
# Predict noise for the whole image
pred_noise = model(
torch.cat([xt, mask_latent,
original_latent], dim=1),
t, text_emb)
# Standard denoise step
xt_prev = denoise_step(xt, pred_noise, t)
# Force unmasked regions to match original
# (at appropriate noise level for t-1)
original_noised = add_noise(
original_latent, t - 1)
xt_prev = (mask_latent * xt_prev
+ (1 - mask_latent)
* original_noised)
return xt_prev
This "force the known pixels" approach is simple but effective. The boundary between masked and unmasked regions can sometimes show visible seams, which is why more sophisticated methods use feathered masks (gradual transition from 0 to 1 at the boundary) or apply additional blending in post-processing. Having said that, modern inpainting models trained specifically for this task (like the SD2 inpainting checkpoint above) handle boundaries remarkably well because they've seen millions of examples during training.
Style transfer: content meets aesthetics
Neural style transfer, proposed by Gatys et al. (2015), was one of the first "wow" demonstrations of deep learning for creative purposes. The insight: a pretrained CNN (like VGG, which we covered in episode #46) naturally separates content (what objects are where) from style (textures, colors, brushstrokes) in its internal representations.
Content is captured by the feature maps themselves -- which neurons are active where. Style is captured by the Gram matrix -- the correlations between feature maps, which describe texture patterns independent of spatial layout. If two feature channels tend to activate together (say, one detecting horizontal edges and another detecting warm colors), that correlation is part of the "style" regardless of where in the image those patterns appear:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.models as models
def gram_matrix(features):
"""Compute the Gram matrix: correlations
between feature channels."""
b, c, h, w = features.shape
F_map = features.view(b, c, h * w)
G = torch.bmm(F_map, F_map.transpose(1, 2))
return G / (c * h * w)
def extract_features(model, x, layers):
"""Extract features at specified VGG layers."""
features = {}
current = x
layer_idx = 0
for module in model.features:
current = module(current)
if isinstance(module, nn.ReLU):
name = f"relu{layer_idx}"
if name in layers:
features[name] = current
layer_idx += 1
return features
def style_transfer_loss(generated,
content_target,
style_target,
vgg,
alpha=1, beta=1e5):
"""Combined content + style loss."""
content_layers = ['relu8']
style_layers = ['relu0', 'relu2', 'relu4',
'relu8', 'relu12']
all_layers = set(content_layers + style_layers)
gen_feat = extract_features(
vgg, generated, all_layers)
con_feat = extract_features(
vgg, content_target, all_layers)
sty_feat = extract_features(
vgg, style_target, all_layers)
content_loss = sum(
F.mse_loss(gen_feat[l], con_feat[l])
for l in content_layers
if l in gen_feat and l in con_feat
)
style_loss = sum(
F.mse_loss(
gram_matrix(gen_feat[l]),
gram_matrix(sty_feat[l]))
for l in style_layers
if l in gen_feat and l in sty_feat
)
return alpha * content_loss + beta * style_loss
The optimization works differently from normal training: instead of updating model weights, you update the image itself. Start with the content image (or random noise), compute the combined loss, backpropagate through the frozen VGG network, and update the pixel values via gradient descent. The result is an image that has the spatial structure of the content image but the textures and color palette of the style image.
The beta parameter (style weight) relative to alpha (content weight) controls the tradeoff. High beta produces output that looks more like the style image at the expense of recognizability. Low beta preserves content faithfully but the style becomes subtle. The typical ratio is around 1e5:1, which gives a good balance for artistic style transfer.
Fast style transfer (Johnson et al., 2016) trains a feedforward network to apply a specific style in a single forward pass, trading generality for speed. AdaIN (Adaptive Instance Normalization, Huang and Belongie, 2017) enables arbitrary style transfer in real-time by adjusting the mean and variance of content features to match the style features -- it's essentially transferring the first-order statistics of the style onto the content.
Image-to-image translation: pix2pix
Pix2pix (Isola et al., 2017) learns to translate between paired image domains: sketch -> photo, day -> night, aerial view -> map, edges -> handbag. It uses a conditional GAN (episode #55) where the generator receives the source image and produces the target domain version, while the discriminator judges whether the output looks realistic given the input:
import torch
import torch.nn as nn
class Pix2PixGenerator(nn.Module):
"""U-Net generator for pix2pix. Encoder-decoder
with skip connections that pass fine detail
directly from input to output."""
def __init__(self, in_ch=3, out_ch=3, base=64):
super().__init__()
# Encoder
self.enc1 = nn.Conv2d(
in_ch, base, 4, stride=2, padding=1)
self.enc2 = nn.Sequential(
nn.LeakyReLU(0.2),
nn.Conv2d(base, base * 2, 4, 2, 1),
nn.BatchNorm2d(base * 2))
self.enc3 = nn.Sequential(
nn.LeakyReLU(0.2),
nn.Conv2d(base * 2, base * 4, 4, 2, 1),
nn.BatchNorm2d(base * 4))
self.enc4 = nn.Sequential(
nn.LeakyReLU(0.2),
nn.Conv2d(base * 4, base * 8, 4, 2, 1))
# Decoder with skip connections
self.dec4 = nn.Sequential(
nn.ReLU(),
nn.ConvTranspose2d(
base * 8, base * 4, 4, 2, 1),
nn.BatchNorm2d(base * 4))
self.dec3 = nn.Sequential(
nn.ReLU(),
nn.ConvTranspose2d(
base * 8, base * 2, 4, 2, 1),
nn.BatchNorm2d(base * 2))
self.dec2 = nn.Sequential(
nn.ReLU(),
nn.ConvTranspose2d(
base * 4, base, 4, 2, 1),
nn.BatchNorm2d(base))
self.dec1 = nn.Sequential(
nn.ReLU(),
nn.ConvTranspose2d(
base * 2, out_ch, 4, 2, 1),
nn.Tanh())
def forward(self, x):
e1 = self.enc1(x)
e2 = self.enc2(e1)
e3 = self.enc3(e2)
e4 = self.enc4(e3)
d4 = self.dec4(e4)
# Skip connections: cat encoder features
# with decoder features
d3 = self.dec3(torch.cat([d4, e3], dim=1))
d2 = self.dec2(torch.cat([d3, e2], dim=1))
d1 = self.dec1(torch.cat([d2, e1], dim=1))
return d1
gen = Pix2PixGenerator()
sketch = torch.randn(1, 3, 256, 256)
photo = gen(sketch)
print(f"Input: {sketch.shape}")
print(f"Output: {photo.shape}")
The skip connections are critical here -- without them, the decoder would have to reconstruct every spatial detail from the compressed bottleneck, which inevitably loses fine edges, textures, and precise alignments. With skip connections, the decoder receives the input's spatial detail directly and only needs to transform it rather than reconstruct it. This is the same principle we saw in segmentation U-Nets (episode #80) and the diffusion U-Net (episode #84).
Pix2pix requires paired training data: you need the SAME scene in both domains. That's easy for synthetic pairs (extract edges from photos, then train the reverse) but impossible for many real-world translations. You can't get paired horse-zebra photos of the exact same animal in the exact same pose.
CycleGAN: unpaired translation
CycleGAN (Zhu et al., 2017) removes the paired data requirement entirely. It learns to translate between domains using only unpaired examples: a collection of horse photos and a separate collection of zebra photos, with no correspondences between them.
The key constraint is cycle consistency: if you translate a horse photo to zebra, then translate the result back to horse, you should get the original photo back. And vice versa. This prevents the model from producing arbitrary outputs that happen to look like zebras but have no relationship to the input horse:
import torch.nn.functional as F
def cyclegan_losses(real_A, real_B,
G_AB, G_BA,
D_A, D_B,
lambda_cycle=10.0):
"""Full CycleGAN loss computation.
Two generators, two discriminators."""
# Forward cycle: A -> B -> A
fake_B = G_AB(real_A)
recon_A = G_BA(fake_B)
cycle_loss_A = F.l1_loss(recon_A, real_A)
# Backward cycle: B -> A -> B
fake_A = G_BA(real_B)
recon_B = G_AB(fake_A)
cycle_loss_B = F.l1_loss(recon_B, real_B)
cycle_loss = lambda_cycle * (
cycle_loss_A + cycle_loss_B)
# GAN losses: fake images should fool
# discriminators
gan_loss_AB = F.mse_loss(
D_B(fake_B),
torch.ones_like(D_B(fake_B)))
gan_loss_BA = F.mse_loss(
D_A(fake_A),
torch.ones_like(D_A(fake_A)))
# Identity loss (optional but helps preserve
# color): G_AB(real_B) should look like real_B
ident_B = G_AB(real_B)
ident_A = G_BA(real_A)
identity_loss = 0.5 * (
F.l1_loss(ident_B, real_B)
+ F.l1_loss(ident_A, real_A))
total = (gan_loss_AB + gan_loss_BA
+ cycle_loss + identity_loss * 5.0)
return total, {
"cycle": cycle_loss.item(),
"gan": (gan_loss_AB + gan_loss_BA).item(),
"identity": identity_loss.item(),
}
CycleGAN requires training four networks: two generators (A->B, B->A) and two discriminators (one per domain). That's considerably more expensive than pix2pix, but the freedom from paired data makes it dramatically more versatile. The classic demos include horse<->zebra, summer<->winter, photo<->painting, and apple<->orange.
The identity loss term is a subtle addition: when you feed a real zebra to the horse->zebra generator, it should return the image unchanged (it's already a zebra). This prevents the generator from making unnecessary changes and helps preserve colors that are already correct.
Super-resolution: enhancing detail
Super-resolution upscales low-resolution images while adding realistic detail that the model invents based on learned priors about what real images look like at high resolution. A 128x128 image becomes a convincing 512x512 image -- not just bilinearly interpolated (which gives blurry upscaling), but with sharp edges, realistic textures, and fine details:
from diffusers import StableDiffusionUpscalePipeline
import torch
from PIL import Image
pipe = StableDiffusionUpscalePipeline.from_pretrained(
"stabilityai/stable-diffusion-x4-upscaler",
torch_dtype=torch.float16
).to("cuda")
low_res = Image.open("small_photo.png")
upscaled = pipe(
prompt="high quality detailed photograph",
image=low_res,
num_inference_steps=20,
).images[0]
upscaled.save("upscaled_4x.png")
The classic approach, SRGAN (Super-Resolution GAN, Ledig et al., 2017), trains a generator to upscale and a discriminator to distinguish real high-res from fake upscaled images. The critical innovation was using perceptual loss -- comparing VGG features rather than raw pixels. MSE pixel loss produces blurry results because the "average" of all possible high-resolution versions of a low-res image is, well, blurry. Perceptual loss encourages the model to produce images that look realistic to a pretrained feature extractor, even if the specific details it invents don't match the ground truth exactly.
Modern super-resolution increasingly uses diffusion. The approach: take the low-res image, upscale it with bilinear interpolation to the target size, add noise, then denoise. The denoising process fills in the missing high-frequency detail -- sharpening edges, adding texture, refining fine structures. The prompt can guide what kind of detail gets added ("photograph" vs "oil painting" vs "pencil drawing"), which is something SRGAN couldn't do at all.
Diffusion-based image editing: the strength parameter
The most flexible approach to image editing today uses the img2img pipeline from Stable Diffusion. The core idea: take a real image, encode it into latent space, add some noise (not all the way to pure noise -- just partially), then denoise with a new text prompt. The amount of noise controls how much the image changes:
from diffusers import StableDiffusionImg2ImgPipeline
import torch
from PIL import Image
pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-1",
torch_dtype=torch.float16
).to("cuda")
original = Image.open("photo_of_house.png").resize(
(512, 512))
edited = pipe(
prompt="a house covered in snow, "
"winter scene, frost on windows",
image=original,
strength=0.6,
num_inference_steps=30,
).images[0]
edited.save("house_winter.png")
The strength parameter is the critical control. It determines how many of the total denoising steps actually run -- strength=0.6 with 30 steps means starting at step 18 (60% of the way through the noise schedule) and denoising from there. Lower strength preserves more of the original, higher strength allows more radical changes:
- strength=0.2-0.3: subtle modifications. Color shifts, slight texture changes, minor atmospheric adjustments. The composition and structure are essentially unchanged.
- strength=0.5-0.6: moderate edits. Weather changes, time-of-day shifts, material swaps. Overall composition preserved but significant details can change.
- strength=0.7-0.8: major transformation. The subject and rough layout survive but fine details are largely regenerated.
- strength=0.9-1.0: near-complete regeneration. The prompt dominates almost entirely. At 1.0 you're basically doing txt2img with a very weak hint from the input.
Finding the right strength for a specific edit is genuinely an art. Too low and nothing changes. Too high and you lose the original image entirely. I usually start at 0.5 and adjust up or down depending on whether the edit is too subtle or too destructive ;-)
InstructPix2Pix: editing with natural language
InstructPix2Pix (Brooks et al., 2023) takes a different approach: instead of providing a replacement prompt that describes the entire output image, you provide an editing instruction. "Make it winter", "Add sunglasses to the person", "Change the car color to red", "Make the dog look surprised". The model learns to apply the edit while preserving everything else:
from diffusers import (
StableDiffusionInstructPix2PixPipeline)
import torch
from PIL import Image
pipe = (StableDiffusionInstructPix2PixPipeline
.from_pretrained(
"timbrooks/instruct-pix2pix",
torch_dtype=torch.float16)
.to("cuda"))
image = Image.open("summer_garden.png").resize(
(512, 512))
edited = pipe(
prompt="make it a snowy winter scene",
image=image,
num_inference_steps=20,
image_guidance_scale=1.5,
guidance_scale=7.0,
).images[0]
edited.save("winter_garden.png")
InstructPix2Pix has two guidance scales instead of one: guidance_scale controls how strongly the text instruction is followed (same as CFG from episode #85), and image_guidance_scale controls how much the original image is preserved. High image guidance = stay close to the original, change less. Low image guidance = more freedom to make dramatic changes.
The training data for InstructPix2Pix was generated synthetically: take an image caption, use a language model to generate editing instructions AND modified captions, then use Stable Diffusion to generate paired before/after images from the original and modified captions. This is a clever bootstrapping approach -- you don't need humans to manually edit millions of images, you use existing generative models to create the training data for a new editing model.
Building a practical editing pipeline
In practice, you often want to combine multiple techniques into a pipeline. Here's a class that composes different editing operations:
class ImageEditingPipeline:
"""Combine multiple image editing techniques
into a reusable pipeline."""
def __init__(self, device="cuda"):
self.device = device
self.dtype = torch.float16
def compute_edit_strength(self, edit_type):
"""Suggest appropriate strength for
different edit types."""
suggestions = {
"color_shift": 0.3,
"weather_change": 0.55,
"style_transfer": 0.65,
"object_removal": 0.7,
"scene_change": 0.8,
}
return suggestions.get(edit_type, 0.5)
def create_mask_from_bbox(self, image_size,
bbox):
"""Create a binary mask from a bounding box
(x1, y1, x2, y2) for inpainting."""
import numpy as np
from PIL import Image as PILImage
mask = np.zeros(image_size[::-1],
dtype=np.uint8)
x1, y1, x2, y2 = bbox
mask[y1:y2, x1:x2] = 255
return PILImage.fromarray(mask)
def feather_mask(self, mask_array, radius=10):
"""Apply Gaussian blur to mask edges for
smoother inpainting boundaries."""
import numpy as np
from scipy.ndimage import gaussian_filter
feathered = gaussian_filter(
mask_array.astype(np.float32),
sigma=radius)
# Renormalize to 0-255
feathered = (feathered / feathered.max()
* 255).astype(np.uint8)
return feathered
def log_operation(self, operation, params,
success):
"""Log editing operations for
reproducibility."""
import json
entry = {
"operation": operation,
"params": params,
"success": success,
}
print(json.dumps(entry, indent=2))
pipeline = ImageEditingPipeline()
# Example: suggest strengths for different edits
for edit in ["color_shift", "weather_change",
"style_transfer", "object_removal",
"scene_change"]:
s = pipeline.compute_edit_strength(edit)
print(f"{edit:<20} -> strength={s:.2f}")
The key insight for building real editing systems: each technique has its sweet spot. Inpainting is best for localized changes (remove this object, replace this region). Style transfer works for global aesthetic changes. Img2img is best for scene-level transformations (day to night, summer to winter). Super-resolution is for enhancement without semantic changes. InstructPix2Pix bridges the gap between targeted edits and scene-level changes by letting you describe the edit in natural language.
All of these techniques are converging. Modern systems like SDXL editing pipelines, Adobe Firefly, and research prototypes increasingly combine them into unified models that can inpaint, restyle, translate, upscale, and edit -- all through a single interface controlled by text prompts and masks. The underlying machinery is the same conditional diffusion framework we've been building across episodes #84-86, just with different conditioning signals plugged in.
Samengevat
- Inpainting fills masked regions using a modified diffusion process where unmasked pixels are forced to match the original at each denoising step; the text prompt guides what gets generated in the masked area;
- neural style transfer separates content (feature maps) from style (Gram matrices) using a pretrained CNN like VGG; optimization updates the image pixels while keeping the network frozen; fast variants (AdaIN) enable real-time arbitrary style application;
- pix2pix translates between paired image domains using a conditional GAN with a U-Net generator and skip connections; CycleGAN handles unpaired domains through cycle consistency constraints (A->B->A should reconstruct A);
- super-resolution adds realistic high-frequency detail using perceptual losses (GAN-based) or diffusion-based denoising; perceptual loss (comparing VGG features, not pixels) was the key innovation over MSE-based approaches that produce blurry results;
- diffusion img2img editing controls change intensity through the
strengthparameter: more noise = more timesteps denoised = more change; finding the right strength for a specific edit is an iterative process; - InstructPix2Pix adds natural language editing instructions with separate guidance scales for text adherence and image preservation, trained on synthetically generated before/after pairs;
- all image editing tasks are converging toward a single paradigm: conditional diffusion with various forms of guidance -- masks for inpainting, images for img2img, text for prompting, structure maps for ControlNet (episode #85).
The vision section of this series has been a quite a journey -- from basic image processing through detection, segmentation, pose estimation, OCR, video understanding, and now generative and editing models. There's still fascinating ground to cover in how machines understand the visual world from different angles, including the geometry and structure of three-dimensional scenes.
Exercises
Exercise 1: Build a mask quality analyzer for inpainting. Create a class InpaintMaskAnalyzer that: (a) takes a binary mask as a 2D numpy array (0 = keep, 255 = replace), (b) computes mask statistics: total area (number of masked pixels), coverage ratio (masked / total pixels), bounding box of the masked region, aspect ratio of the bounding box, (c) computes a "boundary complexity" score: the number of masked pixels that have at least one unmasked neighbor (4-connectivity) divided by the total masked pixel count -- this measures how jagged/complex the mask boundary is (a square mask has low complexity, a fractal mask has high complexity), (d) implements classify_mask(mask) that categorizes the mask as "small_patch" (coverage < 5%), "medium_region" (5-25%), "large_area" (25-50%), or "reconstruction" (> 50%), and recommends an appropriate number of diffusion steps for each category (more steps for larger masks), (e) generates 4 test masks on a 256x256 canvas: a 30x30 centered square, a 100x100 centered square, a horizontal stripe (full width, 80px tall), and a checkerboard (alternate 16x16 blocks). Print the analysis for each mask and verify that boundary complexity is lowest for the square masks and highest for the checkerboard.
Exercise 2: Build a style transfer weight explorer. Create a class StyleWeightExplorer that: (a) generates synthetic "content features" and "style features" as random tensors of shape (1, 64, 32, 32), (b) implements gram_matrix(features) that computes the Gram matrix (episode formula: flatten spatial dims, then F * F^T, normalized by CHW), (c) implements content_loss(gen_features, content_features) as MSE between feature maps and style_loss(gen_features, style_features) as MSE between their Gram matrices, (d) for a grid of (alpha, beta) weight pairs -- alpha in [1, 10, 100] and beta in [1e3, 1e4, 1e5, 1e6] -- computes the combined loss alpha * content_loss + beta * style_loss using a fixed "generated" feature tensor (equal to the content features, so content loss starts at 0), (e) prints a table showing: for each (alpha, beta) pair, the content loss contribution, the style loss contribution, the total loss, and the "style dominance ratio" (style contribution / total). Identify which (alpha, beta) pairs give style dominance between 0.4 and 0.6 (the balanced sweet spot) and which are heavily style-dominated (> 0.9) or heavily content-dominated (< 0.1).
Exercise 3: Build a diffusion strength calibrator. Create a class StrengthCalibrator that: (a) implements a toy 1D "image editing" scenario: x0_original is a known signal of length 128, x0_target is a modified version (original with a sine wave added to simulate an edit), (b) implements simulate_edit(x0, strength, T=1000, num_steps=50) that: encodes x0 into "latent" (just the signal itself for this toy model), adds noise at timestep t_start = int((1 - strength) * T) using a linear schedule, then runs DDIM-style denoising from t_start to 0 using a "perfect oracle" that always predicts the correct noise (so any error comes from discretization and the strength setting, not model quality), (c) for strengths [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0], computes: MSE between the edited result and x0_original (should increase with strength -- more noise destroys more of the original), MSE between the edited result and x0_target (should decrease initially then plateau -- moderate strength finds the target), and a "preservation score" (1 - normalized MSE to original) and "transformation score" (1 - normalized MSE to target), (d) prints a table and identifies the strength that gives the best balance (closest to equal preservation and transformation scores). Verify that strength=0.0 (if you tested it) would perfectly preserve the original, and strength=1.0 almost completely destroys it.