Learn Creative Coding (#86) - Text as Data: Analyzing and Visualizing Language

about 5 hours ago

Learn Creative Coding (#86) - Text as Data: Analyzing and Visualizing Language

Last episode we built network visualizations -- force-directed layouts, community detection, adjacency matrices, arc diagrams, interactive hover highlighting, edge bundling. We turned structural relationships into visual patterns. Nodes and edges, clusters and bridges, the social architecture of data made visible. But all the data we've worked with so far in this arc -- geographic coordinates, timestamps, network connections -- has been inherently numeric. Numbers map naturally to visual properties: position, size, color, angle. The mapping is direct.

Text is different. Text is messy. A paragraph of English isn't a number. A word isn't a coordinate. Before you can visualize text, you have to measure it -- turn words and sentences into quantities that your canvas can work with. Word frequency, sentence length, character distribution, sentiment scores. Once you extract those measurements, text becomes data like any other. And the patterns hiding in language turn out to be surprisingly beautiful.

This episode is about treating text as raw material for creative coding. We'll count words (and discover Zipf's law along the way), measure sentence rhythm, score sentiment with a dictionary lookup, map characters and words to colors, compare two texts side by side, and generate new text with Markov chains. We parsed structured data files back in episode 81 -- CSV rows and JSON objects with neat fields. Text is the unstructured cousin: no columns, no types, just a stream of characters that your code has to make sense of.

Word frequency: the shape of language

The most basic measurement of text: how often does each word appear? Count every word, sort by frequency, and you've got a profile of what the text is about. The most frequent words in any English text are always the same: "the", "a", "is", "and", "of", "to". These are called stop words -- they're structural, not meaningful. Filter them out and the content words that remain tell you the subject matter. A novel about whales will have "whale", "sea", "captain" near the top. A tech blog post will have "function", "data", "code".

function countWords(text) {
  const words = text.toLowerCase()
    .replace(/[^a-z\s]/g, '')
    .split(/\s+/)
    .filter(w => w.length > 0);

  const counts = {};
  for (const word of words) {
    counts[word] = (counts[word] || 0) + 1;
  }

  return Object.entries(counts)
    .sort((a, b) => b[1] - a[1]);
}

const sampleText = `The quick brown fox jumps over the lazy dog.
The dog barked at the fox. The fox ran away from the dog.
A cat watched the dog chase the fox from the garden wall.
The garden was quiet after the fox and the dog left.`;

const freq = countWords(sampleText);
// [["the", 10], ["dog", 4], ["fox", 4], ["the", ...], ...]

Ten occurrences of "the" in four sentences. Four each for "dog" and "fox". That's already telling you something -- this text is about a dog and a fox, which of course it is, but the algorithm doesn't know that. It just counted.

The replace(/[^a-z\s]/g, '') strip is crude but effective -- it removes punctuation, digits, and special characters, leaving only lowercase letters and spaces. A proper tokenizer would handle apostrophes ("don't" should be one word, not "don" and "t"), hyphens ("well-known"), and unicode. For creative coding purposes, the crude version usually works fine.

Zipf's law: the universal pattern

Here's something wild. If you take any sufficiently long text in any natural language and plot word frequency on a log-log scale, you get a straight line. The most frequent word appears roughly twice as often as the second most frequent, three times as often as the third, and so on. Frequency times rank equals a constant. This is Zipf's law, and it holds for English, French, Japanese, Arabic, ancient Greek -- every natural language ever studied.

const canvas = document.createElement('canvas');
canvas.width = 700;
canvas.height = 500;
document.body.appendChild(canvas);
const ctx = canvas.getContext('2d');

// generate synthetic word frequencies following Zipf's law
// (in real use, feed in actual counted words from a text)
const numWords = 200;
const zipfFreqs = [];
const maxFreq = 5000;
for (let rank = 1; rank <= numWords; rank++) {
  // zipf: freq = C / rank^s, where s is close to 1
  const freq = maxFreq / Math.pow(rank, 1.07);
  zipfFreqs.push({ rank, freq });
}

ctx.fillStyle = '#0a0a1a';
ctx.fillRect(0, 0, 700, 500);

// log-log plot
const logMaxRank = Math.log10(numWords);
const logMaxFreq = Math.log10(maxFreq);

for (const { rank, freq } of zipfFreqs) {
  const logRank = Math.log10(rank);
  const logFreq = Math.log10(freq);

  const x = 60 + (logRank / logMaxRank) * 580;
  const y = 460 - (logFreq / logMaxFreq) * 420;

  ctx.beginPath();
  ctx.arc(x, y, 3, 0, Math.PI * 2);
  ctx.fillStyle = 'rgba(120, 180, 255, 0.6)';
  ctx.fill();
}

// reference line (perfect Zipf)
ctx.beginPath();
ctx.moveTo(60, 40);
ctx.lineTo(640, 460);
ctx.strokeStyle = 'rgba(255, 150, 100, 0.3)';
ctx.lineWidth = 1;
ctx.stroke();

The dots fall on a straight line. That straight line on a log-log plot means the relationship is a power law -- a mathematical signature that appears in city populations, earthquake magnitudes, website traffic, income distributions, and apparently every language humans have ever spoken. Nobody fully agrees on why Zipf's law holds for language. It's one of those patterns that seem too universal to be coincidence but too mysterious to explain from first principles.

For creative coding, Zipf's law means word frequency distributions always have the same shape: a few very common words, a long tail of rare words. Your visualization has to handle that extreme range. Linear mapping won't work -- the top 5 words will dominate and everything else will be invisible. Log scaling (like we used for population density in episode 83) fixes that.

Stop words and content filtering

Those ultra-frequent words -- "the", "and", "is", "of", "to", "a", "in", "that" -- are the stop words. They're grammatically essential but semantically empty. For most text visualization, you want to remove them so the meaningful words shine through.

const stopWords = new Set([
  'the', 'a', 'an', 'and', 'or', 'but', 'is', 'are', 'was', 'were',
  'be', 'been', 'being', 'have', 'has', 'had', 'do', 'does', 'did',
  'will', 'would', 'could', 'should', 'may', 'might', 'shall', 'can',
  'of', 'to', 'in', 'for', 'on', 'with', 'at', 'by', 'from', 'as',
  'into', 'through', 'during', 'before', 'after', 'above', 'below',
  'between', 'out', 'off', 'over', 'under', 'again', 'further',
  'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how',
  'all', 'each', 'every', 'both', 'few', 'more', 'most', 'other',
  'some', 'such', 'no', 'not', 'only', 'own', 'same', 'so', 'than',
  'too', 'very', 'just', 'because', 'about', 'up', 'down',
  'it', 'its', 'he', 'she', 'they', 'them', 'his', 'her', 'their',
  'this', 'that', 'these', 'those', 'i', 'me', 'my', 'we', 'us', 'our',
  'you', 'your', 'who', 'which', 'what'
]);

function contentWords(text) {
  return text.toLowerCase()
    .replace(/[^a-z\s]/g, '')
    .split(/\s+/)
    .filter(w => w.length > 1 && !stopWords.has(w));
}

function contentWordFrequency(text) {
  const words = contentWords(text);
  const counts = {};
  for (const word of words) {
    counts[word] = (counts[word] || 0) + 1;
  }
  return Object.entries(counts).sort((a, b) => b[1] - a[1]);
}

Now instead of "the, a, is, and, of" you get the actual subject words. Feed in a speech by Martin Luther King and you'll get "dream", "freedom", "nation", "justice" at the top. Feed in a cooking recipe and you'll get "butter", "flour", "minutes", "stir". The content filter is the difference between seeing the skeleton of English grammar (same for every text) and seeing the fingerprint of this particular text.

There's a creative argument for keeping stop words, though. Their frequency patterns reveal writing style, not content. One author might use "however" constantly while another never does. The ratio of "I" to "we" tells you something about perspective. If you're comparing how two texts are written rather than what they're about, the stop words are the signal, not the noise.

Sentence length: the rhythm of prose

Every writer has a rhythm. Hemingway wrote short, punchy sentences. Faulkner wrote sentences that stretched across half a page, nesting clause inside clause inside clause, circling back to pick up threads that seemed abandoned paragraphs ago. Visualizing sentence length turns that rhythmic fingerprint into a visible pattern.

const canvas = document.createElement('canvas');
canvas.width = 900;
canvas.height = 300;
document.body.appendChild(canvas);
const ctx = canvas.getContext('2d');

function getSentenceLengths(text) {
  // split on . ! ? (simplified -- doesn't handle abbreviations like "Dr." or "U.S.")
  const sentences = text.split(/[.!?]+/).filter(s => s.trim().length > 0);
  return sentences.map(s => s.trim().split(/\s+/).length);
}

// simulate two different writing styles
const shortStyle = [];  // hemingway-ish
const longStyle = [];   // faulkner-ish

for (let i = 0; i < 40; i++) {
  shortStyle.push(4 + Math.floor(Math.random() * 8));
  longStyle.push(15 + Math.floor(Math.random() * 30));
}

ctx.fillStyle = '#0a0a1a';
ctx.fillRect(0, 0, 900, 300);

const barWidth = 900 / 40 - 2;
const maxLen = Math.max(...shortStyle, ...longStyle);

// top row: short style
for (let i = 0; i < 40; i++) {
  const h = (shortStyle[i] / maxLen) * 120;
  ctx.fillStyle = 'rgba(100, 200, 180, 0.6)';
  ctx.fillRect(i * (barWidth + 2) + 1, 130 - h, barWidth, h);
}

// bottom row: long style
for (let i = 0; i < 40; i++) {
  const h = (longStyle[i] / maxLen) * 120;
  ctx.fillStyle = 'rgba(200, 120, 180, 0.6)';
  ctx.fillRect(i * (barWidth + 2) + 1, 170, barWidth, h);
}

Two writers, same number of sentences. The top row is low and uniform -- short, choppy, rhythmically tight. The bottom row is tall and varied -- long, sprawling, rhythmically loose. You can see the writing style without reading a single word. The bar chart IS the prose rhythm, translated from temporal flow into spatial pattern.

This is one of my favuorite text visualizations because it captures something that's genuinely hard to articulate in words. You feel writing rhythm when you read, but you can't easily describe it. "Hemingway writes short sentences" is true but doesn't tell you the pattern -- how his short sentences cluster, where the occasional long one breaks the rhythm for emphasis. The visualization shows all of that at once.

Sentiment analysis: positive and negative

Sentiment analysis rates text as positive, negative, or neutral. The simplest approach: use a word-level dictionary where each word has a pre-assigned score. The AFINN lexicon gives scores from -5 (very negative) to +5 (very positive) for about 2,477 English words. "Love" is +3. "Hate" is -3. "Disaster" is -3. "Excellent" is +3. Most words aren't in the dictionary at all -- they're neutral.

// simplified AFINN-style sentiment dictionary
const sentiment = {
  'love': 3, 'happy': 3, 'joy': 3, 'great': 3, 'excellent': 3,
  'wonderful': 4, 'beautiful': 3, 'amazing': 4, 'good': 2, 'nice': 2,
  'like': 1, 'best': 3, 'brilliant': 4, 'perfect': 3, 'hope': 2,
  'win': 3, 'won': 3, 'success': 2, 'smile': 2, 'laugh': 1,
  'hate': -3, 'terrible': -3, 'awful': -3, 'bad': -2, 'worst': -3,
  'ugly': -2, 'stupid': -2, 'fail': -2, 'failed': -2, 'disaster': -3,
  'horrible': -3, 'angry': -2, 'sad': -2, 'pain': -2, 'fear': -2,
  'kill': -3, 'die': -2, 'death': -2, 'war': -2, 'destroy': -3,
  'wrong': -2, 'broken': -1, 'hurt': -2, 'lost': -1, 'miss': -1
};

function scoreSentiment(text) {
  const words = text.toLowerCase().replace(/[^a-z\s]/g, '').split(/\s+/);
  let total = 0;
  let scored = 0;

  for (const word of words) {
    if (sentiment[word] !== undefined) {
      total += sentiment[word];
      scored++;
    }
  }

  return { total, scored, average: scored > 0 ? total / scored : 0 };
}

This is crude. It doesn't understand negation ("not happy" should be negative but scores positive because "happy" is +3 and "not" isn't scored). It doesn't understand sarcasm, context, or intensity modifiers ("really happy" vs "happy"). But for creative coding, crude works. You're not building a production sentiment classifier -- you're extracting a signal that drives visual output. The imperfections add character. A text that's mostly positive with occasional negative spikes creates a visual rhythm that's more interesting than a perfectly calibrated flat score.

Visualizing sentiment over a story

The real magic happens when you score sentiment per sentence and plot it across a text. The emotional arc of a story becomes visible. Happy beginning, dark middle, triumphant ending? You'll see it as a color gradient shifting from warm to cold and back to warm.

const canvas = document.createElement('canvas');
canvas.width = 900;
canvas.height = 250;
document.body.appendChild(canvas);
const ctx = canvas.getContext('2d');

// fake story: 30 sentences with an emotional arc
// positive start, dip into negative, positive ending
const storyScores = [];
for (let i = 0; i < 30; i++) {
  const t = i / 29;
  // arc shape: starts positive, dips negative at 60%, recovers
  const arc = Math.sin(t * Math.PI * 2 - Math.PI * 0.3) * 2;
  const noise = (Math.random() - 0.5) * 1.5;
  storyScores.push(arc + noise);
}

ctx.fillStyle = '#0a0a1a';
ctx.fillRect(0, 0, 900, 250);

const barW = 900 / 30 - 2;

for (let i = 0; i < storyScores.length; i++) {
  const score = storyScores[i];
  const x = i * (barW + 2) + 1;

  // positive = warm amber, negative = cool blue
  const normalized = (score + 4) / 8;  // roughly -4 to +4 -> 0 to 1
  const hue = normalized * 40 + (1 - normalized) * 220;
  const lightness = 25 + Math.abs(score) * 8;

  const barH = Math.abs(score) * 20 + 5;
  const y = score >= 0 ? 125 - barH : 125;

  ctx.fillStyle = `hsl(${hue}, 55%, ${lightness}%)`;
  ctx.fillRect(x, y, barW, barH);
}

// center line
ctx.beginPath();
ctx.moveTo(0, 125);
ctx.lineTo(900, 125);
ctx.strokeStyle = 'rgba(80, 90, 110, 0.3)';
ctx.lineWidth = 1;
ctx.stroke();

Warm bars above the line: positive sentiment. Cool bars below: negative. The emotional arc is visible -- optimistic opening, descent into darkness around sentence 18, recovery toward the end. If you fed in an actual novel (sentence by sentence), you'd see the emotional structure that the author built. Kurt Vonnegut famously sketched story shapes by hand -- "man in hole" (happy, falls into trouble, climbs out), "boy meets girl" (rises, falls, rises). This visualization generates those shapes automatically from the text itself.

Text as color: every character a pixel

Here's a purely aesthetic approach: map each character in a text to a color and draw them as a grid of tiny colored rectangles. No semantic analysis, no word counting. Just the visual pattern of the character stream. Different languages produce different patterns because their character frequency distributions differ. English has lots of 'e', 't', 'a'. German has lots of 'e', 'n', 'r' plus umlauts. The visual fingerprint of a language emerges from character-level coloring.

const canvas = document.createElement('canvas');
canvas.width = 700;
canvas.height = 500;
document.body.appendChild(canvas);
const ctx = canvas.getContext('2d');

function charToColor(ch) {
  const code = ch.charCodeAt(0);

  if (ch === ' ') return { h: 0, s: 0, l: 5 };      // spaces: near-black
  if (ch === '\n') return { h: 0, s: 0, l: 3 };      // newlines: very dark
  if (/[aeiou]/.test(ch)) {
    // vowels: warm tones
    const vowelMap = { a: 0, e: 30, i: 50, o: 25, u: 40 };
    return { h: vowelMap[ch] || 20, s: 60, l: 45 };
  }
  if (/[a-z]/.test(ch)) {
    // consonants: cool tones
    const hue = 180 + ((code - 97) / 26) * 120;
    return { h: hue, s: 45, l: 35 };
  }
  if (/[0-9]/.test(ch)) return { h: 60, s: 40, l: 30 };  // digits: yellow
  // punctuation: bright accents
  return { h: 300, s: 50, l: 50 };
}

const text = `the quick brown fox jumps over the lazy dog and the cat sat on the mat
while the rain fell softly on the old tin roof making patterns in the dust
a bird sang somewhere in the distance its melody weaving through the afternoon
shadows lengthened across the garden as the sun dipped below the horizon
everything was still except for the wind rustling through the dry autumn leaves`;

const chars = text.split('');
const cols = 70;
const cellW = 700 / cols;
const cellH = cellW;

ctx.fillStyle = '#0a0a1a';
ctx.fillRect(0, 0, 700, 500);

for (let i = 0; i < chars.length; i++) {
  const col = i % cols;
  const row = Math.floor(i / cols);
  const ch = chars[i].toLowerCase();
  const color = charToColor(ch);

  ctx.fillStyle = `hsl(${color.h}, ${color.s}%, ${color.l}%)`;
  ctx.fillRect(col * cellW, row * cellH, cellW - 0.5, cellH - 0.5);
}

Vowels glow warm (reds and oranges). Consonants sit cool (blues and greens). Spaces are near-black gaps. Punctuation pops as bright magenta accents. The visual texture of English prose becomes a fabric of warm and cool patches, with the vowel-consonant rhythm creating a kind of weave pattern. Compare this to a language with fewer vowels (like Czech or Polish) and you'd see a cooler, more blue-green image. A language with lots of vowels (like Hawaiian or Italian) would glow warmer.

This technique maps the texture of language, not its meaning. Two texts about completely different subjects in the same language will produce similar color patterns. Two translations of the same text into different languages will look very different. The visualization captures the phonetic DNA of the language itself.

Reading level: measuring complexity

The Flesch-Kincaid readability formula estimates what grade level a text is written at. It uses two inputs: average sentence length (words per sentence) and average syllable count per word. Longer sentences and longer words mean higher reading level. It's imperfect -- "antidisestablishmentarianism" is a long word but a 10-year-old knows it -- but as a rough metric it works surprisingly well.

function countSyllables(word) {
  word = word.toLowerCase().replace(/[^a-z]/g, '');
  if (word.length <= 3) return 1;

  // simple heuristic: count vowel groups
  const vowelGroups = word.match(/[aeiouy]+/g);
  let count = vowelGroups ? vowelGroups.length : 1;

  // silent e
  if (word.endsWith('e') && count > 1) count--;
  // words ending in 'le' after consonant
  if (word.endsWith('le') && word.length > 2 && !/[aeiouy]/.test(word[word.length - 3])) {
    count++;
  }

  return Math.max(1, count);
}

function fleschKincaid(text) {
  const sentences = text.split(/[.!?]+/).filter(s => s.trim().length > 0);
  const words = text.replace(/[^a-z\s]/gi, '').split(/\s+/).filter(w => w.length > 0);

  const totalSentences = sentences.length;
  const totalWords = words.length;
  const totalSyllables = words.reduce((sum, w) => sum + countSyllables(w), 0);

  const avgSentenceLen = totalWords / totalSentences;
  const avgSyllables = totalSyllables / totalWords;

  // Flesch-Kincaid Grade Level
  const grade = 0.39 * avgSentenceLen + 11.8 * avgSyllables - 15.59;
  return { grade: Math.max(0, grade), avgSentenceLen, avgSyllables };
}

A children's book might score grade 3-4. A newspaper article scores 8-10. Academic papers score 12-16. Feed in the text of a novel chapter by chapter and plot the grade level over the book -- you can see where the author simplifies (action scenes, dialogue) and where they get technical (exposition, world-building). The complexity map of a text is another kind of rhythm, related to but distinct from the sentiment arc.

Word clouds: done right

Word clouds are the most common text visualization and also the most criticized. They're often random, ugly, and hard to read. But the concept -- size proportional to frequency -- is sound. The execution just needs care.

const canvas = document.createElement('canvas');
canvas.width = 800;
canvas.height = 600;
document.body.appendChild(canvas);
const ctx = canvas.getContext('2d');

// top 30 content words from a hypothetical text
const wordData = [
  ['ocean', 45], ['wave', 38], ['ship', 32], ['storm', 28], ['captain', 25],
  ['crew', 22], ['wind', 20], ['sail', 19], ['harbor', 17], ['coast', 16],
  ['night', 15], ['stars', 14], ['compass', 13], ['journey', 12], ['fog', 11],
  ['island', 10], ['reef', 10], ['current', 9], ['horizon', 9], ['depth', 8],
  ['anchor', 8], ['rope', 7], ['deck', 7], ['mast', 6], ['whale', 6],
  ['lighthouse', 5], ['tide', 5], ['salt', 5], ['port', 4], ['bow', 4]
];

ctx.fillStyle = '#0a0a1a';
ctx.fillRect(0, 0, 800, 600);

const maxFreq = wordData[0][1];

// spiral placement: try positions along a spiral, accept if no overlap
const placed = [];

for (const [word, freq] of wordData) {
  const fontSize = 14 + (freq / maxFreq) * 48;
  ctx.font = `${Math.floor(fontSize)}px monospace`;
  const metrics = ctx.measureText(word);
  const wordW = metrics.width;
  const wordH = fontSize;

  // spiral outward from center until we find a free spot
  let px = 0, py = 0;
  let found = false;

  for (let t = 0; t < 500; t++) {
    const angle = t * 0.15;
    const radius = t * 0.8;
    const testX = 400 + Math.cos(angle) * radius - wordW / 2;
    const testY = 300 + Math.sin(angle) * radius + wordH / 3;

    // check overlap with placed words
    let overlaps = false;
    for (const p of placed) {
      if (testX < p.x + p.w + 4 && testX + wordW + 4 > p.x &&
          testY - wordH < p.y && testY > p.y - p.h) {
        overlaps = true;
        break;
      }
    }

    if (!overlaps) {
      px = testX;
      py = testY;
      found = true;
      break;
    }
  }

  if (found) {
    const hue = 180 + (freq / maxFreq) * 60;
    const lightness = 30 + (freq / maxFreq) * 25;

    ctx.font = `${Math.floor(fontSize)}px monospace`;
    ctx.fillStyle = `hsl(${hue}, 50%, ${lightness}%)`;
    ctx.fillText(word, px, py);

    placed.push({ x: px, y: py, w: wordW, h: wordH });
  }
}

The spiral placement algorithm starts at the center and works outward, testing each position along an Archimedean spiral until it finds a spot that doesn't overlap any previously placed word. High-frequency words are placed first (they're biggest and need the most room), so they end up near the center. Low-frequency words fill the gaps around the edges. The result is compact and organized, not the random scatter that gives word clouds a bad name.

The key improvement over naive word clouds: overlap detection. Without it, words stack on top of each other and become unreadable. With it, every word is visible and the size hierarchy is clean. It's still a word cloud -- purists will complain that exact comparison between similarly-sized words is hard -- but for creative coding, the visual impact is solid.

Comparing two texts

Side-by-side comparison reveals how differently two texts use language. Same visualization, different data. The contrast tells the story.

const canvas = document.createElement('canvas');
canvas.width = 900;
canvas.height = 400;
document.body.appendChild(canvas);
const ctx = canvas.getContext('2d');

// two fake texts: one technical, one poetic
const techWords = [
  ['function', 22], ['data', 18], ['array', 15], ['loop', 14], ['variable', 12],
  ['code', 11], ['return', 10], ['string', 9], ['object', 9], ['method', 8],
  ['index', 7], ['value', 7], ['error', 6], ['type', 6], ['class', 5]
];

const poetWords = [
  ['moon', 20], ['river', 16], ['shadow', 14], ['silence', 13], ['dream', 12],
  ['light', 11], ['wind', 10], ['stone', 9], ['rain', 9], ['flower', 8],
  ['dawn', 7], ['bird', 7], ['song', 6], ['ocean', 6], ['star', 5]
];

ctx.fillStyle = '#0a0a1a';
ctx.fillRect(0, 0, 900, 400);

const maxFreq = 22;

function drawWordBars(words, startX, hue, label) {
  ctx.fillStyle = 'rgba(180, 180, 200, 0.5)';
  ctx.font = '12px monospace';
  ctx.textAlign = 'center';
  ctx.fillText(label, startX + 190, 25);

  for (let i = 0; i < words.length; i++) {
    const [word, freq] = words[i];
    const y = 40 + i * 23;
    const barW = (freq / maxFreq) * 250;

    ctx.fillStyle = `hsla(${hue}, 50%, 45%, 0.7)`;
    ctx.fillRect(startX, y, barW, 18);

    ctx.fillStyle = 'rgba(200, 200, 220, 0.6)';
    ctx.font = '10px monospace';
    ctx.textAlign = 'left';
    ctx.fillText(word, startX + barW + 6, y + 13);
  }
}

drawWordBars(techWords, 40, 200, 'Technical Text');
drawWordBars(poetWords, 500, 320, 'Poetry');

Two horizontal bar charts, side by side. The technical text is dominated by "function", "data", "array" -- the vocabulary of code. The poetry is dominated by "moon", "river", "shadow" -- the vocabulary of imagery. The bar lengths make the frequency differences scannable at a glance. The hue difference (cool blue for tech, warm magenta for poetry) reinforces the thematic contrast.

This is the simplest form of text comparison but it's effective. You could go deeper: shared words highlighted in a third color, unique words per text emphasized, frequency ratios computed. But the basic side-by-side bar chart already communicates the core insight -- these two texts live in completely different vocabulary spaces.

Markov chain text generation

Allez, time for something different. Instead of analyzing text, let's generate it. A Markov chain builds a probability model from a source text: for each word, what words are likely to follow it? Then you generate new text by picking words according to those probabilities. The output has the same statistical texture as the input -- similar word patterns, similar rhythm -- but it's nonsense. Beautiful, evocative nonsense.

function buildMarkovChain(text, order) {
  const words = text.split(/\s+/);
  const chain = {};

  for (let i = 0; i < words.length - order; i++) {
    const key = words.slice(i, i + order).join(' ');
    const next = words[i + order];

    if (!chain[key]) chain[key] = [];
    chain[key].push(next);
  }

  return chain;
}

function generateText(chain, order, length) {
  const keys = Object.keys(chain);
  let current = keys[Math.floor(Math.random() * keys.length)];
  const output = current.split(' ');

  for (let i = 0; i < length; i++) {
    const options = chain[current];
    if (!options || options.length === 0) {
      current = keys[Math.floor(Math.random() * keys.length)];
      continue;
    }

    const next = options[Math.floor(Math.random() * options.length)];
    output.push(next);

    const words = output.slice(-order);
    current = words.join(' ');
  }

  return output.join(' ');
}

// example usage:
const source = `the sea was calm and the ship sailed slowly through the
dark water the stars reflected in the waves and the wind was gentle
the captain stood on the deck watching the horizon where clouds gathered
slowly the sea grew rough and the waves crashed against the hull
the wind howled through the rigging and the crew worked to secure the sails`;

const chain = buildMarkovChain(source, 2);
const generated = generateText(chain, 2, 50);

Order 1 produces word salad -- each word connects to any word that ever followed it in the source, so the output jumps around randomly. Order 2 is the sweet spot for creative use: it captures two-word phrases, so the output has local coherence ("the sea", "the wind") but global incoherence (sentences don't make sense as a whole). Order 3 starts reproducing entire source sentences because three-word sequences in a short source text are often unique.

For creative coding, Markov-generated text is material. You can visualize the generation process itself: draw each word as it's selected, color-coded by which source sentence it came from. The generated text becomes a visual patchwork of fragments from the original, reassembled into new patterns. The source text as DNA, the generated text as a mutant offspring.

Visualizing generated text with provenance

const canvas = document.createElement('canvas');
canvas.width = 800;
canvas.height = 400;
document.body.appendChild(canvas);
const ctx = canvas.getContext('2d');

// generate text and track which source position each word came from
function generateWithProvenance(chain, order, length) {
  const keys = Object.keys(chain);
  let current = keys[Math.floor(Math.random() * keys.length)];
  const output = [];

  const startWords = current.split(' ');
  for (const w of startWords) {
    output.push({ word: w, source: Math.random() });
  }

  for (let i = 0; i < length; i++) {
    const options = chain[current];
    if (!options || options.length === 0) {
      current = keys[Math.floor(Math.random() * keys.length)];
      continue;
    }

    const idx = Math.floor(Math.random() * options.length);
    const next = options[idx];
    output.push({ word: next, source: idx / options.length });

    const words = output.slice(-order).map(o => o.word);
    current = words.join(' ');
  }

  return output;
}

const source = `the sea was calm and the ship sailed through the dark water
the stars reflected in the waves and the wind was gentle the captain
stood on the deck watching the horizon where clouds gathered slowly`;

const chain = buildMarkovChain(source, 2);
const generated = generateWithProvenance(chain, 2, 60);

ctx.fillStyle = '#0a0a1a';
ctx.fillRect(0, 0, 800, 400);

let x = 20;
let y = 40;
const lineHeight = 28;

ctx.font = '14px monospace';

for (const item of generated) {
  const metrics = ctx.measureText(item.word + ' ');

  if (x + metrics.width > 780) {
    x = 20;
    y += lineHeight;
  }

  // color from source position: different hues for different "origins"
  const hue = item.source * 280 + 120;
  ctx.fillStyle = `hsl(${hue}, 50%, 50%)`;
  ctx.fillText(item.word, x, y);

  x += metrics.width;
}

Each word is colored by where it came from in the probability table. Words that had many possible successors (high entropy) get one color. Words that came from a unique, deterministic transition get another. The color pattern reveals the structure of the Markov chain itself -- predictable passages are uniform in color, unpredictable junctions create color shifts. You're visualizing not just the text but the generation process.

Creative exercise: text portrait of two texts

Time to put it together. Take two short texts with different styles. Compute word frequency, average sentence length, and sentiment for each. Create a side-by-side visual portrait: a colored bar for each sentence (height from sentence length, color from sentiment), with the most frequent content words arranged around it.

const canvas = document.createElement('canvas');
canvas.width = 900;
canvas.height = 500;
document.body.appendChild(canvas);
const ctx = canvas.getContext('2d');

// two contrasting texts
const textA = `The sun rose over the mountain and light poured into the valley.
Birds sang in the trees and the river sparkled.
It was a beautiful morning and everything felt alive.
The children ran through the meadow laughing.
Flowers bloomed everywhere in brilliant colors.
A gentle breeze carried the scent of pine.
Life was good and the world was at peace.`;

const textB = `The factory stood silent in the grey rain.
Rust crept along the walls and the windows were broken.
Nobody came here anymore.
The machines had stopped years ago and dust covered everything.
Wind whistled through the cracks.
A dog wandered through the empty parking lot alone.
The only sound was water dripping from the rusted pipes.`;

function analyzeText(text) {
  const sentences = text.split(/[.!?]+/).filter(s => s.trim().length > 0);
  return sentences.map(s => {
    const words = s.trim().split(/\s+/);
    let score = 0;
    for (const w of words) {
      const lower = w.toLowerCase().replace(/[^a-z]/g, '');
      if (sentiment[lower]) score += sentiment[lower];
    }
    return { length: words.length, sentiment: score, text: s.trim() };
  });
}

const analysisA = analyzeText(textA);
const analysisB = analyzeText(textB);

ctx.fillStyle = '#0a0a1a';
ctx.fillRect(0, 0, 900, 500);

function drawTextPortrait(analysis, startX, label) {
  ctx.fillStyle = 'rgba(180, 180, 200, 0.5)';
  ctx.font = '12px monospace';
  ctx.textAlign = 'center';
  ctx.fillText(label, startX + 180, 30);

  const maxLen = 20;
  const barWidth = 360 / analysis.length - 3;

  for (let i = 0; i < analysis.length; i++) {
    const s = analysis[i];
    const x = startX + i * (barWidth + 3);
    const barH = (Math.min(s.length, maxLen) / maxLen) * 300;
    const y = 420 - barH;

    // sentiment -> color: positive = warm amber, negative = cool blue
    const sentNorm = (s.sentiment + 6) / 12;
    const hue = sentNorm * 40 + (1 - sentNorm) * 220;
    const lightness = 25 + Math.abs(s.sentiment) * 5;

    ctx.fillStyle = `hsl(${hue}, 55%, ${lightness}%)`;
    ctx.fillRect(x, y, barWidth, barH);

    // sentence length number
    ctx.fillStyle = 'rgba(160, 170, 190, 0.4)';
    ctx.font = '9px monospace';
    ctx.textAlign = 'center';
    ctx.fillText(s.length, x + barWidth / 2, 440);
  }
}

drawTextPortrait(analysisA, 30, 'Sunny Valley');
drawTextPortrait(analysisB, 480, 'Abandoned Factory');

The sunny valley text has warm amber bars -- positive sentiment throughout. The abandoned factory has cool blue bars -- negative sentiment. The bar heights show sentence length variation: the valley text has moderate, even sentences; the factory text has one very short sentence ("Nobody came here anymore.") that creates a visual dip. The two portraits are visually distinct even if you can't read the words. The color temperature alone tells you: one text is happy, one is not.

See where this is going? :-) Text is infinite creative material. Every book, every speech, every tweet, every log file has measurable properties that map to visual channels. Word frequency drives size. Sentiment drives color. Sentence length drives rhythm. Character distribution drives texture. And Markov chains can generate new text with the statistical DNA of the original. Data art from text isn't limited to word clouds -- it's an entire visual language for the written word.

The techniques from episodes 82-85 all apply here. The map() function from episode 82 converts word counts to pixel sizes. The log scaling from episode 83 tames Zipf's power law. The temporal layout from episode 84 works for plotting sentiment over a story. The network visualization from episode 85 could show word co-occurrence graphs. Text analysis plugs directly into the data-to-visuals pipeline we've been building.

't Komt erop neer...

Text is unstructured data. Before you can visualize it, you have to measure it -- word frequency, sentence length, character distribution, sentiment scores. These measurements turn prose into numbers your canvas can work with
Word frequency analysis counts each word's occurrences and sorts by count. Stop words ("the", "and", "is") dominate every English text -- filter them to reveal the content words that characterize what the text is about
Zipf's law: in any natural language, word frequency times rank is roughly constant. This power law creates extreme distributions (a few very common words, a long tail of rare ones) that require log scaling for visual display
Sentence length variation reveals writing rhythm. Short sentences mean tension. Long sentences mean exposition. Plotting sentence lengths as a bar chart turns prose rhythm into visible pattern -- you can see Hemingway vs Faulkner without reading a word
Sentiment analysis scores text as positive or negative using dictionary lookup (AFINN lexicon: -5 to +5 per word). Crude but effective for creative coding. Plot sentiment per sentence across a story to see the emotional arc -- the shape of the narrative's mood
Character-to-color mapping treats each letter as a pixel. Vowels get warm tones, consonants get cool tones, spaces stay dark. The resulting grid shows the phonetic texture of the language itself -- different languages produce different visual patterns from their character frequency distributions
Flesch-Kincaid readability scores grade level from average sentence length and syllable count. Visualizing reading level across a text shows where the author simplifies and where they get technical
Word clouds work when you use spiral placement with overlap detection. Place high-frequency words first at the center, let smaller words fill gaps outward. The overlap test is what separates readable word clouds from visual garbage
Text comparison works through side-by-side bar charts of word frequency. Two texts in different domains (technical vs poetic) occupy completely different vocabulary spaces -- the contrast is immediately visible in the bar lengths and word labels
Markov chains build probability models from source text (what words follow what other words) and generate new text with the same statistical properties. Order 2 hits the creative sweet spot: locally coherent phrases, globally nonsensical. The generated text carries the statistical DNA of the source
Visualizing Markov generation with provenance tracking (coloring each word by its origin in the probability table) reveals the structure of the generation process itself -- predictable passages vs high-entropy junctions

Sallukes! Thanks for reading.

@femdev

stem stemsocial steemstem programming creativecoding

0.000

0 comments