Learn AI Series (#76) - Mini Project - Your Own AI Assistant

avatar

Learn AI Series (#76) - Mini Project - Your Own AI Assistant

variant-a-03-red.png

What will I learn

  • You will learn how to combine RAG, agents, tools, and a local model into a working AI assistant;
  • building a domain-specific assistant that actually knows your data through ChromaDB-backed retrieval;
  • tool integration for calculation, code execution, file access, and knowledge search;
  • conversation management with context window strategies and history compression;
  • evaluation and iteration: measuring whether your assistant actually works;
  • what separates a demo from something people actually use in practice.

Requirements

  • A working modern computer running macOS, Windows or Ubuntu;
  • An installed Python 3(.11+) distribution;
  • The ambition to learn AI and machine learning.

Difficulty

  • Beginner

Curriculum (of the Learn AI Series):

Learn AI Series (#76) - Mini Project - Your Own AI Assistant

Solutions to Episode #75 Exercises

Exercise 1: CLIP-style similarity search engine.

import random
import math


class ImageTextSearch:
    """CLIP-style similarity search using simulated embeddings."""

    def __init__(self, embed_dim=128):
        self.embed_dim = embed_dim
        self.entries = []

    def _make_embedding(self, seed_val):
        """Generate a deterministic normalized random embedding."""
        rng = random.Random(seed_val)
        vec = [rng.gauss(0, 1) for _ in range(self.embed_dim)]
        norm = math.sqrt(sum(x * x for x in vec))
        return [x / norm for x in vec]

    def _cosine_similarity(self, a, b):
        dot = sum(x * y for x, y in zip(a, b))
        return dot  # both are unit vectors, so dot = cosine

    def add_image(self, path, description):
        seed = hash(path) & 0xFFFFFFFF
        emb = self._make_embedding(seed)
        self.entries.append({
            "path": path,
            "description": description,
            "embedding": emb,
        })

    def search_by_text(self, query, top_k=5):
        seed = hash(query) & 0xFFFFFFFF
        query_emb = self._make_embedding(seed)
        scored = []
        for entry in self.entries:
            sim = self._cosine_similarity(query_emb, entry["embedding"])
            scored.append((sim, entry))
        scored.sort(key=lambda x: x[0], reverse=True)
        return scored[:top_k]

    def search_by_image(self, image_path, top_k=5):
        target = None
        for entry in self.entries:
            if entry["path"] == image_path:
                target = entry["embedding"]
                break
        if target is None:
            return []
        scored = []
        for entry in self.entries:
            if entry["path"] == image_path:
                continue
            sim = self._cosine_similarity(target, entry["embedding"])
            scored.append((sim, entry))
        scored.sort(key=lambda x: x[0], reverse=True)
        return scored[:top_k]

    def find_duplicates(self, threshold=0.95):
        pairs = []
        for i in range(len(self.entries)):
            for j in range(i + 1, len(self.entries)):
                sim = self._cosine_similarity(
                    self.entries[i]["embedding"],
                    self.entries[j]["embedding"]
                )
                if sim > threshold:
                    pairs.append((
                        self.entries[i]["path"],
                        self.entries[j]["path"],
                        sim
                    ))
        return pairs


search = ImageTextSearch()

categories = {
    "animals": [
        ("golden_retriever.jpg", "A golden retriever playing in a park"),
        ("tabby_cat.jpg", "An orange tabby cat sleeping on a windowsill"),
        ("parrot.jpg", "A colorful macaw parrot on a branch"),
        ("horse_field.jpg", "A brown horse grazing in a green field"),
        ("penguin_ice.jpg", "Emperor penguins huddled on Antarctic ice"),
    ],
    "landscapes": [
        ("mountain_sunset.jpg", "Sunset over snow-capped mountain peaks"),
        ("ocean_cliff.jpg", "Dramatic ocean cliffs at golden hour"),
        ("desert_dunes.jpg", "Rolling sand dunes in the Sahara"),
        ("forest_path.jpg", "A winding path through an autumn forest"),
        ("lake_reflection.jpg", "Mountain lake with mirror-like reflections"),
    ],
    "food": [
        ("pasta_dish.jpg", "Fresh pasta with tomato basil sauce"),
        ("sushi_platter.jpg", "Assorted sushi and sashimi on a wooden board"),
        ("chocolate_cake.jpg", "Rich dark chocolate layer cake"),
        ("fruit_bowl.jpg", "Colorful bowl of tropical fruits"),
        ("bread_basket.jpg", "Freshly baked sourdough bread loaves"),
    ],
    "architecture": [
        ("gothic_cathedral.jpg", "Gothic cathedral with flying buttresses"),
        ("modern_tower.jpg", "Glass and steel skyscraper at night"),
        ("japanese_temple.jpg", "Traditional Japanese temple in a garden"),
        ("bridge_sunset.jpg", "Suspension bridge at sunset"),
        ("old_town.jpg", "Cobblestone streets in a European old town"),
    ],
}

for cat, items in categories.items():
    for path, desc in items:
        search.add_image(path, desc)

# Text search
print("Search: 'sunset over mountains'")
results = search.search_by_text("sunset over mountains", top_k=5)
for sim, entry in results:
    print(f"  {sim:+.4f}  {entry['path']:25s}  {entry['description']}")

# Image search
print("\nSimilar to 'mountain_sunset.jpg':")
results = search.search_by_image("mountain_sunset.jpg", top_k=5)
for sim, entry in results:
    print(f"  {sim:+.4f}  {entry['path']:25s}  {entry['description']}")

# Duplicate detection
print("\nDuplicate pairs (threshold 0.95):")
dupes = search.find_duplicates(threshold=0.95)
if dupes:
    for a, b, sim in dupes:
        print(f"  {sim:.4f}  {a} <-> {b}")
else:
    print("  None found (expected with random embeddings)")

The seeded random embeddings are deterministic -- same path or query always produces the same vector. In a real system you'd replace _make_embedding with actual CLIP inference, but the search and ranking logic stays identical. Cosine similarity between unit vectors is just their dot product, which is the core operation behind every embedding-based retrieval system we've built since episode #63.

Exercise 2: Contrastive learning trainer.

import random
import math


class ContrastiveTrainer:
    """Simulates CLIP-style contrastive training."""

    def __init__(self, n_categories=10, pairs_per_cat=10,
                 embed_dim=64):
        self.n_categories = n_categories
        self.pairs_per_cat = pairs_per_cat
        self.embed_dim = embed_dim
        self.data = self._generate_data()

        # Initialize random embeddings
        rng = random.Random(42)
        self.image_embs = []
        self.text_embs = []
        for _ in range(len(self.data)):
            ie = [rng.gauss(0, 1) for _ in range(embed_dim)]
            te = [rng.gauss(0, 1) for _ in range(embed_dim)]
            self.image_embs.append(self._normalize(ie))
            self.text_embs.append(self._normalize(te))

    def _generate_data(self):
        pairs = []
        for cat in range(self.n_categories):
            for _ in range(self.pairs_per_cat):
                pairs.append(cat)
        return pairs

    def _normalize(self, vec):
        norm = math.sqrt(sum(x * x for x in vec))
        if norm < 1e-10:
            return vec
        return [x / norm for x in vec]

    def _dot(self, a, b):
        return sum(x * y for x, y in zip(a, b))

    def compute_loss(self, img_indices, txt_indices,
                     temperature=0.07):
        batch = len(img_indices)
        # Similarity matrix
        logits = []
        for i in img_indices:
            row = []
            for j in txt_indices:
                sim = self._dot(self.image_embs[i],
                                self.text_embs[j])
                row.append(sim / temperature)
            logits.append(row)

        # Cross entropy in both directions
        loss = 0.0
        for k in range(batch):
            # Image-to-text
            max_val = max(logits[k])
            exps = [math.exp(v - max_val) for v in logits[k]]
            total = sum(exps)
            loss -= math.log(exps[k] / total + 1e-10)

            # Text-to-image
            col = [logits[r][k] for r in range(batch)]
            max_val = max(col)
            exps = [math.exp(v - max_val) for v in col]
            total = sum(exps)
            loss -= math.log(exps[k] / total + 1e-10)

        return loss / (2 * batch)

    def alignment_score(self):
        """Average cosine sim: matching vs non-matching."""
        n = len(self.data)
        match_sims = []
        nonmatch_sims = []
        for i in range(n):
            for j in range(n):
                sim = self._dot(self.image_embs[i],
                                self.text_embs[j])
                if self.data[i] == self.data[j]:
                    match_sims.append(sim)
                else:
                    nonmatch_sims.append(sim)
        avg_match = sum(match_sims) / len(match_sims)
        avg_non = sum(nonmatch_sims) / len(nonmatch_sims)
        return avg_match, avg_non

    def train(self, steps=50, lr=0.1, batch_size=16,
              temperature=0.07):
        rng = random.Random(123)
        indices = list(range(len(self.data)))

        for step in range(steps):
            batch = rng.sample(indices, min(batch_size,
                                            len(indices)))

            # Compute gradients numerically (adjust embeddings)
            for idx in range(len(batch)):
                i = batch[idx]
                for d in range(self.embed_dim):
                    # Increase similarity to matching pair
                    j = batch[idx]
                    grad_img = (self.text_embs[j][d]
                                - self.image_embs[i][d])
                    grad_txt = (self.image_embs[i][d]
                                - self.text_embs[j][d])

                    # Decrease similarity to non-matching
                    for k_idx in range(len(batch)):
                        k = batch[k_idx]
                        if self.data[k] != self.data[i]:
                            grad_img -= (
                                0.1 * self.text_embs[k][d]
                            )
                            grad_txt -= (
                                0.1 * self.image_embs[k][d]
                            )

                    self.image_embs[i][d] += lr * grad_img
                    self.text_embs[j][d] += lr * grad_txt

                self.image_embs[i] = self._normalize(
                    self.image_embs[i]
                )
                self.text_embs[batch[idx]] = self._normalize(
                    self.text_embs[batch[idx]]
                )

            if step in [0, 10, 25, 49]:
                loss = self.compute_loss(batch, batch,
                                         temperature)
                m, nm = self.alignment_score()
                gap = m - nm
                print(f"Step {step:>3}: loss={loss:.4f}  "
                      f"match={m:.4f}  non={nm:.4f}  "
                      f"gap={gap:.4f}")


trainer = ContrastiveTrainer()
trainer.train(steps=50, lr=0.1)

The alignment gap (matching similarity minus non-matching similarity) is the metric that actually matters. A gap of 0 means the model can't distinguish matching from non-matching pairs. As training progresses, matching pairs get pulled together while non-matching pairs get pushed apart -- that's the entire contrastive learning dynamic in two numbers.

Exercise 3: Visual question answering evaluator.

class VQAEvaluator:
    """Evaluate VQA model accuracy across question categories."""

    def __init__(self):
        self.test_cases = [
            {"image": "Three red cars parked on a street",
             "question": "How many cars are there?",
             "truth": "3", "category": "counting"},
            {"image": "Five people standing near two bicycles",
             "question": "How many people are in the image?",
             "truth": "5", "category": "counting"},
            {"image": "A single white cat on a blue couch",
             "question": "How many cats are visible?",
             "truth": "1", "category": "counting"},
            {"image": "A bright red sports car on a highway",
             "question": "What color is the car?",
             "truth": "red", "category": "color"},
            {"image": "A woman in a green dress near yellow flowers",
             "question": "What color is the dress?",
             "truth": "green", "category": "color"},
            {"image": "A blue mug on a wooden brown table",
             "question": "What color is the mug?",
             "truth": "blue", "category": "color"},
            {"image": "A dog sitting to the left of a tree",
             "question": "Is the dog to the left or right of the tree?",
             "truth": "left", "category": "spatial"},
            {"image": "A lamp above a desk with books below",
             "question": "Where is the lamp relative to the desk?",
             "truth": "above", "category": "spatial"},
            {"image": "A cat behind a glass window",
             "question": "Is the cat inside or outside?",
             "truth": "inside", "category": "spatial"},
            {"image": "A park with several dogs and a fountain",
             "question": "Is there a fountain in the image?",
             "truth": "yes", "category": "yes-no"},
            {"image": "An empty parking lot at night",
             "question": "Are there any cars in the lot?",
             "truth": "no", "category": "yes-no"},
            {"image": "A beach with umbrellas and waves",
             "question": "Is this a beach scene?",
             "truth": "yes", "category": "yes-no"},
            {"image": "A sign reading STOP in bold red letters",
             "question": "What does the sign say?",
             "truth": "stop", "category": "reading"},
            {"image": "A storefront with OPEN neon sign",
             "question": "What word is on the neon sign?",
             "truth": "open", "category": "reading"},
            {"image": "A license plate showing ABC 1234",
             "question": "What is on the license plate?",
             "truth": "abc 1234", "category": "reading"},
        ]

    def _simple_vqa(self, image_desc, question):
        """Keyword-matching VQA simulator."""
        desc = image_desc.lower()
        q = question.lower()
        numbers = {"one": "1", "two": "2", "three": "3",
                   "four": "4", "five": "5", "six": "6"}
        colors = ["red", "blue", "green", "yellow", "white",
                  "black", "brown", "orange", "purple", "pink"]
        spatials = ["left", "right", "above", "below",
                    "inside", "outside", "behind", "front"]

        if "how many" in q:
            for word in desc.split():
                if word.isdigit():
                    return word
                if word in numbers:
                    return numbers[word]
            return "1"

        if "color" in q or "colour" in q:
            for color in colors:
                if color in desc:
                    return color
            return "unknown"

        if any(s in q for s in ["left", "right", "where",
                                "inside", "outside"]):
            for sp in spatials:
                if sp in desc:
                    return sp
            return "unknown"

        if q.startswith("is ") or q.startswith("are "):
            for word in q.split():
                if word in desc:
                    return "yes"
            return "no"

        if "what" in q and ("say" in q or "sign" in q
                            or "read" in q or "word" in q
                            or "plate" in q):
            import re
            caps = re.findall(r'[A-Z][A-Z\s\d]+', image_desc)
            if caps:
                return caps[0].strip().lower()
            return "unknown"

        return "unknown"

    def evaluate(self, model_fn=None):
        if model_fn is None:
            model_fn = self._simple_vqa

        results = []
        for case in self.test_cases:
            answer = model_fn(case["image"], case["question"])
            answer = answer.lower().strip()
            truth = case["truth"].lower().strip()

            exact = answer == truth
            relaxed = truth in answer or answer in truth
            confidence = 1.0 if exact else (0.5 if relaxed
                                            else 0.0)
            results.append({
                "category": case["category"],
                "exact": exact,
                "relaxed": relaxed,
                "confidence": confidence,
                "answer": answer,
                "truth": truth,
            })

        # Per-category report
        cats = {}
        for r in results:
            c = r["category"]
            if c not in cats:
                cats[c] = {"exact": 0, "relaxed": 0,
                           "conf": [], "total": 0,
                           "failures": []}
            cats[c]["total"] += 1
            if r["exact"]:
                cats[c]["exact"] += 1
            if r["relaxed"]:
                cats[c]["relaxed"] += 1
            cats[c]["conf"].append(r["confidence"])
            if not r["exact"]:
                cats[c]["failures"].append(
                    f"  got '{r['answer']}', "
                    f"expected '{r['truth']}'"
                )

        print(f"{'Category':<12} {'Exact':>7} {'Relaxed':>9} "
              f"{'Conf':>6} {'N':>3}")
        print("-" * 42)
        for cat in sorted(cats):
            s = cats[cat]
            ea = s["exact"] / s["total"]
            ra = s["relaxed"] / s["total"]
            ac = sum(s["conf"]) / len(s["conf"])
            print(f"{cat:<12} {ea:>7.0%} {ra:>9.0%} "
                  f"{ac:>6.2f} {s['total']:>3}")

        total = len(results)
        exact_all = sum(1 for r in results if r["exact"])
        relax_all = sum(1 for r in results if r["relaxed"])
        print(f"\nOverall: {exact_all}/{total} exact, "
              f"{relax_all}/{total} relaxed")

        return results, cats

    def error_analysis(self, results=None, cats=None):
        if results is None:
            results, cats = self.evaluate()

        worst = sorted(cats.items(),
                       key=lambda x: x[1]["exact"] / x[1]["total"])
        print("\nError Analysis (hardest categories first):")
        for cat, s in worst:
            rate = s["exact"] / s["total"]
            print(f"  {cat}: {rate:.0%} exact match")
            for f in s["failures"]:
                print(f"    FAIL:{f}")


evaluator = VQAEvaluator()
results, cats = evaluator.evaluate()
evaluator.error_analysis(results, cats)

The per-category breakdown is where the real insight lives. Counting and color questions tend to work well with simple keyword matching because the answers are directly stated in image descriptions. Spatial reasoning is harder because "left of the tree" requires understanding relational prepositions. Reading/OCR is even harder since it depends on how exactly the text was transcribed in the description. This maps directly to how real VLMs perform -- they nail color and counting, struggle with spatial reasoning, and are inconsistent with text extraction. Knowing your model's category-level weaknesses tells you exactly where to invest in improvement.

On to today's episode

Here we go! This is the fourth mini project in our series, and it's the capstone of Arc 4 -- the entire LLM arc. In episode #21 we built a crypto market regime classifier. In episode #36 we built a complete ML pipeline. In episode #56 we built a transformer from scratch. Each mini project combined everything we'd learned up to that point into something you could actually use.

This time we're pulling together twenty episodes worth of LLM building blocks: language models (#57-58), embeddings and RAG (#63-65), APIs (#66), agents and tools (#67-68), fine-tuning (#69), local models (#70), generation techniques (#71), evaluation (#73), the Hugging Face ecosystem (#74), and multimodal models (#75). We're building a personal AI assistant that knows your documents, uses tools, and runs entirely on your own hardware.

Not a wrapper around someone else's API. An actual system you own and control ;-)

The architecture

Our assistant has four components that each map directly to concepts we've covered:

  1. Local LLM (via Ollama, episode #70) -- the brain
  2. RAG system (ChromaDB + embedding model, episodes #63-65) -- domain knowledge
  3. Tool system (episodes #67-68) -- actions beyond text generation
  4. Conversation manager -- context window strategy and memory

Here's the full class scaffold with initialization:

import ollama
import chromadb
import json
import subprocess
import re
from pathlib import Path


class PersonalAssistant:
    def __init__(self, model="llama3.1:8b",
                 docs_path="./knowledge"):
        self.model = model
        self.chroma = chromadb.PersistentClient(
            path="./assistant_db"
        )
        self.collection = self.chroma.get_or_create_collection(
            "knowledge"
        )
        self.conversation = []
        self.max_history = 20
        self.tools = self._register_tools()
        self.system_prompt = self._build_system_prompt()

        # Index documents if provided
        if Path(docs_path).exists():
            self._index_documents(docs_path)

    def _build_system_prompt(self):
        tool_descriptions = "\n".join(
            f"- {name}: {info['description']}"
            for name, info in self.tools.items()
        )
        return (
            "You are a helpful personal assistant with "
            "access to the user's documents and tools.\n\n"
            "When answering questions:\n"
            "1. Check if the retrieved context contains "
            "relevant information\n"
            "2. If yes, base your answer on that context "
            "and cite the source\n"
            "3. If no relevant context, use your general "
            "knowledge but say so\n"
            "4. Use tools when the user's request requires "
            "action\n\n"
            f"Available tools:\n{tool_descriptions}\n\n"
            "To use a tool, respond with: "
            "TOOL: tool_name(arg1, arg2)\n"
            "After seeing the result, provide your "
            "final answer."
        )

The system prompt is deliberately explicit about when to use tools versus when to use retrieved context versus when to fall back on general knowledge. This is prompt engineering from episode #62 in practice -- clear instructions reduce the model's tendency to hallucinate or ignore available tools.

Indexing your documents

The RAG system needs your documents chunked and embedded. We covered the theory in episodes #63-64, but here's the full implementaton tailored for a personal knowledge base:

    def _index_documents(self, docs_path):
        """Index all text and markdown files."""
        docs_path = Path(docs_path)
        existing_ids = set(self.collection.get()["ids"])

        for file_path in docs_path.rglob("*"):
            if file_path.suffix not in [".txt", ".md",
                                         ".py", ".json"]:
                continue

            doc_id = str(file_path.relative_to(docs_path))
            if doc_id in existing_ids:
                continue

            text = file_path.read_text(
                encoding="utf-8", errors="ignore"
            )
            chunks = self._chunk_text(
                text, chunk_size=500, overlap=50
            )

            for i, chunk in enumerate(chunks):
                self.collection.upsert(
                    ids=[f"{doc_id}::{i}"],
                    documents=[chunk],
                    metadatas=[{
                        "source": doc_id,
                        "chunk": i
                    }]
                )

        print(f"Knowledge base: "
              f"{self.collection.count()} chunks indexed")

    def _chunk_text(self, text, chunk_size=500, overlap=50):
        """Split text into overlapping chunks
        at paragraph boundaries."""
        paragraphs = text.split("\n\n")
        chunks = []
        current = ""

        for para in paragraphs:
            if (len(current) + len(para) > chunk_size
                    and current):
                chunks.append(current.strip())
                # Keep overlap from end of previous chunk
                words = current.split()
                overlap_text = " ".join(
                    words[-overlap // 5:]
                )
                current = overlap_text + "\n\n" + para
            else:
                current = (current + "\n\n" + para
                           if current else para)

        if current.strip():
            chunks.append(current.strip())

        return chunks

    def _retrieve(self, query, n_results=3):
        """Retrieve relevant document chunks."""
        if self.collection.count() == 0:
            return []

        results = self.collection.query(
            query_texts=[query], n_results=n_results
        )
        contexts = []
        for doc, meta in zip(results["documents"][0],
                             results["metadatas"][0]):
            contexts.append({
                "text": doc,
                "source": meta["source"]
            })
        return contexts

The _chunk_text method splits at paragraph boundaries rather than cutting mid-sentence. This matters more than you'd expect -- a chunk that ends in the middle of a code block or a sentence loses context that the embedding model needs to generate a useful vector representation. The overlap parameter ensures that concepts spanning two paragraphs don't fall through the cracks.

Notice that we're using ChromaDB's built-in embedding model (it defaults to all-MiniLM-L6-v2 from sentence-transformers, which we discussed in episode #74). You could swap in any embedding model from the Hub, but for a personal assistant the default works well enough. Don't over-engineer the retrieval until you've confirmed it's actually the bottleneck.

The tool system

Tools give the assistant capabilities beyond text generation. We keep the set small and focused -- each tool does one thing well. This is the same philosophy from episode #68 where we discussed that a few reliable tools beat many unreliable ones:

    def _register_tools(self):
        return {
            "calculate": {
                "function": self._tool_calculate,
                "description": (
                    "Evaluate a math expression. "
                    "Usage: calculate(expression)"
                )
            },
            "read_file": {
                "function": self._tool_read_file,
                "description": (
                    "Read a file's contents. "
                    "Usage: read_file(path)"
                )
            },
            "list_files": {
                "function": self._tool_list_files,
                "description": (
                    "List files in a directory. "
                    "Usage: list_files(path)"
                )
            },
            "run_python": {
                "function": self._tool_run_python,
                "description": (
                    "Execute Python code and return output. "
                    "Usage: run_python(code)"
                )
            },
            "search_knowledge": {
                "function": self._tool_search_knowledge,
                "description": (
                    "Search the knowledge base. "
                    "Usage: search_knowledge(query)"
                )
            }
        }

    def _tool_calculate(self, expression):
        try:
            allowed = set("0123456789+-*/.() ")
            if not all(c in allowed for c in expression):
                return "Error: only numeric expressions allowed"
            return str(eval(expression))
        except Exception as e:
            return f"Error: {e}"

    def _tool_read_file(self, path):
        try:
            content = Path(path).read_text()
            if len(content) > 2000:
                return (content[:2000] +
                        f"\n... (truncated, "
                        f"{len(content)} chars total)")
            return content
        except Exception as e:
            return f"Error reading file: {e}"

    def _tool_list_files(self, path):
        try:
            entries = sorted(Path(path).iterdir())
            return "\n".join(
                f"{'[DIR]' if e.is_dir() else '[FILE]'} "
                f"{e.name}"
                for e in entries[:50]
            )
        except Exception as e:
            return f"Error: {e}"

    def _tool_run_python(self, code):
        try:
            result = subprocess.run(
                ["python3", "-c", code],
                capture_output=True, text=True, timeout=10
            )
            output = result.stdout
            if result.stderr:
                output += f"\nSTDERR: {result.stderr}"
            return output or "(no output)"
        except subprocess.TimeoutExpired:
            return "Error: execution timed out (10s limit)"
        except Exception as e:
            return f"Error: {e}"

    def _tool_search_knowledge(self, query):
        results = self._retrieve(query, n_results=5)
        if not results:
            return "No relevant documents found."
        return "\n---\n".join(
            f"[{r['source']}]: {r['text']}"
            for r in results
        )

Notice the security constraints on each tool. calculate only allows numeric characters -- no builtins, no imports, no os.system through a crafted expression. run_python has a 10-second timeout, and read_file truncates large files. These are the guardrails we discussed in episode #68 (agent safety patterns). A personal assistant running locally is lower risk than a public-facing one, but the habit of constraining tool capabilities is worth building from the start.

Having said that, the calculate tool has a subtle limitation: it uses eval(), which even with character filtering could theoretically be exploited (though it's hard with only digits and arithmetic operators). In a production system you'd want a proper expression parser. For a personal tool running on your own machine, the risk is minimal.

The conversation loop

This is where everything comes together -- the agent loop from episode #67, with RAG context injection and tool handling:

    def chat(self, user_message):
        """Process a user message and return a response."""
        # Retrieve relevant context
        contexts = self._retrieve(user_message)
        context_text = ""
        if contexts:
            context_text = (
                "\n\nRelevant knowledge:\n"
                + "\n---\n".join(
                    f"[{c['source']}]: {c['text']}"
                    for c in contexts
                )
            )

        # Build messages
        messages = [{
            "role": "system",
            "content": self.system_prompt + context_text
        }]
        messages.extend(
            self.conversation[-self.max_history:]
        )
        messages.append({
            "role": "user",
            "content": user_message
        })

        # Get LLM response
        response = ollama.chat(
            model=self.model, messages=messages
        )
        assistant_msg = response["message"]["content"]

        # Check for tool calls
        if "TOOL:" in assistant_msg:
            assistant_msg = self._handle_tool_calls(
                assistant_msg, messages
            )

        # Update conversation history
        self.conversation.append({
            "role": "user",
            "content": user_message
        })
        self.conversation.append({
            "role": "assistant",
            "content": assistant_msg
        })

        # Compress history if it gets too long
        if len(self.conversation) > self.max_history * 2:
            self._compress_history()

        return assistant_msg

    def _handle_tool_calls(self, response, messages):
        """Extract and execute tool calls,
        then get final response."""
        tool_pattern = r'TOOL:\s*(\w+)\(([^)]*)\)'
        matches = re.findall(tool_pattern, response)

        tool_results = []
        for tool_name, args in matches:
            if tool_name in self.tools:
                result = self.tools[tool_name]["function"](
                    args.strip().strip("\"'")
                )
                tool_results.append(
                    f"[{tool_name}] Result: {result}"
                )
            else:
                tool_results.append(
                    f"[{tool_name}] Error: unknown tool"
                )

        if not tool_results:
            return response

        # Feed tool results back to the model
        messages.append({
            "role": "assistant",
            "content": response
        })
        messages.append({
            "role": "user",
            "content": (
                "Tool results:\n"
                + "\n".join(tool_results)
                + "\n\nNow provide your final answer "
                "based on these results."
            )
        })

        final = ollama.chat(
            model=self.model, messages=messages
        )
        return final["message"]["content"]

    def _compress_history(self):
        """Summarize older conversation history."""
        old_messages = self.conversation[
            :len(self.conversation) // 2
        ]
        summary_text = "\n".join(
            f"{m['role']}: {m['content'][:100]}"
            for m in old_messages
        )

        summary = ollama.chat(
            model=self.model,
            messages=[{
                "role": "user",
                "content": (
                    "Summarize this conversation "
                    "history concisely:\n" + summary_text
                )
            }]
        )

        self.conversation = [{
            "role": "system",
            "content": (
                "Earlier conversation summary: "
                + summary["message"]["content"]
            )
        }] + self.conversation[
            len(self.conversation) // 2:
        ]

The _compress_history method is a practical solution to the context window problem we discussed in episode #72 (tokenization). Every conversation eventually exceeds the model's context window. Rather than hard-truncating (which loses important context from earlier in the conversation), we summarize the older messages and keep the recent ones verbatim. The model gets the gist of what was discussed before plus full detail of the recent exchanges.

Is this a perfect solution? No. The summary itself takes up context, the summarization can lose important details, and the model doing the summarizing might miss nuances. But it's pragmatic. It works well enough for a personal assistant where conversations rarely need to reference details from 50 messages ago. For production systems with long conversations, you'd want a more sophisticated approach -- maybe a sliding window with selective retrieval from a conversation database.

Running the assistant

def main():
    assistant = PersonalAssistant(
        model="llama3.1:8b",
        docs_path="./my_documents"
    )

    print("Personal AI Assistant ready. "
          "Type 'quit' to exit.\n")

    while True:
        user_input = input("You: ").strip()
        if user_input.lower() in ["quit", "exit", "q"]:
            break
        if not user_input:
            continue

        response = assistant.chat(user_input)
        print(f"\nAssistant: {response}\n")


if __name__ == "__main__":
    main()

Put some documents in ./my_documents/ -- your project README, meeting notes, code files, whatever you want the assistant to know about -- and run it. The first query will be slow while Ollama loads the model into memory. Subsequent queries should respond in a few seconds depending on your hardware and the model size.

Evaluation: does it actually work?

Building the assistant is the straightforward part. Knowing whether it works well is harder. Apply what we learned in episode #73 (LLM evaluation):

def evaluate_assistant(assistant, test_cases):
    results = []
    for case in test_cases:
        response = assistant.chat(case["question"])
        # Reset conversation between test cases
        assistant.conversation = []

        result = {
            "question": case["question"],
            "response": response,
            "passed": True,
        }

        # Check required elements
        for check in case.get("must_contain", []):
            if check.lower() not in response.lower():
                result["passed"] = False
                result["missing"] = check

        # Check forbidden elements
        for check in case.get("must_not_contain", []):
            if check.lower() in response.lower():
                result["passed"] = False
                result["forbidden"] = check

        results.append(result)

    passed = sum(1 for r in results if r["passed"])
    print(f"Passed: {passed}/{len(results)}")

    for r in results:
        status = "PASS" if r["passed"] else "FAIL"
        print(f"  [{status}] {r['question'][:50]}...")
        if not r["passed"]:
            if "missing" in r:
                print(f"    Missing: {r['missing']}")
            if "forbidden" in r:
                print(f"    Found forbidden: {r['forbidden']}")

    return results


test_cases = [
    {
        "question": "What is in my project README?",
        "must_contain": ["readme"],
    },
    {
        "question": "Calculate 15% of 2847",
        "must_contain": ["427"],
    },
    {
        "question": "List the Python files in my project",
        "must_contain": [".py"],
    },
    {
        "question": "What is the capital of France?",
        "must_contain": ["paris"],
        "must_not_contain": ["error"],
    },
]

Write test cases based on real things you'd ask the assistant. Every time it gives a wrong answer, add that question and the correct answer as a test case. Over time, your test suite becomes a reliable quality measure -- exactly the custom evaluation approach from episode #73.

The must_contain / must_not_contain pattern is deliberately simple. You don't need sophisticated NLP metrics for a personal tool. You need to know: did it find my document? Did it calculate correctly? Did it use the right tool? Binary checks on expected content catch most failures, and they're easy to maintain.

What separates a demo from a product

This assistant works. But it's a demo. Here's what a real product would add, and why each piece matters:

Reliability: error handling for every failure mode. What happens when Ollama isn't running? When ChromaDB's database is corrupted? When a tool call times out? A demo crashes with a traceback. A product catches the exception, tells the user what happened, and suggests a fix.

Persistence: the conversation history is lost when you restart. A product stores conversations in a database, lets you resume them across sessions, and maintains the knowledge base index. You'd also want a way to update documents -- right now, re-indexing an already-indexed file gets skipped by the existing_ids check.

Performance: the first query is slow because Ollama needs to load the model. Subsequent queries depend on model size and context length. A product would keep the model warm, optimize chunk sizes based on actual retrieval quality, and cache embeddings for frequent queries.

Security: our run_python tool executes arbitrary code. On your own machine that's acceptable (you trust yourself). A product for other users needs sandboxed execution, restricted file access to specific directories, and input validation on every tool argument.

User experience: a command-line interface works for us developers. Non-technical users need a web interface, or voice input, or integration with the tools they already use. That's a whole separate engineering effort.

The gap between demo and product is where most AI projects stall. The model works in the notebook. The demo impresses in a meeting. Then reality hits -- edge cases, error handling, scale, security, user experience. Building a reliable AI system is an engineering challenge as much as a machine learning one. If that sounds familiar, it's because we said the same thing back in episode #34 (ML engineering). The fundamentals don't change just because the model got bigger.

Arc 4 retrospective

We've covered quit some ground. Here's what Arc 4 taught us, and how this mini project connects each piece:

  • Language modeling (#57): our assistant generates text token by token, same autoregressive process
  • GPT architecture (#58): the decoder-only design that powers our local model
  • BERT (#59): the embedding model in our RAG pipeline uses BERT-style encoding
  • Training LLMs (#60): understanding why smaller local models have different capabilites than larger ones
  • Alignment (#61): the instruction-following behavior that makes the assistant actually useful
  • Prompt engineering (#62): our system prompt and tool-calling format
  • Embeddings (#63): the vector representations that power document retrieval
  • RAG (#64-65): our entire document knowledge system
  • APIs (#66): the Ollama API follows the same patterns as OpenAI's
  • Agents (#67-68): our tool-calling loop and conversation management
  • Fine-tuning (#69): the path for making the assistant better at specific tasks
  • Local models (#70): running everything on your own hardware, no cloud dependency
  • Generation (#71): temperature and sampling settings that affect response quality
  • Tokenization (#72): why context windows are measured in tokens, not words
  • Evaluation (#73): our test suite approach to measuring quality
  • Hugging Face (#74): the embedding models from the Hub that ChromaDB uses
  • Multimodal (#75): the natural next step would be adding image understanding to the assistant (which we now know how to do with LLaVA or BLIP-2)

Twenty episodes of building blocks, assembled into something that actually runs on your laptop and answers questions about your own documents. That's the pattern of this entire series: understand the fundamentals, build from scratch, then combine into real systems.

And with that, Arc 4 is complete. We started with language modeling theory and ended with a working local AI assistant. The next arc shifts focus to computer vision -- processing images from fundamentals through detection, segmentation, and generative models. The skills transfer directly: attention mechanisms, transformers, embeddings, evaluation -- all the same concepts, applied to visual data.

Samengevat

  • A personal AI assistant combines RAG (document knowledge), agents (tool use), and a local LLM into a system you fully own and control -- no API keys, no data leaving your machine, no monthly bill;
  • document indexing chunks your files at paragraph boundaries and stores embeddings in ChromaDB for semantic retrieval at query time -- the same embedding + vector search pipeline from episodes #63-65;
  • tools give the assistant capabilities beyond text: calculation, file access, code execution, and knowledge search -- each with security guardrails (character filtering, timeouts, truncation);
  • the conversation loop retrieves context, builds messages, calls the LLM, handles tool calls, and manages history compression when the conversation exceeds the context window;
  • evaluation with specific test cases (must_contain, must_not_contain) catches regressions and measures quality over time -- the same custom evaluation approach from episode #73;
  • the demo-to-product gap is where real engineering happens: reliability, persistence, security, performance, and user experience -- the same production concerns from episode #34, applied to an LLM system.

Thanks for reading! Hope you enjoyed the build.

@scipio



0
0
0.000
0 comments