Learn AI Series (#76) - Mini Project - Your Own AI Assistant
Learn AI Series (#76) - Mini Project - Your Own AI Assistant

What will I learn
- You will learn how to combine RAG, agents, tools, and a local model into a working AI assistant;
- building a domain-specific assistant that actually knows your data through ChromaDB-backed retrieval;
- tool integration for calculation, code execution, file access, and knowledge search;
- conversation management with context window strategies and history compression;
- evaluation and iteration: measuring whether your assistant actually works;
- what separates a demo from something people actually use in practice.
Requirements
- A working modern computer running macOS, Windows or Ubuntu;
- An installed Python 3(.11+) distribution;
- The ambition to learn AI and machine learning.
Difficulty
- Beginner
Curriculum (of the Learn AI Series):
- Learn AI Series (#1) - What Machine Learning Actually Is
- Learn AI Series (#2) - Setting Up Your AI Workbench - Python and NumPy
- Learn AI Series (#3) - Your Data Is Just Numbers - How Machines See the World
- Learn AI Series (#4) - Your First Prediction - No Math, Just Intuition
- Learn AI Series (#5) - Patterns in Data - What "Learning" Actually Looks Like
- Learn AI Series (#6) - From Intuition to Math - Why We Need Formulas
- Learn AI Series (#7) - The Training Loop - See It Work Step by Step
- Learn AI Series (#8) - The Math You Actually Need (Part 1) - Linear Algebra
- Learn AI Series (#9) - The Math You Actually Need (Part 2) - Calculus and Probability
- Learn AI Series (#10) - Your First ML Model - Linear Regression From Scratch
- Learn AI Series (#11) - Making Linear Regression Real
- Learn AI Series (#12) - Classification - Logistic Regression From Scratch
- Learn AI Series (#13) - Evaluation - How to Know If Your Model Actually Works
- Learn AI Series (#14) - Data Preparation - The 80% Nobody Talks About
- Learn AI Series (#15) - Feature Engineering and Selection
- Learn AI Series (#16) - Scikit-Learn - The Standard Library of ML
- Learn AI Series (#17) - Decision Trees - How Machines Make Decisions
- Learn AI Series (#18) - Random Forests - Wisdom of Crowds
- Learn AI Series (#19) - Gradient Boosting - The Kaggle Champion
- Learn AI Series (#20) - Support Vector Machines - Drawing the Perfect Boundary
- Learn AI Series (#21) - Mini Project - Predicting Crypto Market Regimes
- Learn AI Series (#22) - K-Means Clustering - Finding Groups
- Learn AI Series (#23) - Advanced Clustering - Beyond K-Means
- Learn AI Series (#24) - Dimensionality Reduction - PCA
- Learn AI Series (#25) - Advanced Dimensionality Reduction - t-SNE and UMAP
- Learn AI Series (#26) - Anomaly Detection - Finding What Doesn't Belong
- Learn AI Series (#27) - Recommendation Systems - "Users Like You Also Liked..."
- Learn AI Series (#28) - Time Series Fundamentals - When Order Matters
- Learn AI Series (#29) - Time Series Forecasting - Predicting What Comes Next
- Learn AI Series (#30) - Natural Language Processing - Text as Data
- Learn AI Series (#31) - Word Embeddings - Meaning in Numbers
- Learn AI Series (#32) - Bayesian Methods - Thinking in Probabilities
- Learn AI Series (#33) - Ensemble Methods Deep Dive - Stacking and Blending
- Learn AI Series (#34) - ML Engineering - From Notebook to Production
- Learn AI Series (#35) - Data Ethics and Bias in ML
- Learn AI Series (#36) - Mini Project - Complete ML Pipeline
- Learn AI Series (#37) - The Perceptron - Where It All Started
- Learn AI Series (#38) - Neural Networks From Scratch - Forward Pass
- Learn AI Series (#39) - Neural Networks From Scratch - Backpropagation
- Learn AI Series (#40) - Training Neural Networks - Practical Challenges
- Learn AI Series (#41) - Optimization Algorithms - SGD, Momentum, Adam
- Learn AI Series (#42) - PyTorch Fundamentals - Tensors and Autograd
- Learn AI Series (#43) - PyTorch Data and Training
- Learn AI Series (#44) - PyTorch nn.Module - Building Real Networks
- Learn AI Series (#45) - Convolutional Neural Networks - Theory
- Learn AI Series (#46) - CNNs in Practice - Classic to Modern Architectures
- Learn AI Series (#47) - CNN Applications - Detection, Segmentation, Style Transfer
- Learn AI Series (#48) - Recurrent Neural Networks - Sequences
- Learn AI Series (#49) - LSTM and GRU - Solving the Memory Problem
- Learn AI Series (#50) - Sequence-to-Sequence Models
- Learn AI Series (#51) - Attention Mechanisms
- Learn AI Series (#52) - The Transformer Architecture (Part 1)
- Learn AI Series (#53) - The Transformer Architecture (Part 2)
- Learn AI Series (#54) - Vision Transformers
- Learn AI Series (#55) - Generative Adversarial Networks
- Learn AI Series (#56) - Mini Project - Building a Transformer From Scratch
- Learn AI Series (#57) - Language Modeling - Predicting the Next Word
- Learn AI Series (#58) - GPT Architecture - Decoder-Only Transformers
- Learn AI Series (#59) - BERT and Encoder Models
- Learn AI Series (#60) - Training Large Language Models
- Learn AI Series (#61) - Instruction Tuning and Alignment
- Learn AI Series (#62) - Prompt Engineering - Getting the Most from LLMs
- Learn AI Series (#63) - Embeddings and Vector Search
- Learn AI Series (#64) - Retrieval-Augmented Generation (RAG) - Basics
- Learn AI Series (#65) - RAG - Advanced Techniques
- Learn AI Series (#66) - Working with LLM APIs
- Learn AI Series (#67) - Building AI Agents (Part 1) - Foundations
- Learn AI Series (#68) - Building AI Agents (Part 2) - Advanced Patterns
- Learn AI Series (#69) - Fine-Tuning Language Models
- Learn AI Series (#70) - Running Local Models
- Learn AI Series (#71) - Text Generation Techniques
- Learn AI Series (#72) - Tokenization Deep Dive
- Learn AI Series (#73) - LLM Evaluation
- Learn AI Series (#74) - The Hugging Face Ecosystem
- Learn AI Series (#75) - Multimodal Models - Text Meets Vision
- Learn AI Series (#76) - Mini Project - Your Own AI Assistant (this post)
Learn AI Series (#76) - Mini Project - Your Own AI Assistant
Solutions to Episode #75 Exercises
Exercise 1: CLIP-style similarity search engine.
import random
import math
class ImageTextSearch:
"""CLIP-style similarity search using simulated embeddings."""
def __init__(self, embed_dim=128):
self.embed_dim = embed_dim
self.entries = []
def _make_embedding(self, seed_val):
"""Generate a deterministic normalized random embedding."""
rng = random.Random(seed_val)
vec = [rng.gauss(0, 1) for _ in range(self.embed_dim)]
norm = math.sqrt(sum(x * x for x in vec))
return [x / norm for x in vec]
def _cosine_similarity(self, a, b):
dot = sum(x * y for x, y in zip(a, b))
return dot # both are unit vectors, so dot = cosine
def add_image(self, path, description):
seed = hash(path) & 0xFFFFFFFF
emb = self._make_embedding(seed)
self.entries.append({
"path": path,
"description": description,
"embedding": emb,
})
def search_by_text(self, query, top_k=5):
seed = hash(query) & 0xFFFFFFFF
query_emb = self._make_embedding(seed)
scored = []
for entry in self.entries:
sim = self._cosine_similarity(query_emb, entry["embedding"])
scored.append((sim, entry))
scored.sort(key=lambda x: x[0], reverse=True)
return scored[:top_k]
def search_by_image(self, image_path, top_k=5):
target = None
for entry in self.entries:
if entry["path"] == image_path:
target = entry["embedding"]
break
if target is None:
return []
scored = []
for entry in self.entries:
if entry["path"] == image_path:
continue
sim = self._cosine_similarity(target, entry["embedding"])
scored.append((sim, entry))
scored.sort(key=lambda x: x[0], reverse=True)
return scored[:top_k]
def find_duplicates(self, threshold=0.95):
pairs = []
for i in range(len(self.entries)):
for j in range(i + 1, len(self.entries)):
sim = self._cosine_similarity(
self.entries[i]["embedding"],
self.entries[j]["embedding"]
)
if sim > threshold:
pairs.append((
self.entries[i]["path"],
self.entries[j]["path"],
sim
))
return pairs
search = ImageTextSearch()
categories = {
"animals": [
("golden_retriever.jpg", "A golden retriever playing in a park"),
("tabby_cat.jpg", "An orange tabby cat sleeping on a windowsill"),
("parrot.jpg", "A colorful macaw parrot on a branch"),
("horse_field.jpg", "A brown horse grazing in a green field"),
("penguin_ice.jpg", "Emperor penguins huddled on Antarctic ice"),
],
"landscapes": [
("mountain_sunset.jpg", "Sunset over snow-capped mountain peaks"),
("ocean_cliff.jpg", "Dramatic ocean cliffs at golden hour"),
("desert_dunes.jpg", "Rolling sand dunes in the Sahara"),
("forest_path.jpg", "A winding path through an autumn forest"),
("lake_reflection.jpg", "Mountain lake with mirror-like reflections"),
],
"food": [
("pasta_dish.jpg", "Fresh pasta with tomato basil sauce"),
("sushi_platter.jpg", "Assorted sushi and sashimi on a wooden board"),
("chocolate_cake.jpg", "Rich dark chocolate layer cake"),
("fruit_bowl.jpg", "Colorful bowl of tropical fruits"),
("bread_basket.jpg", "Freshly baked sourdough bread loaves"),
],
"architecture": [
("gothic_cathedral.jpg", "Gothic cathedral with flying buttresses"),
("modern_tower.jpg", "Glass and steel skyscraper at night"),
("japanese_temple.jpg", "Traditional Japanese temple in a garden"),
("bridge_sunset.jpg", "Suspension bridge at sunset"),
("old_town.jpg", "Cobblestone streets in a European old town"),
],
}
for cat, items in categories.items():
for path, desc in items:
search.add_image(path, desc)
# Text search
print("Search: 'sunset over mountains'")
results = search.search_by_text("sunset over mountains", top_k=5)
for sim, entry in results:
print(f" {sim:+.4f} {entry['path']:25s} {entry['description']}")
# Image search
print("\nSimilar to 'mountain_sunset.jpg':")
results = search.search_by_image("mountain_sunset.jpg", top_k=5)
for sim, entry in results:
print(f" {sim:+.4f} {entry['path']:25s} {entry['description']}")
# Duplicate detection
print("\nDuplicate pairs (threshold 0.95):")
dupes = search.find_duplicates(threshold=0.95)
if dupes:
for a, b, sim in dupes:
print(f" {sim:.4f} {a} <-> {b}")
else:
print(" None found (expected with random embeddings)")
The seeded random embeddings are deterministic -- same path or query always produces the same vector. In a real system you'd replace _make_embedding with actual CLIP inference, but the search and ranking logic stays identical. Cosine similarity between unit vectors is just their dot product, which is the core operation behind every embedding-based retrieval system we've built since episode #63.
Exercise 2: Contrastive learning trainer.
import random
import math
class ContrastiveTrainer:
"""Simulates CLIP-style contrastive training."""
def __init__(self, n_categories=10, pairs_per_cat=10,
embed_dim=64):
self.n_categories = n_categories
self.pairs_per_cat = pairs_per_cat
self.embed_dim = embed_dim
self.data = self._generate_data()
# Initialize random embeddings
rng = random.Random(42)
self.image_embs = []
self.text_embs = []
for _ in range(len(self.data)):
ie = [rng.gauss(0, 1) for _ in range(embed_dim)]
te = [rng.gauss(0, 1) for _ in range(embed_dim)]
self.image_embs.append(self._normalize(ie))
self.text_embs.append(self._normalize(te))
def _generate_data(self):
pairs = []
for cat in range(self.n_categories):
for _ in range(self.pairs_per_cat):
pairs.append(cat)
return pairs
def _normalize(self, vec):
norm = math.sqrt(sum(x * x for x in vec))
if norm < 1e-10:
return vec
return [x / norm for x in vec]
def _dot(self, a, b):
return sum(x * y for x, y in zip(a, b))
def compute_loss(self, img_indices, txt_indices,
temperature=0.07):
batch = len(img_indices)
# Similarity matrix
logits = []
for i in img_indices:
row = []
for j in txt_indices:
sim = self._dot(self.image_embs[i],
self.text_embs[j])
row.append(sim / temperature)
logits.append(row)
# Cross entropy in both directions
loss = 0.0
for k in range(batch):
# Image-to-text
max_val = max(logits[k])
exps = [math.exp(v - max_val) for v in logits[k]]
total = sum(exps)
loss -= math.log(exps[k] / total + 1e-10)
# Text-to-image
col = [logits[r][k] for r in range(batch)]
max_val = max(col)
exps = [math.exp(v - max_val) for v in col]
total = sum(exps)
loss -= math.log(exps[k] / total + 1e-10)
return loss / (2 * batch)
def alignment_score(self):
"""Average cosine sim: matching vs non-matching."""
n = len(self.data)
match_sims = []
nonmatch_sims = []
for i in range(n):
for j in range(n):
sim = self._dot(self.image_embs[i],
self.text_embs[j])
if self.data[i] == self.data[j]:
match_sims.append(sim)
else:
nonmatch_sims.append(sim)
avg_match = sum(match_sims) / len(match_sims)
avg_non = sum(nonmatch_sims) / len(nonmatch_sims)
return avg_match, avg_non
def train(self, steps=50, lr=0.1, batch_size=16,
temperature=0.07):
rng = random.Random(123)
indices = list(range(len(self.data)))
for step in range(steps):
batch = rng.sample(indices, min(batch_size,
len(indices)))
# Compute gradients numerically (adjust embeddings)
for idx in range(len(batch)):
i = batch[idx]
for d in range(self.embed_dim):
# Increase similarity to matching pair
j = batch[idx]
grad_img = (self.text_embs[j][d]
- self.image_embs[i][d])
grad_txt = (self.image_embs[i][d]
- self.text_embs[j][d])
# Decrease similarity to non-matching
for k_idx in range(len(batch)):
k = batch[k_idx]
if self.data[k] != self.data[i]:
grad_img -= (
0.1 * self.text_embs[k][d]
)
grad_txt -= (
0.1 * self.image_embs[k][d]
)
self.image_embs[i][d] += lr * grad_img
self.text_embs[j][d] += lr * grad_txt
self.image_embs[i] = self._normalize(
self.image_embs[i]
)
self.text_embs[batch[idx]] = self._normalize(
self.text_embs[batch[idx]]
)
if step in [0, 10, 25, 49]:
loss = self.compute_loss(batch, batch,
temperature)
m, nm = self.alignment_score()
gap = m - nm
print(f"Step {step:>3}: loss={loss:.4f} "
f"match={m:.4f} non={nm:.4f} "
f"gap={gap:.4f}")
trainer = ContrastiveTrainer()
trainer.train(steps=50, lr=0.1)
The alignment gap (matching similarity minus non-matching similarity) is the metric that actually matters. A gap of 0 means the model can't distinguish matching from non-matching pairs. As training progresses, matching pairs get pulled together while non-matching pairs get pushed apart -- that's the entire contrastive learning dynamic in two numbers.
Exercise 3: Visual question answering evaluator.
class VQAEvaluator:
"""Evaluate VQA model accuracy across question categories."""
def __init__(self):
self.test_cases = [
{"image": "Three red cars parked on a street",
"question": "How many cars are there?",
"truth": "3", "category": "counting"},
{"image": "Five people standing near two bicycles",
"question": "How many people are in the image?",
"truth": "5", "category": "counting"},
{"image": "A single white cat on a blue couch",
"question": "How many cats are visible?",
"truth": "1", "category": "counting"},
{"image": "A bright red sports car on a highway",
"question": "What color is the car?",
"truth": "red", "category": "color"},
{"image": "A woman in a green dress near yellow flowers",
"question": "What color is the dress?",
"truth": "green", "category": "color"},
{"image": "A blue mug on a wooden brown table",
"question": "What color is the mug?",
"truth": "blue", "category": "color"},
{"image": "A dog sitting to the left of a tree",
"question": "Is the dog to the left or right of the tree?",
"truth": "left", "category": "spatial"},
{"image": "A lamp above a desk with books below",
"question": "Where is the lamp relative to the desk?",
"truth": "above", "category": "spatial"},
{"image": "A cat behind a glass window",
"question": "Is the cat inside or outside?",
"truth": "inside", "category": "spatial"},
{"image": "A park with several dogs and a fountain",
"question": "Is there a fountain in the image?",
"truth": "yes", "category": "yes-no"},
{"image": "An empty parking lot at night",
"question": "Are there any cars in the lot?",
"truth": "no", "category": "yes-no"},
{"image": "A beach with umbrellas and waves",
"question": "Is this a beach scene?",
"truth": "yes", "category": "yes-no"},
{"image": "A sign reading STOP in bold red letters",
"question": "What does the sign say?",
"truth": "stop", "category": "reading"},
{"image": "A storefront with OPEN neon sign",
"question": "What word is on the neon sign?",
"truth": "open", "category": "reading"},
{"image": "A license plate showing ABC 1234",
"question": "What is on the license plate?",
"truth": "abc 1234", "category": "reading"},
]
def _simple_vqa(self, image_desc, question):
"""Keyword-matching VQA simulator."""
desc = image_desc.lower()
q = question.lower()
numbers = {"one": "1", "two": "2", "three": "3",
"four": "4", "five": "5", "six": "6"}
colors = ["red", "blue", "green", "yellow", "white",
"black", "brown", "orange", "purple", "pink"]
spatials = ["left", "right", "above", "below",
"inside", "outside", "behind", "front"]
if "how many" in q:
for word in desc.split():
if word.isdigit():
return word
if word in numbers:
return numbers[word]
return "1"
if "color" in q or "colour" in q:
for color in colors:
if color in desc:
return color
return "unknown"
if any(s in q for s in ["left", "right", "where",
"inside", "outside"]):
for sp in spatials:
if sp in desc:
return sp
return "unknown"
if q.startswith("is ") or q.startswith("are "):
for word in q.split():
if word in desc:
return "yes"
return "no"
if "what" in q and ("say" in q or "sign" in q
or "read" in q or "word" in q
or "plate" in q):
import re
caps = re.findall(r'[A-Z][A-Z\s\d]+', image_desc)
if caps:
return caps[0].strip().lower()
return "unknown"
return "unknown"
def evaluate(self, model_fn=None):
if model_fn is None:
model_fn = self._simple_vqa
results = []
for case in self.test_cases:
answer = model_fn(case["image"], case["question"])
answer = answer.lower().strip()
truth = case["truth"].lower().strip()
exact = answer == truth
relaxed = truth in answer or answer in truth
confidence = 1.0 if exact else (0.5 if relaxed
else 0.0)
results.append({
"category": case["category"],
"exact": exact,
"relaxed": relaxed,
"confidence": confidence,
"answer": answer,
"truth": truth,
})
# Per-category report
cats = {}
for r in results:
c = r["category"]
if c not in cats:
cats[c] = {"exact": 0, "relaxed": 0,
"conf": [], "total": 0,
"failures": []}
cats[c]["total"] += 1
if r["exact"]:
cats[c]["exact"] += 1
if r["relaxed"]:
cats[c]["relaxed"] += 1
cats[c]["conf"].append(r["confidence"])
if not r["exact"]:
cats[c]["failures"].append(
f" got '{r['answer']}', "
f"expected '{r['truth']}'"
)
print(f"{'Category':<12} {'Exact':>7} {'Relaxed':>9} "
f"{'Conf':>6} {'N':>3}")
print("-" * 42)
for cat in sorted(cats):
s = cats[cat]
ea = s["exact"] / s["total"]
ra = s["relaxed"] / s["total"]
ac = sum(s["conf"]) / len(s["conf"])
print(f"{cat:<12} {ea:>7.0%} {ra:>9.0%} "
f"{ac:>6.2f} {s['total']:>3}")
total = len(results)
exact_all = sum(1 for r in results if r["exact"])
relax_all = sum(1 for r in results if r["relaxed"])
print(f"\nOverall: {exact_all}/{total} exact, "
f"{relax_all}/{total} relaxed")
return results, cats
def error_analysis(self, results=None, cats=None):
if results is None:
results, cats = self.evaluate()
worst = sorted(cats.items(),
key=lambda x: x[1]["exact"] / x[1]["total"])
print("\nError Analysis (hardest categories first):")
for cat, s in worst:
rate = s["exact"] / s["total"]
print(f" {cat}: {rate:.0%} exact match")
for f in s["failures"]:
print(f" FAIL:{f}")
evaluator = VQAEvaluator()
results, cats = evaluator.evaluate()
evaluator.error_analysis(results, cats)
The per-category breakdown is where the real insight lives. Counting and color questions tend to work well with simple keyword matching because the answers are directly stated in image descriptions. Spatial reasoning is harder because "left of the tree" requires understanding relational prepositions. Reading/OCR is even harder since it depends on how exactly the text was transcribed in the description. This maps directly to how real VLMs perform -- they nail color and counting, struggle with spatial reasoning, and are inconsistent with text extraction. Knowing your model's category-level weaknesses tells you exactly where to invest in improvement.
On to today's episode
Here we go! This is the fourth mini project in our series, and it's the capstone of Arc 4 -- the entire LLM arc. In episode #21 we built a crypto market regime classifier. In episode #36 we built a complete ML pipeline. In episode #56 we built a transformer from scratch. Each mini project combined everything we'd learned up to that point into something you could actually use.
This time we're pulling together twenty episodes worth of LLM building blocks: language models (#57-58), embeddings and RAG (#63-65), APIs (#66), agents and tools (#67-68), fine-tuning (#69), local models (#70), generation techniques (#71), evaluation (#73), the Hugging Face ecosystem (#74), and multimodal models (#75). We're building a personal AI assistant that knows your documents, uses tools, and runs entirely on your own hardware.
Not a wrapper around someone else's API. An actual system you own and control ;-)
The architecture
Our assistant has four components that each map directly to concepts we've covered:
- Local LLM (via Ollama, episode #70) -- the brain
- RAG system (ChromaDB + embedding model, episodes #63-65) -- domain knowledge
- Tool system (episodes #67-68) -- actions beyond text generation
- Conversation manager -- context window strategy and memory
Here's the full class scaffold with initialization:
import ollama
import chromadb
import json
import subprocess
import re
from pathlib import Path
class PersonalAssistant:
def __init__(self, model="llama3.1:8b",
docs_path="./knowledge"):
self.model = model
self.chroma = chromadb.PersistentClient(
path="./assistant_db"
)
self.collection = self.chroma.get_or_create_collection(
"knowledge"
)
self.conversation = []
self.max_history = 20
self.tools = self._register_tools()
self.system_prompt = self._build_system_prompt()
# Index documents if provided
if Path(docs_path).exists():
self._index_documents(docs_path)
def _build_system_prompt(self):
tool_descriptions = "\n".join(
f"- {name}: {info['description']}"
for name, info in self.tools.items()
)
return (
"You are a helpful personal assistant with "
"access to the user's documents and tools.\n\n"
"When answering questions:\n"
"1. Check if the retrieved context contains "
"relevant information\n"
"2. If yes, base your answer on that context "
"and cite the source\n"
"3. If no relevant context, use your general "
"knowledge but say so\n"
"4. Use tools when the user's request requires "
"action\n\n"
f"Available tools:\n{tool_descriptions}\n\n"
"To use a tool, respond with: "
"TOOL: tool_name(arg1, arg2)\n"
"After seeing the result, provide your "
"final answer."
)
The system prompt is deliberately explicit about when to use tools versus when to use retrieved context versus when to fall back on general knowledge. This is prompt engineering from episode #62 in practice -- clear instructions reduce the model's tendency to hallucinate or ignore available tools.
Indexing your documents
The RAG system needs your documents chunked and embedded. We covered the theory in episodes #63-64, but here's the full implementaton tailored for a personal knowledge base:
def _index_documents(self, docs_path):
"""Index all text and markdown files."""
docs_path = Path(docs_path)
existing_ids = set(self.collection.get()["ids"])
for file_path in docs_path.rglob("*"):
if file_path.suffix not in [".txt", ".md",
".py", ".json"]:
continue
doc_id = str(file_path.relative_to(docs_path))
if doc_id in existing_ids:
continue
text = file_path.read_text(
encoding="utf-8", errors="ignore"
)
chunks = self._chunk_text(
text, chunk_size=500, overlap=50
)
for i, chunk in enumerate(chunks):
self.collection.upsert(
ids=[f"{doc_id}::{i}"],
documents=[chunk],
metadatas=[{
"source": doc_id,
"chunk": i
}]
)
print(f"Knowledge base: "
f"{self.collection.count()} chunks indexed")
def _chunk_text(self, text, chunk_size=500, overlap=50):
"""Split text into overlapping chunks
at paragraph boundaries."""
paragraphs = text.split("\n\n")
chunks = []
current = ""
for para in paragraphs:
if (len(current) + len(para) > chunk_size
and current):
chunks.append(current.strip())
# Keep overlap from end of previous chunk
words = current.split()
overlap_text = " ".join(
words[-overlap // 5:]
)
current = overlap_text + "\n\n" + para
else:
current = (current + "\n\n" + para
if current else para)
if current.strip():
chunks.append(current.strip())
return chunks
def _retrieve(self, query, n_results=3):
"""Retrieve relevant document chunks."""
if self.collection.count() == 0:
return []
results = self.collection.query(
query_texts=[query], n_results=n_results
)
contexts = []
for doc, meta in zip(results["documents"][0],
results["metadatas"][0]):
contexts.append({
"text": doc,
"source": meta["source"]
})
return contexts
The _chunk_text method splits at paragraph boundaries rather than cutting mid-sentence. This matters more than you'd expect -- a chunk that ends in the middle of a code block or a sentence loses context that the embedding model needs to generate a useful vector representation. The overlap parameter ensures that concepts spanning two paragraphs don't fall through the cracks.
Notice that we're using ChromaDB's built-in embedding model (it defaults to all-MiniLM-L6-v2 from sentence-transformers, which we discussed in episode #74). You could swap in any embedding model from the Hub, but for a personal assistant the default works well enough. Don't over-engineer the retrieval until you've confirmed it's actually the bottleneck.
The tool system
Tools give the assistant capabilities beyond text generation. We keep the set small and focused -- each tool does one thing well. This is the same philosophy from episode #68 where we discussed that a few reliable tools beat many unreliable ones:
def _register_tools(self):
return {
"calculate": {
"function": self._tool_calculate,
"description": (
"Evaluate a math expression. "
"Usage: calculate(expression)"
)
},
"read_file": {
"function": self._tool_read_file,
"description": (
"Read a file's contents. "
"Usage: read_file(path)"
)
},
"list_files": {
"function": self._tool_list_files,
"description": (
"List files in a directory. "
"Usage: list_files(path)"
)
},
"run_python": {
"function": self._tool_run_python,
"description": (
"Execute Python code and return output. "
"Usage: run_python(code)"
)
},
"search_knowledge": {
"function": self._tool_search_knowledge,
"description": (
"Search the knowledge base. "
"Usage: search_knowledge(query)"
)
}
}
def _tool_calculate(self, expression):
try:
allowed = set("0123456789+-*/.() ")
if not all(c in allowed for c in expression):
return "Error: only numeric expressions allowed"
return str(eval(expression))
except Exception as e:
return f"Error: {e}"
def _tool_read_file(self, path):
try:
content = Path(path).read_text()
if len(content) > 2000:
return (content[:2000] +
f"\n... (truncated, "
f"{len(content)} chars total)")
return content
except Exception as e:
return f"Error reading file: {e}"
def _tool_list_files(self, path):
try:
entries = sorted(Path(path).iterdir())
return "\n".join(
f"{'[DIR]' if e.is_dir() else '[FILE]'} "
f"{e.name}"
for e in entries[:50]
)
except Exception as e:
return f"Error: {e}"
def _tool_run_python(self, code):
try:
result = subprocess.run(
["python3", "-c", code],
capture_output=True, text=True, timeout=10
)
output = result.stdout
if result.stderr:
output += f"\nSTDERR: {result.stderr}"
return output or "(no output)"
except subprocess.TimeoutExpired:
return "Error: execution timed out (10s limit)"
except Exception as e:
return f"Error: {e}"
def _tool_search_knowledge(self, query):
results = self._retrieve(query, n_results=5)
if not results:
return "No relevant documents found."
return "\n---\n".join(
f"[{r['source']}]: {r['text']}"
for r in results
)
Notice the security constraints on each tool. calculate only allows numeric characters -- no builtins, no imports, no os.system through a crafted expression. run_python has a 10-second timeout, and read_file truncates large files. These are the guardrails we discussed in episode #68 (agent safety patterns). A personal assistant running locally is lower risk than a public-facing one, but the habit of constraining tool capabilities is worth building from the start.
Having said that, the calculate tool has a subtle limitation: it uses eval(), which even with character filtering could theoretically be exploited (though it's hard with only digits and arithmetic operators). In a production system you'd want a proper expression parser. For a personal tool running on your own machine, the risk is minimal.
The conversation loop
This is where everything comes together -- the agent loop from episode #67, with RAG context injection and tool handling:
def chat(self, user_message):
"""Process a user message and return a response."""
# Retrieve relevant context
contexts = self._retrieve(user_message)
context_text = ""
if contexts:
context_text = (
"\n\nRelevant knowledge:\n"
+ "\n---\n".join(
f"[{c['source']}]: {c['text']}"
for c in contexts
)
)
# Build messages
messages = [{
"role": "system",
"content": self.system_prompt + context_text
}]
messages.extend(
self.conversation[-self.max_history:]
)
messages.append({
"role": "user",
"content": user_message
})
# Get LLM response
response = ollama.chat(
model=self.model, messages=messages
)
assistant_msg = response["message"]["content"]
# Check for tool calls
if "TOOL:" in assistant_msg:
assistant_msg = self._handle_tool_calls(
assistant_msg, messages
)
# Update conversation history
self.conversation.append({
"role": "user",
"content": user_message
})
self.conversation.append({
"role": "assistant",
"content": assistant_msg
})
# Compress history if it gets too long
if len(self.conversation) > self.max_history * 2:
self._compress_history()
return assistant_msg
def _handle_tool_calls(self, response, messages):
"""Extract and execute tool calls,
then get final response."""
tool_pattern = r'TOOL:\s*(\w+)\(([^)]*)\)'
matches = re.findall(tool_pattern, response)
tool_results = []
for tool_name, args in matches:
if tool_name in self.tools:
result = self.tools[tool_name]["function"](
args.strip().strip("\"'")
)
tool_results.append(
f"[{tool_name}] Result: {result}"
)
else:
tool_results.append(
f"[{tool_name}] Error: unknown tool"
)
if not tool_results:
return response
# Feed tool results back to the model
messages.append({
"role": "assistant",
"content": response
})
messages.append({
"role": "user",
"content": (
"Tool results:\n"
+ "\n".join(tool_results)
+ "\n\nNow provide your final answer "
"based on these results."
)
})
final = ollama.chat(
model=self.model, messages=messages
)
return final["message"]["content"]
def _compress_history(self):
"""Summarize older conversation history."""
old_messages = self.conversation[
:len(self.conversation) // 2
]
summary_text = "\n".join(
f"{m['role']}: {m['content'][:100]}"
for m in old_messages
)
summary = ollama.chat(
model=self.model,
messages=[{
"role": "user",
"content": (
"Summarize this conversation "
"history concisely:\n" + summary_text
)
}]
)
self.conversation = [{
"role": "system",
"content": (
"Earlier conversation summary: "
+ summary["message"]["content"]
)
}] + self.conversation[
len(self.conversation) // 2:
]
The _compress_history method is a practical solution to the context window problem we discussed in episode #72 (tokenization). Every conversation eventually exceeds the model's context window. Rather than hard-truncating (which loses important context from earlier in the conversation), we summarize the older messages and keep the recent ones verbatim. The model gets the gist of what was discussed before plus full detail of the recent exchanges.
Is this a perfect solution? No. The summary itself takes up context, the summarization can lose important details, and the model doing the summarizing might miss nuances. But it's pragmatic. It works well enough for a personal assistant where conversations rarely need to reference details from 50 messages ago. For production systems with long conversations, you'd want a more sophisticated approach -- maybe a sliding window with selective retrieval from a conversation database.
Running the assistant
def main():
assistant = PersonalAssistant(
model="llama3.1:8b",
docs_path="./my_documents"
)
print("Personal AI Assistant ready. "
"Type 'quit' to exit.\n")
while True:
user_input = input("You: ").strip()
if user_input.lower() in ["quit", "exit", "q"]:
break
if not user_input:
continue
response = assistant.chat(user_input)
print(f"\nAssistant: {response}\n")
if __name__ == "__main__":
main()
Put some documents in ./my_documents/ -- your project README, meeting notes, code files, whatever you want the assistant to know about -- and run it. The first query will be slow while Ollama loads the model into memory. Subsequent queries should respond in a few seconds depending on your hardware and the model size.
Evaluation: does it actually work?
Building the assistant is the straightforward part. Knowing whether it works well is harder. Apply what we learned in episode #73 (LLM evaluation):
def evaluate_assistant(assistant, test_cases):
results = []
for case in test_cases:
response = assistant.chat(case["question"])
# Reset conversation between test cases
assistant.conversation = []
result = {
"question": case["question"],
"response": response,
"passed": True,
}
# Check required elements
for check in case.get("must_contain", []):
if check.lower() not in response.lower():
result["passed"] = False
result["missing"] = check
# Check forbidden elements
for check in case.get("must_not_contain", []):
if check.lower() in response.lower():
result["passed"] = False
result["forbidden"] = check
results.append(result)
passed = sum(1 for r in results if r["passed"])
print(f"Passed: {passed}/{len(results)}")
for r in results:
status = "PASS" if r["passed"] else "FAIL"
print(f" [{status}] {r['question'][:50]}...")
if not r["passed"]:
if "missing" in r:
print(f" Missing: {r['missing']}")
if "forbidden" in r:
print(f" Found forbidden: {r['forbidden']}")
return results
test_cases = [
{
"question": "What is in my project README?",
"must_contain": ["readme"],
},
{
"question": "Calculate 15% of 2847",
"must_contain": ["427"],
},
{
"question": "List the Python files in my project",
"must_contain": [".py"],
},
{
"question": "What is the capital of France?",
"must_contain": ["paris"],
"must_not_contain": ["error"],
},
]
Write test cases based on real things you'd ask the assistant. Every time it gives a wrong answer, add that question and the correct answer as a test case. Over time, your test suite becomes a reliable quality measure -- exactly the custom evaluation approach from episode #73.
The must_contain / must_not_contain pattern is deliberately simple. You don't need sophisticated NLP metrics for a personal tool. You need to know: did it find my document? Did it calculate correctly? Did it use the right tool? Binary checks on expected content catch most failures, and they're easy to maintain.
What separates a demo from a product
This assistant works. But it's a demo. Here's what a real product would add, and why each piece matters:
Reliability: error handling for every failure mode. What happens when Ollama isn't running? When ChromaDB's database is corrupted? When a tool call times out? A demo crashes with a traceback. A product catches the exception, tells the user what happened, and suggests a fix.
Persistence: the conversation history is lost when you restart. A product stores conversations in a database, lets you resume them across sessions, and maintains the knowledge base index. You'd also want a way to update documents -- right now, re-indexing an already-indexed file gets skipped by the existing_ids check.
Performance: the first query is slow because Ollama needs to load the model. Subsequent queries depend on model size and context length. A product would keep the model warm, optimize chunk sizes based on actual retrieval quality, and cache embeddings for frequent queries.
Security: our run_python tool executes arbitrary code. On your own machine that's acceptable (you trust yourself). A product for other users needs sandboxed execution, restricted file access to specific directories, and input validation on every tool argument.
User experience: a command-line interface works for us developers. Non-technical users need a web interface, or voice input, or integration with the tools they already use. That's a whole separate engineering effort.
The gap between demo and product is where most AI projects stall. The model works in the notebook. The demo impresses in a meeting. Then reality hits -- edge cases, error handling, scale, security, user experience. Building a reliable AI system is an engineering challenge as much as a machine learning one. If that sounds familiar, it's because we said the same thing back in episode #34 (ML engineering). The fundamentals don't change just because the model got bigger.
Arc 4 retrospective
We've covered quit some ground. Here's what Arc 4 taught us, and how this mini project connects each piece:
- Language modeling (#57): our assistant generates text token by token, same autoregressive process
- GPT architecture (#58): the decoder-only design that powers our local model
- BERT (#59): the embedding model in our RAG pipeline uses BERT-style encoding
- Training LLMs (#60): understanding why smaller local models have different capabilites than larger ones
- Alignment (#61): the instruction-following behavior that makes the assistant actually useful
- Prompt engineering (#62): our system prompt and tool-calling format
- Embeddings (#63): the vector representations that power document retrieval
- RAG (#64-65): our entire document knowledge system
- APIs (#66): the Ollama API follows the same patterns as OpenAI's
- Agents (#67-68): our tool-calling loop and conversation management
- Fine-tuning (#69): the path for making the assistant better at specific tasks
- Local models (#70): running everything on your own hardware, no cloud dependency
- Generation (#71): temperature and sampling settings that affect response quality
- Tokenization (#72): why context windows are measured in tokens, not words
- Evaluation (#73): our test suite approach to measuring quality
- Hugging Face (#74): the embedding models from the Hub that ChromaDB uses
- Multimodal (#75): the natural next step would be adding image understanding to the assistant (which we now know how to do with LLaVA or BLIP-2)
Twenty episodes of building blocks, assembled into something that actually runs on your laptop and answers questions about your own documents. That's the pattern of this entire series: understand the fundamentals, build from scratch, then combine into real systems.
And with that, Arc 4 is complete. We started with language modeling theory and ended with a working local AI assistant. The next arc shifts focus to computer vision -- processing images from fundamentals through detection, segmentation, and generative models. The skills transfer directly: attention mechanisms, transformers, embeddings, evaluation -- all the same concepts, applied to visual data.
Samengevat
- A personal AI assistant combines RAG (document knowledge), agents (tool use), and a local LLM into a system you fully own and control -- no API keys, no data leaving your machine, no monthly bill;
- document indexing chunks your files at paragraph boundaries and stores embeddings in ChromaDB for semantic retrieval at query time -- the same embedding + vector search pipeline from episodes #63-65;
- tools give the assistant capabilities beyond text: calculation, file access, code execution, and knowledge search -- each with security guardrails (character filtering, timeouts, truncation);
- the conversation loop retrieves context, builds messages, calls the LLM, handles tool calls, and manages history compression when the conversation exceeds the context window;
- evaluation with specific test cases (
must_contain,must_not_contain) catches regressions and measures quality over time -- the same custom evaluation approach from episode #73; - the demo-to-product gap is where real engineering happens: reliability, persistence, security, performance, and user experience -- the same production concerns from episode #34, applied to an LLM system.