Data Pipeline: Image Indexing
Overview
The indexer.py script is the primary data ingestion pipeline. It converts a folder of raw images and videos into a searchable vector database in ChromaDB.
Workflow
flowchart LR A[Scan vault/] --> B{Is file<br/>new or modified?} B -->|No| Z[Skip] B -->|Yes| C{File type?} C -->|Image| D[Load via PIL] C -->|Video| E[Extract frame at 10%] D --> F[Generate 300x300 thumbnail] E --> F F --> G[Embed via Qwen3-VL] G --> H[Upsert to ChromaDB]
File Discovery
Recursively scans the target directory for supported formats.
| Category | Extensions |
|---|---|
| Images | .jpg, .jpeg, .png, .webp, .bmp, .heic, .heif, .tiff, .tif |
| Videos | .mp4, .mov, .avi, .webm, .mkv |
Location: indexer.py:106-110
Incremental Indexing
Files are only re-indexed if their mtime (modification time) has changed since the last scan.
existing_mtimes = {
id: meta.get("mtime", 0.0)
for id, meta in zip(existing_data["ids"], existing_data["metadatas"])
}Constants:
- Database path:
./.db - Thumbnails path:
./.cache/thumbnails
Thumbnail Generation
Every indexed media file generates a 300x300 WebP thumbnail stored in a local .cache/thumbnails directory.
Why: Loading 10MB original photos during search is too slow. Serving small thumbnails makes the UI feel instant.
Implementation: indexer.py:74-84
Video Frame Extraction
Videos are not indexed as whole files. A single representative frame is extracted at ~10% of the video duration to avoid black starting frames.
Implementation: indexer.py:86-103
Embedding & Storage
- Batch size: 8 (configurable via
--batchparameter) - Model:
Qwen3-VL-Embedding-2B - Vector dimensions: model-specific (typically 2048-dim)
- Distance metric: Cosine
- Storage: ChromaDB Persistent Client at
./.db
Implementation: indexer.py:171-179
Configuration
| Parameter | CLI Flag | Default | Description |
|---|---|---|---|
| Vault path | path | Required | Path to the image vault directory |
| Batch size | --batch | 8 | Batch size for embedding inference |
Usage
python indexer.py vault
python indexer.py vault --batch 16Related Components
- Qwen3VL Embeddings – Embedding logic used during vectorization
- REST API Endpoints – Serves search results and thumbnails
- Media Entities – Data structures for vault files
Last Updated: 2026-06-17