Data Pipeline: Image Indexing

Overview

The indexer.py script is the primary data ingestion pipeline. It converts a folder of raw images and videos into a searchable vector database in ChromaDB.

Workflow

flowchart LR
    A[Scan vault/] --> B{Is file<br/>new or modified?}
    B -->|No| Z[Skip]
    B -->|Yes| C{File type?}
    C -->|Image| D[Load via PIL]
    C -->|Video| E[Extract frame at 10%]
    D --> F[Generate 300x300 thumbnail]
    E --> F
    F --> G[Embed via Qwen3-VL]
    G --> H[Upsert to ChromaDB]

File Discovery

Recursively scans the target directory for supported formats.

CategoryExtensions
Images.jpg, .jpeg, .png, .webp, .bmp, .heic, .heif, .tiff, .tif
Videos.mp4, .mov, .avi, .webm, .mkv

Location: indexer.py:106-110

Incremental Indexing

Files are only re-indexed if their mtime (modification time) has changed since the last scan.

existing_mtimes = {
    id: meta.get("mtime", 0.0)
    for id, meta in zip(existing_data["ids"], existing_data["metadatas"])
}

Constants:

  • Database path: ./.db
  • Thumbnails path: ./.cache/thumbnails

Thumbnail Generation

Every indexed media file generates a 300x300 WebP thumbnail stored in a local .cache/thumbnails directory.

Why: Loading 10MB original photos during search is too slow. Serving small thumbnails makes the UI feel instant.

Implementation: indexer.py:74-84

Video Frame Extraction

Videos are not indexed as whole files. A single representative frame is extracted at ~10% of the video duration to avoid black starting frames.

Implementation: indexer.py:86-103

Embedding & Storage

  • Batch size: 8 (configurable via --batch parameter)
  • Model: Qwen3-VL-Embedding-2B
  • Vector dimensions: model-specific (typically 2048-dim)
  • Distance metric: Cosine
  • Storage: ChromaDB Persistent Client at ./.db

Implementation: indexer.py:171-179

Configuration

ParameterCLI FlagDefaultDescription
Vault pathpathRequiredPath to the image vault directory
Batch size--batch8Batch size for embedding inference

Usage

python indexer.py vault
python indexer.py vault --batch 16

Last Updated: 2026-06-17