Audio version — Estimated duration: 4 min 11 sec

Data Pipeline: Image Indexing

Overview

The indexer.py script is the primary data ingestion pipeline. It converts a folder of raw images and videos into a searchable vector database in ChromaDB.

Workflow

flowchart LR
    A[Scan vault/] --> B{Is file<br/>new or modified?}
    B -->|No| Z[Skip]
    B -->|Yes| C{File type?}
    C -->|Image| D[Load via PIL]
    C -->|Video| E[Extract frame at 10%]
    D --> F[Generate 300x300 thumbnail]
    E --> F
    F --> G[Embed via Qwen3-VL]
    G --> H[Upsert to ChromaDB]

File Discovery

Recursively scans the target directory for supported formats.

Category	Extensions
Images	`.jpg`, `.jpeg`, `.png`, `.webp`, `.bmp`, `.heic`, `.heif`, `.tiff`, `.tif`
Videos	`.mp4`, `.mov`, `.avi`, `.webm`, `.mkv`

Location: indexer.py:106-110

Incremental Indexing

Files are only re-indexed if their mtime (modification time) has changed since the last scan.

existing_mtimes = {
    id: meta.get("mtime", 0.0)
    for id, meta in zip(existing_data["ids"], existing_data["metadatas"])
}

Constants:

Database path: ./.db
Thumbnails path: ./.cache/thumbnails

Thumbnail Generation

Every indexed media file generates a 300x300 WebP thumbnail stored in a local .cache/thumbnails directory.

Why: Loading 10MB original photos during search is too slow. Serving small thumbnails makes the UI feel instant.

Implementation: indexer.py:74-84

Video Frame Extraction

Videos are not indexed as whole files. A single representative frame is extracted at ~10% of the video duration to avoid black starting frames.

Implementation: indexer.py:86-103

Embedding & Storage

Batch size: 8 (configurable via --batch parameter)
Model: Qwen3-VL-Embedding-2B
Vector dimensions: model-specific (typically 2048-dim)
Distance metric: Cosine
Storage: ChromaDB Persistent Client at ./.db

Implementation: indexer.py:171-179

Configuration

Parameter	CLI Flag	Default	Description
Vault path	`path`	Required	Path to the image vault directory
Batch size	`--batch`	`8`	Batch size for embedding inference

Usage

python indexer.py vault
python indexer.py vault --batch 16

Qwen3VL Embeddings – Embedding logic used during vectorization
REST API Endpoints – Serves search results and thumbnails
Media Entities – Data structures for vault files

Last Updated: 2026-06-17

ProjectBreakdown-101

Explorer

Image Indexing Pipeline

Data Pipeline: Image Indexing

Overview

Workflow

File Discovery

Incremental Indexing

Thumbnail Generation

Video Frame Extraction

Embedding & Storage

Configuration

Usage

Graph View

Table of Contents

Backlinks

ProjectBreakdown-101

Explorer

Image Indexing Pipeline

Data Pipeline: Image Indexing

Overview

Workflow

File Discovery

Incremental Indexing

Thumbnail Generation

Video Frame Extraction

Embedding & Storage

Configuration

Usage

Related Components

Graph View

Table of Contents

Backlinks