ML Models: Qwen3-VL-Embedding-2B

Overview

The project uses Qwen3-VL-Embedding-2B (2 Billion parameters) for multimodal embeddings. Unlike traditional CLIP models, this VLM provides significantly better nuanced understanding of complex scenes, text within images, and stylistic attributes.

Model ID: Qwen/Qwen3-VL-Embedding-2B
Local Path: ./Qwen/Qwen3-VL-Embedding-2B/ (auto-downloaded via HuggingFace Hub)

Architecture

classDiagram
    class Qwen3VLForEmbedding {
        +model: Qwen3VLModel
        +forward(input_ids, attention_mask, pixel_values, image_grid_thw)
    }
    class Qwen3VLEmbedder {
        +model: Qwen3VLForEmbedding
        +processor: Qwen3VLProcessor
        +embed(inputs, normalize)
        -_pooling_last(hidden_state, attention_mask)
    }
    Qwen3VLEmbedder --> Qwen3VLForEmbedding

Custom Model Wrapper

A custom model class extends Qwen3VLPreTrainedModel to expose last_hidden_state for pooling.

File: embedding_utils.py:22-40

Output Dataclass

@dataclass
class Qwen3VLForEmbeddingOutput(ModelOutput):
    last_hidden_state: Optional[torch.FloatTensor] = None
    attention_mask: Optional[torch.Tensor] = None

Embedding Process

  1. Conversation formatting: Each input (image or text) is wrapped in a chat template with a system prompt "Represent the user's input.".
  2. Vision processing: Images are resized and tiled according to min_pixels and max_pixels constraints.
  3. Tokenizer/Processor: Qwen3VLProcessor turns the conversation into token IDs and vision tensors.
  4. Forward pass: The model generates last_hidden_state.
  5. Pooling: The last non-padding token is selected using _pooling_last (last-token pooling).
  6. Normalization: L2 normalization is applied by default.

Configuration Constants

ConstantValueDescription
MAX_LENGTH8192Maximum token sequence length
MIN_PIXELS4096 (4×32×32)Minimum image pixel count
MAX_PIXELS57,600 (1800×32)Maximum image pixel count

Hardware Acceleration

DevicePrecisionUse Case
CUDAfloat16NVIDIA GPUs
MPSfloat16Apple Silicon
CPUfloat32Fallback

Auto-Download

On first run, the model weights are automatically downloaded from Hugging Face Hub to ./Qwen/Qwen3-VL-Embedding-2B/ using snapshot_download.

Implementation: indexer.py:27-38


Last Updated: 2026-06-17