ML Models: Qwen3-VL-Embedding-2B
Overview
The project uses Qwen3-VL-Embedding-2B (2 Billion parameters) for multimodal embeddings. Unlike traditional CLIP models, this VLM provides significantly better nuanced understanding of complex scenes, text within images, and stylistic attributes.
Model ID: Qwen/Qwen3-VL-Embedding-2B
Local Path: ./Qwen/Qwen3-VL-Embedding-2B/ (auto-downloaded via HuggingFace Hub)
Architecture
classDiagram class Qwen3VLForEmbedding { +model: Qwen3VLModel +forward(input_ids, attention_mask, pixel_values, image_grid_thw) } class Qwen3VLEmbedder { +model: Qwen3VLForEmbedding +processor: Qwen3VLProcessor +embed(inputs, normalize) -_pooling_last(hidden_state, attention_mask) } Qwen3VLEmbedder --> Qwen3VLForEmbedding
Custom Model Wrapper
A custom model class extends Qwen3VLPreTrainedModel to expose last_hidden_state for pooling.
File: embedding_utils.py:22-40
Output Dataclass
@dataclass
class Qwen3VLForEmbeddingOutput(ModelOutput):
last_hidden_state: Optional[torch.FloatTensor] = None
attention_mask: Optional[torch.Tensor] = NoneEmbedding Process
- Conversation formatting: Each input (image or text) is wrapped in a chat template with a system prompt
"Represent the user's input.". - Vision processing: Images are resized and tiled according to
min_pixelsandmax_pixelsconstraints. - Tokenizer/Processor:
Qwen3VLProcessorturns the conversation into token IDs and vision tensors. - Forward pass: The model generates
last_hidden_state. - Pooling: The last non-padding token is selected using
_pooling_last(last-token pooling). - Normalization: L2 normalization is applied by default.
Configuration Constants
| Constant | Value | Description |
|---|---|---|
MAX_LENGTH | 8192 | Maximum token sequence length |
MIN_PIXELS | 4096 (4×32×32) | Minimum image pixel count |
MAX_PIXELS | 57,600 (1800×32) | Maximum image pixel count |
Hardware Acceleration
| Device | Precision | Use Case |
|---|---|---|
| CUDA | float16 | NVIDIA GPUs |
| MPS | float16 | Apple Silicon |
| CPU | float32 | Fallback |
Auto-Download
On first run, the model weights are automatically downloaded from Hugging Face Hub to ./Qwen/Qwen3-VL-Embedding-2B/ using snapshot_download.
Implementation: indexer.py:27-38
Related Components
- Image Indexing Pipeline – Uses embedder for batch vectorization
- REST API Endpoints – Uses embedder for text query embedding
- Environment Configuration – Model path and hardware settings
Last Updated: 2026-06-17