Usage
News/2026-03-09-usage-vibe-coding-guide
Vibe Coding GuideMar 9, 20267 min read
?Unverified·First-party

Usage

# Building Semantic Video Search with Voyage Multimodal 3.5

Why this matters for builders

Voyage AI just dropped voyage-multimodal-3.5, the first production-grade multimodal embedding model that natively understands interleaved text+images and video frames in a single unified embedding space.

It beats Cohere Embed v4 by 4.56% on 15 visual document retrieval benchmarks and Google Multimodal Embedding 001 by 4.65% on video retrieval datasets, while matching top text-only models on pure text. Most importantly, it supports Matryoshka embeddings (2048 → 1024 → 512 → 256 dimensions) and multiple quantization options (fp32, int8, binary) with minimal quality loss.

This unlocks practical, cost-effective semantic search over mixed media content — especially video libraries, product demos, meeting recordings, training content, and customer support videos.

For vibe coders who ship real products, this is a rare opportunity: a single embedding model that can power both visual document RAG and video search without modality-gap headaches.

When to use it

Use voyage-multimodal-3.5 when you need to:

  • Search inside video libraries using natural language ("show me the part where the engineer explains authentication flow")
  • Build unified search over PDFs, slides, screenshots, and video clips
  • Support long-context visual documents that exceed single-image limits
  • Reduce vector storage costs via Matryoshka or binary quantization
  • Replace separate text + image + video pipelines with one model

Skip it for pure text search (use voyage-3 or voyage-3.5 instead) or when you need real-time frame-by-frame analysis (this is for retrieval, not streaming video understanding).


The full process — Ship a working video semantic search prototype in one focused sprint

1. Define the goal (30 minutes)

Write a one-paragraph spec:

"Build a semantic video search demo that lets users type natural language queries and retrieve the most relevant video segments from a library of 50+ product and engineering videos. Each video is split into coherent scenes with optional transcripts. Results should show the video segment + timestamp + relevance score. Support Matryoshka dimensionality so we can test 512-dim vs 2048-dim tradeoffs. Store everything in a local vector DB for fast iteration."

Prompt your AI coding assistant with this exact goal so it stays scoped.

2. Shape the spec and prompt (45 minutes)

Good system prompt for Claude/Cursor/GPT:

You are an experienced retrieval engineer. We are building a semantic video search prototype using Voyage AI's new voyage-multimodal-3.5 model. Key constraints:

  • Videos must be processed as ordered sequences of frames + optional transcript text
  • Respect the 32k token limit per embedding (every 1120 pixels = 1 token)
  • Use scene-level segmentation aligned with transcript timestamps when possible
  • Support Matryoshka embeddings (2048, 1024, 512, 256)
  • Use voyageai Python SDK
  • Store in Chroma or Qdrant with metadata (video_id, start_time, end_time, transcript_snippet)
  • Return playable video segments with timestamps

3. Scaffold the project

mkdir voyage-video-search && cd voyage-video-search
mkdir videos segments embeddings
touch main.py embedder.py retriever.py utils.py requirements.txt

requirements.txt

voyageai
chromadb
opencv-python-headless
pydub
tqdm
python-dotenv

4. Implement carefully

embedder.py (core logic)

import voyageai
from typing import List, Dict
import cv2
import numpy as np
from pathlib import Path

class VoyageVideoEmbedder:
    def __init__(self, model="voyage-multimodal-3.5"):
        self.client = voyageai.Client()
        self.model = model
    
    def extract_frames(self, video_path: str, fps: int = 1, max_tokens: int = 32000):
        """Extract frames while respecting token budget."""
        cap = cv2.VideoCapture(video_path)
        original_fps = cap.get(cv2.CAP_PROP_FPS)
        frame_interval = int(original_fps / fps)
        
        frames = []
        frame_count = 0
        while cap.isOpened():
            ret, frame = cap.read()
            if not ret:
                break
            if frame_count % frame_interval == 0:
                # Voyage: every 1120 pixels = 1 token
                height, width = frame.shape[:2]
                pixels = height * width
                tokens_per_frame = pixels // 1120
                frames.append((frame, tokens_per_frame))
            frame_count += 1
        cap.release()
        return frames
    
    def embed_video_segment(self, 
                           video_path: str, 
                           transcript: str = None,
                           target_dim: int = 1024,
                           fps: int = 1) -> Dict:
        frames = self.extract_frames(video_path, fps=fps)
        
        # Simple token budget check
        total_tokens = sum(tokens for _, tokens in frames)
        if total_tokens > 32000:
            # Reduce FPS or resolution (production logic here)
            fps = max(0.5, fps * 0.7)
            frames = self.extract_frames(video_path, fps=fps)
        
        # Voyage multimodal accepts list of images + text
        images = [cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) for frame, _ in frames]
        
        input_data = images
        if transcript:
            input_data = [transcript] + images  # interleaved
        
        # Get Matryoshka embedding
        result = self.client.embed(
            inputs=input_data,
            model=self.model,
            input_type="document",
            truncation=True,
            output_dimension=target_dim   # Matryoshka magic
        )
        
        return {
            "embedding": result.embeddings[0],
            "metadata": {
                "video_path": video_path,
                "transcript": transcript,
                "frame_count": len(frames),
                "approx_duration": len(frames) / fps,
                "dimension": target_dim
            }
        }

retriever.py

import chromadb
from chromadb.utils.embedding_functions import EmbeddingFunction

class VoyageEmbeddingFunction(EmbeddingFunction):
    def __init__(self, embedder, target_dim=1024):
        self.embedder = embedder
        self.target_dim = target_dim
    
    def __call__(self, texts: List[str]):
        # For query time we usually embed text only
        results = self.embedder.client.embed(
            inputs=texts,
            model="voyage-multimodal-3.5",
            input_type="query",
            output_dimension=self.target_dim
        )
        return results.embeddings

embedder = VoyageVideoEmbedder()
collection = chroma_client.get_or_create_collection(
    name="video_segments",
    embedding_function=VoyageEmbeddingFunction(embedder, target_dim=512)
)

5. Validate rigorously

Test checklist:

  • Single video segment embeds successfully
  • 2048-dim vs 512-dim embeddings show expected similarity patterns
  • Text query retrieves correct video scene (test with 5 known queries)
  • Token budget respected on long videos
  • Matryoshka truncation actually reduces dimension without crashing
  • Retrieval quality degrades gracefully when using lower dimensions
  • Metadata (timestamps, transcripts) preserved

Run this validation prompt with your AI coding tool:

"Write a test suite that embeds 3 sample video segments, runs 5 natural language queries, and prints recall@1 and average cosine similarity for the correct segment."

6. Ship it safely

Production guardrails:

  • Always split videos longer than ~30-45 seconds into scenes
  • Store both full-res and low-res versions if needed
  • Cache embeddings aggressively (they're expensive)
  • Monitor embedding latency and cost
  • Start with 512-dim embeddings in production unless you need maximum accuracy
  • Add hybrid search (keyword + vector) for best results

Fastest way to demo: Use a public dataset like MSR-VTT or YouCook2. Process 20 videos, build a Gradio or Streamlit UI that lets users type queries and plays the top 3 matching clips with timestamps.


Pitfalls and guardrails

  • Token explosion: Don't feed entire hour-long videos. Always segment.
  • Modality gap regression: Even though Voyage uses a unified encoder, very long transcripts can still dominate. Balance text vs visual tokens.
  • FPS choice: 1 FPS is usually enough for most retrieval use cases. 2+ FPS is rarely worth the token cost.
  • Matryoshka surprise: Not all downstream vector DBs or rerankers handle variable dimensions equally well. Test your full retrieval stack at each dimension.
  • Cost: Video embedding is significantly more expensive than text. Implement smart caching and segment reuse.

What to do next — Iteration checklist

  1. Add automatic scene boundary detection using transcript timestamps or PySceneDetect
  2. Add a lightweight reranker (voyage-rerank or Cohere) on top of the top 20 results
  3. Benchmark 512 vs 1024 dimensions on your specific domain
  4. Build a simple UI and share internally
  5. Measure end-to-end latency and cost per query
  6. Explore hybrid search with traditional text search on transcripts

Once this works, you’ll have a strong foundation for internal video knowledge bases, customer support clip libraries, or content recommendation systems.

Sources

This guide is written for builders who can edit code and use AI coding tools. All code snippets are starter templates — check the official Voyage AI documentation and Python SDK for the latest parameters and best practices.

Original Source

blog.voyageai.com

Comments

No comments yet. Be the first to share your thoughts!