Skip to content

Vector Search

We have found vector search particularly helpful for finding rare classes or bootstrapping a dataset when you only have a few examples. This has been used with these tools to mine for rare examples on plankton and midwater animals, although the methodology is generalizable to other domains. If you have a model in mind, a vector search can be used to find similar data - it does not necessarily need to be an image. This is more about answering the question "Have I seen this before?" rather than "What is this?" which is more of a classification problem.

How it works

We have a service called VSS (Vector Search System). This service uses models to convert data into high-dimensional vectors called embeddings in machine learning jargon. These embeddings are essentially "features" that capture the data's characteristics. For example, below is a depiction of the process of generating an embedding for an image.

  1. Embedding Generation: Data is processed through a pre-trained model (such as DINOv2, CLIP), or if audio, Perch2 to generate a unique vector representation.
  2. Vector Database: These vectors are stored in a specialized database optimized for fast similarity searches.
  3. Similarity Search: When you provide a "query" input (here it is an image), the system finds the nearest neighbors in the vector space—those images whose vectors are most similar to the query vector.

Conceptual Overview

Vector DB Concept

VSS Key Features

  • Project-Specific Databases: Each project has its own dedicated vector database, ensuring that search results are relevant to the specific domain and data quality. It doesn't make sense to combine plankton data with drone/uav data, and we want to make sure they are never confused.
  • Multiple Model Support: We support various embedding models, including DINOv2 (excellent for general visual similarity), but find a model fine-tuned for your data if you have enough the best.
  • Tator Integration: Results can be visualized and labeled directly within Tator, allowing for a seamless transition from discovery to annotation.
  • Voxel51 Integration: Results can be uploaded to the Voxel51 for annotation. The professional license of Voxel51 allows for more advanced annotation and integrated model training.

Workflow: Expanding Rare Classes

  1. Identify a Query: Find one or more high-quality examples of a rare class in Tator.
  2. Run a Search: Use the VSS API or integrated tools to search for similar images across your project's entire image collection.
  3. Review Results: Examine the top matches returned by the search.
  4. Label in Tator: Quickly label the correct matches to expand your training set for that rare class.
  5. Iterate: Use the newly labeled examples as additional queries to find even more candidates.

API and Tools

  • 🧭 VSS Service: The main interactive documentation for the Vector Similarity Search service.
  • SDCAT: Often used as the first step to generate the initial clusters and embeddings that are then searchable via VSS.

Updated: 2026-04-13