Algorithm Deep Dive¶

This page provides an in-depth explanation of the sampling algorithm, including mathematical details and design rationale.

Overview¶

The sampler implements a three-pass pipeline designed to extract a sparse, diverse subset of high-quality frames from hours of underwater video:

Pass 1: Quality Filtering — Compute metrics and reject problematic frames
Pass 2: Diversity Selection — Grid-based selection spanning visual feature space
Pass 3: Frame Extraction — Save selected frames to disk

The goal is to maximize visual diversity while maintaining quality standards, producing a dataset suitable for training robust computer vision models.

Design Principles¶

Why Not Random Sampling?¶

Random sampling gives equal weight to all frames, leading to:

Over-representation of common scenes (e.g., blue water)
Under-representation of rare events (e.g., unusual lighting, particle aggregations)
Inclusion of poor-quality frames (blur, darkness, overexposure)

Why Not Uniform Temporal Sampling?¶

Uniform spacing (e.g., every 5 seconds) doesn't account for:

Scene transitions: May miss short transects through interesting features
Quality variation: Includes frames during rapid motion or bad lighting
Redundancy: Captures nearly identical frames during static scenes

Our Approach: Quality + Diversity¶

Quality gates ensure every selected frame meets minimum standards
Diversity selection ensures broad coverage of visual conditions
Interest scoring prioritizes visually rich, informative content
Temporal filtering prevents adjacent redundant frames

Pass 1: Quality Filtering¶

Metric Computation¶

For each sampled frame (at --sample-fps rate), we compute four metrics:

1. Brightness¶

\[B = \frac{1}{W \times H} \sum_{i=1}^{W} \sum_{j=1}^{H} \text{gray}(i, j)\]

Where gray(i,j) is the grayscale pixel value (0–255) obtained via OpenCV's cvtColor with COLOR_BGR2GRAY.

Purpose: Reject frames that are too dark (sensor failure, occlusion) or overexposed (lighting artifacts).

Implementation:

cv::Mat gray;
cv::cvtColor(frame, gray, cv::COLOR_BGR2GRAY);
double brightness = cv::mean(gray)[0];

Typical ranges: - Deep water, ambient light: 30–80 - Shallow water, sunlit: 80–150 - Artificial lighting: 100–200 - Overexposed: >240

2. Sharpness (Laplacian Variance)¶

\[S = \text{Var}\left(\nabla^2 I\right) = \text{Var}\left(\frac{\partial^2 I}{\partial x^2} + \frac{\partial^2 I}{\partial y^2}\right)\]

The Laplacian operator detects edges and high-frequency content. Low variance indicates blur or lack of detail.

Implementation:

cv::Mat gray, laplacian;
cv::cvtColor(frame, gray, cv::COLOR_BGR2GRAY);
cv::Laplacian(gray, laplacian, CV_64F);

cv::Scalar mu, sigma;
cv::meanStdDev(laplacian, mu, sigma);
double sharpness = sigma[0] * sigma[0];  // variance

Typical ranges: - Severe blur: <10 - Slight blur: 10–30 - Acceptable: 30–100 - Very sharp: >100

Why Laplacian?
It's fast, rotation-invariant, and well-correlated with perceptual sharpness. Alternatives (FFT-based methods, gradient magnitude) are slower or more complex.

3. Entropy (Shannon Entropy)¶

\[E = -\sum_{k=0}^{255} p_k \log_2(p_k)\]

Where \(p_k\) is the probability of pixel value \(k\) in the grayscale histogram.

Purpose: Measure information content. Low entropy → featureless (e.g., uniform blue water). High entropy → rich texture and detail.

Implementation:

cv::Mat gray;
cv::cvtColor(frame, gray, cv::COLOR_BGR2GRAY);

int hist[256] = {0};
for (int r = 0; r < gray.rows; r++) {
    const uchar* row = gray.ptr<uchar>(r);
    for (int c = 0; c < gray.cols; c++)
        hist[row[c]]++;
}

int total = gray.rows * gray.cols;
double entropy = 0.0;
for (int i = 0; i < 256; i++) {
    if (hist[i] > 0) {
        double p = (double)hist[i] / total;
        entropy -= p * std::log2(p);
    }
}

Typical ranges: - Uniform (e.g., solid color): 0–2 bits - Low texture (blue water): 2–4 bits - Moderate detail: 4–6 bits - High texture (particles, organisms): 6–8 bits

Maximum possible: 8 bits (uniform distribution of all 256 gray levels)

4. Motion (Frame Differencing)¶

\[M = \frac{1}{W \times H} \sum_{i=1}^{W} \sum_{j=1}^{H} |I_t(i,j) - I_{t-1}(i,j)|\]

Mean absolute difference between current frame and previous frame.

Purpose: Detect camera motion, subject movement, or scene changes. Zero motion → static camera on featureless background.

Implementation:

cv::Mat curr_gray, prev_gray, diff;
cv::cvtColor(curr_frame, curr_gray, cv::COLOR_BGR2GRAY);
cv::cvtColor(prev_frame, prev_gray, cv::COLOR_BGR2GRAY);
cv::absdiff(curr_gray, prev_gray, diff);
double motion = cv::mean(diff)[0];

Typical ranges: - Static camera, static scene: 0–2 - Slow pan or drift: 2–10 - Moderate motion: 10–30 - Rapid motion or scene change: >30

Note: Motion is not a quality gate by default, but contributes to the interest score (see below).

Quality Gates¶

After computing metrics, frames are rejected if they fail any threshold:

bool passes = (brightness >= min_brightness) &&
              (brightness <= max_brightness) &&
              (sharpness  >= min_sharpness)  &&
              (entropy    >= min_entropy);

No motion gate: Motion can be zero (static scene) without implying poor quality. It's used for ranking, not filtering.

Metric Caching¶

To avoid re-computing metrics on repeated runs (e.g., when tuning --max-frames or --n-bins), we cache results to disk.

Cache key (FNV-1a hash):

video_path + video_size + last_modified_time + sample_fps

Cache format (JSON):

{
  "video_path": "/data/Triton43_Cam1.mp4",
  "sample_fps": 1.0,
  "records": [
    {
      "frame_idx": 0,
      "frame_ts": "20250904T120000Z",
      "brightness": 85.3,
      "sharpness": 42.1,
      "entropy": 3.8,
      "motion": 12.5,
      "fps": 29.97
    },
    ...
  ]
}

Cache directory: ./.metric_cache by default, configurable via --cache-dir.

When to skip cache: - Video file modified (timestamp or size changed) - Different --sample-fps - --no-cache flag passed

Pass 2: Diversity Selection¶

After quality filtering, we have a set of candidate frames. Now we select a diverse subset that spans the full range of visual conditions.

Temporal Filtering (Pre-Step)¶

Before grid binning, we enforce temporal separation to avoid near-duplicate frames from the same clip.

Algorithm (per video):

Sort candidate frames by frame_idx
Initialize last_selected = -∞
For each frame:
- If (frame_idx - last_selected) / fps >= min_gap:
  - Accept frame
  - Update last_selected = frame_idx
- Else: reject (too close to previous)

Example (--min-gap 1.0 seconds, 30 fps): - Frame 0 → accepted - Frame 15 (0.5s later) → rejected - Frame 30 (1.0s later) → accepted - Frame 31 (1.03s later) → rejected (only 0.03s since frame 30)

Rationale: Adjacent frames differ only slightly (camera moves ~cm, lighting unchanged). Enforcing gaps increases diversity and reduces dataset size without losing information.

Feature Normalization¶

To bin frames into a 3D grid, we need features on a common scale. We use:

Brightness (B)
Log-sharpness: \(\ln(1 + S)\) — compresses large dynamic range
Entropy (E)

Normalization (per-feature):

Compute 2nd and 98th percentiles across all candidates
Linearly scale to [0, 1]:

\[f_{\text{norm}} = \frac{f - p_2}{p_{98} - p_2}, \quad \text{clamped to } [0, 1]\]

Why 2nd/98th percentiles? Avoids outliers skewing the scale (e.g., one extremely bright frame).

Implementation:

auto percentile = [](std::vector<float> v, double p) {
    std::sort(v.begin(), v.end());
    int n = v.size();
    double idx = (p / 100.0) * (n - 1);
    int lo = (int)idx;
    double frac = idx - lo;
    return v[lo] * (1 - frac) + v[lo + 1] * frac;
};

float p2  = percentile(brightness, 2);
float p98 = percentile(brightness, 98);
for (auto& b : brightness)
    b = std::clamp((b - p2) / (p98 - p2), 0.0f, 1.0f);

Grid Binning¶

Divide the normalized 3D space \([0,1]^3\) into a regular grid with n_bins per axis (default: 8³ = 512 cells).

Bin assignment:

auto bin = [&](float v) {
    return std::clamp((int)(v * n_bins), 0, n_bins - 1);
};

int cell_id = bin(brightness[i])
            + bin(log_sharpness[i]) * n_bins
            + bin(entropy[i]) * n_bins * n_bins;

Example (n_bins = 8): - Frame with (B=0.1, log S=0.5, E=0.9) → cell (0, 4, 7) → ID = 0 + 4×8 + 7×64 = 480

Grid properties:

Uniform coverage: Each cell represents a distinct region of visual space
Adaptive density: Dense regions (many frames) → many cells occupied; sparse regions → few cells
Scale-free: Works for any dataset size (adjusts via n_bins)

Interest Scoring¶

Within each cell, frames are ranked by interest score:

\[I = E \times \ln(1 + S) \times (1 + M)\]

Components:

Entropy (E): Favors textured, detailed scenes
Log-sharpness: Favors sharp focus (log prevents dominance by extreme values)
Motion (M): Favors dynamic content (camera movement, subject motion)

Why this formula?

Multiplicative: All factors must be reasonable (low entropy → low score, regardless of sharpness)
Log-sharpness: Diminishing returns for extreme sharpness (100 vs 200 matters less than 10 vs 20)
Motion bonus: \((1 + M)\) gives 1× to 10× boost depending on activity level

Example scores: - Blue water (E=2.0, S=20, M=1): \(2.0 \times \ln(21) \times 2 = 12.2\) - Particle cloud (E=4.5, S=80, M=15): \(4.5 \times \ln(81) \times 16 = 316.8\) - Blurry organism (E=3.0, S=8, M=10): \(3.0 \times \ln(9) \times 11 = 72.6\)

Per-Cell Sampling¶

For each occupied grid cell:

Sort frames by interest score (descending)
Select top max_per_cell frames (default: ceil(max_frames / n_bins³))

Example (--max-frames 5000, --n-bins 8): - 512 cells available - max_per_cell = ceil(5000 / 512) = 10 - Each cell contributes ≤10 frames

If a cell has fewer frames: Take all of them.

Budget Enforcement¶

After per-cell sampling, if total exceeds --max-frames:

Pool all selected frames
Sort by interest score (descending)
Trim to max_frames

This preserves the highest-interest frames while respecting the budget.

Diversity Guarantees¶

What the grid ensures:

✅ No blind spots: If a visual condition exists in candidates, it's represented (≥1 frame per occupied cell)
✅ Proportional representation: Cells with many high-quality candidates contribute more frames (up to max_per_cell)
✅ Bounded redundancy: At most max_per_cell frames from same feature region

What it doesn't ensure:

❌ Semantic diversity: Grid operates on low-level metrics, not object categories (may select multiple frames of same organism if visually distinct)
❌ Uniform distribution: Dense regions (common scenes) contribute more total frames than sparse regions

Trade-offs: For stricter uniformity, reduce max_per_cell (may miss high-interest frames). For maximum interest, increase it (may over-sample common scenes).

Pass 3: Frame Extraction¶

Video Re-Reading¶

Selected frames are re-read from source videos and saved to disk:

Group selections by video
Sort by frame_idx (minimizes seeking)
Open video, seek to first frame
Read sequentially when possible (if frame_idx = prev + 1)

Optimization: Sequential reads are ~10× faster than random seeks.

Filename Format¶

<vehicle>_Cam<N>_<ISO_timestamp>_<frame_idx>.<ext>

Example:

Triton43_Cam1_20250904T120315Z_0001234.png

Components:

vehicle: Extracted from filename (e.g., "Triton43")
camera: Extracted from filename (e.g., "Cam1")
ISO_timestamp: ISO 8601 format (YYYYMMDDTHHMMSSZ) from video metadata or filename
frame_idx: 0-padded to 7 digits (supports up to 10M frames per video)
ext: .png (default) or .jpg / .jpeg (via --format)

Why zero-padded? Enables correct lexicographic sorting (e.g., for creating videos from frames).

Output Format¶

PNG (default): - Lossless compression - Larger files (~1–5 MB per frame at 1920×1080) - Preferred for training (no compression artifacts)

JPEG (--format .jpg): - Lossy compression - Smaller files (~100–500 KB per frame) - Acceptable for training if storage limited

Computational Complexity¶

Pass 1: Quality Filtering¶

Per frame: - Brightness: \(O(WH)\) — single pass over pixels - Sharpness: \(O(WH)\) — Laplacian convolution - Entropy: \(O(WH)\) — histogram computation - Motion: \(O(WH)\) — frame differencing

Total: \(O(N \times WH)\) where \(N\) is number of examined frames

Parallelism: OpenMP across videos (no inter-video dependencies)

Typical performance (1920×1080, 1 fps sampling, 8 threads): - ~5–10 videos/sec (no cache) - ~50–100 videos/sec (cache hit)

Pass 2: Diversity Selection¶

Percentile computation: \(O(N \log N)\) per feature (sorting)
Grid binning: \(O(N)\) (single pass)
Per-cell sorting: \(O(N \log N)\) worst-case (if all frames in one cell)
Budget trim: \(O(N \log N)\) (sorting)

Total: \(O(N \log N)\)

Typical performance: <1 sec for N ≤ 100,000 candidates

Pass 3: Frame Extraction¶

Per frame: Video seek + decode (~10–50 ms depending on codec and distance)

Total: \(O(M)\) where \(M\) is number of selected frames

Typical performance: ~20–50 frames/sec (mixed seeks), ~100+ frames/sec (sequential)

Memory Usage¶

Metric records: ~50 bytes × N → 5 MB per 100,000 examined frames
Grid map: ~10 bytes × occupied cells → <10 KB typical
Video buffers: ~10 MB per open video (OpenCV internal)

Peak memory: <100 MB for typical workloads

Comparison to Alternatives¶

vs. K-means Clustering¶

K-means in feature space (brightness, sharpness, entropy) is a common approach.

Advantages of grid-based:

✅ Deterministic: No random initialization, same results every run
✅ Faster: No iterative convergence, single-pass binning
✅ Interpretable: Cell boundaries correspond to fixed feature ranges
✅ Uniform coverage: Doesn't collapse to dense regions (k-means centroids gravitate to high-density areas)

Disadvantages:

❌ Axis-aligned: Doesn't adapt to data manifold (k-means clusters can be arbitrarily shaped)
❌ Fixed resolution: Must choose n_bins upfront (k-means adapts K to data)

vs. FPS (Farthest Point Sampling)¶

FPS iteratively selects frames maximizing minimum distance to selected set.

Advantages of grid-based:

✅ Faster: \(O(N \log N)\) vs \(O(MN)\) for FPS
✅ Parallelizable: Grid binning is embarrassingly parallel
✅ Controllable: n_bins and max_per_cell give intuitive control over diversity vs coverage

Disadvantages:

❌ Looser packing: FPS guarantees minimum separation; grid allows clustering at cell boundaries

vs. Active Learning¶

Active learning selects frames a model is uncertain about.

Advantages of grid-based:

✅ No model required: Works for initial dataset creation
✅ Faster: No inference loop
✅ Broader diversity: Doesn't focus only on decision boundaries

When to use active learning: After initial training, to improve specific weaknesses. When to use grid sampling: Before training, to create initial diverse dataset.

Design Decisions and Rationale¶

Why Log-Sharpness in Normalization?¶

Sharpness has a fat-tailed distribution (many low values, few extreme values). Using raw sharpness would cause:

Most frames binned to low-sharpness cells
Extreme-sharpness frames dominate high bins

Log transform compresses the tail, giving more uniform binning across sharpness levels.

Why 2nd/98th Percentile Normalization?¶

Alternatives:

Min/max: Vulnerable to outliers (one extreme frame ruins scale)
Mean/std: Assumes Gaussian distribution (metrics are often skewed)
Median/IQR: Better than mean/std, but 2nd/98th gives wider range while rejecting outliers

Our choice balances robustness and range coverage.

Why Brightness + Log-Sharpness + Entropy (not 4D with Motion)?¶

Reasons for 3D grid:

Interpretability: 3D is visualizable (e.g., with scatter plots)
Sparsity: 4D grid (e.g., 8⁴ = 4096 cells) would be mostly empty for typical datasets (<10,000 candidates)
Motion is temporal: Motion depends on frame order, not intrinsic to a single frame (less suitable for binning)

Motion is used for ranking, not diversity selection.

Why Not Use Color Features?¶

Underwater video color is unreliable:

Wavelength absorption: Red light disappears at depth, color shifts to blue/green
Artificial lighting: Strobes and vehicle lights create unnatural color casts
Camera white balance: Auto-WB creates inconsistent color across clips

Grayscale metrics (brightness, sharpness, entropy) are more stable and physically meaningful in underwater environments.

Validation and Tuning¶

See Tuning Guide for practical advice on parameter selection.

Key validation questions:

Are quality gates appropriate? → Run calibrate on representative footage
Is diversity adequate? → Manually inspect selected frames, check for under-represented conditions
Is temporal separation sufficient? → Check for near-duplicate frames, increase --min-gap if needed
Is grid resolution appropriate? → Increase n_bins if over-sampling common scenes, decrease if cells are too sparse

Metrics to track:

Pass rate (quality filtering): 40–60% ideal (too low → overly strict, too high → accepting poor frames)
Occupied cells: 30–70% of n_bins³ typical (too few → under-utilizing diversity, too many → over-fragmenting)
Frames per cell: ~5–20 typical (too few → sparse, too many → redundant)

Algorithm Deep Dive¶

Overview¶

Design Principles¶

Why Not Random Sampling?¶

Why Not Uniform Temporal Sampling?¶

Our Approach: Quality + Diversity¶

Pass 1: Quality Filtering¶

Metric Computation¶

1. Brightness¶

2. Sharpness (Laplacian Variance)¶

3. Entropy (Shannon Entropy)¶

4. Motion (Frame Differencing)¶

Quality Gates¶

Metric Caching¶

Pass 2: Diversity Selection¶

Temporal Filtering (Pre-Step)¶

Feature Normalization¶

Grid Binning¶

Interest Scoring¶

Per-Cell Sampling¶

Budget Enforcement¶

Diversity Guarantees¶

Pass 3: Frame Extraction¶

Video Re-Reading¶

Filename Format¶

Output Format¶

Computational Complexity¶

Pass 1: Quality Filtering¶

Pass 2: Diversity Selection¶

Pass 3: Frame Extraction¶

Memory Usage¶

Comparison to Alternatives¶

vs. K-means Clustering¶

vs. FPS (Farthest Point Sampling)¶

vs. Active Learning¶

Design Decisions and Rationale¶

Why Log-Sharpness in Normalization?¶

Why 2nd/98th Percentile Normalization?¶

Why Brightness + Log-Sharpness + Entropy (not 4D with Motion)?¶

Why Not Use Color Features?¶

Validation and Tuning¶

Further Reading¶