Algorithm Deep Dive¶
This page provides an in-depth explanation of the sampling algorithm, including mathematical details and design rationale.
Overview¶
The sampler implements a three-pass pipeline designed to extract a sparse, diverse subset of high-quality frames from hours of underwater video:
- Pass 1: Quality Filtering — Compute metrics and reject problematic frames
- Pass 2: Diversity Selection — Grid-based selection spanning visual feature space
- Pass 3: Frame Extraction — Save selected frames to disk
The goal is to maximize visual diversity while maintaining quality standards, producing a dataset suitable for training robust computer vision models.
Design Principles¶
Why Not Random Sampling?¶
Random sampling gives equal weight to all frames, leading to:
- Over-representation of common scenes (e.g., blue water)
- Under-representation of rare events (e.g., unusual lighting, particle aggregations)
- Inclusion of poor-quality frames (blur, darkness, overexposure)
Why Not Uniform Temporal Sampling?¶
Uniform spacing (e.g., every 5 seconds) doesn't account for:
- Scene transitions: May miss short transects through interesting features
- Quality variation: Includes frames during rapid motion or bad lighting
- Redundancy: Captures nearly identical frames during static scenes
Our Approach: Quality + Diversity¶
- Quality gates ensure every selected frame meets minimum standards
- Diversity selection ensures broad coverage of visual conditions
- Interest scoring prioritizes visually rich, informative content
- Temporal filtering prevents adjacent redundant frames
Pass 1: Quality Filtering¶
Metric Computation¶
For each sampled frame (at --sample-fps rate), we compute four metrics:
1. Brightness¶
Where gray(i,j) is the grayscale pixel value (0–255) obtained via OpenCV's cvtColor with COLOR_BGR2GRAY.
Purpose: Reject frames that are too dark (sensor failure, occlusion) or overexposed (lighting artifacts).
Implementation:
Typical ranges: - Deep water, ambient light: 30–80 - Shallow water, sunlit: 80–150 - Artificial lighting: 100–200 - Overexposed: >240
2. Sharpness (Laplacian Variance)¶
The Laplacian operator detects edges and high-frequency content. Low variance indicates blur or lack of detail.
Implementation:
cv::Mat gray, laplacian;
cv::cvtColor(frame, gray, cv::COLOR_BGR2GRAY);
cv::Laplacian(gray, laplacian, CV_64F);
cv::Scalar mu, sigma;
cv::meanStdDev(laplacian, mu, sigma);
double sharpness = sigma[0] * sigma[0]; // variance
Typical ranges: - Severe blur: <10 - Slight blur: 10–30 - Acceptable: 30–100 - Very sharp: >100
Why Laplacian?
It's fast, rotation-invariant, and well-correlated with perceptual sharpness. Alternatives (FFT-based methods, gradient magnitude) are slower or more complex.
3. Entropy (Shannon Entropy)¶
Where \(p_k\) is the probability of pixel value \(k\) in the grayscale histogram.
Purpose: Measure information content. Low entropy → featureless (e.g., uniform blue water). High entropy → rich texture and detail.
Implementation:
cv::Mat gray;
cv::cvtColor(frame, gray, cv::COLOR_BGR2GRAY);
int hist[256] = {0};
for (int r = 0; r < gray.rows; r++) {
const uchar* row = gray.ptr<uchar>(r);
for (int c = 0; c < gray.cols; c++)
hist[row[c]]++;
}
int total = gray.rows * gray.cols;
double entropy = 0.0;
for (int i = 0; i < 256; i++) {
if (hist[i] > 0) {
double p = (double)hist[i] / total;
entropy -= p * std::log2(p);
}
}
Typical ranges: - Uniform (e.g., solid color): 0–2 bits - Low texture (blue water): 2–4 bits - Moderate detail: 4–6 bits - High texture (particles, organisms): 6–8 bits
Maximum possible: 8 bits (uniform distribution of all 256 gray levels)
4. Motion (Frame Differencing)¶
Mean absolute difference between current frame and previous frame.
Purpose: Detect camera motion, subject movement, or scene changes. Zero motion → static camera on featureless background.
Implementation:
cv::Mat curr_gray, prev_gray, diff;
cv::cvtColor(curr_frame, curr_gray, cv::COLOR_BGR2GRAY);
cv::cvtColor(prev_frame, prev_gray, cv::COLOR_BGR2GRAY);
cv::absdiff(curr_gray, prev_gray, diff);
double motion = cv::mean(diff)[0];
Typical ranges: - Static camera, static scene: 0–2 - Slow pan or drift: 2–10 - Moderate motion: 10–30 - Rapid motion or scene change: >30
Note: Motion is not a quality gate by default, but contributes to the interest score (see below).
Quality Gates¶
After computing metrics, frames are rejected if they fail any threshold:
bool passes = (brightness >= min_brightness) &&
(brightness <= max_brightness) &&
(sharpness >= min_sharpness) &&
(entropy >= min_entropy);
No motion gate: Motion can be zero (static scene) without implying poor quality. It's used for ranking, not filtering.
Metric Caching¶
To avoid re-computing metrics on repeated runs (e.g., when tuning --max-frames or --n-bins), we cache results to disk.
Cache key (FNV-1a hash):
Cache format (JSON):
{
"video_path": "/data/Triton43_Cam1.mp4",
"sample_fps": 1.0,
"records": [
{
"frame_idx": 0,
"frame_ts": "20250904T120000Z",
"brightness": 85.3,
"sharpness": 42.1,
"entropy": 3.8,
"motion": 12.5,
"fps": 29.97
},
...
]
}
Cache directory: ./.metric_cache by default, configurable via --cache-dir.
When to skip cache:
- Video file modified (timestamp or size changed)
- Different --sample-fps
- --no-cache flag passed
Pass 2: Diversity Selection¶
After quality filtering, we have a set of candidate frames. Now we select a diverse subset that spans the full range of visual conditions.
Temporal Filtering (Pre-Step)¶
Before grid binning, we enforce temporal separation to avoid near-duplicate frames from the same clip.
Algorithm (per video):
- Sort candidate frames by
frame_idx - Initialize
last_selected = -∞ - For each frame:
- If
(frame_idx - last_selected) / fps >= min_gap:- Accept frame
- Update
last_selected = frame_idx
- Else: reject (too close to previous)
- If
Example (--min-gap 1.0 seconds, 30 fps):
- Frame 0 → accepted
- Frame 15 (0.5s later) → rejected
- Frame 30 (1.0s later) → accepted
- Frame 31 (1.03s later) → rejected (only 0.03s since frame 30)
Rationale: Adjacent frames differ only slightly (camera moves ~cm, lighting unchanged). Enforcing gaps increases diversity and reduces dataset size without losing information.
Feature Normalization¶
To bin frames into a 3D grid, we need features on a common scale. We use:
- Brightness (B)
- Log-sharpness: \(\ln(1 + S)\) — compresses large dynamic range
- Entropy (E)
Normalization (per-feature):
- Compute 2nd and 98th percentiles across all candidates
- Linearly scale to [0, 1]:
Why 2nd/98th percentiles? Avoids outliers skewing the scale (e.g., one extremely bright frame).
Implementation:
auto percentile = [](std::vector<float> v, double p) {
std::sort(v.begin(), v.end());
int n = v.size();
double idx = (p / 100.0) * (n - 1);
int lo = (int)idx;
double frac = idx - lo;
return v[lo] * (1 - frac) + v[lo + 1] * frac;
};
float p2 = percentile(brightness, 2);
float p98 = percentile(brightness, 98);
for (auto& b : brightness)
b = std::clamp((b - p2) / (p98 - p2), 0.0f, 1.0f);
Grid Binning¶
Divide the normalized 3D space \([0,1]^3\) into a regular grid with n_bins per axis (default: 8³ = 512 cells).
Bin assignment:
auto bin = [&](float v) {
return std::clamp((int)(v * n_bins), 0, n_bins - 1);
};
int cell_id = bin(brightness[i])
+ bin(log_sharpness[i]) * n_bins
+ bin(entropy[i]) * n_bins * n_bins;
Example (n_bins = 8): - Frame with (B=0.1, log S=0.5, E=0.9) → cell (0, 4, 7) → ID = 0 + 4×8 + 7×64 = 480
Grid properties:
- Uniform coverage: Each cell represents a distinct region of visual space
- Adaptive density: Dense regions (many frames) → many cells occupied; sparse regions → few cells
- Scale-free: Works for any dataset size (adjusts via
n_bins)
Interest Scoring¶
Within each cell, frames are ranked by interest score:
Components:
- Entropy (E): Favors textured, detailed scenes
- Log-sharpness: Favors sharp focus (log prevents dominance by extreme values)
- Motion (M): Favors dynamic content (camera movement, subject motion)
Why this formula?
- Multiplicative: All factors must be reasonable (low entropy → low score, regardless of sharpness)
- Log-sharpness: Diminishing returns for extreme sharpness (100 vs 200 matters less than 10 vs 20)
- Motion bonus: \((1 + M)\) gives 1× to 10× boost depending on activity level
Example scores: - Blue water (E=2.0, S=20, M=1): \(2.0 \times \ln(21) \times 2 = 12.2\) - Particle cloud (E=4.5, S=80, M=15): \(4.5 \times \ln(81) \times 16 = 316.8\) - Blurry organism (E=3.0, S=8, M=10): \(3.0 \times \ln(9) \times 11 = 72.6\)
Per-Cell Sampling¶
For each occupied grid cell:
- Sort frames by interest score (descending)
- Select top
max_per_cellframes (default:ceil(max_frames / n_bins³))
Example (--max-frames 5000, --n-bins 8):
- 512 cells available
- max_per_cell = ceil(5000 / 512) = 10
- Each cell contributes ≤10 frames
If a cell has fewer frames: Take all of them.
Budget Enforcement¶
After per-cell sampling, if total exceeds --max-frames:
- Pool all selected frames
- Sort by interest score (descending)
- Trim to
max_frames
This preserves the highest-interest frames while respecting the budget.
Diversity Guarantees¶
What the grid ensures:
- ✅ No blind spots: If a visual condition exists in candidates, it's represented (≥1 frame per occupied cell)
- ✅ Proportional representation: Cells with many high-quality candidates contribute more frames (up to
max_per_cell) - ✅ Bounded redundancy: At most
max_per_cellframes from same feature region
What it doesn't ensure:
- ❌ Semantic diversity: Grid operates on low-level metrics, not object categories (may select multiple frames of same organism if visually distinct)
- ❌ Uniform distribution: Dense regions (common scenes) contribute more total frames than sparse regions
Trade-offs: For stricter uniformity, reduce max_per_cell (may miss high-interest frames). For maximum interest, increase it (may over-sample common scenes).
Pass 3: Frame Extraction¶
Video Re-Reading¶
Selected frames are re-read from source videos and saved to disk:
- Group selections by video
- Sort by
frame_idx(minimizes seeking) - Open video, seek to first frame
- Read sequentially when possible (if
frame_idx = prev + 1)
Optimization: Sequential reads are ~10× faster than random seeks.
Filename Format¶
Example:
Components:
vehicle: Extracted from filename (e.g., "Triton43")camera: Extracted from filename (e.g., "Cam1")ISO_timestamp: ISO 8601 format (YYYYMMDDTHHMMSSZ) from video metadata or filenameframe_idx: 0-padded to 7 digits (supports up to 10M frames per video)ext:.png(default) or.jpg/.jpeg(via--format)
Why zero-padded? Enables correct lexicographic sorting (e.g., for creating videos from frames).
Output Format¶
PNG (default): - Lossless compression - Larger files (~1–5 MB per frame at 1920×1080) - Preferred for training (no compression artifacts)
JPEG (--format .jpg):
- Lossy compression
- Smaller files (~100–500 KB per frame)
- Acceptable for training if storage limited
Computational Complexity¶
Pass 1: Quality Filtering¶
Per frame: - Brightness: \(O(WH)\) — single pass over pixels - Sharpness: \(O(WH)\) — Laplacian convolution - Entropy: \(O(WH)\) — histogram computation - Motion: \(O(WH)\) — frame differencing
Total: \(O(N \times WH)\) where \(N\) is number of examined frames
Parallelism: OpenMP across videos (no inter-video dependencies)
Typical performance (1920×1080, 1 fps sampling, 8 threads): - ~5–10 videos/sec (no cache) - ~50–100 videos/sec (cache hit)
Pass 2: Diversity Selection¶
- Percentile computation: \(O(N \log N)\) per feature (sorting)
- Grid binning: \(O(N)\) (single pass)
- Per-cell sorting: \(O(N \log N)\) worst-case (if all frames in one cell)
- Budget trim: \(O(N \log N)\) (sorting)
Total: \(O(N \log N)\)
Typical performance: <1 sec for N ≤ 100,000 candidates
Pass 3: Frame Extraction¶
Per frame: Video seek + decode (~10–50 ms depending on codec and distance)
Total: \(O(M)\) where \(M\) is number of selected frames
Typical performance: ~20–50 frames/sec (mixed seeks), ~100+ frames/sec (sequential)
Memory Usage¶
- Metric records: ~50 bytes × N → 5 MB per 100,000 examined frames
- Grid map: ~10 bytes × occupied cells → <10 KB typical
- Video buffers: ~10 MB per open video (OpenCV internal)
Peak memory: <100 MB for typical workloads
Comparison to Alternatives¶
vs. K-means Clustering¶
K-means in feature space (brightness, sharpness, entropy) is a common approach.
Advantages of grid-based:
- ✅ Deterministic: No random initialization, same results every run
- ✅ Faster: No iterative convergence, single-pass binning
- ✅ Interpretable: Cell boundaries correspond to fixed feature ranges
- ✅ Uniform coverage: Doesn't collapse to dense regions (k-means centroids gravitate to high-density areas)
Disadvantages:
- ❌ Axis-aligned: Doesn't adapt to data manifold (k-means clusters can be arbitrarily shaped)
- ❌ Fixed resolution: Must choose
n_binsupfront (k-means adapts K to data)
vs. FPS (Farthest Point Sampling)¶
FPS iteratively selects frames maximizing minimum distance to selected set.
Advantages of grid-based:
- ✅ Faster: \(O(N \log N)\) vs \(O(MN)\) for FPS
- ✅ Parallelizable: Grid binning is embarrassingly parallel
- ✅ Controllable:
n_binsandmax_per_cellgive intuitive control over diversity vs coverage
Disadvantages:
- ❌ Looser packing: FPS guarantees minimum separation; grid allows clustering at cell boundaries
vs. Active Learning¶
Active learning selects frames a model is uncertain about.
Advantages of grid-based:
- ✅ No model required: Works for initial dataset creation
- ✅ Faster: No inference loop
- ✅ Broader diversity: Doesn't focus only on decision boundaries
When to use active learning: After initial training, to improve specific weaknesses. When to use grid sampling: Before training, to create initial diverse dataset.
Design Decisions and Rationale¶
Why Log-Sharpness in Normalization?¶
Sharpness has a fat-tailed distribution (many low values, few extreme values). Using raw sharpness would cause:
- Most frames binned to low-sharpness cells
- Extreme-sharpness frames dominate high bins
Log transform compresses the tail, giving more uniform binning across sharpness levels.
Why 2nd/98th Percentile Normalization?¶
Alternatives:
- Min/max: Vulnerable to outliers (one extreme frame ruins scale)
- Mean/std: Assumes Gaussian distribution (metrics are often skewed)
- Median/IQR: Better than mean/std, but 2nd/98th gives wider range while rejecting outliers
Our choice balances robustness and range coverage.
Why Brightness + Log-Sharpness + Entropy (not 4D with Motion)?¶
Reasons for 3D grid:
- Interpretability: 3D is visualizable (e.g., with scatter plots)
- Sparsity: 4D grid (e.g., 8⁴ = 4096 cells) would be mostly empty for typical datasets (<10,000 candidates)
- Motion is temporal: Motion depends on frame order, not intrinsic to a single frame (less suitable for binning)
Motion is used for ranking, not diversity selection.
Why Not Use Color Features?¶
Underwater video color is unreliable:
- Wavelength absorption: Red light disappears at depth, color shifts to blue/green
- Artificial lighting: Strobes and vehicle lights create unnatural color casts
- Camera white balance: Auto-WB creates inconsistent color across clips
Grayscale metrics (brightness, sharpness, entropy) are more stable and physically meaningful in underwater environments.
Validation and Tuning¶
See Tuning Guide for practical advice on parameter selection.
Key validation questions:
- Are quality gates appropriate? → Run
calibrateon representative footage - Is diversity adequate? → Manually inspect selected frames, check for under-represented conditions
- Is temporal separation sufficient? → Check for near-duplicate frames, increase
--min-gapif needed - Is grid resolution appropriate? → Increase
n_binsif over-sampling common scenes, decrease if cells are too sparse
Metrics to track:
- Pass rate (quality filtering): 40–60% ideal (too low → overly strict, too high → accepting poor frames)
- Occupied cells: 30–70% of
n_bins³typical (too few → under-utilizing diversity, too many → over-fragmenting) - Frames per cell: ~5–20 typical (too few → sparse, too many → redundant)
Further Reading¶
- Tuning Guide: How to calibrate parameters for your data
- Quality Metrics: Detailed explanation of brightness, sharpness, entropy, motion
- Diversity Selection: Grid-based sampling deep dive
- Temporal Filtering: Why and how temporal separation works