Skip to content

Diversity Selection

In-depth explanation of the grid-based diversity selection algorithm.

Problem Statement

After quality filtering, we have a set of candidate frames. Our goal is to select a diverse subset that:

  1. Spans the full range of visual conditions (lighting, texture, focus quality)
  2. Avoids redundancy (no over-representation of common scenes)
  3. Respects a frame budget (--max-frames)
  4. Prioritizes high-interest frames (sharp, textured, dynamic)

Naive approaches (random sampling, uniform temporal sampling) fail to balance these goals.


Grid-Based Selection

Overview

We partition the 3D feature space (brightness, log-sharpness, entropy) into a regular grid, then sample frames from each occupied cell.

Key insight: Grid cells act as diversity buckets — frames in the same cell are visually similar, frames in different cells are visually distinct.

Algorithm Steps

  1. Temporal filtering: Enforce --min-gap between frames from same video
  2. Feature normalization: Scale brightness, log-sharpness, entropy to [0, 1]
  3. Grid binning: Assign each frame to a 3D cell based on normalized features
  4. Per-cell ranking: Sort frames within each cell by interest score
  5. Per-cell sampling: Select top max-per-cell frames from each cell
  6. Budget enforcement: If total exceeds max-frames, keep highest-scoring frames globally

Step 1: Temporal Filtering

See Temporal Filtering for full details.

Summary: For each video, enforce minimum time gap (--min-gap seconds) between selected frames.

Effect: Reduces candidates by ~10–40% (depends on gap size and video diversity).

Purpose: Avoid near-duplicate frames from same clip.


Step 2: Feature Normalization

Why Normalize?

Raw metrics have different scales:

  • Brightness: 0–255
  • Sharpness: 0–1000+ (fat-tailed distribution)
  • Entropy: 0–8

Without normalization, sharpness would dominate grid binning (most bins would be empty).

Normalization Method

For each feature (brightness, log-sharpness, entropy):

  1. Compute 2nd and 98th percentiles across all candidates: \(\(p_2, p_{98}\)\)

  2. Linearly scale to [0, 1]: \(\(f_{\text{norm}} = \frac{f - p_2}{p_{98} - p_2}\)\)

  3. Clamp to [0, 1]: \(\(f_{\text{norm}} = \max(0, \min(1, f_{\text{norm}}))\)\)

Why 2nd/98th Percentiles?

Alternatives:

Method Pros Cons
Min/max Full range Vulnerable to outliers
Mean ± 2σ Robust to outliers Assumes Gaussian (metrics are skewed)
Median ± IQR Robust, non-parametric Narrow range (wastes resolution)
2nd/98th percentile Wide range, outlier-resistant Requires sorting (O(N log N))

2nd/98th balances range and robustness: Captures 96% of data while rejecting extreme outliers.

Why Log-Sharpness?

Sharpness has a fat-tailed distribution:

  • Most frames: 10–50
  • Some frames: 100–500
  • Few frames: 1000+

Without log transform: - 90% of frames → bins 0–2 (low sharpness) - 10% of frames → bins 3–7 (high sharpness)

With log transform \(\ln(1 + S)\): - More uniform distribution across bins - Diminishing returns for extreme sharpness (100 vs 200 matters less than 10 vs 20)


Step 3: Grid Binning

Grid Structure

Divide normalized space \([0, 1]^3\) into a regular grid with n-bins per axis.

Total cells: n_bins³

Example (n-bins = 8): - 512 cells total - Cell (0, 0, 0): [0.0–0.125) × [0.0–0.125) × [0.0–0.125) - Cell (7, 7, 7): [0.875–1.0) × [0.875–1.0) × [0.875–1.0)

Cell Assignment

For normalized features \((B_n, S_n, E_n) \in [0, 1]^3\):

\[\text{cell}_B = \lfloor B_n \times \text{n\_bins} \rfloor$$ $$\text{cell}_S = \lfloor S_n \times \text{n\_bins} \rfloor$$ $$\text{cell}_E = \lfloor E_n \times \text{n\_bins} \rfloor\]

Linear index (for hash map): \(\(\text{cell\_id} = \text{cell}_B + \text{cell}_S \times \text{n\_bins} + \text{cell}_E \times \text{n\_bins}^2\)\)

Example: - Frame with \((B_n=0.1, S_n=0.5, E_n=0.9)\) and n-bins=8: - \(\text{cell}_B = \lfloor 0.1 \times 8 \rfloor = 0\) - \(\text{cell}_S = \lfloor 0.5 \times 8 \rfloor = 4\) - \(\text{cell}_E = \lfloor 0.9 \times 8 \rfloor = 7\) - \(\text{cell\_id} = 0 + 4 \times 8 + 7 \times 64 = 480\)

Implementation

auto bin = [&](float v) {
    return std::clamp((int)(v * n_bins), 0, n_bins - 1);
};

std::unordered_map<int, std::vector<size_t>> cell_map;
for (size_t i = 0; i < candidates.size(); i++) {
    int key = bin(brightness[i])
            + bin(log_sharpness[i]) * n_bins
            + bin(entropy[i]) * n_bins * n_bins;
    cell_map[key].push_back(i);
}

Result: Each cell contains indices of frames with similar features.


Step 4: Per-Cell Ranking

Within each occupied cell, frames are sorted by interest score:

\[I = E \times \ln(1 + S) \times (1 + M)\]

Why this formula?

  • Entropy (E): Favors textured, detailed frames
  • Log-sharpness: Favors sharp focus (log prevents extreme values from dominating)
  • Motion (M): Favors dynamic content (camera movement, subject motion)
  • Multiplicative: All factors must be reasonable (low entropy → low score regardless of sharpness)

Example scores:

Frame Type E S M Score
Blue water 2.0 20 1 2.0 × ln(21) × 2 = 12.2
Blurry organism 3.0 8 10 3.0 × ln(9) × 11 = 72.6
Particle cloud 4.5 80 15 4.5 × ln(81) × 16 = 316.8
Sharp static organism 4.0 100 2 4.0 × ln(101) × 3 = 55.3

Effect: Frames with rich texture, sharp focus, and dynamic content score highest.


Step 5: Per-Cell Sampling

For each occupied cell:

  1. Sort frames by interest score (descending)
  2. Select top max-per-cell frames
  3. Discard the rest

Default max-per-cell: \(\(\text{max\_per\_cell} = \lceil \frac{\text{max\_frames}}{\text{n\_bins}^3} \rceil\)\)

Rationale: Ensures cells can collectively reach frame budget.

Example (max-frames = 5000, n-bins = 8): \(\(\text{max\_per\_cell} = \lceil \frac{5000}{512} \rceil = 10\)\)

Each cell contributes ≤10 frames, up to 5120 total (then trimmed to 5000).

Cell Occupancy

Not all cells are occupied (sparse data). Typical occupancy: 30–70% of cells.

Example (1000 candidates, 512 cells, n-bins=8): - ~400 occupied cells (78%) - ~0–5 frames per cell on average - Some dense cells (common scenes) have 20+ frames - Some sparse cells (rare conditions) have 1–2 frames

Impact: - Dense cells: Contribute max-per-cell frames (e.g., 10) - Sparse cells: Contribute all frames (e.g., 2)

Result: Common scenes contribute more total frames, but each cell is capped (prevents complete dominance).


Step 6: Budget Enforcement

After per-cell sampling, if total exceeds max-frames:

  1. Pool all selected frames (from all cells)
  2. Sort by interest score (descending)
  3. Trim to max-frames

Effect: Keeps the highest-interest frames while respecting budget.

Example: - 400 occupied cells × 10 frames/cell = 4000 selected - Budget = 5000 → no trim needed

Example 2: - 450 occupied cells × 10 frames/cell = 4500 selected - Budget = 3000 → trim to top 3000 by interest score


Diversity Guarantees

What the Grid Ensures

Coverage: Every occupied cell contributes ≥1 frame (up to max-per-cell)
Bounded redundancy: At most max-per-cell frames from same feature region
Proportional representation: Dense cells (common scenes) contribute more frames than sparse cells (rare scenes), but capped
Deterministic: Same candidates → same output (no random seed)

What the Grid Doesn't Ensure

Uniform distribution: Dense regions contribute more frames than sparse regions
Semantic diversity: Grid operates on low-level metrics, not object categories (may select multiple frames of same organism if visually distinct)
Spatial diversity: No geographic or temporal spread constraints


Parameter Tuning

n-bins (Grid Resolution)

Trade-off:

  • Higher (e.g., 10–12): Finer distinctions, more diversity, sparser cells
  • Lower (e.g., 5–7): Coarser grouping, less diversity, denser cells

Rule of thumb: \(\(\text{n\_bins} \approx \sqrt[3]{\frac{\text{num\_candidates}}{20}}\)\)

Target: ~10–50 candidates per cell on average.

Example: - 1,000 candidates → n-bins = 5 (125 cells, ~8 per cell) - 10,000 candidates → n-bins = 8 (512 cells, ~20 per cell) - 100,000 candidates → n-bins = 12 (1,728 cells, ~58 per cell)

Diagnostics: After running sample, check output:

After grid filter (8³ cells, ≤10/cell):  4237  (387 occupied cells)

  • Occupied < 20%: Grid too sparse → lower n-bins
  • Occupied > 80%: Grid too dense → raise n-bins

max-per-cell (Cell Budget)

Trade-off:

  • Lower (e.g., 1–5): Strong uniform diversity, may miss high-interest regions
  • Higher (e.g., 20–50): Interest-driven, may over-sample common scenes

Default (balanced): \(\(\text{max\_per\_cell} = \lceil \frac{\text{max\_frames}}{\text{n\_bins}^3} \rceil\)\)

When to override:

Force uniform distribution (each cell equally weighted):

--max-per-cell 1  # Each cell contributes ≤1 frame

Maximize high-interest frames (allow dense cells to dominate):

--max-per-cell 50  # Each cell contributes ≤50 frames

Example (5000 frame budget, 8³ = 512 cells):

max-per-cell Effect
1 Up to 512 frames (uniform, may under-shoot budget)
10 (default) Up to 5120 frames (balanced)
50 Up to 25,600 frames (interest-driven, trimmed to 5000)

Comparison to Alternatives

vs. K-means Clustering

K-means partitions feature space into K clusters, selecting representative frames.

Aspect Grid-Based K-means
Determinism ✅ Deterministic ❌ Random initialization
Speed ✅ O(N log N) (sorting) ❌ O(KNT) (T iterations)
Coverage ✅ Guarantees all regions ❌ Centroids gravitate to dense regions
Interpretability ✅ Cell boundaries are fixed ❌ Cluster shapes are data-dependent
Adaptability ❌ Axis-aligned (doesn't follow data manifold) ✅ Arbitrary cluster shapes

When to prefer grid-based: Fast, reproducible, uniform coverage
When to prefer k-means: Data lies on a low-dimensional manifold (e.g., elongated clusters)

vs. Farthest Point Sampling (FPS)

FPS iteratively selects frames maximizing minimum distance to selected set.

Aspect Grid-Based FPS
Speed ✅ O(N log N) ❌ O(MN) (M = selected count)
Coverage ✅ Guarantees all regions ✅ Guarantees minimum separation
Quality ✅ Interest-weighted ❌ Distance-only (no quality ranking)
Packing ❌ Allows clustering at cell boundaries ✅ Tighter uniform packing

When to prefer grid-based: Large datasets (>10K candidates), quality-weighted selection
When to prefer FPS: Strict uniform spacing required, small datasets

vs. Random Sampling

Random sampling selects frames uniformly at random.

Aspect Grid-Based Random
Diversity ✅ Guaranteed coverage ❌ Clustering due to randomness
Quality ✅ Interest-weighted ❌ Equal weight to all frames
Reproducibility ✅ Deterministic ❌ Requires seed for reproducibility

When to prefer grid-based: Always (unless you need unbiased sampling for statistics)


Visualization

3D Grid Occupancy (Conceptual)

Imagine a 3D cube \([0,1]^3\) divided into 8³ = 512 smaller cubes:

              Entropy
                 |  □  □  ■  ■
                 | □  ■  ■  □
                 |■  ■  □  □
                 |─────────→ Log-Sharpness
                /
               /
              / Brightness
  • ■ (filled): Occupied cell (has candidates)
  • □ (empty): Unoccupied cell (no candidates)

Typical pattern: - Dense region (blue water): Many occupied cells, many frames per cell - Sparse region (organisms): Few occupied cells, few frames per cell

Grid selection: - Dense region: Contributes max-per-cell × num_cells (e.g., 10 × 200 = 2000 frames) - Sparse region: Contributes all frames (e.g., 1–5 per cell × 50 cells = 100 frames)

Total: 2100 frames (blue water dominates, but organisms are represented)

Occupancy vs. n-bins

n-bins Total Cells Candidates (10K) Occupancy Avg per Cell
5 125 10,000 ~90% 80
8 512 10,000 ~60% 20
10 1,000 10,000 ~40% 10
12 1,728 10,000 ~25% 6

Sweet spot: 30–70% occupancy (balance between coverage and cell density).


Advanced: Custom Diversity Metrics

To use different features for diversity selection:

  1. Replace normalization in grid_diversity() (sample.cpp):

    // Example: Use contrast instead of entropy
    for (size_t i = 0; i < n; i++) {
        bright[i] = ...; sharp[i] = ...; contrast[i] = ...;
    }
    

  2. Compute custom features (add to FrameRecord and sample_video_metrics()):

    double compute_contrast(const cv::Mat& frame) {
        cv::Scalar mu, sigma;
        cv::meanStdDev(frame, mu, sigma);
        return sigma[0];  // Standard deviation
    }
    

  3. Rebuild and test

See API Reference for complete example.


Summary

Grid-based diversity selection:

  1. Fast: O(N log N) (sorting for percentiles and per-cell ranking)
  2. Deterministic: Same candidates → same output
  3. Balanced: Covers all feature regions while prioritizing high-interest frames
  4. Tunable: n-bins and max-per-cell control diversity vs. interest trade-off

Key parameters: - --n-bins: Grid resolution (5–12 typical) - --max-per-cell: Cell budget (default: auto-computed)

When to adjust: - Over-sampling common scenes → lower max-per-cell, raise n-bins - Missing rare events → lower max-per-cell, raise n-bins - Under-utilizing budget → lower n-bins, raise max-per-cell


Next Steps