Diversity Selection¶

In-depth explanation of the grid-based diversity selection algorithm.

Problem Statement¶

After quality filtering, we have a set of candidate frames. Our goal is to select a diverse subset that:

Spans the full range of visual conditions (lighting, texture, focus quality)
Avoids redundancy (no over-representation of common scenes)
Respects a frame budget (--max-frames)
Prioritizes high-interest frames (sharp, textured, dynamic)

Naive approaches (random sampling, uniform temporal sampling) fail to balance these goals.

Grid-Based Selection¶

Overview¶

We partition the 3D feature space (brightness, log-sharpness, entropy) into a regular grid, then sample frames from each occupied cell.

Key insight: Grid cells act as diversity buckets — frames in the same cell are visually similar, frames in different cells are visually distinct.

Algorithm Steps¶

Temporal filtering: Enforce --min-gap between frames from same video
Feature normalization: Scale brightness, log-sharpness, entropy to [0, 1]
Grid binning: Assign each frame to a 3D cell based on normalized features
Per-cell ranking: Sort frames within each cell by interest score
Per-cell sampling: Select top max-per-cell frames from each cell
Budget enforcement: If total exceeds max-frames, keep highest-scoring frames globally

Step 1: Temporal Filtering¶

See Temporal Filtering for full details.

Summary: For each video, enforce minimum time gap (--min-gap seconds) between selected frames.

Effect: Reduces candidates by ~10–40% (depends on gap size and video diversity).

Purpose: Avoid near-duplicate frames from same clip.

Step 2: Feature Normalization¶

Why Normalize?¶

Raw metrics have different scales:

Brightness: 0–255
Sharpness: 0–1000+ (fat-tailed distribution)
Entropy: 0–8

Without normalization, sharpness would dominate grid binning (most bins would be empty).

Normalization Method¶

For each feature (brightness, log-sharpness, entropy):

Compute 2nd and 98th percentiles across all candidates: $\(p_2, p_{98}$\)
Linearly scale to [0, 1]: $\(f_{\text{norm}} = \frac{f - p_2}{p_{98} - p_2}$\)
Clamp to [0, 1]: $\(f_{\text{norm}} = \max(0, \min(1, f_{\text{norm}}))$\)

Why 2nd/98th Percentiles?¶

Alternatives:

Method	Pros	Cons
Min/max	Full range	Vulnerable to outliers
Mean ± 2σ	Robust to outliers	Assumes Gaussian (metrics are skewed)
Median ± IQR	Robust, non-parametric	Narrow range (wastes resolution)
2nd/98th percentile	Wide range, outlier-resistant	Requires sorting (O(N log N))

2nd/98th balances range and robustness: Captures 96% of data while rejecting extreme outliers.

Why Log-Sharpness?¶

Sharpness has a fat-tailed distribution:

Most frames: 10–50
Some frames: 100–500
Few frames: 1000+

Without log transform: - 90% of frames → bins 0–2 (low sharpness) - 10% of frames → bins 3–7 (high sharpness)

With log transform $\ln(1 + S)$: - More uniform distribution across bins - Diminishing returns for extreme sharpness (100 vs 200 matters less than 10 vs 20)

Step 3: Grid Binning¶

Grid Structure¶

Divide normalized space $[0, 1]^3$ into a regular grid with n-bins per axis.

Total cells: n_bins³

Example (n-bins = 8): - 512 cells total - Cell (0, 0, 0): [0.0–0.125) × [0.0–0.125) × [0.0–0.125) - Cell (7, 7, 7): [0.875–1.0) × [0.875–1.0) × [0.875–1.0)

Cell Assignment¶

For normalized features $(B_n, S_n, E_n) \in [0, 1]^3$:

\[\text{cell}_B = \lfloor B_n \times \text{n\_bins} \rfloor$$ $$\text{cell}_S = \lfloor S_n \times \text{n\_bins} \rfloor$$ $$\text{cell}_E = \lfloor E_n \times \text{n\_bins} \rfloor\]

Linear index (for hash map): $\(\text{cell\_id} = \text{cell}_B + \text{cell}_S \times \text{n\_bins} + \text{cell}_E \times \text{n\_bins}^2$\)

Example: - Frame with $(B_n=0.1, S_n=0.5, E_n=0.9)$ and n-bins=8: - $\text{cell}_B = \lfloor 0.1 \times 8 \rfloor = 0$ - $\text{cell}_S = \lfloor 0.5 \times 8 \rfloor = 4$ - $\text{cell}_E = \lfloor 0.9 \times 8 \rfloor = 7$ - $\text{cell\_id} = 0 + 4 \times 8 + 7 \times 64 = 480$

Implementation¶

auto bin = [&](float v) {
    return std::clamp((int)(v * n_bins), 0, n_bins - 1);
};

std::unordered_map<int, std::vector<size_t>> cell_map;
for (size_t i = 0; i < candidates.size(); i++) {
    int key = bin(brightness[i])
            + bin(log_sharpness[i]) * n_bins
            + bin(entropy[i]) * n_bins * n_bins;
    cell_map[key].push_back(i);
}

Result: Each cell contains indices of frames with similar features.

Step 4: Per-Cell Ranking¶

Within each occupied cell, frames are sorted by interest score:

\[I = E \times \ln(1 + S) \times (1 + M)\]

Why this formula?

Entropy (E): Favors textured, detailed frames
Log-sharpness: Favors sharp focus (log prevents extreme values from dominating)
Motion (M): Favors dynamic content (camera movement, subject motion)
Multiplicative: All factors must be reasonable (low entropy → low score regardless of sharpness)

Example scores:

Frame Type	E	S	M	Score
Blue water	2.0	20	1	2.0 × ln(21) × 2 = 12.2
Blurry organism	3.0	8	10	3.0 × ln(9) × 11 = 72.6
Particle cloud	4.5	80	15	4.5 × ln(81) × 16 = 316.8
Sharp static organism	4.0	100	2	4.0 × ln(101) × 3 = 55.3

Effect: Frames with rich texture, sharp focus, and dynamic content score highest.

Step 5: Per-Cell Sampling¶

For each occupied cell:

Sort frames by interest score (descending)
Select top max-per-cell frames
Discard the rest

Default max-per-cell: $\(\text{max\_per\_cell} = \lceil \frac{\text{max\_frames}}{\text{n\_bins}^3} \rceil$\)

Rationale: Ensures cells can collectively reach frame budget.

Example (max-frames = 5000, n-bins = 8): $\(\text{max\_per\_cell} = \lceil \frac{5000}{512} \rceil = 10$\)

Each cell contributes ≤10 frames, up to 5120 total (then trimmed to 5000).

Cell Occupancy¶

Not all cells are occupied (sparse data). Typical occupancy: 30–70% of cells.

Example (1000 candidates, 512 cells, n-bins=8): - ~400 occupied cells (78%) - ~0–5 frames per cell on average - Some dense cells (common scenes) have 20+ frames - Some sparse cells (rare conditions) have 1–2 frames

Impact: - Dense cells: Contribute max-per-cell frames (e.g., 10) - Sparse cells: Contribute all frames (e.g., 2)

Result: Common scenes contribute more total frames, but each cell is capped (prevents complete dominance).

Step 6: Budget Enforcement¶

After per-cell sampling, if total exceeds max-frames:

Pool all selected frames (from all cells)
Sort by interest score (descending)
Trim to max-frames

Effect: Keeps the highest-interest frames while respecting budget.

Example: - 400 occupied cells × 10 frames/cell = 4000 selected - Budget = 5000 → no trim needed

Example 2: - 450 occupied cells × 10 frames/cell = 4500 selected - Budget = 3000 → trim to top 3000 by interest score

Diversity Guarantees¶

What the Grid Ensures¶

✅ Coverage: Every occupied cell contributes ≥1 frame (up to max-per-cell)
✅ Bounded redundancy: At most max-per-cell frames from same feature region
✅ Proportional representation: Dense cells (common scenes) contribute more frames than sparse cells (rare scenes), but capped
✅ Deterministic: Same candidates → same output (no random seed)

What the Grid Doesn't Ensure¶

❌ Uniform distribution: Dense regions contribute more frames than sparse regions
❌ Semantic diversity: Grid operates on low-level metrics, not object categories (may select multiple frames of same organism if visually distinct)
❌ Spatial diversity: No geographic or temporal spread constraints

Parameter Tuning¶

`n-bins` (Grid Resolution)¶

Trade-off:

Higher (e.g., 10–12): Finer distinctions, more diversity, sparser cells
Lower (e.g., 5–7): Coarser grouping, less diversity, denser cells

Rule of thumb: $\(\text{n\_bins} \approx \sqrt[3]{\frac{\text{num\_candidates}}{20}}$\)

Target: ~10–50 candidates per cell on average.

Example: - 1,000 candidates → n-bins = 5 (125 cells, ~8 per cell) - 10,000 candidates → n-bins = 8 (512 cells, ~20 per cell) - 100,000 candidates → n-bins = 12 (1,728 cells, ~58 per cell)

Diagnostics: After running sample, check output:

After grid filter (8³ cells, ≤10/cell):  4237  (387 occupied cells)

Occupied < 20%: Grid too sparse → lower n-bins
Occupied > 80%: Grid too dense → raise n-bins

`max-per-cell` (Cell Budget)¶

Trade-off:

Lower (e.g., 1–5): Strong uniform diversity, may miss high-interest regions
Higher (e.g., 20–50): Interest-driven, may over-sample common scenes

Default (balanced): $\(\text{max\_per\_cell} = \lceil \frac{\text{max\_frames}}{\text{n\_bins}^3} \rceil$\)

When to override:

Force uniform distribution (each cell equally weighted):

--max-per-cell 1  # Each cell contributes ≤1 frame

Maximize high-interest frames (allow dense cells to dominate):

--max-per-cell 50  # Each cell contributes ≤50 frames

Example (5000 frame budget, 8³ = 512 cells):

max-per-cell	Effect
1	Up to 512 frames (uniform, may under-shoot budget)
10 (default)	Up to 5120 frames (balanced)
50	Up to 25,600 frames (interest-driven, trimmed to 5000)

Comparison to Alternatives¶

vs. K-means Clustering¶

K-means partitions feature space into K clusters, selecting representative frames.

Aspect	Grid-Based	K-means
Determinism	✅ Deterministic	❌ Random initialization
Speed	✅ O(N log N) (sorting)	❌ O(KNT) (T iterations)
Coverage	✅ Guarantees all regions	❌ Centroids gravitate to dense regions
Interpretability	✅ Cell boundaries are fixed	❌ Cluster shapes are data-dependent
Adaptability	❌ Axis-aligned (doesn't follow data manifold)	✅ Arbitrary cluster shapes

When to prefer grid-based: Fast, reproducible, uniform coverage
When to prefer k-means: Data lies on a low-dimensional manifold (e.g., elongated clusters)

vs. Farthest Point Sampling (FPS)¶

FPS iteratively selects frames maximizing minimum distance to selected set.

Aspect	Grid-Based	FPS
Speed	✅ O(N log N)	❌ O(MN) (M = selected count)
Coverage	✅ Guarantees all regions	✅ Guarantees minimum separation
Quality	✅ Interest-weighted	❌ Distance-only (no quality ranking)
Packing	❌ Allows clustering at cell boundaries	✅ Tighter uniform packing

When to prefer grid-based: Large datasets (>10K candidates), quality-weighted selection
When to prefer FPS: Strict uniform spacing required, small datasets

vs. Random Sampling¶

Random sampling selects frames uniformly at random.

Aspect	Grid-Based	Random
Diversity	✅ Guaranteed coverage	❌ Clustering due to randomness
Quality	✅ Interest-weighted	❌ Equal weight to all frames
Reproducibility	✅ Deterministic	❌ Requires seed for reproducibility

When to prefer grid-based: Always (unless you need unbiased sampling for statistics)

Visualization¶

3D Grid Occupancy (Conceptual)¶

Imagine a 3D cube $[0,1]^3$ divided into 8³ = 512 smaller cubes:

              Entropy
                 ↑
                 |  □  □  ■  ■
                 | □  ■  ■  □
                 |■  ■  □  □
                 |─────────→ Log-Sharpness
                /
               /
              / Brightness

■ (filled): Occupied cell (has candidates)
□ (empty): Unoccupied cell (no candidates)

Typical pattern: - Dense region (blue water): Many occupied cells, many frames per cell - Sparse region (organisms): Few occupied cells, few frames per cell

Grid selection: - Dense region: Contributes max-per-cell × num_cells (e.g., 10 × 200 = 2000 frames) - Sparse region: Contributes all frames (e.g., 1–5 per cell × 50 cells = 100 frames)

Total: 2100 frames (blue water dominates, but organisms are represented)

Occupancy vs. n-bins¶

n-bins	Total Cells	Candidates (10K)	Occupancy	Avg per Cell
5	125	10,000	~90%	80
8	512	10,000	~60%	20
10	1,000	10,000	~40%	10
12	1,728	10,000	~25%	6

Sweet spot: 30–70% occupancy (balance between coverage and cell density).

Advanced: Custom Diversity Metrics¶

To use different features for diversity selection:

Replace normalization in grid_diversity() (sample.cpp):

// Example: Use contrast instead of entropy
for (size_t i = 0; i < n; i++) {
    bright[i] = ...; sharp[i] = ...; contrast[i] = ...;
}

Compute custom features (add to FrameRecord and sample_video_metrics()):

double compute_contrast(const cv::Mat& frame) {
    cv::Scalar mu, sigma;
    cv::meanStdDev(frame, mu, sigma);
    return sigma[0];  // Standard deviation
}

Rebuild and test

See API Reference for complete example.

Summary¶

Grid-based diversity selection:

✅ Fast: O(N log N) (sorting for percentiles and per-cell ranking)
✅ Deterministic: Same candidates → same output
✅ Balanced: Covers all feature regions while prioritizing high-interest frames
✅ Tunable: n-bins and max-per-cell control diversity vs. interest trade-off

Key parameters: - --n-bins: Grid resolution (5–12 typical) - --max-per-cell: Cell budget (default: auto-computed)

When to adjust: - Over-sampling common scenes → lower max-per-cell, raise n-bins - Missing rare events → lower max-per-cell, raise n-bins - Under-utilizing budget → lower n-bins, raise max-per-cell

Next Steps¶

Temporal Filtering: How time gaps are enforced
Quality Metrics: What features are used for grid binning
Tuning Guide: Practical parameter selection