Diversity Selection¶
In-depth explanation of the grid-based diversity selection algorithm.
Problem Statement¶
After quality filtering, we have a set of candidate frames. Our goal is to select a diverse subset that:
- Spans the full range of visual conditions (lighting, texture, focus quality)
- Avoids redundancy (no over-representation of common scenes)
- Respects a frame budget (
--max-frames) - Prioritizes high-interest frames (sharp, textured, dynamic)
Naive approaches (random sampling, uniform temporal sampling) fail to balance these goals.
Grid-Based Selection¶
Overview¶
We partition the 3D feature space (brightness, log-sharpness, entropy) into a regular grid, then sample frames from each occupied cell.
Key insight: Grid cells act as diversity buckets — frames in the same cell are visually similar, frames in different cells are visually distinct.
Algorithm Steps¶
- Temporal filtering: Enforce
--min-gapbetween frames from same video - Feature normalization: Scale brightness, log-sharpness, entropy to [0, 1]
- Grid binning: Assign each frame to a 3D cell based on normalized features
- Per-cell ranking: Sort frames within each cell by interest score
- Per-cell sampling: Select top
max-per-cellframes from each cell - Budget enforcement: If total exceeds
max-frames, keep highest-scoring frames globally
Step 1: Temporal Filtering¶
See Temporal Filtering for full details.
Summary: For each video, enforce minimum time gap (--min-gap seconds) between selected frames.
Effect: Reduces candidates by ~10–40% (depends on gap size and video diversity).
Purpose: Avoid near-duplicate frames from same clip.
Step 2: Feature Normalization¶
Why Normalize?¶
Raw metrics have different scales:
- Brightness: 0–255
- Sharpness: 0–1000+ (fat-tailed distribution)
- Entropy: 0–8
Without normalization, sharpness would dominate grid binning (most bins would be empty).
Normalization Method¶
For each feature (brightness, log-sharpness, entropy):
-
Compute 2nd and 98th percentiles across all candidates: \(\(p_2, p_{98}\)\)
-
Linearly scale to [0, 1]: \(\(f_{\text{norm}} = \frac{f - p_2}{p_{98} - p_2}\)\)
-
Clamp to [0, 1]: \(\(f_{\text{norm}} = \max(0, \min(1, f_{\text{norm}}))\)\)
Why 2nd/98th Percentiles?¶
Alternatives:
| Method | Pros | Cons |
|---|---|---|
| Min/max | Full range | Vulnerable to outliers |
| Mean ± 2σ | Robust to outliers | Assumes Gaussian (metrics are skewed) |
| Median ± IQR | Robust, non-parametric | Narrow range (wastes resolution) |
| 2nd/98th percentile | Wide range, outlier-resistant | Requires sorting (O(N log N)) |
2nd/98th balances range and robustness: Captures 96% of data while rejecting extreme outliers.
Why Log-Sharpness?¶
Sharpness has a fat-tailed distribution:
- Most frames: 10–50
- Some frames: 100–500
- Few frames: 1000+
Without log transform: - 90% of frames → bins 0–2 (low sharpness) - 10% of frames → bins 3–7 (high sharpness)
With log transform \(\ln(1 + S)\): - More uniform distribution across bins - Diminishing returns for extreme sharpness (100 vs 200 matters less than 10 vs 20)
Step 3: Grid Binning¶
Grid Structure¶
Divide normalized space \([0, 1]^3\) into a regular grid with n-bins per axis.
Total cells: n_bins³
Example (n-bins = 8):
- 512 cells total
- Cell (0, 0, 0): [0.0–0.125) × [0.0–0.125) × [0.0–0.125)
- Cell (7, 7, 7): [0.875–1.0) × [0.875–1.0) × [0.875–1.0)
Cell Assignment¶
For normalized features \((B_n, S_n, E_n) \in [0, 1]^3\):
Linear index (for hash map): \(\(\text{cell\_id} = \text{cell}_B + \text{cell}_S \times \text{n\_bins} + \text{cell}_E \times \text{n\_bins}^2\)\)
Example:
- Frame with \((B_n=0.1, S_n=0.5, E_n=0.9)\) and n-bins=8:
- \(\text{cell}_B = \lfloor 0.1 \times 8 \rfloor = 0\)
- \(\text{cell}_S = \lfloor 0.5 \times 8 \rfloor = 4\)
- \(\text{cell}_E = \lfloor 0.9 \times 8 \rfloor = 7\)
- \(\text{cell\_id} = 0 + 4 \times 8 + 7 \times 64 = 480\)
Implementation¶
auto bin = [&](float v) {
return std::clamp((int)(v * n_bins), 0, n_bins - 1);
};
std::unordered_map<int, std::vector<size_t>> cell_map;
for (size_t i = 0; i < candidates.size(); i++) {
int key = bin(brightness[i])
+ bin(log_sharpness[i]) * n_bins
+ bin(entropy[i]) * n_bins * n_bins;
cell_map[key].push_back(i);
}
Result: Each cell contains indices of frames with similar features.
Step 4: Per-Cell Ranking¶
Within each occupied cell, frames are sorted by interest score:
Why this formula?
- Entropy (E): Favors textured, detailed frames
- Log-sharpness: Favors sharp focus (log prevents extreme values from dominating)
- Motion (M): Favors dynamic content (camera movement, subject motion)
- Multiplicative: All factors must be reasonable (low entropy → low score regardless of sharpness)
Example scores:
| Frame Type | E | S | M | Score |
|---|---|---|---|---|
| Blue water | 2.0 | 20 | 1 | 2.0 × ln(21) × 2 = 12.2 |
| Blurry organism | 3.0 | 8 | 10 | 3.0 × ln(9) × 11 = 72.6 |
| Particle cloud | 4.5 | 80 | 15 | 4.5 × ln(81) × 16 = 316.8 |
| Sharp static organism | 4.0 | 100 | 2 | 4.0 × ln(101) × 3 = 55.3 |
Effect: Frames with rich texture, sharp focus, and dynamic content score highest.
Step 5: Per-Cell Sampling¶
For each occupied cell:
- Sort frames by interest score (descending)
- Select top
max-per-cellframes - Discard the rest
Default max-per-cell:
\(\(\text{max\_per\_cell} = \lceil \frac{\text{max\_frames}}{\text{n\_bins}^3} \rceil\)\)
Rationale: Ensures cells can collectively reach frame budget.
Example (max-frames = 5000, n-bins = 8):
\(\(\text{max\_per\_cell} = \lceil \frac{5000}{512} \rceil = 10\)\)
Each cell contributes ≤10 frames, up to 5120 total (then trimmed to 5000).
Cell Occupancy¶
Not all cells are occupied (sparse data). Typical occupancy: 30–70% of cells.
Example (1000 candidates, 512 cells, n-bins=8): - ~400 occupied cells (78%) - ~0–5 frames per cell on average - Some dense cells (common scenes) have 20+ frames - Some sparse cells (rare conditions) have 1–2 frames
Impact:
- Dense cells: Contribute max-per-cell frames (e.g., 10)
- Sparse cells: Contribute all frames (e.g., 2)
Result: Common scenes contribute more total frames, but each cell is capped (prevents complete dominance).
Step 6: Budget Enforcement¶
After per-cell sampling, if total exceeds max-frames:
- Pool all selected frames (from all cells)
- Sort by interest score (descending)
- Trim to
max-frames
Effect: Keeps the highest-interest frames while respecting budget.
Example: - 400 occupied cells × 10 frames/cell = 4000 selected - Budget = 5000 → no trim needed
Example 2: - 450 occupied cells × 10 frames/cell = 4500 selected - Budget = 3000 → trim to top 3000 by interest score
Diversity Guarantees¶
What the Grid Ensures¶
✅ Coverage: Every occupied cell contributes ≥1 frame (up to max-per-cell)
✅ Bounded redundancy: At most max-per-cell frames from same feature region
✅ Proportional representation: Dense cells (common scenes) contribute more frames than sparse cells (rare scenes), but capped
✅ Deterministic: Same candidates → same output (no random seed)
What the Grid Doesn't Ensure¶
❌ Uniform distribution: Dense regions contribute more frames than sparse regions
❌ Semantic diversity: Grid operates on low-level metrics, not object categories (may select multiple frames of same organism if visually distinct)
❌ Spatial diversity: No geographic or temporal spread constraints
Parameter Tuning¶
n-bins (Grid Resolution)¶
Trade-off:
- Higher (e.g., 10–12): Finer distinctions, more diversity, sparser cells
- Lower (e.g., 5–7): Coarser grouping, less diversity, denser cells
Rule of thumb: \(\(\text{n\_bins} \approx \sqrt[3]{\frac{\text{num\_candidates}}{20}}\)\)
Target: ~10–50 candidates per cell on average.
Example:
- 1,000 candidates → n-bins = 5 (125 cells, ~8 per cell)
- 10,000 candidates → n-bins = 8 (512 cells, ~20 per cell)
- 100,000 candidates → n-bins = 12 (1,728 cells, ~58 per cell)
Diagnostics: After running sample, check output:
- Occupied < 20%: Grid too sparse → lower
n-bins - Occupied > 80%: Grid too dense → raise
n-bins
max-per-cell (Cell Budget)¶
Trade-off:
- Lower (e.g., 1–5): Strong uniform diversity, may miss high-interest regions
- Higher (e.g., 20–50): Interest-driven, may over-sample common scenes
Default (balanced): \(\(\text{max\_per\_cell} = \lceil \frac{\text{max\_frames}}{\text{n\_bins}^3} \rceil\)\)
When to override:
Force uniform distribution (each cell equally weighted):
Maximize high-interest frames (allow dense cells to dominate):
Example (5000 frame budget, 8³ = 512 cells):
| max-per-cell | Effect |
|---|---|
| 1 | Up to 512 frames (uniform, may under-shoot budget) |
| 10 (default) | Up to 5120 frames (balanced) |
| 50 | Up to 25,600 frames (interest-driven, trimmed to 5000) |
Comparison to Alternatives¶
vs. K-means Clustering¶
K-means partitions feature space into K clusters, selecting representative frames.
| Aspect | Grid-Based | K-means |
|---|---|---|
| Determinism | ✅ Deterministic | ❌ Random initialization |
| Speed | ✅ O(N log N) (sorting) | ❌ O(KNT) (T iterations) |
| Coverage | ✅ Guarantees all regions | ❌ Centroids gravitate to dense regions |
| Interpretability | ✅ Cell boundaries are fixed | ❌ Cluster shapes are data-dependent |
| Adaptability | ❌ Axis-aligned (doesn't follow data manifold) | ✅ Arbitrary cluster shapes |
When to prefer grid-based: Fast, reproducible, uniform coverage
When to prefer k-means: Data lies on a low-dimensional manifold (e.g., elongated clusters)
vs. Farthest Point Sampling (FPS)¶
FPS iteratively selects frames maximizing minimum distance to selected set.
| Aspect | Grid-Based | FPS |
|---|---|---|
| Speed | ✅ O(N log N) | ❌ O(MN) (M = selected count) |
| Coverage | ✅ Guarantees all regions | ✅ Guarantees minimum separation |
| Quality | ✅ Interest-weighted | ❌ Distance-only (no quality ranking) |
| Packing | ❌ Allows clustering at cell boundaries | ✅ Tighter uniform packing |
When to prefer grid-based: Large datasets (>10K candidates), quality-weighted selection
When to prefer FPS: Strict uniform spacing required, small datasets
vs. Random Sampling¶
Random sampling selects frames uniformly at random.
| Aspect | Grid-Based | Random |
|---|---|---|
| Diversity | ✅ Guaranteed coverage | ❌ Clustering due to randomness |
| Quality | ✅ Interest-weighted | ❌ Equal weight to all frames |
| Reproducibility | ✅ Deterministic | ❌ Requires seed for reproducibility |
When to prefer grid-based: Always (unless you need unbiased sampling for statistics)
Visualization¶
3D Grid Occupancy (Conceptual)¶
Imagine a 3D cube \([0,1]^3\) divided into 8³ = 512 smaller cubes:
- ■ (filled): Occupied cell (has candidates)
- □ (empty): Unoccupied cell (no candidates)
Typical pattern: - Dense region (blue water): Many occupied cells, many frames per cell - Sparse region (organisms): Few occupied cells, few frames per cell
Grid selection:
- Dense region: Contributes max-per-cell × num_cells (e.g., 10 × 200 = 2000 frames)
- Sparse region: Contributes all frames (e.g., 1–5 per cell × 50 cells = 100 frames)
Total: 2100 frames (blue water dominates, but organisms are represented)
Occupancy vs. n-bins¶
| n-bins | Total Cells | Candidates (10K) | Occupancy | Avg per Cell |
|---|---|---|---|---|
| 5 | 125 | 10,000 | ~90% | 80 |
| 8 | 512 | 10,000 | ~60% | 20 |
| 10 | 1,000 | 10,000 | ~40% | 10 |
| 12 | 1,728 | 10,000 | ~25% | 6 |
Sweet spot: 30–70% occupancy (balance between coverage and cell density).
Advanced: Custom Diversity Metrics¶
To use different features for diversity selection:
-
Replace normalization in
grid_diversity()(sample.cpp): -
Compute custom features (add to
FrameRecordandsample_video_metrics()): -
Rebuild and test
See API Reference for complete example.
Summary¶
Grid-based diversity selection:
- ✅ Fast: O(N log N) (sorting for percentiles and per-cell ranking)
- ✅ Deterministic: Same candidates → same output
- ✅ Balanced: Covers all feature regions while prioritizing high-interest frames
- ✅ Tunable:
n-binsandmax-per-cellcontrol diversity vs. interest trade-off
Key parameters:
- --n-bins: Grid resolution (5–12 typical)
- --max-per-cell: Cell budget (default: auto-computed)
When to adjust:
- Over-sampling common scenes → lower max-per-cell, raise n-bins
- Missing rare events → lower max-per-cell, raise n-bins
- Under-utilizing budget → lower n-bins, raise max-per-cell
Next Steps¶
- Temporal Filtering: How time gaps are enforced
- Quality Metrics: What features are used for grid binning
- Tuning Guide: Practical parameter selection