Temporal Filtering¶

In-depth explanation of temporal filtering and minimum time gap enforcement.

Problem Statement¶

Video frames are highly correlated in time. Adjacent frames differ only by:

Camera motion: ~cm displacement at typical AUV speeds (0.5–2 m/s)
Subject motion: Organisms and particles move slowly relative to frame rate (30 fps)
Lighting changes: Gradual as vehicle moves relative to sun/lights

Without temporal filtering, diversity selection would pick many near-duplicate frames from the same clip, wasting frame budget and annotation effort.

Algorithm¶

Overview¶

For each video, enforce minimum time gap (--min-gap seconds) between selected frames.

Input: Frames sorted by frame_idx
Output: Subset with temporal separation ≥ min-gap

Pseudo-Code¶

def temporal_filter(frames, min_gap_sec, fps):
    kept = []
    last_selected_idx = -infinity

    for frame in sorted(frames, key=lambda f: f.frame_idx):
        gap_sec = (frame.frame_idx - last_selected_idx) / fps
        if gap_sec >= min_gap_sec:
            kept.append(frame)
            last_selected_idx = frame.frame_idx

    return kept

Implementation¶

std::vector<FrameRecord> temporal_filter(
    const std::vector<FrameRecord>& candidates,
    double min_gap_sec)
{
    std::map<std::string, std::vector<std::size_t>> by_video;
    for (std::size_t i = 0; i < candidates.size(); i++)
        by_video[candidates[i].video_path.string()].push_back(i);

    std::vector<FrameRecord> kept;
    for (auto& [_, indices] : by_video) {
        std::sort(indices.begin(), indices.end(),
            [&](std::size_t a, std::size_t b) {
                return candidates[a].frame_idx < candidates[b].frame_idx;
            });

        double last_idx = -1e9;
        double fps = candidates[indices[0]].fps;

        for (std::size_t i : indices) {
            const auto& r = candidates[i];
            if ((r.frame_idx - last_idx) / fps >= min_gap_sec) {
                kept.push_back(r);
                last_idx = r.frame_idx;
            }
        }
    }
    return kept;
}

Key points:

Group by video: Temporal filtering is per-video (frames from different videos are independent)
Sort by frame index: Ensures temporal order
Greedy selection: Accept frame if gap ≥ threshold (doesn't try to maximize total frames)

Example¶

Input¶

Video at 30 fps, min-gap = 2.0 seconds:

Frame Index	Time (sec)	Quality	Interest Score
0	0.0	✅	50
15	0.5	✅	80
30	1.0	✅	60
60	2.0	✅	90
61	2.03	✅	100
120	4.0	✅	70
180	6.0	✅	85

Processing¶

Frame 0: Gap = ∞ (first frame) ≥ 2.0 → Accept
last_selected = 0
Frame 15: Gap = (15 - 0) / 30 = 0.5 sec < 2.0 → Reject
Frame 30: Gap = (30 - 0) / 30 = 1.0 sec < 2.0 → Reject
Frame 60: Gap = (60 - 0) / 30 = 2.0 sec ≥ 2.0 → Accept
last_selected = 60
Frame 61: Gap = (61 - 60) / 30 = 0.03 sec < 2.0 → Reject
(Even though interest score is highest)
Frame 120: Gap = (120 - 60) / 30 = 2.0 sec ≥ 2.0 → Accept
last_selected = 120
Frame 180: Gap = (180 - 120) / 30 = 2.0 sec ≥ 2.0 → Accept
last_selected = 180

Output¶

Kept frames: 0, 60, 120, 180
Rejection rate: 3 / 7 = 43%

Note: Frame 61 (highest interest score) is rejected due to proximity to frame 60.

When Temporal Filtering is Applied¶

Order in pipeline:

Pass 1: Compute metrics, apply quality gates
Pass 2a: Temporal filtering (this step)
Pass 2b: Grid-based diversity selection
Pass 3: Extract frames

Rationale: Apply temporal filtering before grid binning to ensure candidates entering the grid are temporally separated.

Alternative ordering (not used): - Apply after grid binning → Risk of near-duplicates in different cells - Apply after budget enforcement → Too late (redundant frames already selected)

Parameter Tuning¶

`--min-gap` (Minimum Gap in Seconds)¶

Trade-off:

Higher (e.g., 5–10s): Strong temporal separation, fewer frames per video
Lower (e.g., 0.5–1s): More frames per video, risk of near-duplicates

Recommended values:

Vehicle Speed	Scene Change Rate	Recommended min-gap
Stationary (benthic lander)	Slow (static scene)	5–10s
Slow (<1 m/s)	Moderate (gradual drift)	2–5s
Fast (>1 m/s)	Rapid (transect)	1–2s

How to Choose¶

Method 1: Visual inspection

Run with small gap (e.g., --min-gap 0.5)
Inspect output frames from same video
If many look nearly identical, increase gap

Method 2: Frame rate estimation

Estimate how long it takes for scene to change:

\[\text{min\_gap} = \frac{\text{field\_of\_view}}{\text{vehicle\_speed}}\]

Example (1m FOV, 0.5 m/s vehicle): \(\(\text{min\_gap} = \frac{1 \text{ m}}{0.5 \text{ m/s}} = 2 \text{ sec}\)\)

Method 3: Empirical

Start with 1–2 seconds (typical for AUV surveys), then adjust based on output.

Impact on Diversity¶

Positive Effects¶

✅ Reduces redundancy: Avoids near-duplicate frames
✅ Spreads coverage: Forces sampling across time (different scenes, conditions)
✅ Improves annotation efficiency: Each frame provides new information

Potential Issues¶

❌ Misses short events: If interesting feature lasts < min-gap, may not be captured
❌ Reduces candidate pool: Fewer frames available for grid binning
❌ Ignores spatial diversity: Doesn't account for vehicle position (may sample same location at different times)

Mitigation Strategies¶

For short events: - Lower --min-gap (e.g., 0.5–1s) - Raise --sample-fps to examine more frames (increases chance of catching event)

For large candidate pools: - Temporal filtering rejection (10–40%) is acceptable - If rejection > 50%, consider lowering --min-gap or increasing --sample-fps

For spatial diversity: - If vehicle revisits same location (e.g., benthic survey grid), temporal filtering is beneficial (avoids duplicate coverage) - If vehicle does single transect, temporal filtering spreads frames along path

Greedy vs. Optimal Selection¶

Current Approach (Greedy)¶

Algorithm: Accept first frame, then next frame ≥ min-gap later, repeat.

Pros: - ✅ Simple, fast (O(N log N) for sorting) - ✅ Deterministic - ✅ Intuitive

Cons: - ❌ Not optimal (may not maximize number of selected frames)

Example: Suboptimal Case¶

Frames at times: 0, 1, 3, 4, 6 seconds (min-gap = 2s):

Greedy selection: 1. Accept frame 0 2. Reject frame 1 (gap = 1s < 2s) 3. Accept frame 3 (gap = 3s ≥ 2s) 4. Reject frame 4 (gap = 1s < 2s) 5. Reject frame 6 (gap = 3s ≥ 2s, but 6 - 3 = 3s ≥ 2s... actually accept!)

Result: Frames 0, 3, 6 (3 frames)

Optimal selection (maximize count): - Accept frames 0, 3, 6 (3 frames) ← Same!

Alternative scenario: Frames at 0, 1.9, 4 seconds:

Greedy: 0, 4 (2 frames)
Optimal: 0, 1.9 (2 frames) or 0, 4 (2 frames) ← Same count, different choices

Conclusion: Greedy is often near-optimal for typical frame distributions. Optimal algorithm (dynamic programming) is O(N²), not worth the complexity.

Interaction with Grid Diversity¶

Temporal filtering runs first, so grid binning operates on temporally separated frames.

Effect on grid occupancy:

Without temporal filtering: - Many frames from same clip may fall in same grid cell (near-duplicates) - Dense cells become denser

With temporal filtering: - Fewer frames per clip enter grid - Cells represent temporally diverse instances of similar visual conditions

Example:

Clip A (blue water, 100 frames examined, 80 pass quality): - Without temporal filter: 80 frames → many in same cell (brightness=0.3, sharpness=0.4, entropy=0.2) - With temporal filter (min-gap=2s, 30 fps): 80 → ~15 kept → fewer redundant frames in same cell

Result: Grid cells are more diverse (temporally and visually).

Multi-Video Considerations¶

Independent Filtering¶

Temporal filtering is per-video:

Frames from Video A and Video B are independent (no temporal constraint between them)
min-gap applies within each video

Rationale: Videos may be recorded at different times/locations (no temporal relationship).

Cross-Video Diversity¶

Grid diversity handles cross-video selection:

If Video A and Video B both have blue water frames, grid ensures limited contribution from that cell
Temporal filtering ensures frames within each video are spread out

Combined effect: Both temporal and visual diversity.

Advanced: Adaptive Gaps¶

Current implementation: Fixed min-gap for all videos.

Potential enhancement: Adaptive gap based on scene change rate:

double adaptive_gap(const FrameRecord& prev, const FrameRecord& curr) {
    double motion_factor = curr.motion / 10.0;  // Normalize
    return min_gap_base * (1.0 / (1.0 + motion_factor));
}

Effect: Larger gaps for static scenes, smaller gaps for dynamic scenes.

Not implemented (added complexity, minor benefit for typical use cases).

Visualization¶

Timeline Example¶

Video timeline (30 fps, min-gap = 2s):
Time (sec):  0    1    2    3    4    5    6    7    8    9    10
Frames:      |    |    |    |    |    |    |    |    |    |    |
             ✓         ✓         ✓         ✓         ✓
             │         │         │         │         │
             │         │         │         │         │
            Kept     Kept      Kept      Kept      Kept

Rejected:    ╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳
             (all frames between kept frames)

Gap enforcement: Minimum 2 seconds (60 frames at 30 fps) between kept frames.

Summary¶

Temporal filtering:

✅ Purpose: Avoid near-duplicate frames from same clip
✅ Method: Greedy selection with minimum time gap
✅ When: Before grid diversity selection
✅ Impact: Reduces candidates by ~10–40%, improves temporal diversity

Key parameter: --min-gap (seconds) - 1–2s: Typical for fast-moving vehicles - 2–5s: Typical for slow-moving vehicles - 5–10s: Static or very slow scenes

Trade-off: Larger gap → fewer candidates, stronger temporal diversity

Next Steps¶

Diversity Selection: How grid-based selection works after temporal filtering
Quality Metrics: What metrics are used for filtering
Tuning Guide: Practical parameter selection