Skip to content

Temporal Filtering

In-depth explanation of temporal filtering and minimum time gap enforcement.

Problem Statement

Video frames are highly correlated in time. Adjacent frames differ only by:

  • Camera motion: ~cm displacement at typical AUV speeds (0.5–2 m/s)
  • Subject motion: Organisms and particles move slowly relative to frame rate (30 fps)
  • Lighting changes: Gradual as vehicle moves relative to sun/lights

Without temporal filtering, diversity selection would pick many near-duplicate frames from the same clip, wasting frame budget and annotation effort.


Algorithm

Overview

For each video, enforce minimum time gap (--min-gap seconds) between selected frames.

Input: Frames sorted by frame_idx
Output: Subset with temporal separation ≥ min-gap

Pseudo-Code

def temporal_filter(frames, min_gap_sec, fps):
    kept = []
    last_selected_idx = -infinity

    for frame in sorted(frames, key=lambda f: f.frame_idx):
        gap_sec = (frame.frame_idx - last_selected_idx) / fps
        if gap_sec >= min_gap_sec:
            kept.append(frame)
            last_selected_idx = frame.frame_idx

    return kept

Implementation

std::vector<FrameRecord> temporal_filter(
    const std::vector<FrameRecord>& candidates,
    double min_gap_sec)
{
    std::map<std::string, std::vector<std::size_t>> by_video;
    for (std::size_t i = 0; i < candidates.size(); i++)
        by_video[candidates[i].video_path.string()].push_back(i);

    std::vector<FrameRecord> kept;
    for (auto& [_, indices] : by_video) {
        std::sort(indices.begin(), indices.end(),
            [&](std::size_t a, std::size_t b) {
                return candidates[a].frame_idx < candidates[b].frame_idx;
            });

        double last_idx = -1e9;
        double fps = candidates[indices[0]].fps;

        for (std::size_t i : indices) {
            const auto& r = candidates[i];
            if ((r.frame_idx - last_idx) / fps >= min_gap_sec) {
                kept.push_back(r);
                last_idx = r.frame_idx;
            }
        }
    }
    return kept;
}

Key points:

  1. Group by video: Temporal filtering is per-video (frames from different videos are independent)
  2. Sort by frame index: Ensures temporal order
  3. Greedy selection: Accept frame if gap ≥ threshold (doesn't try to maximize total frames)

Example

Input

Video at 30 fps, min-gap = 2.0 seconds:

Frame Index Time (sec) Quality Interest Score
0 0.0 50
15 0.5 80
30 1.0 60
60 2.0 90
61 2.03 100
120 4.0 70
180 6.0 85

Processing

  1. Frame 0: Gap = ∞ (first frame) ≥ 2.0 → Accept
    last_selected = 0

  2. Frame 15: Gap = (15 - 0) / 30 = 0.5 sec < 2.0 → Reject

  3. Frame 30: Gap = (30 - 0) / 30 = 1.0 sec < 2.0 → Reject

  4. Frame 60: Gap = (60 - 0) / 30 = 2.0 sec ≥ 2.0 → Accept
    last_selected = 60

  5. Frame 61: Gap = (61 - 60) / 30 = 0.03 sec < 2.0 → Reject
    (Even though interest score is highest)

  6. Frame 120: Gap = (120 - 60) / 30 = 2.0 sec ≥ 2.0 → Accept
    last_selected = 120

  7. Frame 180: Gap = (180 - 120) / 30 = 2.0 sec ≥ 2.0 → Accept
    last_selected = 180

Output

Kept frames: 0, 60, 120, 180
Rejection rate: 3 / 7 = 43%

Note: Frame 61 (highest interest score) is rejected due to proximity to frame 60.


When Temporal Filtering is Applied

Order in pipeline:

  1. Pass 1: Compute metrics, apply quality gates
  2. Pass 2a: Temporal filtering (this step)
  3. Pass 2b: Grid-based diversity selection
  4. Pass 3: Extract frames

Rationale: Apply temporal filtering before grid binning to ensure candidates entering the grid are temporally separated.

Alternative ordering (not used): - Apply after grid binning → Risk of near-duplicates in different cells - Apply after budget enforcement → Too late (redundant frames already selected)


Parameter Tuning

--min-gap (Minimum Gap in Seconds)

Trade-off:

  • Higher (e.g., 5–10s): Strong temporal separation, fewer frames per video
  • Lower (e.g., 0.5–1s): More frames per video, risk of near-duplicates

Recommended values:

Vehicle Speed Scene Change Rate Recommended min-gap
Stationary (benthic lander) Slow (static scene) 5–10s
Slow (<1 m/s) Moderate (gradual drift) 2–5s
Fast (>1 m/s) Rapid (transect) 1–2s

How to Choose

Method 1: Visual inspection

  1. Run with small gap (e.g., --min-gap 0.5)
  2. Inspect output frames from same video
  3. If many look nearly identical, increase gap

Method 2: Frame rate estimation

Estimate how long it takes for scene to change:

\[\text{min\_gap} = \frac{\text{field\_of\_view}}{\text{vehicle\_speed}}\]

Example (1m FOV, 0.5 m/s vehicle): \(\(\text{min\_gap} = \frac{1 \text{ m}}{0.5 \text{ m/s}} = 2 \text{ sec}\)\)

Method 3: Empirical

Start with 1–2 seconds (typical for AUV surveys), then adjust based on output.


Impact on Diversity

Positive Effects

Reduces redundancy: Avoids near-duplicate frames
Spreads coverage: Forces sampling across time (different scenes, conditions)
Improves annotation efficiency: Each frame provides new information

Potential Issues

Misses short events: If interesting feature lasts < min-gap, may not be captured
Reduces candidate pool: Fewer frames available for grid binning
Ignores spatial diversity: Doesn't account for vehicle position (may sample same location at different times)

Mitigation Strategies

For short events: - Lower --min-gap (e.g., 0.5–1s) - Raise --sample-fps to examine more frames (increases chance of catching event)

For large candidate pools: - Temporal filtering rejection (10–40%) is acceptable - If rejection > 50%, consider lowering --min-gap or increasing --sample-fps

For spatial diversity: - If vehicle revisits same location (e.g., benthic survey grid), temporal filtering is beneficial (avoids duplicate coverage) - If vehicle does single transect, temporal filtering spreads frames along path


Greedy vs. Optimal Selection

Current Approach (Greedy)

Algorithm: Accept first frame, then next frame ≥ min-gap later, repeat.

Pros: - ✅ Simple, fast (O(N log N) for sorting) - ✅ Deterministic - ✅ Intuitive

Cons: - ❌ Not optimal (may not maximize number of selected frames)

Example: Suboptimal Case

Frames at times: 0, 1, 3, 4, 6 seconds (min-gap = 2s):

Greedy selection: 1. Accept frame 0 2. Reject frame 1 (gap = 1s < 2s) 3. Accept frame 3 (gap = 3s ≥ 2s) 4. Reject frame 4 (gap = 1s < 2s) 5. Reject frame 6 (gap = 3s ≥ 2s, but 6 - 3 = 3s ≥ 2s... actually accept!)

Result: Frames 0, 3, 6 (3 frames)

Optimal selection (maximize count): - Accept frames 0, 3, 6 (3 frames) ← Same!

Alternative scenario: Frames at 0, 1.9, 4 seconds:

Greedy: 0, 4 (2 frames)
Optimal: 0, 1.9 (2 frames) or 0, 4 (2 frames) ← Same count, different choices

Conclusion: Greedy is often near-optimal for typical frame distributions. Optimal algorithm (dynamic programming) is O(N²), not worth the complexity.


Interaction with Grid Diversity

Temporal filtering runs first, so grid binning operates on temporally separated frames.

Effect on grid occupancy:

Without temporal filtering: - Many frames from same clip may fall in same grid cell (near-duplicates) - Dense cells become denser

With temporal filtering: - Fewer frames per clip enter grid - Cells represent temporally diverse instances of similar visual conditions

Example:

Clip A (blue water, 100 frames examined, 80 pass quality): - Without temporal filter: 80 frames → many in same cell (brightness=0.3, sharpness=0.4, entropy=0.2) - With temporal filter (min-gap=2s, 30 fps): 80 → ~15 kept → fewer redundant frames in same cell

Result: Grid cells are more diverse (temporally and visually).


Multi-Video Considerations

Independent Filtering

Temporal filtering is per-video:

  • Frames from Video A and Video B are independent (no temporal constraint between them)
  • min-gap applies within each video

Rationale: Videos may be recorded at different times/locations (no temporal relationship).

Cross-Video Diversity

Grid diversity handles cross-video selection:

  • If Video A and Video B both have blue water frames, grid ensures limited contribution from that cell
  • Temporal filtering ensures frames within each video are spread out

Combined effect: Both temporal and visual diversity.


Advanced: Adaptive Gaps

Current implementation: Fixed min-gap for all videos.

Potential enhancement: Adaptive gap based on scene change rate:

double adaptive_gap(const FrameRecord& prev, const FrameRecord& curr) {
    double motion_factor = curr.motion / 10.0;  // Normalize
    return min_gap_base * (1.0 / (1.0 + motion_factor));
}

Effect: Larger gaps for static scenes, smaller gaps for dynamic scenes.

Not implemented (added complexity, minor benefit for typical use cases).


Visualization

Timeline Example

Video timeline (30 fps, min-gap = 2s):
Time (sec):  0    1    2    3    4    5    6    7    8    9    10
Frames:      |    |    |    |    |    |    |    |    |    |    |
             ✓         ✓         ✓         ✓         ✓
             │         │         │         │         │
             │         │         │         │         │
            Kept     Kept      Kept      Kept      Kept

Rejected:    ╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳╳
             (all frames between kept frames)

Gap enforcement: Minimum 2 seconds (60 frames at 30 fps) between kept frames.


Summary

Temporal filtering:

  • Purpose: Avoid near-duplicate frames from same clip
  • Method: Greedy selection with minimum time gap
  • When: Before grid diversity selection
  • Impact: Reduces candidates by ~10–40%, improves temporal diversity

Key parameter: --min-gap (seconds) - 1–2s: Typical for fast-moving vehicles - 2–5s: Typical for slow-moving vehicles - 5–10s: Static or very slow scenes

Trade-off: Larger gap → fewer candidates, stronger temporal diversity


Next Steps