Skip to content

dedup-voc: Deduplicate Pascal VOC Annotations

The dedup-voc command scans Pascal VOC XML annotation files and removes duplicate objects (same class and bounding box) from each file. This helps ensure that each annotation is unique per image, improving dataset quality for training and analysis.

Usage

m3-download dedup-voc DIRECTORY [OPTIONS]

Parameters

  • DIRECTORY: Directory containing Pascal VOC annotation XML files, or a single XML file
  • --output-dir: (Optional) Directory to save deduplicated annotations; if omitted, original files are updated
  • --verbose, -v: (Optional) Display detailed logging information

Output

  • Modified VOC XML files with duplicates removed
  • Console output summarizing the number of objects removed per file and in total

How It Works

  1. The command loads each annotation file and parses all objects
  2. Objects are compared by class name and bounding box coordinates
  3. Only unique objects are kept; duplicates are removed
  4. Deduplicated annotations are saved in pretty-printed XML format
  5. A summary of removed objects is displayed

Safe Processing

Use the --output-dir option to keep your original files unchanged. If not specified, files are updated in-place.

Examples

Deduplicate all annotations in a directory, updating files in-place:

m3-download dedup-voc annotations/

Deduplicate and save results to a new directory:

m3-download dedup-voc annotations/ --output-dir deduped_annotations/

Deduplicate a single file:

m3-download dedup-voc annotation.xml

Enable verbose logging for troubleshooting:

m3-download dedup-voc annotations/ --verbose

Notes

  • The tool works with both single files and directories
  • If no files are provided, or no XML files are found, an error is displayed
  • Directories cannot be processed together with multiple files
  • Deduplication is based on object name and bounding box coordinates

For more details, see the source code in m3_download/cli/dedup_voc.py.