dedup-voc: Deduplicate Pascal VOC Annotations¶
The dedup-voc command scans Pascal VOC XML annotation files and removes duplicate objects (same class and bounding box) from each file. This helps ensure that each annotation is unique per image, improving dataset quality for training and analysis.
Usage¶
Parameters¶
DIRECTORY: Directory containing Pascal VOC annotation XML files, or a single XML file--output-dir: (Optional) Directory to save deduplicated annotations; if omitted, original files are updated--verbose, -v: (Optional) Display detailed logging information
Output¶
- Modified VOC XML files with duplicates removed
- Console output summarizing the number of objects removed per file and in total
How It Works¶
- The command loads each annotation file and parses all objects
- Objects are compared by class name and bounding box coordinates
- Only unique objects are kept; duplicates are removed
- Deduplicated annotations are saved in pretty-printed XML format
- A summary of removed objects is displayed
Safe Processing
Use the --output-dir option to keep your original files unchanged. If not specified, files are updated in-place.
Examples¶
Deduplicate all annotations in a directory, updating files in-place:
Deduplicate and save results to a new directory:
Deduplicate a single file:
Enable verbose logging for troubleshooting:
Notes¶
- The tool works with both single files and directories
- If no files are provided, or no XML files are found, an error is displayed
- Directories cannot be processed together with multiple files
- Deduplication is based on object name and bounding box coordinates
For more details, see the source code in m3_download/cli/dedup_voc.py.