Skip to content

Prepare your data

Overview

Your data needs to be organized into separate folders for training/validation/test. This input format is required for the popular Ultralytics YOLOv5 model. There is a split command to help you split your data.

Assuming that you have downloaded your images and YOLO Darknet TXT files into separate images/ and labels/ folders,

├── data
│   │   ├── images
│   │   │   └── image1.jpg
│   │   │   └── image2.jpg
│   │   ├── labels
│   │   │   └── image1.txt
│   │   │   └── image2.txt 

Run the split command

Split the data with the split command

deepsea-ai split -i data -o split
Autosplitting images from data
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 160/160 [00:00<00:00, 28060.24it/s]
Splitting autosplit_train.txt
132it [00:00, 428.74it/s]
Splitting autosplit_val.txt
19it [00:00, 259.01it/s]
Splitting autosplit_test.txt
9it [00:00, 295.15it/s]
Creating split/labels.tar.gz...
Creating split/images.tar.gz...
Done

You should now have compressed labels and images that have been split into training/validation/test sets respectively by 85%/15%/5%. These compressed files are used in the train command

split
├── images.tar.gz
└── labels.tar.gz

Lastly, a plain text file is needed to map the numeric class IDs in the labels/ to names. That is passed in with the --label-map option. This is a simple text file with the yolo names in the sorted order of the training label indexes.

For example, if you have only two classes Benthocodon and Nanomia which map to classes 0 and 1 respectively, the class file looks like this, with one class per line:

Benthocodon
Nanomia

Why do I need a label-map file?

YOLOv5 uses an annotation format similar to YOLO Darknet TXT but with the addition of a YAML file which contains the model configuration and class values. The YAML file is autogenerated in a deepsea-ai processing job, with the help of this simple .txt file. Important the class names should be listed in the same sorted order of the class labels in the YOLO Darknet TXT files. Pay attention to your order as it is a common mistake.


Updated: 2024-08-14