Split¶

The split command is used to randomly split a dataset into training, validation, and test sets. By default, it uses a split ratio of 85% training, 10% validation, and 5% test.

Prerequisites¶

The input directory must be organized with images/ and labels/ subfolders:

dataset_root/
├── images/
│   ├── image1.jpg
│   ├── image2.jpg
│   └── ...
└── labels/
    ├── image1.txt
    ├── image2.txt
    └── ...

The labels should be in YOLO format (.txt files) and correspond to the images.

Usage¶

To split your dataset, use the split command with the --input (or -i) and --output (or -o) options.

aidata split --input ./my_dataset --output ./my_dataset_split

Output¶

The command generates two compressed tarballs in the output directory:

images.tar.gz: Contains the split images organized into train/, val/, and test/ subfolders.
labels.tar.gz: Contains the split labels organized into train/, val/, and test/ subfolders.

The resulting structure inside the tarballs (when extracted) will look like this:

images/
├── train/
├── val/
└── test/

labels/
├── train/
├── val/
└── test/

Additionally, the command creates three text files in the input directory listing the files assigned to each split: - autosplit_train.txt - autosplit_val.txt - autosplit_test.txt

Note

The split is reproducible as it uses a fixed random seed (0).

last updated: 2026-02-08