Flexible Taxonomies in Machine Learning¶
While scientific taxonomies prioritize stable, hierarchical categorization for knowledge preservation, machine learning taxonomies serve a fundamentally different purpose: they are simply tools for building effective systems. In fact, most commercial ML tools do not enforce taxonomies at all. Instead, they allow for arbitrary labels to be assigned to any input.
A taxonomy in a machine learning system is really intended to be a performance optimization tool. In machine learning, we refer to the taxa as a "Label" and sometimes multiple labels are assigned to an input.
MBARI projects should not incur unnecessary overhead in requiring a rigid taxonomy which can slow down development and training, unless it is needed to compare model performance across multiple models or projects.
Here are a few examples of how this is used in MBARI projects:
The Unknown¶
An Unknown category maybe a category not present in the training data—this could represent a new species, or a new type of object, but is is in the truest sense it is something that is unknown. Here are a few examples from the i2MAP project of something that is unknown but may be of interest, but cannot be identified. While this is referred to as "marine organism" or "marine snow" in the i2MAP project, it is in the machine learning domain referred to as "Unknown". This is not particulary useful for training a classification model, but it can be useful for tracking algorithms, or exploring the data in other ways.
| Unknown1 | Unknown2 | Unknown3 |
|---|---|---|
![]() |
![]() |
![]() |
The Noise¶
What scientists might dismiss as "noise" or "irrelevant data" often may be the majority of real-world data. For example, a artifact from the lens of a camera, a bubble in underwater imagery, or simply a sound that can be confused with the sound of interest in the audio data should be captured as a separate category.
🧪 Aggregate ?¶
Thousands of bubbles in the ISIIS instrument needed to be separated into a bubble class. These can be confused with important scientific data such as aggregates.
| Dense Aggregate | Bubble |
|---|---|
![]() |
![]() |
🐋 🎵 Blue Whale or Ship Noise?¶
This was used in a project to create a binary classification model to distinguish between Blue Whale A calls and Blue Whale B calls. Model performance was 94.5% accuracy on validation data by separating the two classes this way.
| Blue Whale A call | False Blue Whale A call |
|---|---|
![]() |
![]() |
🗓️ Updated: 2025-12-12






