Meta Generation
The pbp-meta-gen
command-line program is used to generate JSON files with audio metadata. This is a necessary step
before running the main HMB generation program to extract and optionally correct the time data.
This also generates an overview of the recording coverage for the specified date range which can be used to identify gaps in the data, help with the selection of the data to be processed, or to identify any issues with the data before processing.
Instructions below assume you have already installed the package,
e.g. pip install mbari-pbp
.
Once this is done, you can proceed to the main program pbp-hmb-gen.
Overview¶
Three types of audio recorders are supported: NRS, IcListen, and Soundtrap files. Here is the current supported matrix:
Recorder | Google Storage | AWS S3 | Local Storage |
---|---|---|---|
NRS | |||
IcListen | |||
Soundtrap |
For audio that is stored in a cloud storage bucket, the URI that is required to access the audio files depends on the cloud storage provider. The data must be stored in a public cloud storage bucket; private buckets are not supported.
- For Google Storage, use the gs: prefix, e.g.
gs://noaa-passive-bioacoustic/nrs/audio/11/nrs_11_2019-2021/audio
. - For AWS S3, use the s3: prefix, e.g.
s3://pacific-sound-256khz
. - For local files, the URI is the path to the directory where the audio files are stored with the file: prefix, e.g.
file:///Volumes/PAM_Archive/FK01
, orfile:\\Users\dcline\PAM_Archive\FK01
Note the triple slash after the prefix for a local archive file:///Volumes. This is required for the URI to be parsed correctly.
Examples¶
Note
The prefix for any file, is the string that is used to match the beginning of the file name before the timestamp. For example, if the file name is ONMS_FK01_7412_20230315_000000.wav
,
the prefix would be ONMS_FK01_7412_
or ONMS_FK01_7412
, NRS11_20191024_022220.flac
would have a prefix of NRS11_
, and
MARS_20220902_000000.wav
would have a prefix of MARS_
or MARS
.
There is flexible handling of the timestamp in the file name, so any of following file names are all valid:
NRS11_20191024_022220.flac
NRS11_191024T022220Z.flac
NRS11_20191024T022220Z.wav
NRS11_20191024022220.wav
NRS11_20191024T022220Z.d100.x.wav
NRS11_191024T022220Z.d100.x.wav
Generate JSONs with audio metadata from NRS flac files for a date range¶
The following command generates JSON files in the json/nrs
directory only for files in gs://noaa-passive-bioacoustic/nrs/audio/11/nrs_11_2019-2021/audio
that iclude the file string NRS11. Logs will be stored in the output
directory, for the specified date range.
pbp-meta-gen --recorder=NRS \
--json-base-dir=json/nrs \
--output-dir=output \
--uri=gs://noaa-passive-bioacoustic/nrs/audio/11/nrs_11_2019-2021/audio \
--start=20191023 \
--end=20191024 \
--prefix=NRS11_
If your data is stored locally on Windows, e.g. in your \Users\dcline\Downloads
directory, the command might look something like:
pbp-meta-gen --recorder NRS --json-base-dir=json/nrs \
--output-dir=output \
--uri= file:\\Users\dcline\Downloads\ \
--start=20191023 \
--end=20191024 \
--prefix=NRS11_
Following this command, you should see two JSON files in the json/nrs
directory; one for each day of the date range.
json/nrs/
└── 2019
├── 20191023.json
└── 20191024.json
output/
├── NRS20191023_20191024.log
Generate JSONs with audio metadata from IcListen wav files for a date range¶
The following command generates JSON files in the json/iclisten
directory only for files in s3://pacific-sound-256khz
that include the file string MARS.
Logs will be stored in the output
directory, for the specified date range. The MARS data is recorded in 10-minute intervals, so there are many files to process.
This would be a good time to go get a cup of coffee . This will take a while to process since the pacific sound archive has many files.
pbp-meta-gen --recorder=ICLISTEN \
--json-base-dir=json/iclisten \
--output-dir=output \
--uri=s3://pacific-sound-256khz \
--start=20191023 \
--end=20191024 \
--prefix=MARS
You should see two JSON files in the json/iclisten
directory; one for each day of the date range.
json/iclisten/
└── 2019
├── 20191023.json
└── 20191024.json
output/
├── ICLISTEN20191023_20191024.log
Generate JSONs with audio metadata from Soundtrap wav files for a date range¶
pbp-meta-gen --recorder=SOUNDTRAP \
--json-base-dir=json/FK01 \
--output-dir=logs/json/FK01 \
--uri=file://Volumes/PAM_Archive/FK01 \
--start=20230315 \
--end=20230316 \
--prefix=ONMS_FK01_7412
JSON format¶
Why JSON?
We choose JSON files to store the metadata because it is human-readable, easy to parse, and can be easily integrated as part of a larger data processing pipeline.
The JSON file schema is as follows:
Field | Description |
---|---|
channels | The number of channels in the audio file. |
uri | The location of the audio file. This is a URI that can be used to access the file in a public cloud storage bucket or local file system. |
start | The start time of the audio file in ISO 8601 format. |
end | The end time of the audio file in ISO 8601 format. |
duration_secs | The duration of the audio file in seconds. |
[
{
"uri": "gs://noaa-passive-bioacoustic/nrs/audio/11/nrs_11_2019-2021/audio/NRS11_20191023_222213.flac",
"start": "2019-10-23T22:22:13Z",
"end": "2019-10-24T02:22:13Z",
"duration_secs": 14400,
"channels": 1
}
]
Need help? Try the --help option¶
$ pbp-meta-gen --help
usage: pbp-meta-gen [-h] [--version] --recorder {NRS,ICLISTEN,SOUNDTRAP} --json-base-dir dir --output-dir dir --uri uri --start YYYYMMDD --end YYYYMMDD --prefix PREFIX [PREFIX ...]
Generate JSONs with audio metadata for NRS flac files, IcListen wav files, and Soundtrap wav files from either a local directory or gs/s3 bucket.
options:
-h, --help show this help message and exit
--version show program's version number and exit
--recorder {NRS,ICLISTEN,SOUNDTRAP}
Choose the audio instrument type
--json-base-dir dir JSON base directory to store the metadata
--output-dir dir Output directory to store logs
--uri uri Location of the audio files. S3 location supported for IcListen or Soundtrap, and GS supported for NRS.
--start YYYYMMDD The starting date to be processed.
--end YYYYMMDD The ending date to be processed.
--prefix PREFIX [PREFIX ...]
Prefix for search to match the audio files. Assumption is the prefix is separated by an underscore, e.g. 'MARS_'.
Examples:
pbp-meta-gen \
--json-base-dir=tests/json/nrs \
--output-dir=output \
--uri=s3://pacific-sound-ch01 \
--start=20220902 \
--end=20220902 \
--prefix=MARS \
--recorder=NRS