API reference

The specifications of the Cochlear.ai Sense models as well as their output examples are described below. In the output examples, we assume that the input is an audio file with a length of 3 seconds or an audio stream of 1 second.

speech_detector

Input type:audio file
Prediction unit:
 1 second
Inter-prediction duration:
 0.5 seconds
Sample-rate:16000 Hz (recommended) or higher
Output examples:
 
{"result": [{"speech": [0.972, 0.995, 1.0, 0.994, 0.992]}]}

music_detector

Input type:audio file
Prediction unit:
 1 second
Inter-prediction duration:
 0.5 seconds
Sample-rate:16000 Hz (recommended) or higher
Output examples:
 
{"result": [{"music": [0.602, 0.789, 0.515, 0.866, 1.0]}]}

age_gender

This model estimates the probability that an input source is an adult male voice, an adult female voice, or a child voice.

Input type:audio file
Prediction unit:
 1 second
Inter-prediction duration:
 0.5 seconds
Sample-rate:16000 Hz (recommended) or higher
Output examples:
 
{"result": [{"age/gender": "child", "probability": [0.173, 0.202, 0.336, 0.775, 0.997]},
            {"age/gender": "male", "probability": [0.654, 0.461, 0.125, 0.051, 0.001]},
            {"age/gender": "female", "probability": [0.173, 0.336, 0.539, 0.174, 0.002]}]}

music_genre

The available 23 genre classes are given below:

'Traditional', 'Old-time', 'Pop', 'Rock', 'Electronic', 'R&B', 'World', 'Latin',
'Metal', 'Alternative', 'Hip-Hop', 'New-Age', 'Country', 'Jazz', 'Folk', 'Classical',
'Punk', 'Reggae', 'Blues', 'Dance', 'Ballad', 'Trot', 'Funk'
Input type:audio file
Prediction unit:
 entire audio
Inter-prediction duration:
 N/A
Sample-rate:22050 Hz (recommended) or higher
Output examples:
 
{"result": [{"genre": ["Alternative", "Dance"], "probability": [0.443, 0.411]}]}

music_mood

The mood is represented as a point in the valence-arousal plane. “Valence” is a measure of an individual’s emotional state. (high valence: positive, low valence: negative) “Arousal” is a measure of how energized an individual feels. (high arousal: exciting, low arousal: calming)

Input type:audio file
Prediction unit:
 entire audio
Inter-prediction duration:
 N/A
Sample-rate:22050 Hz (recommended) or higher
Output examples:
 
{"result": [{"arousal": [0.536], "valence": [0.029]}]}

music_tempo

Note that the outputs denote the top-two tempo candidates in bpm and their corresponding probabilities.

Input type:audio file
Prediction unit:
 entire audio
Inter-prediction duration:
 N/A
Sample-rate:22050 Hz or higher
Output examples:
 
{"result": [{"tempo": [72.0, 36.0], "probability": [0.881, 0.119]}]}

music_key

Note that the output denotes the top-one key candidate and its corresponding probability.

Input type:audio file
Prediction unit:
 entire audio
Inter-prediction duration:
 N/A
Sample-rate:22050 Hz or higher
Output examples:
 
{"result": [{"key": ["Gb"], "probability": [0.752]}]}

event

The event model requires users to provide subtask which is one of the follows:

'babycry', 'carhorn', 'cough', 'dogbark', 'glassbreak', 'siren', 'snoring'
Input type:audio file
Prediction unit:
 1 second
Inter-prediction duration:
 0.5 seconds
Sample-rate:22050 Hz (recommended) or higher
Output examples:
 
{"result": [{"event": "babycry", "probability": [0.999, 1.0, 0.531, 0.091, 0.486]}]}

speech_detector_stream

Prediction unit:
 1 second
Inter-prediction duration:
 0.5 seconds
Sample-rate:16000 Hz
Output examples:
 
{"result": [{"speech": [0.972]}]}

music_detector_stream

Prediction unit:
 1 second
Inter-prediction duration:
 0.5 seconds
Sample-rate:16000 Hz
Output examples:
 
{"result": [{"music": [0.602]}]}

age_gender_stream

This model estimates the probability that an input source is an adult male voice, an adult female voice, or a child voice.

Prediction unit:
 1 second
Inter-prediction duration:
 0.5 seconds
Sample-rate:16000 Hz
Output examples:
 
{"result": [{"age/gender": "child", "probability": [0.173]},
            {"age/gender": "male", "probability": [0.654]},
            {"age/gender": "female", "probability": [0.173]}]}

music_genre_stream

The available 23 genre classes are given below:

'Traditional', 'Old-time', 'Pop', 'Rock', 'Electronic', 'R&B', 'World', 'Latin',
'Metal', 'Alternative', 'Hip-Hop', 'New-Age', 'Country', 'Jazz', 'Folk', 'Classical',
'Punk', 'Reggae', 'Blues', 'Dance', 'Ballad', 'Trot', 'Funk'
Prediction unit:
 3 seconds
Inter-prediction duration:
 0.5 seconds
Sample-rate:22050 Hz
Output examples:
 
{"result": [{"genre": ["Alternative", "Dance"], "probability": [0.443, 0.411]}]}

music_mood_stream

The mood is represented as a point in the valence-arousal plane. “Valence” is a measure of an individual’s emotional state. (high valence: positive, low valence: negative) “Arousal” is a measure of how energized an individual feels. (high arousal: exciting, low arousal: calming)

Prediction unit:
 3 seconds
Inter-prediction duration:
 0.5 seconds
Sample-rate:22050 Hz
Output examples:
 
{"result": [{"arousal": [0.536], "valence": [0.029]}]}

event_stream

The event model requires users to provide subtask which is one of the follows:

'babycry', 'carhorn', 'cough', 'dogbark', 'glassbreak', 'siren', 'snoring'
Prediction unit:
 1 second
Inter-prediction duration:
 0.5 seconds
Sample-rate:22050 Hz
Output examples:
 
{"result": [{"event": "babycry", "probability": [0.999]}]}