Realtime API Reference

GETwss://eu2.rt.speechmatics.com/v2/

Protocol overview

A basic Realtime session will have the following message exchanges:

WARNING

Browser based transcription

When starting a Real-Time transcription session in the browser, temporary keys should be used to avoid exposing your long-lived API key.

To do so, you must provide the temporary key as a part of a query parameter. This is due to a browser limitation. For example:

 wss://eu2.rt.speechmatics.com/v2?jwt=<temporary-key>

Handshake Responses

Successful Response

101 Switching Protocols - Switch to WebSocket protocol

Here is an example for a successful WebSocket handshake:

GET /v2/ HTTP/1.1
Host: eu2.rt.speechmatics.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: ujRTbIaQsXO/0uCbjjkSZQ==
Sec-WebSocket-Version: 13
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits
Authorization: Bearer wmz9fkLJM6U5NdyaG3HLHybGZj65PXp
User-Agent: Python/3.8 websockets/8.1

A successful response should look like:

HTTP/1.1 101 Switching Protocols
Server: nginx/1.17.8
Date: Wed, 06 Jan 2021 11:01:05 GMT
Connection: upgrade
Upgrade: WebSocket
Sec-WebSocket-Accept: 87kiC/LI5WgXG52nSylnfXdz260=

Malformed Request

A malformed handshake request will result in one of the following HTTP responses:

400 Bad Request
401 Unauthorized - when the API key is not valid
405 Method Not Allowed - when the request method is not GET

Client Retry

Following a successful handshake and switch to the WebSocket protocol, the client could receive an immediate error message and WebSocket close handshake from the server. For the following errors only, we recommend adding a client retry interval of at least 5-10 seconds:

4005 quota_exceeded
4013 job_error
1011 internal_error

Message Handling

Each message that the Server accepts is a stringified JSON object with the following fields:

message (String): The name of the message we are sending. Any other fields depend on the value of the message and are described below.

The messages sent by the Server to a Client are stringified JSON objects as well.

The only exception is a binary message sent from the Client to the Server containing a chunk of audio which will be referred to as AddAudio.

The following values of the message field are supported:

Sent messages

StartRecognition

Initiates a new recognition session.

messagerequired

Constant value: StartRecognition

audio_format objectrequired

oneOf

MOD1
MOD2

typerequired

Constant value: raw

encodingstringrequired

Possible values: [pcm_f32le, pcm_s16le, mulaw]

sample_rateintegerrequired

typerequired

Constant value: file

transcription_config objectrequired

languagestringrequired

domainstring

Request a specialized model based on 'language' but optimized for a particular field, e.g. "finance" or "medical".

output_localestring

Possible values: non-empty

additional_vocab object[]

Array [

oneOf

MOD1
MOD2

string

Possible values: non-empty

contentstringrequired

Possible values: non-empty

sounds_likestring[]

Possible values: >= 1

]

diarizationstring

Possible values: [none, speaker]

max_delaynumber

Possible values: >= 0

max_delay_modestring

Possible values: [flexible, fixed]

speaker_diarization_config object

max_speakersinteger

Possible values: >= 2 and <= 100

prefer_current_speakerboolean

speaker_sensitivityfloat

Possible values: >= 0 and <= 1

audio_filtering_config object

volume_thresholdfloat

Possible values: >= 0 and <= 100

transcript_filtering_config

remove_disfluenciesboolean

replacementsarray[]

A list of replacement rules to apply to the transcript. Each rule consists of a pattern to match and a replacement string.

enable_partialsboolean

Default value: false

enable_entitiesboolean

Default value: true

operating_pointstring

Possible values: [standard, enhanced]

punctuation_overrides object

permitted_marksstring[]

The punctuation marks which the client is prepared to accept in transcription output, or the special value 'all' (the default). Unsupported marks are ignored. This value is used to guide the transcription process.

Possible values: Value must match regular expression ^(.|all)$

sensitivityfloat

Ranges between zero and one. Higher values will produce more punctuation. The default is 0.5.

Possible values: >= 0 and <= 1

conversation_config object

This mode will detect when a speaker has stopped talking. The end_of_utterance_silence_trigger is the time in seconds after which the server will assume that the speaker has finished speaking, and will emit an EndOfUtterance message. A value of 0 disables the feature.

end_of_utterance_silence_triggerfloat

Possible values: >= 0 and <= 2

Default value: 0

translation_config object

target_languagesstring[]required

enable_partialsboolean

Default value: false

audio_events_config object

typesstring[]

AddAudio

A binary chunk of audio. The server confirms receipt by sending an AudioAdded message.

stringbinary

EndOfStream

Declares that the client has no more audio to send.

messagerequired

Constant value: EndOfStream

last_seq_nointegerrequired

SetRecognitionConfig

Allows the client to re-configure the recognition session.

messagerequired

Constant value: SetRecognitionConfig

transcription_config objectrequired

languagestringrequired

domainstring

Request a specialized model based on 'language' but optimized for a particular field, e.g. "finance" or "medical".

output_localestring

Possible values: non-empty

additional_vocab object[]

Array [

oneOf

MOD1
MOD2

string

Possible values: non-empty

contentstringrequired

Possible values: non-empty

sounds_likestring[]

Possible values: >= 1

]

diarizationstring

Possible values: [none, speaker]

max_delaynumber

Possible values: >= 0

max_delay_modestring

Possible values: [flexible, fixed]

speaker_diarization_config object

max_speakersinteger

Possible values: >= 2 and <= 100

prefer_current_speakerboolean

speaker_sensitivityfloat

Possible values: >= 0 and <= 1

audio_filtering_config object

volume_thresholdfloat

Possible values: >= 0 and <= 100

transcript_filtering_config

remove_disfluenciesboolean

replacementsarray[]

A list of replacement rules to apply to the transcript. Each rule consists of a pattern to match and a replacement string.

enable_partialsboolean

Default value: false

enable_entitiesboolean

Default value: true

operating_pointstring

Possible values: [standard, enhanced]

punctuation_overrides object

permitted_marksstring[]

Possible values: Value must match regular expression ^(.|all)$

sensitivityfloat

Ranges between zero and one. Higher values will produce more punctuation. The default is 0.5.

Possible values: >= 0 and <= 1

conversation_config object

end_of_utterance_silence_triggerfloat

Possible values: >= 0 and <= 2

Default value: 0

Received messages

RecognitionStarted

Server response to StartRecognition, acknowledging that a recognition session has started.

messagerequired

Constant value: RecognitionStarted

orchestrator_versionstring

idstring

AudioAdded

Server response to AddAudio, indicating that audio has been added successfully.

messagerequired

Constant value: AudioAdded

seq_nointegerrequired

AddPartialTranscript

Contains a work-in-progress transcript of a part of the audio that the client has sent.

messagerequired

Constant value: AddPartialTranscript

formatstring

Speechmatics JSON output format version number.

Example: 2.1

metadata objectrequired

start_timefloatrequired

end_timefloatrequired

transcriptstringrequired

results object[]required

Array [

typestringrequired

Possible values: [word, punctuation]

start_timefloatrequired

end_timefloatrequired

channelstring

attaches_tostring

Possible values: [next, previous, none, both]

is_eosboolean

alternatives object[]

Array [

contentstringrequired

confidencefloatrequired

languagestring

display

directionstringrequired

Possible values: [ltr, rtl]

speakerstring

]

scorefloat

Possible values: >= 0 and <= 1

volumefloat

Possible values: >= 0 and <= 100

]

AddTranscript

Contains the final transcript of a part of the audio that the client has sent.

messagerequired

Constant value: AddTranscript

formatstring

Speechmatics JSON output format version number.

Example: 2.1

metadata objectrequired

start_timefloatrequired

end_timefloatrequired

transcriptstringrequired

results object[]required

Array [

typestringrequired

Possible values: [word, punctuation]

start_timefloatrequired

end_timefloatrequired

channelstring

attaches_tostring

Possible values: [next, previous, none, both]

is_eosboolean

alternatives object[]

Array [

contentstringrequired

confidencefloatrequired

languagestring

display

directionstringrequired

Possible values: [ltr, rtl]

speakerstring

]

scorefloat

Possible values: >= 0 and <= 1

volumefloat

Possible values: >= 0 and <= 100

]

AddPartialTranslation

Contains a work-in-progress translation of a part of the audio that the client has sent.

messagerequired

Constant value: AddPartialTranslation

formatstring

Speechmatics JSON output format version number.

Example: 2.1

languagestringrequired

results object[]required

Array [

contentstringrequired

start_timefloatrequired

end_timefloatrequired

speakerstring

]

AddTranslation

Contains the final translation of a part of the audio that the client has sent.

messagerequired

Constant value: AddTranslation

formatstring

Speechmatics JSON output format version number.

Example: 2.1

languagestringrequired

results object[]required

Array [

contentstringrequired

start_timefloatrequired

end_timefloatrequired

speakerstring

]

EndOfTranscript

Server response to EndOfStream, after the server has finished sending all AddTranscript messages.

messagerequired

Constant value: EndOfTranscript

AudioEventStarted

Start of an audio event detected.

messagerequired

Constant value: AudioEventStarted

event objectrequired

typestringrequired

start_timefloatrequired

confidencefloatrequired

AudioEventEnded

End of an audio event detected.

messagerequired

Constant value: AudioEventEnded

event objectrequired

typestringrequired

end_timefloatrequired

EndOfUtterance

Indicates the end of an utterance, triggered by a configurable period of non-speech.

messagerequired

Constant value: EndOfUtterance

metadata objectrequired

start_timefloat

end_timefloat

Info

Additional information sent from the server to the client.

messagerequired

Constant value: Info

typestringrequired

Possible values: [recognition_quality, model_redirect, deprecated, concurrent_session_usage]

reasonstringrequired

codeinteger

seq_nointeger

qualitystring

usagenumber

quotanumber

last_updatedstring

Warning

Warning messages sent from the server to the client.

messagerequired

Constant value: Warning

typestringrequired

Possible values: [duration_limit_exceeded]

reasonstringrequired

codeinteger

seq_nointeger

duration_limitnumber

Error

Error messages sent from the server to the client.

messagerequired

Constant value: Error

typestringrequired

Possible values: [invalid_message, invalid_model, invalid_config, invalid_audio_type, not_authorised, insufficient_funds, not_allowed, job_error, data_error, buffer_error, protocol_error, timelimit_exceeded, quota_exceeded, unknown_error]

reasonstringrequired

codeinteger

seq_nointeger

Realtime API Reference

wss://eu2.rt.speechmatics.com/v2/

Protocol overview​

Browser based transcription​

Handshake Responses​

Message Handling​

Sent messages​

StartRecognition​

AddAudio​

EndOfStream​

SetRecognitionConfig​

Received messages​

RecognitionStarted​

AudioAdded​

AddPartialTranscript​

AddTranscript​

AddPartialTranslation​

AddTranslation​

EndOfTranscript​

AudioEventStarted​

AudioEventEnded​

EndOfUtterance​

Info​

Warning​

Error​

Protocol overview

Browser based transcription

Handshake Responses

Message Handling

Sent messages

StartRecognition

AddAudio

EndOfStream

SetRecognitionConfig

Received messages

RecognitionStarted

AudioAdded

AddPartialTranscript

AddTranscript

AddPartialTranslation

AddTranslation

EndOfTranscript

AudioEventStarted

AudioEventEnded

EndOfUtterance

Info

Warning

Error