Realtime API Reference
GETwss://eu2.rt.speechmatics.com/v2/
Protocol overview
A basic Realtime session will have the following message exchanges:
Browser based transcription
When starting a Real-Time transcription session in the browser, temporary keys should be used to avoid exposing your long-lived API key.
To do so, you must provide the temporary key as a part of a query parameter. This is due to a browser limitation. For example:
wss://eu2.rt.speechmatics.com/v2?jwt=<temporary-key>
Handshake Responses
Successful Response
101 Switching Protocols- Switch to WebSocket protocol
Here is an example for a successful WebSocket handshake:
GET /v2/ HTTP/1.1
Host: eu2.rt.speechmatics.com
Upgrade: websocket
Connection: Upgrade
Sec-WebSocket-Key: ujRTbIaQsXO/0uCbjjkSZQ==
Sec-WebSocket-Version: 13
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits
Authorization: Bearer wmz9fkLJM6U5NdyaG3HLHybGZj65PXp
User-Agent: Python/3.8 websockets/8.1
A successful response should look like:
HTTP/1.1 101 Switching Protocols
Server: nginx/1.17.8
Date: Wed, 06 Jan 2021 11:01:05 GMT
Connection: upgrade
Upgrade: WebSocket
Sec-WebSocket-Accept: 87kiC/LI5WgXG52nSylnfXdz260=
Malformed Request
A malformed handshake request will result in one of the following HTTP responses:
400 Bad Request401 Unauthorized- when the API key is not valid405 Method Not Allowed- when the request method is not GET
Client Retry
Following a successful handshake and switch to the WebSocket protocol, the client could receive an immediate error message and WebSocket close handshake from the server. For the following errors only, we recommend adding a client retry interval of at least 5-10 seconds:
4005 quota_exceeded4013 job_error1011 internal_error
Message Handling
Each message that the Server accepts is a stringified JSON object with the following fields:
message(String): The name of the message we are sending. Any other fields depend on the value of themessageand are described below.
The messages sent by the Server to a Client are stringified JSON objects as well.
The only exception is a binary message sent from the Client to the Server containing a chunk of audio which will be referred to as AddAudio.
The following values of the message field are supported:
Sent messages
StartRecognition
StartRecognitionaudio_format objectrequired
- MOD1
- MOD2
rawPossible values: [pcm_f32le, pcm_s16le, mulaw]
filetranscription_config objectrequired
Request a specialized model based on 'language' but optimized for a particular field, e.g. "finance" or "medical".
Possible values: non-empty
additional_vocab object[]
- MOD1
- MOD2
Possible values: non-empty
Possible values: non-empty
Possible values: >= 1
Possible values: [none, speaker]
Possible values: >= 0
Possible values: [flexible, fixed]
speaker_diarization_config object
Possible values: >= 2 and <= 100
Possible values: >= 0 and <= 1
audio_filtering_config object
Possible values: >= 0 and <= 100
transcript_filtering_config
A list of replacement rules to apply to the transcript. Each rule consists of a pattern to match and a replacement string.
falsetruePossible values: [standard, enhanced]
punctuation_overrides object
The punctuation marks which the client is prepared to accept in transcription output, or the special value 'all' (the default). Unsupported marks are ignored. This value is used to guide the transcription process.
Possible values: Value must match regular expression ^(.|all)$
Ranges between zero and one. Higher values will produce more punctuation. The default is 0.5.
Possible values: >= 0 and <= 1
conversation_config object
This mode will detect when a speaker has stopped talking. The end_of_utterance_silence_trigger is the time in seconds after which the server will assume that the speaker has finished speaking, and will emit an EndOfUtterance message. A value of 0 disables the feature.
Possible values: >= 0 and <= 2
0translation_config object
falseaudio_events_config object
AddAudio
EndOfStream
EndOfStreamSetRecognitionConfig
SetRecognitionConfigtranscription_config objectrequired
Request a specialized model based on 'language' but optimized for a particular field, e.g. "finance" or "medical".
Possible values: non-empty
additional_vocab object[]
- MOD1
- MOD2
Possible values: non-empty
Possible values: non-empty
Possible values: >= 1
Possible values: [none, speaker]
Possible values: >= 0
Possible values: [flexible, fixed]
speaker_diarization_config object
Possible values: >= 2 and <= 100
Possible values: >= 0 and <= 1
audio_filtering_config object
Possible values: >= 0 and <= 100
transcript_filtering_config
A list of replacement rules to apply to the transcript. Each rule consists of a pattern to match and a replacement string.
falsetruePossible values: [standard, enhanced]
punctuation_overrides object
The punctuation marks which the client is prepared to accept in transcription output, or the special value 'all' (the default). Unsupported marks are ignored. This value is used to guide the transcription process.
Possible values: Value must match regular expression ^(.|all)$
Ranges between zero and one. Higher values will produce more punctuation. The default is 0.5.
Possible values: >= 0 and <= 1
conversation_config object
This mode will detect when a speaker has stopped talking. The end_of_utterance_silence_trigger is the time in seconds after which the server will assume that the speaker has finished speaking, and will emit an EndOfUtterance message. A value of 0 disables the feature.
Possible values: >= 0 and <= 2
0Received messages
RecognitionStarted
RecognitionStartedAudioAdded
AudioAddedAddPartialTranscript
AddPartialTranscriptSpeechmatics JSON output format version number.
2.1metadata objectrequired
results object[]required
Possible values: [word, punctuation]
Possible values: [next, previous, none, both]
alternatives object[]
display
Possible values: [ltr, rtl]
Possible values: >= 0 and <= 1
Possible values: >= 0 and <= 100
AddTranscript
AddTranscriptSpeechmatics JSON output format version number.
2.1metadata objectrequired
results object[]required
Possible values: [word, punctuation]
Possible values: [next, previous, none, both]
alternatives object[]
display
Possible values: [ltr, rtl]
Possible values: >= 0 and <= 1
Possible values: >= 0 and <= 100
AddPartialTranslation
AddPartialTranslationSpeechmatics JSON output format version number.
2.1results object[]required
AddTranslation
AddTranslationSpeechmatics JSON output format version number.
2.1results object[]required
EndOfTranscript
EndOfTranscriptAudioEventStarted
AudioEventStartedevent objectrequired
AudioEventEnded
AudioEventEndedevent objectrequired
EndOfUtterance
EndOfUtterancemetadata objectrequired
Info
InfoPossible values: [recognition_quality, model_redirect, deprecated, concurrent_session_usage]
Warning
WarningPossible values: [duration_limit_exceeded]
Error
ErrorPossible values: [invalid_message, invalid_model, invalid_config, invalid_audio_type, not_authorised, insufficient_funds, not_allowed, job_error, data_error, buffer_error, protocol_error, timelimit_exceeded, quota_exceeded, unknown_error]