Walter van Heuven

 
About Lab    Research Publications Software

waltervanheuven.net > Software > Speech2Text

Speech2Text

animation_s2t

Speech2Text provides a simple graphical user interface for different automatic speech recognition (ASR) systems and services (OpenAI's whisper, whisper.cpp, faster-whisper, whisper ASR webservice, and the whisper API). Speech2Text transcribes the speech from audio and video files (e.g., .mp3, m4a, .wav, .mp4, .mov). The output is a text or subtitle file (.vtt or .srt). When you select OpenAI's whisper, whisper.cpp, or faster-whisper the ASR runs locally on your computer.

The app will convert mp4, mov, avi, and m4a files to (16 kHz, 16-bit) wav files if FFmpeg is installed on your computer. On macOS you can use brew to install FFmpeg, and on Windows use Scoop or Chocolatey to install FFmpeg. If you use the whisper ASR webservice or the whisper API, there is no need to install FFmpeg because the file conversion takes place on the server.

The application uses by default 'whisper.cpp' with the 'base' model for transcribing speech. This runs completely locally on your computer. To improve the accuracy of the transcription, select the medium or large/large-v2 model in the Settings. The medium and large models improve, in particular, accuracy of transcribing foreign names, specialist terminology, transcribing poor/noisy recordings and accented speech. However, these models require a powerful computer because the transcription process is processor intensive and requires a significant amount of memory.

s2t_icon

Download

Version: 2.0.2

Installation and starting Speech2Text

On MacOS open the DMG file and drag Speech2Text to the Applications folder. On Windows double click on the "Speech2Text_Setup.zip" to open the zip file that contains the installer. Next, double click "Speech2Text Setup" to install Speech2Text.

The app is not code signed. Therefore, you will see a security message when you start the application for the first time or run the installer. Below are instructions to open the app in macOS and to run the installer in Windows.

macOS

The first time you start the application on macOS (double click on the application icon in Finder), you will see the dialog box below:

security msg 1

Please note that starting the application the first time might take a while.

To fix the security warning, right-click on the application icon in the Finder, and then select in the popup menu "Open". Next, you will see this message:

security msg 2

Select "Open" and then it will take a few seconds before the app starts. The security warning will now not appear anymore for this app and the application should start up quickly.

Windows

The following message will appear when you start the installer.

security msg Windows 1

Click on "More info", and the following message will appear.

security msg Windows 2

Click on "Run anyway" to start the installer.

Speech to Text transcription speed

See Table 1 for some indication of how long it takes to generate captions (VTT file) for a 19-minute recording.

Table 1. Comparison of ASR Performance using whisper.cpp and the Whisper ASR Webservice
Model Dell Latitudea iMac (M1)b MacBook Pro (M1 Max)c ASR webserviced
tiny 0h01m11s 0h00m17s 0h00m10s
base 0h02m06s 0h00m28s 0h00m15s
small 0h09m37s 0h01m06s 0h00m31s
medium 0h31m08s 0h02m56s 0h01m08s
large 3h41m35s 0h05m32s 0h01m57s 0h02m21s
aDell Latitude 5520 (Windows 11, 16 GB, Intel i7 11th gen). b24" iMac M1 (macOS Ventura, 16 GB) whisper.cpp using GPU (Metal). c14" MacBook Pro M1 Max (macOS Ventura, 32 GB) whisper.cpp using GPU (Metal). dASR webservice running on a ThinkStation P360 (Windows 11, 32 GB, Intel i9 12th gen, RTX A4000) using CUDA.

Source code

The source code will be made available on GitHub.

If you have any questions, contact Walter van Heuven

Known issues