Noise Reduction in CMU Sphinx

We can safely say that today CMU Sphinx has become the leader among free speech recognition software. Pocketsphinx comes with Ubuntu , the promising Simon project is built with wide use, and the structure of the Voxforge case suggests that it was created for sphinxtrain.

Despite the rapid development of Sphinx itself and speech recognition methods in general, everyone who tried to use it in practice knows how difficult it is to get a sane result even for simple tasks. And all because you can not just connect the default model and expect that the system will understand you. It is required to adapt the acoustics, build a relevant language model, find the optimal parameters and configuration of the engine - in general, spend weeks of time, painstakingly reducing the error percentage by percentage. As a person who has spent these same weeks, I can assure you that nothing is guaranteed to you in this case either. Especially if you want to recognize speech, recorded not by the headset, but by the laptop’s built-in microphone, as is often the case.

In general, the fundamental cause of poor recognition is the discrepancy between the training and test conditions (a little tracing paper with conditions mismatch). It includes everything: unfamiliar speakers, mismatched characteristics of the channels, inadequate language model, and even the manifestation of emotions that we did not expect from the user. In the case of a notebook microphone, we have various additive noises and echoes, which were not in the training base, and which can significantly drop the recognition accuracy.

Prehistory

The implementation of noise reduction in CMU Sphinx began exactly a year ago from this post of Nikolay Shmyrev (a low bow to him for everything, by the way): Around noise-robust PNCC features . Two months later, a commit took place, but the first mention in the FAQ appeared only on June 10, 2014. Up to this point, it was suggested to fight noises by adapting to the channel (very good advice, by the way, which nobody canceled). So, for experiments, you will need the latest for today version 0.8.
')
The description of the algorithm itself is given in the fundamental article and in Nikolai's post. In short, the algorithm is very similar to the MFCC, and modifications are due to research in the human auditory system. Reduction / suppression of noise in speech recognition systems is a very extensive area, which I will not go into, because I don’t rummage. I’ll tell you how to put it into practice. This post is a compilation of information found in articles and forums. You will need to get acquainted with the sphinx. Otherwise, welcome to the wiki .

Noise reduction in practice

If PNCCs are just new signs, it’s logical to assume that they can be used by specifying the appropriate value for -feat . And no, haha. In this case, the implementation is a modification of an already existing mechanism for feature extraction. And it looks a little different for pocketsphinx and Sphinx4. But let's order.

Creating acoustic models

So, before we begin to recognize, we need acoustic models. Existing models will not suit us, because they are still received in the usual way, which means that an attempt to use them in a noise-resistant system will generate the very fundamental mismatch. Therefore, the model will need to re-train. For this, respectively, you need a case and installed sphinxbase and sphinxtrain. As a case, I recommend voxforge , which will need to be slightly modified.

Here we come to the most important thing. As you probably know, sphinxtrain is controlled by a common config (sphinx_train.cfg), which sets all parameters for training (and testing) models, and also additional feat.params, which specify the parameters for feature extraction. Starting from version 0.8, some Sphinx utilities received additional parameters responsible for noise reduction. Namely -remove_noise and -lifter . For -remove_noise you need to set the value to yes (however, this is its default value), and the usual value of the -lifter parameter is 22. If you set it in the main config:

 $CFG_LIFTER = "22"; # Cepstrum lifter is smoothing to improve recognition

then you can read it from there:

 -lifter __CFG_LIFTER__

Another important parameter for us is -transform . Its default value is legacy , but we need dct . So, in order to train noise-proof models, we need to set a trio of parameters in feat.params:

 -transform dct -remove_noise yes -lifter 22

But still, it is better to transfer them to sphinx_train.cfg, as is done for other parameters:

 $CFG_TRANSFORM = "dct"; # Previously legacy transform is used, but dct is more accurate $CFG_LIFTER = "22"; # Cepstrum lifter is smoothing to improve recognition

feat.params:

 -transform __CFG_TRANSFORM__ -remove_noise yes -lifter __CFG_LIFTER__

You have to understand that sphinxtrain is just a wrapper script for individual utilities, such as fe, so if you call them separately, you should always set these parameters (if any).

Here is an example of my configs for voxforge-en:

sphinx_train.cfg:

 # Configuration script for sphinx trainer -*-mode:Perl-*- $CFG_VERBOSE = 1; # Determines how much goes to the screen. # These are filled in at configuration time $CFG_DB_NAME = "voxforge_en"; # Experiment name, will be used to name model files and log files $CFG_EXPTNAME = "$CFG_DB_NAME"; # Directory containing SphinxTrain binaries $CFG_BASE_DIR = "/home/speechdat/voxforge-en"; $CFG_SPHINXTRAIN_DIR = "/usr/local/lib/sphinxtrain"; $CFG_BIN_DIR = "/usr/local/libexec/sphinxtrain"; $CFG_SCRIPT_DIR = "/usr/local/lib/sphinxtrain/scripts"; # Audio waveform and feature file information $CFG_WAVFILES_DIR = "$CFG_BASE_DIR/wav"; $CFG_WAVFILE_EXTENSION = 'wav'; $CFG_WAVFILE_TYPE = 'mswav'; # one of nist, mswav, raw $CFG_FEATFILES_DIR = "$CFG_BASE_DIR/feat"; $CFG_FEATFILE_EXTENSION = 'mfc'; $CFG_VECTOR_LENGTH = 13; # Feature extraction parameters $CFG_WAVFILE_SRATE = 16000.0; $CFG_NUM_FILT = 40; # For wideband speech it's 40, for telephone 8khz reasonable value is 31 $CFG_LO_FILT = 133.33334; # For telephone 8kHz speech value is 200 $CFG_HI_FILT = 6855.4976; # For telephone 8kHz speech value is 3500 $CFG_TRANSFORM = "dct"; # Previously legacy transform is used, but dct is more accurate $CFG_LIFTER = "22"; # Cepstrum lifter is smoothing to improve recognition $CFG_MIN_ITERATIONS = 1; # BW Iterate at least this many times $CFG_MAX_ITERATIONS = 10; # BW Don't iterate more than this, somethings likely wrong. # (none/max) Type of AGC to apply to input files $CFG_AGC = 'none'; # (current/none) Type of cepstral mean subtraction/normalization # to apply to input files $CFG_CMN = 'current'; $CFG_CMNINIT = 10.0; # (yes/no) Normalize variance of input files to 1.0 $CFG_VARNORM = 'no'; # (yes/no) Train full covariance matrices $CFG_FULLVAR = 'no'; # (yes/no) Use diagonals only of full covariance matrices for # Forward-Backward evaluation (recommended if CFG_FULLVAR is yes) $CFG_DIAGFULL = 'no'; # (yes/no) Perform vocal tract length normalization in training. This # will result in a "normalized" model which requires VTLN to be done # during decoding as well. $CFG_VTLN = 'no'; # Starting warp factor for VTLN $CFG_VTLN_START = 0.80; # Ending warp factor for VTLN $CFG_VTLN_END = 1.40; # Step size of warping factors $CFG_VTLN_STEP = 0.05; # Directory to write queue manager logs to $CFG_QMGR_DIR = "$CFG_BASE_DIR/qmanager"; # Directory to write training logs to $CFG_LOG_DIR = "$CFG_BASE_DIR/logdir"; # Directory for re-estimation counts $CFG_BWACCUM_DIR = "$CFG_BASE_DIR/bwaccumdir"; # Directory to write model parameter files to $CFG_MODEL_DIR = "$CFG_BASE_DIR/model_parameters"; # Directory containing transcripts and control files for # speaker-adaptive training $CFG_LIST_DIR = "$CFG_BASE_DIR/etc"; # Decoding variables for MMIE training $CFG_LANGUAGEWEIGHT = "11.5"; $CFG_BEAMWIDTH = "1e-100"; $CFG_WORDBEAM = "1e-80"; $CFG_LANGUAGEMODEL = "$CFG_LIST_DIR/${CFG_DB_NAME}_full.lm.DMP"; $CFG_WORDPENALTY = "0.2"; # Lattice pruning variables $CFG_ABEAM = "1e-50"; $CFG_NBEAM = "1e-10"; $CFG_PRUNED_DENLAT_DIR = "$CFG_BASE_DIR/pruned_denlat"; # MMIE training related variables $CFG_MMIE = "no"; $CFG_MMIE_MAX_ITERATIONS = 5; $CFG_LATTICE_DIR = "$CFG_BASE_DIR/lattice"; $CFG_MMIE_TYPE = "best"; # Valid values are "rand", "best" or "ci" $CFG_MMIE_CONSTE = "3.0"; $CFG_NUMLAT_DIR = "$CFG_BASE_DIR/numlat"; $CFG_DENLAT_DIR = "$CFG_BASE_DIR/denlat"; # Variables used in main training of models $CFG_DICTIONARY = "$CFG_LIST_DIR/$CFG_DB_NAME.dict"; $CFG_RAWPHONEFILE = "$CFG_LIST_DIR/$CFG_DB_NAME.phone"; $CFG_FILLERDICT = "$CFG_LIST_DIR/$CFG_DB_NAME.filler"; $CFG_LISTOFFILES = "$CFG_LIST_DIR/${CFG_DB_NAME}_full.fileids"; $CFG_TRANSCRIPTFILE = "$CFG_LIST_DIR/${CFG_DB_NAME}_full.transcription"; $CFG_FEATPARAMS = "$CFG_LIST_DIR/feat.params"; # Variables used in characterizing models $CFG_HMM_TYPE = '.cont.'; # Sphinx 4, PocketSphinx #$CFG_HMM_TYPE = '.semi.'; # PocketSphinx #$CFG_HMM_TYPE = '.ptm.'; # PocketSphinx (larger data sets) if (($CFG_HMM_TYPE ne ".semi.") and ($CFG_HMM_TYPE ne ".ptm.") and ($CFG_HMM_TYPE ne ".cont.")) { die "Please choose one CFG_HMM_TYPE out of '.cont.', '.ptm.', or '.semi.', " . "currently $CFG_HMM_TYPE\n"; } # This configuration is fastest and best for most acoustic models in # PocketSphinx and Sphinx-III. See below for Sphinx-II. $CFG_STATESPERHMM = 3; $CFG_SKIPSTATE = 'no'; if ($CFG_HMM_TYPE eq '.semi.') { $CFG_DIRLABEL = 'semi'; # Four stream features for PocketSphinx $CFG_FEATURE = "s2_4x"; $CFG_NUM_STREAMS = 4; $CFG_INITIAL_NUM_DENSITIES = 256; $CFG_FINAL_NUM_DENSITIES = 256; die "For semi continuous models, the initial and final models have the same density" if ($CFG_INITIAL_NUM_DENSITIES != $CFG_FINAL_NUM_DENSITIES); } elsif ($CFG_HMM_TYPE eq '.ptm.') { $CFG_DIRLABEL = 'ptm'; # Four stream features for PocketSphinx $CFG_FEATURE = "s2_4x"; $CFG_NUM_STREAMS = 4; $CFG_INITIAL_NUM_DENSITIES = 64; $CFG_FINAL_NUM_DENSITIES = 64; die "For phonetically tied models, the initial and final models have the same density" if ($CFG_INITIAL_NUM_DENSITIES != $CFG_FINAL_NUM_DENSITIES); } elsif ($CFG_HMM_TYPE eq '.cont.') { $CFG_DIRLABEL = 'cont'; # Single stream features - Sphinx 3 $CFG_FEATURE = "1s_c_d_dd"; $CFG_NUM_STREAMS = 1; $CFG_INITIAL_NUM_DENSITIES = 1; $CFG_FINAL_NUM_DENSITIES = 32; die "The initial has to be less than the final number of densities" if ($CFG_INITIAL_NUM_DENSITIES > $CFG_FINAL_NUM_DENSITIES); } # Number of top gaussians to score a frame. A little bit less accurate computations # make training significantly faster. Uncomment to apply this during the training # For good accuracy make sure you are using the same setting in decoder # In theory this can be different for various training stages. For example 4 for # CI stage and 16 for CD stage # $CFG_CI_TOPN = 4; # $CFG_CD_TOPN = 16; # (yes/no) Train multiple-gaussian context-independent models (useful # for alignment, use 'no' otherwise) in the models created # specifically for forced alignment $CFG_FALIGN_CI_MGAU = 'no'; # (yes/no) Train multiple-gaussian context-independent models (useful # for alignment, use 'no' otherwise) $CFG_CI_MGAU = 'no'; # Number of tied states (senones) to create in decision-tree clustering $CFG_N_TIED_STATES = 3000; # How many parts to run Forward-Backward estimatinon in $CFG_NPART = 1; # (yes/no) Train a single decision tree for all phones (actually one # per state) (useful for grapheme-based models, use 'no' otherwise) $CFG_CROSS_PHONE_TREES = 'no'; # Use force-aligned transcripts (if available) as input to training $CFG_FORCEDALIGN = 'no'; # Use a specific set of models for force alignment. If not defined, # context-independent models for the current experiment will be used. $CFG_FORCE_ALIGN_MDEF = "$CFG_BASE_DIR/model_architecture/$CFG_EXPTNAME.falign_ci.mdef"; $CFG_FORCE_ALIGN_MODELDIR = "$CFG_MODEL_DIR/$CFG_EXPTNAME.falign_ci_$CFG_DIRLABEL"; # Use a specific dictionary and filler dictionary for force alignment. # If these are not defined, a dictionary and filler dictionary will be # created from $CFG_DICTIONARY and $CFG_FILLERDICT, with noise words # removed from the filler dictionary and added to the dictionary (this # is because the force alignment is not very good at inserting them) # $CFG_FORCE_ALIGN_DICTIONARY = "$ST::CFG_BASE_DIR/falignout$ST::CFG_EXPTNAME.falign.dict";; # $CFG_FORCE_ALIGN_FILLERDICT = "$ST::CFG_BASE_DIR/falignout/$ST::CFG_EXPTNAME.falign.fdict";; # Use a particular beam width for force alignment. The wider # (ie smaller numerically) the beam, the fewer sentences will be # rejected for bad alignment. $CFG_FORCE_ALIGN_BEAM = 1e-60; # Calculate an LDA/MLLT transform? $CFG_LDA_MLLT = 'yes'; # Dimensionality of LDA/MLLT output $CFG_LDA_DIMENSION = 29; # This is actually just a difference in log space (it doesn't make # sense otherwise, because different feature parameters have very # different likelihoods) $CFG_CONVERGENCE_RATIO = 0.1; # Queue::POSIX for multiple CPUs on a local machine # Queue::PBS to use a PBS/TORQUE queue $CFG_QUEUE_TYPE = "Queue::POSIX"; # Name of queue to use for PBS/TORQUE $CFG_QUEUE_NAME = "workq"; # (yes/no) Build questions for decision tree clustering automatically $CFG_MAKE_QUESTS = "yes"; # If CFG_MAKE_QUESTS is yes, questions are written to this file. # If CFG_MAKE_QUESTS is no, questions are read from this file. $CFG_QUESTION_SET = "${CFG_BASE_DIR}/model_architecture/${CFG_EXPTNAME}.tree_questions"; #$CFG_QUESTION_SET = "${CFG_BASE_DIR}/linguistic_questions"; $CFG_CP_OPERATION = "${CFG_BASE_DIR}/model_architecture/${CFG_EXPTNAME}.cpmeanvar"; # This variable has to be defined, otherwise utils.pl will not load. $CFG_DONE = 1; return 1;

feat.params:

 -alpha 0.97 -dither yes -doublebw no -nfilt __CFG_NUM_FILT__ -ncep __CFG_VECTOR_LENGTH__ -lowerf __CFG_LO_FILT__ -upperf __CFG_HI_FILT__ -samprate __CFG_WAVFILE_SRATE__ -nfft 512 -wlen 0.0256 -transform __CFG_TRANSFORM__ -feat __CFG_FEATURE__ -agc __CFG_AGC__ -cmn __CFG_CMN__ -varnorm __CFG_VARNORM__ -remove_noise yes -lifter __CFG_LIFTER__

Of course, the training of acoustic models is one more work ~~hemorrhoids~~ . In addition to specific knowledge, it requires installing sphinxbase and sphinxtrain and lasts about a day. Therefore, I shared my models, trained on voxforge-en according to the above recipe: dropbox .

Using acoustic models

Having models, we can finally breathe freely and plug them into our system. Here, recipes vary depending on whether you use pocketsphinx or Sphinx4. With pocketsphinx, everything is simple: you just need to specify a trio of parameters -transform , -remove_noise and -lifter . And if we want to use Sphinx4, then we need to include Denoise components in the frontend and slightly change the frontend itself. The corresponding pipeline will look something like this:

AudioFileDataSource
Dither
Preemphasizer
RaisedCosineWindower
DiscreteFourierTransform
MelFrequencyFilterBank
Denoise
DiscreteCosineTransform2
Lifter
BatchCMN
DeltasFeatureExtractor
FeatureTransform

NB: featureTransform is needed only if you have used LDA / MLLT in model training.

Three components are highlighted in bold and provide noise reduction.

In XML, the corresponding part of the config will look like this:

config.xml

 <component name="mfcFrontEnd" type="edu.cmu.sphinx.frontend.FrontEnd"> <propertylist name="pipeline"> <item>audioFileDataSource</item> <item>dither</item> <item>preemphasizer</item> <item>windower</item> <item>fft</item> <item>melFilterBank</item> <item>denoise</item> <item>dct</item> <item>lifter</item> <item>batchCMN</item> <item>featureExtraction</item> <item>featureTransform</item> </propertylist> </component> <component name="audioFileDataSource" type="edu.cmu.sphinx.frontend.util.AudioFileDataSource"> </component> <component name="preemphasizer" type="edu.cmu.sphinx.frontend.filter.Preemphasizer"> </component> <component name="dither" type="edu.cmu.sphinx.frontend.filter.Dither"> </component> <component name="windower" type="edu.cmu.sphinx.frontend.window.RaisedCosineWindower"> </component> <component name="fft" type="edu.cmu.sphinx.frontend.transform.DiscreteFourierTransform"> </component> <component name="melFilterBank" type="edu.cmu.sphinx.frontend.frequencywarp.MelFrequencyFilterBank"> <property name="numberFilters" value="40"/> <property name="minimumFrequency" value="133.33334"/> <property name="maximumFrequency" value="6855.4976"/> </component> <component name="denoise" type="edu.cmu.sphinx.frontend.denoise.Denoise"> </component> <component name="dct" type="edu.cmu.sphinx.frontend.transform.DiscreteCosineTransform2"> </component> <component name="lifter" type="edu.cmu.sphinx.frontend.transform.Lifter"> </component> <component name="batchCMN" type="edu.cmu.sphinx.frontend.feature.BatchCMN"> </component> <component name="featureExtraction" type="edu.cmu.sphinx.frontend.feature.DeltasFeatureExtractor"> </component> <component name="featureTransform" type="edu.cmu.sphinx.frontend.feature.FeatureTransform"> <property name="loader" value="modelLoader"/> </component>

It works?

Full For my task, I received a 6.5% increase in accuracy of recognition: from 74.65% to 81.38%. But still, adaptation to the channel should be carried out. And be careful in applying this mechanism: on pure audio, it can degrade the result.

Source: https://habr.com/ru/post/227099/

All Articles