📜 ⬆️ ⬇️

WavPlayer - we are not looking for easy ways, we are laying them

As you know, telephony involves voice transmission. Nobody needs a full band of 20Hz-20kHz for voice transmission, for a clear distinguishable and recognizable voice it is quite enough up to 3.5kHz. To be more precise, the speech frequency band used in telephony is from 300 to 3400 Hz. When compressed into a common channel, for precise allocation, guard frequencies are needed at the edges, therefore the bandwidth is 4 kHz. When digitizing it turns out 8kHz. Now, due to the development of the thickness of communication channels, the same Skype and others, boasting of "higher" quality, use 16kHz, and even 32kHz, which, by the way, is really practically impossible to be heard by ear during normal conversation (but quality of the communication channel, but when it bothered the marketplug).

So, almost all the audio files that are used in telephony, recorded with 8 kHz digitization. To speed up the processing of large streams, the compression methods used are just as simple and aimed at a decent result when applied to the desired one - speech compression. These are simple digitization ( PCM ), simple delta codecs (ADPCM, G711 ), or smart codecs ( GSM 06.10 ). These formats are "native" for telephony systems - asterisk, freeswitch (and probably others too). In these formats, data are prepared for people to play the system, and systems can record recordings into the same formats.

However, now the web is sweeping across the planet, and people want to be able to listen to recordings of conversations, greetings, etc. on the web, where mp3 has become the “native” format ...

As a result, for the rare “listen to the archive” function, a naive solution is to set up transcoding entries from the telephone format to MP3 on the server.
')
All anything, but:
Seeing this disgrace, the soul of an engineer fell ill and began to demand to do well. And it’s not “to do badly, and then as it was,” namely, it’s good and straightforward: after all, in fact, the codecs used in telephony are designed for a good result, and it’s extremely cheap. So why make an expensive encoding operation in MP3, so that you can do an expensive decoding operation from an MP3 on the client just because this decoder is already there? Let's just make this the easiest decoder on the client, and that's it!

I was particularly surprised by the absence of these ready-made decoders. This is how WavPlayer was born: a flash player for telephony files.

What he can do:
And recently, users have added a proxy to a standard MP3 player so that only WavPlayer can be used to play both native and transcoded archives. (Initially, I did not do this, assuming that it is JS’s concern to use any of the flash-mp3 players, html5, or use WavPlayer).

Anyone who reads the descriptions of each of the codecs and formats will understand that the player is as simple as a cork. But if this were so, he would have existed a long time ago ... Therefore, I will briefly tell the story of its creation.

To play sounds in a flash, initially only one thing was assumed: playing mp3 inserts. Everything. Nothing else. Beginning with version 10, a sampleData event appeared in the flash.media.Sound interface that allows you to generate and play the generated sound. But as befits a flash, it only does it in its own way: only 44kHz, only stereo, only 32bit floating point numbers.

And we have - 8kHz / 16kHz integers. If we just take the source data and just give out as-is, we get something that is not legible and has a very high frequency. Conclusion? It is necessary to interpolate our existing samples - to make in other words Resample.

When resamplinging, it is important to understand that even with a simple frequency doubling, it is impossible to simply take and insert "average" numbers between two samples - the resulting sound will "whistle" very high frequencies, since instead of a smooth sinusoid we will get a saw. Correct resampling is obtained by restoring the original smooth sound (minimizing the second derivative), and re-digitizing it at the desired frequency. This way we get the right smooth sound with the desired sample rate.

As I, of course, know the theory, but in practice I was very lazy, and the task was to “play records” quite acutely, I had to decide quickly. I do not know the flash, and the working machine under Linux. I looked at the size of the flash compiler - over a hundred meters, it became so broke that I decided to find an alternative to quickly and easily draw on the flash. Quick Google gave a wonderful option - HaXe . A simple C / java-like language that can be translated into several target platforms, including the flash I need. He was taken.

In general, the first working mock-up was scrambled:

I found a fogg project in which ogg files were manually decoded. From there, AudioSink was taken, which implements the push interface instead of pull: the buffer to which we write, and when the flash wants the next piece of data, AudioSink gives it to it from the buffer. Not the most optimal and beautiful implementation, but ready. As a resampler, the implementation of a Lanczos re-sampler (of the highest quality, based on sinc functions) from OpenJDK was taken in the forehead. The code is not the most optimal (I later implemented it in pure Action Script - I managed to speed it up almost 4 times), but it works (and I didn’t need anything else).
The simplest interface: draw a triangle when it is. At the click, play () starts and a square is drawn. At the click, two vertical sticks are drawn.
For decoding, the G711 code is taken from Sox, for the PCM code it gave birth to itself.

And, of course, a spoon of OOP in this barrel of tyrocode: File and Decoder interfaces, which allow the player to abstract from a specific variation in the main player. True, interfaces were born out of need, and not systematically, but when was it different? File works like this - the input data of the file is read, and pushed through the push () method to the decoder. As soon as all the headers are read, the decoder of the corresponding format is created inside the file, and the audio data will be pushed into it. The ready () method begins to return true, and from this moment on all other methods of stream metadata also become valid, and you can read the audio stream data with the getSamples () request, which will return samplesAvailable () samples.

The work of the decoder is also simple - it tells the sample the size in bytes so that the file can be cut into the necessary packages for feeding the decoder. A decoder is sequentially used to convert the buffer data into one sample (into signed float).

The main problem that forms is the proper feeding of the resampler. Let me remind you that the resampler works on the principle of virtual double conversion - based on the input data, a smooth signal is restored at the input sampling frequency, which is re-digitized at the output frequency. History is always needed to restore a signal; therefore, first the decoder needs to be fed with silence of the desired length, for initialization. And throw out this silence from the first answer - then we will get the correct resampling right from the beginning. In the same way, after our data runs out, the resampler must be fed with silence after - in order to get all the recovered information.

And this is how our company of soldiers generates exactly how much data at 44kHz in the necessary form.

After I earned the base player, I started combing it a bit: the first thing is support for more complex codecs, specifically gsm. It immediately became clear that not all are decoded by the sample, batch processing is needed here - so the decoder interface was redone to an incoming array + offset, the output array + offset, returning how many samples were placed on the output. To support Raw files, most of the code is universal; it was moved to a separate general class, so as to override the minimum — only the required parameters for it in the initializer. The GSM decoder itself was taken as usual where it was found, simply transformed quickly into the necessary syntax. Oddly enough - it all worked with a bang.

At the same time, the player control interface was drawn from the JS code + download, play, and pause events were issued, allowing the player to draw the state of the player in the browser as desired. The resulting product began to file in production. When they started testing, some problems got out, especially in deeply adorable IE, which loaded the file in chunks, it seems like 8k or 4k ... well, the events eventually generated a ton, I had to kill the frequency of their generation.

Unfortunately, it quickly became clear that no one has any desire to make an interface on JS. Then it was quick and on my knee a decision was made by gui inside. The player began to generate internal events, and WavPlayerGui was created. His Mini heir stayed as before - all button; plus was created Full, which has the same button on the left, and on the right is a progress bar showing the length, volume loaded, and the current position. Well, that is, there are a few more little squares, the dimensions of which changed in response to events.

As soon as it appeared, it became clear that, in general, it should also be beating on it. And in general, it is completely stupid to listen to records only when you need to listen to the third minute from the 15th minute ... You need to do seek (). The implementation of seek () in this case turned out to be the most difficult task: since we are unable to load the source file from an arbitrary position (we cannot guarantee the server’s support for the Range, and in a flash it’s not so easy to do this), we had to limit the possibilities of seek ( ) 'but only within the loaded part. But even in this case, we do not store the full amount of data recoded to 44kHz (memory, sob, sorry), so if you need to reposition, the following happens:


Then there were a few cosmetic modifications from those who started using it in public, and again there was a challenge - can I support IMA ADPCM. The format is rather ugly, from the point of view of its placement into universality it turned out: the data is not per channel, but mixed up in the same place, so I had to transmit a decoded channel to the decoder as well; at the same time, we had to endure a bit of universality for all other codecs, because the amount of output data depending on the input for all others is fixed and simple; and here ... in general, it depends on - a clear story is required, and it is impossible to start decoding from an arbitrary place. Accordingly, for the seek () function works like this:


In general, oddly enough, it also works. And at the moment, it is available for use by everyone: it does exactly what is needed, exactly as it should.
For the full buzz, it remains only to finally do the same JS interface that I thought our web developers would do; plus make a simple and clear example of integration, which can be put copy-paste'om in your site, because most often the problem of integration, this falls on the shoulders of a sysadmin, not a programmer ... So, to be continued.

Project on Github | Online demo.

Source: https://habr.com/ru/post/223293/


All Articles