-
Notifications
You must be signed in to change notification settings - Fork 139
Chunked inference result depends on chunk length #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
A proof-of-concept implementation of the encoding part is here: https://gist.github.com/f0k/266dd89e52417ba6138d33afa9ff8e80. The main algorithm is this part: https://gist.github.com/f0k/266dd89e52417ba6138d33afa9ff8e80#file-chunked_dac-py-L140-L181. |
Hey thanks for the implementation! I've been on parental leave the last few months, so I haven't been plugged in since writing the existing chunking code. Happy to take a look though, as I noticed the same issue with needing to use the same chunk size at encoding and decoding time. My fix then was just to save the chunk length in the metadata so things can be decoded properly, but this comes with some downsides as you mentioned. I'll take a look at your code and see what I can do! Thanks! |
Nice, congratulations!
Take your time! From what I see, the decoding algorithm will need more attention than what you have during a parental leave, so don't bother for now. Integrating the encoding algorithm alone will not be of much use. |
Thank you for understanding! And thanks for the congrats! If you have time, a slightly more detailed sketch of the decoding algorithm would be super helpful. The encoding algorithm looks quite nice, and it has nice properties with re: invariance to chunk size. I also feel that these sorts of tricks with convolutional nets for audio are not widely available so getting it right in this repo would be a nice contribution to open source! |
Well, we'd start by figuring out how many code frames we can decode at once to stay within the limits of the window size given by the user. Then we'd decode those, getting, say, 5 seconds of audio. We take the next chunk of code frames (that will have to overlap with the previous one at least due to the initial size-7 convolution) and again get 5 seconds of audio. My implementation includes some receptive field computation for the decoder, but maybe it is more helpful to compute the receptive field of the decoder inversed, or separate the effects of forward and transposed convolutions. /edit: The overlap-add idea does not apply due to the nonlinearities. Instead, we will need to overlap the code chunks and crop the decoder output to remove the wrongly computed borders (that should have taken the neighboring codes into account, but could not). Also the code for disabling padding needs to be fixed: To disable zero-padding in a transposed convolution, its |
I was having problems getting the example python code running and wound up here. This is working, but I've noticed that, using your script, the decoded file is half the size of the original input file (7.4 mb vs 15 mb). Is there a setting I'm missing somewhere? I've tried setting at UPDATE - derp... umm... just noticed my input file is 32bit and the output is 16. 🙈 |
Glad you found it. I've just updated the gist to the version I ended up with for my use case; it adds support for input and output directories and can be launched multiple times with the same input and output directories but different CUDA devices, taking care not to process the same files. Chunked decoding is still left as an exercise for the reader ;) |
Decoding in chunks that are overlapped and then chopping off the overlapped samples sounds very plausible as a good method. I'll give it a go! Thanks for the additional detail! |
I met the same issue. But now I have solved the problem.
|
First of all, thanks for the great work and clean code!
For the purpose of training a model on the discrete codes (as opposed to just encoding and decoding a signal), the current chunked inference is not ideal. As nicely summarized by @cinjon in #35, the current implementation slices up the input into chunks of about the requested chunk length, encodes them separately, and saves the blocks of latent codes along with the chunk length. However, concatenating the separately encoded chunks gives a different sequence of discrete codes than encoding the whole sequence at once (or, more generally, using a different chunk size). Specifically, decoding with a larger chunk size will lead to repeated audio segments at the original chunk boundaries (about 5ms per boundary in the default settings). This means a model cannot be fed with arbitrary excerpts from the discrete code sequence; the excerpts have to be aligned on chunk boundaries to be meaningful, and the model will have to learn to model the boundaries at the expected positions. It also means I cannot jump to a specific position in the audio by just multiplying the timestamp by 86Hz.
Since the model is convolutional, it is possible to implement a chunked inference that gives the same result as passing the full signal (except at the beginning and end, since we cannot simulate the zero-padding of hidden layers by padding the input). This entails setting padding to 0 in all Conv1d layers, zero-padding the input signal / code sequence (before chunking), and overlapping the chunks by the amount of padding. The current implementation already sets padding to 0, and pads the input, but chose a different strategy: to obtain the same codes, the input signal chunks would overlap by the amount they're padded with, and the code chunks would be padded and overlapped as well, but the decompression routine neither pads nor overlaps the codes. Instead, it relies on the input signal being padded and overlapped to cater for both the encoder and the decoder (i.e., it produces overlapped and padded code chunks for the chunk length that was used for encoding).
The text was updated successfully, but these errors were encountered: