Skip to content

Chunked inference result depends on chunk length #39

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
f0k opened this issue Sep 5, 2023 · 9 comments
Open

Chunked inference result depends on chunk length #39

f0k opened this issue Sep 5, 2023 · 9 comments

Comments

@f0k
Copy link

f0k commented Sep 5, 2023

First of all, thanks for the great work and clean code!

For the purpose of training a model on the discrete codes (as opposed to just encoding and decoding a signal), the current chunked inference is not ideal. As nicely summarized by @cinjon in #35, the current implementation slices up the input into chunks of about the requested chunk length, encodes them separately, and saves the blocks of latent codes along with the chunk length. However, concatenating the separately encoded chunks gives a different sequence of discrete codes than encoding the whole sequence at once (or, more generally, using a different chunk size). Specifically, decoding with a larger chunk size will lead to repeated audio segments at the original chunk boundaries (about 5ms per boundary in the default settings). This means a model cannot be fed with arbitrary excerpts from the discrete code sequence; the excerpts have to be aligned on chunk boundaries to be meaningful, and the model will have to learn to model the boundaries at the expected positions. It also means I cannot jump to a specific position in the audio by just multiplying the timestamp by 86Hz.

Since the model is convolutional, it is possible to implement a chunked inference that gives the same result as passing the full signal (except at the beginning and end, since we cannot simulate the zero-padding of hidden layers by padding the input). This entails setting padding to 0 in all Conv1d layers, zero-padding the input signal / code sequence (before chunking), and overlapping the chunks by the amount of padding. The current implementation already sets padding to 0, and pads the input, but chose a different strategy: to obtain the same codes, the input signal chunks would overlap by the amount they're padded with, and the code chunks would be padded and overlapped as well, but the decompression routine neither pads nor overlaps the codes. Instead, it relies on the input signal being padded and overlapped to cater for both the encoder and the decoder (i.e., it produces overlapped and padded code chunks for the chunk length that was used for encoding).

@f0k
Copy link
Author

f0k commented Sep 8, 2023

A proof-of-concept implementation of the encoding part is here: https://gist.github.com/f0k/266dd89e52417ba6138d33afa9ff8e80. The main algorithm is this part: https://gist.github.com/f0k/266dd89e52417ba6138d33afa9ff8e80#file-chunked_dac-py-L140-L181.
It produces the same codes as python3 -m dac encode --win_duration=10000, except for the first and last 6 frames.
I did not attempt to implement it as a pull request in your code base because it would need some decisions on how and whether to handle backwards compatibility, and also for my purposes I need the output to be stored in a more efficiently readable format. If interested, I'm happy to help integrating it, though.
Decoding is not implemented yet. Due to the stacked strided transposed convolutions, it needs an overlap-add algorithm to be useful (otherwise it would be limited to work with impractically large window sizes). I don't urgently need it, let's see.

@pseeth
Copy link
Contributor

pseeth commented Sep 9, 2023

Hey thanks for the implementation! I've been on parental leave the last few months, so I haven't been plugged in since writing the existing chunking code. Happy to take a look though, as I noticed the same issue with needing to use the same chunk size at encoding and decoding time. My fix then was just to save the chunk length in the metadata so things can be decoded properly, but this comes with some downsides as you mentioned. I'll take a look at your code and see what I can do!

Thanks!

@f0k
Copy link
Author

f0k commented Sep 11, 2023

I've been on parental leave the last few months

Nice, congratulations!

I'll take a look at your code and see what I can do!

Take your time! From what I see, the decoding algorithm will need more attention than what you have during a parental leave, so don't bother for now. Integrating the encoding algorithm alone will not be of much use.

@pseeth
Copy link
Contributor

pseeth commented Sep 11, 2023

Thank you for understanding! And thanks for the congrats!

If you have time, a slightly more detailed sketch of the decoding algorithm would be super helpful. The encoding algorithm looks quite nice, and it has nice properties with re: invariance to chunk size. I also feel that these sorts of tricks with convolutional nets for audio are not widely available so getting it right in this repo would be a nice contribution to open source!

@f0k
Copy link
Author

f0k commented Sep 12, 2023

If you have time, a slightly more detailed sketch of the decoding algorithm would be super helpful.

Well, we'd start by figuring out how many code frames we can decode at once to stay within the limits of the window size given by the user. Then we'd decode those, getting, say, 5 seconds of audio. We take the next chunk of code frames (that will have to overlap with the previous one at least due to the initial size-7 convolution) and again get 5 seconds of audio. Now we don't concatenate the two 5-second chunks, but we overlap the second chunk a bit with the first chunk and add up the samples in the overlapped part. The tough part is figuring out from the network architecture by how much to overlap the code chunks and by how much to overlap the outputs. If the architecture was perfectly symmetric, the output would need to overlap exactly as much as we overlapped the input during encoding, but the architecture is not symmetric (the decoder has additional convolutions interspersed with the transposed convolutions).

My implementation includes some receptive field computation for the decoder, but maybe it is more helpful to compute the receptive field of the decoder inversed, or separate the effects of forward and transposed convolutions.

/edit: The overlap-add idea does not apply due to the nonlinearities. Instead, we will need to overlap the code chunks and crop the decoder output to remove the wrongly computed borders (that should have taken the neighboring codes into account, but could not). Also the code for disabling padding needs to be fixed: To disable zero-padding in a transposed convolution, its padding ought to be set to its kernel_size, not to zero. Leaving it at zero will just increase the size of the wrongly computed borders that we need to discard, if I see correctly.

@jbmaxwell
Copy link

jbmaxwell commented Sep 13, 2023

I was having problems getting the example python code running and wound up here. This is working, but I've noticed that, using your script, the decoded file is half the size of the original input file (7.4 mb vs 15 mb). Is there a setting I'm missing somewhere? I've tried setting at 8kbps and 16kbps.

UPDATE - derp... umm... just noticed my input file is 32bit and the output is 16. 🙈

@f0k
Copy link
Author

f0k commented Sep 14, 2023

Glad you found it. I've just updated the gist to the version I ended up with for my use case; it adds support for input and output directories and can be launched multiple times with the same input and output directories but different CUDA devices, taking care not to process the same files. Chunked decoding is still left as an exercise for the reader ;)

@pseeth
Copy link
Contributor

pseeth commented Sep 18, 2023

Decoding in chunks that are overlapped and then chopping off the overlapped samples sounds very plausible as a good method. I'll give it a go! Thanks for the additional detail!

@BridgetteSong
Copy link

BridgetteSong commented Sep 20, 2023

I met the same issue. But now I have solved the problem.

  • please first confirm you are in the inferencing mode, which means you have turned off the dropout layer and so on
  • please confirm when your input is always same, and you can get the same outputs from the encoder and decoder. For this you can input a same audio twice and check its outputs.
  • I found when you turn on "@torch.jit.script" in the SnakeModule, the outputs will be a little different although your input is same
  • and in the CPU or GPU, the outputs from a same input will also be different

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants