Due to the big size of the dataset, I decided early in the project to not iterate over all the training set (200M+ examples) during each training epoch. Instead, a random subset of the training examples is sampled on every epoch. This way, I get feedback from the monitoring channels more often. The learning rate is also adjusted more often this way.
I have recently decided to increase the number of batches used per epoch, from 5000 to 20k. With a batch size of 2048, it means that I am now iterating over more than 40M examples on each epoch. For reasons I cannot really explain, increasing the number of batches per iteration seems to have an impact on the MSE (which is still monitored on the same validation examples).
With a 2000 unit two layer MLP (1050-950), the MSE went from 0.026 to 0.024.
Moreover, I am presently training a convolutional network, like the one I described in my previous blog entry but with 2000 units in the two MLP layers (instead of 500). The training also uses 20k batches per iteration. So far, it seems to be doing at least as well as the 2000 unit two layer MLP.
I have been able to match my best result (MSE of 0.027) using a convolutional architecture. Basically, I added a convolutional layer at the ‘bottom’ of the two-layer MLP architecture I was using before. That means that adding this new layer was not really useful.
I found that using 13 channels in the convolutional layer was optimal; I was not able to match my best result with other numbers of channels. As for the kernel, a length of 7 or 8 gets the job done.
In the future, it would be interesting to add a second convolutional layer.
I have added the one-hot encoded speaker features available in the TIMIT dataset (dialect, education, gender and race) to Vincent Dumoulin’s Pylearn2 wrapper. Unfortunately, I cannot seem to make them improve the results. I have concatenated each of them, one at a time, to the input vector I was using before (100 previous acoustic samples and a one-hot encoding of the phone ID associated with the prediction) in a two layer MLP with 500 (rectified linear) units (300-200) (see this yaml file for more details).
As presented before, a MSE of 0.027 is obtained with this model when not using the speaker features. When I add any of the speaker features, I get more or less the same MSE (0.027 or 0.028). I did not have the time to tweak the learning rate much, but I do not get the sense that it explains why I do not see any improvement. That will remain a mistery for now …
When I started working on the project, I intuitively decided to use bigger mini-batches (2048 examples) than I used to because of the huge size of the training set (200M+ examples). We saw in class that the number of training examples should not be a factor in the choice of the batch size. I have thus decided to experiment with different batch sizes to try to figure out which one is best. Here are the results of these experiments for two models: a one layer MLP with 500 (rectified linear) units and a two layer MLP with 500 (300-200) units, respectively.
The number of batches per iteration (epoch) was adjusted so that the total number of training examples per iteration stayed the same. Also, the learning rate was tweaked for each batch size.
The size of the mini-batches seems to have a small (but probably not significant) impact on the MSE. For both models, the best MSE is obtained when using 1024 training examples per mini-batch. Moreover, the relation between the batch size and the MSE seems to follow a U-shape. It could be interesting to test batch sizes of 256 and 8192 to confirm that.
As expected, the training time per epoch goes down as the mini-batches get bigger. However, we cannot conclude much from the total training time.
All in all, it does not seem to matter much what batch size we use (at least if we are in the 256-4096 range), but for the rest of the project, I will still decrease the batch size to 1024 just to feel safer in terms of MSE.
Using one of the best models I have trained so far (a two layer MLP with 500 rectified linear units), I have tried to generate sound from a sequence of phones and a seed of 100 acoustic samples. The phone sequence is taken from the 78th example in the test set and the seed consists of the first 100 samples from this example (excluding the silence at the begining of the sequence).
The acoustic sequences have been generated using Laurent Dinh’s randomization scheme. The code used to generate the sequence is adapted from JF Santos’ gen_phone.py.
Here is the original sequence:
Here is one of the best looking generated sequences:
I have started using phones in the inputs of my models. More precisely, it means that, as well as using the 100 previous acoustic samples to predict the next one, I am also using a one-hot encoding of the phone ID associated with the prediction. That extra feature is now available in Vincent Dumoulin’s latest version of his Pylearn2 wrapper for the TIMIT dataset.
Here are some preliminary results :
The middle column (Without phones) is simply a copy of results presented in my previous blog entries. For the last column (With phones), very little has been done yet to tweak the initial learning rate; I have pretty much just tried the learning rates that worked best without phones. I am still using rectified linear units. You can look at the yaml file here if you want to see how to combine the acoustic samples with the phone ID.
So far, it seems like using the phone associated with the prediction does not help much. A next step would be to add the previous and next phones in the inputs as well as info about the speaker. We should modify Vincent’s dataset to add these extra features.
EDIT#1: I have updated the table to include the number of epochs during which the models were trained. (An epoch consists of an iteration over 5000 mini-batches of size 2048.)
EDIT#2: I have updated the table to include the training time, in hours.