Introduction

Neural networks have been in the spotlight recently. This isn’t a big surprise as they are producing incredible results in a variety of problem spaces. Recently, speech recognition reached human parity with the expressive power of neural networks [1]. The amazing thing is that they were actually introduced in 1943 but have only now become popular [2]. This is because of the availability of large amounts of data and GPU parallelisation. These results are quite exciting and inspired me to undertake a musical experiment with these networks!

Fig 1. A deep neural network.

Music and Neural Networks

Observing the success of neural networks in variety of fields made me think that neural networks could be used to make some interesting predictions about music. Google is doing amazing things with AI and music. Google Magenta is a Google Brain project that’s producing groundbreaking results with AI composers. They are using neural networks to generate melodies. You can perform a duet with an AI in this experiment. This is quite amazing!

However, generating melodies is only one important aspect of music. One could delve deeper and say that there are two main elements to music: the composition and the performance. The composition focuses on the musical notes that define a song. One can think of this as sheet music.

Fig 2. A bar from sheet music.

Then there is the performance. The performance aspect of music focuses on how the notes are played by the performer. So if you have the sheet music for Mozart’s last symphony, then it’s hard to say that two people would perform the sheet music in identical ways. This leads to say that everybody has their own unique way of performing. So this individualistic way of playing music can be labelled as a musical style.

What is Musical Style?

How do we define musical style? It is quite hard to parameterise style as it’s not a tangible property of music like pitch. If we listen to the performances of a novice pianist and an experienced pianist, we realise that even though they playing the same sheet music, they each produce a different range of dynamics. Dynamics is the variation of loudness of music over time. If we take a piano as an example, the loudness of a note depends on how hard the pianist hits the key. How do we represent this on paper? In western music notation, dynamics are indicated by Italian letters which indicate the loudness of the notes. This is where it gets a bit tricky. Performers are most likely to have their own interpretation of a song’s dynamics. A pianist will perform with their unique set of dynamics for a given song. This allows us to say that dynamics are a very important feature of style.

Then there are musical genres such as Classical, Jazz, and so on. What’s interesting here is that people can usually label a song with a genre after hearing it. So it appears that songs in a genre follow a similar style. So now we understand that dynamics help define musical style. We also know that musical styles can be categorised by genre.

Fig 3. The scale of dynamics in musical notation.

What Can We Do With Dynamics?

Previous projects attempt at generating the composition but not the performance aspects such as the dynamics [3, 4, 5]. Currently, generated performances lack dynamics and sound monotonous. This opens doors for adding musical style to the compositions! This all leads to one interesting question:

Can a machine perform like a human?

More specifically, can a machine produce the dynamics for digital sheet music and thus, generate stylised performances? We’ll be using the MIDI file format for this project as it’s similar to digital sheet music.

Human or Bot?

So I have successfully designed and trained a model that can synthesise the dynamics for any sheet music! So here is a little quiz. A and B are performances of the same sheet music. One is a human performance and the other is machine-generated. Can you figure out which performance is by the human?

Click here for the answer.

Baseline with randomised velocities:

From my experimental study, less than half of the respondents could guess correctly! It has passed a musical Turing test! Now that you have seen a glimpse of what is to come, we can refresh our underlying theory of neural networks. I will only briefly explain the pipeline of the project. If you’re a curious soul and want the specifics, then please read my thesis. I started this project not knowing much about neural networks or machine learning! So if you’re in the same position, it’ll hopefully give you a good start.

Let’s Refresh Our Knowledge…

The Neuron

Fig 4. The artificial neuron.

The artificial neuron is designed after the neurons in our bodies. It is a basic unit of computation [2]. It takes a certain number of inputs and then performs a weighted sum of the incoming values. The inputs, $x_{i}$, where $i = 1, 2, …, n$, are multiplied with their corresponding weight $w_{i}$. There is also a bias unit, $b$. This value of this unit is always $1$ and can also be written as $w_{0}$. The bias unit exists to add flexibility to the neuron’s output by shifting the activation function $g$ to the left or right. The net input, $z$, can be calculated as follows: \begin{equation} z = \sum_{i=1}^n w_{i} x_{i} + b \label{input} \end{equation}

The output of a neuron, $o$, is calculated by applying an activation function $g$ on the net input $z$ : \begin{equation} o = g(z) \label{output} \end{equation}

A commonly used activation function is the sigmoid function. The sigmoid activation function takes any real-valued number and places it in the range of $0$ and $1$. This means extremely small numbers are turned into $0$ and large numbers are turned into $1$. Other activation functions include tanh, and ReLu.

Fig 5. The sigmoid function.

Feedforward Neural Networks

Using these neurons as building blocks, we can arrange neurons and build different architectures. The Feedforward Neural Network (FNN) is the most commonly used Artificial Neural Network (ANN) architecture. Neurons are connected in layers where the first layers is the input layer, and the last layer is the output layer. The layers between the input and output layer are called hidden layers.

Data is fed from one layer to the next, which is why they are called “feedforward” networks. Fig 6 shows a three layer feedforward neural network with one hidden layer. Feedforward networks make an assumption about the data they see: they assume that every input is independent of the rest.

Fig 6. Three layer feedforward neural network.

Recurrent Neural Networks

So how does one deal with sequences, or time? A simple FNN’s ability is limited by their assumption that the inputs are independent of each other. They also lack memory; they do not remember what the previous or future inputs are.

In music, the notes articulated at any given time depends on what came before it or even what comes after it. Musicians write music taking the global structure of a song into consideration. So music has a complex macro-harmonic structure with many long-term dependencies. And this is where Recurrent Neural Networks (RNN) come in!

RNNs are similar to FNNs but they differ by having state, and a feedback loop called a recurrent weight. They feed their previous state into their computation to calculate their immediate output. The recurrent weight is responsible for determining how much of the previous state will be introduced into its computation. So every layer takes into consideration its previous state and the current input to calculate the corresponding output. What does this all mean in terms of music? To put simply, the RNNs have a simple memory mechanism that allows them to remember what was computed in the past and use it during its compute. RNNs have been applied successfully to a range of interesting problems [6, 7]

Fig 7. Unfolded RNN over time.

Bi-directional Recurrent Neural Network

When reading sheet music, the human eye possesses the ability to look ahead of a bar in sheet music. It can read ahead and/or behind any point on the sheet. However, the vanilla RNN reads inputs in order; it does not have access or any information regarding the upcoming time-steps. There is an interesting architecture called the Bi-Directional RNN [8]. This architecture is composed of two RNN layers. The first layer is called the forward layer. It processes the input sequence in its original or chronological order. The second layer is called the backward layer which processes the sequence in reverse. It seems like Bi-directional RNNs are a good choice!

Fig 8. A Bi-directional RNN.

Long Term Short-term Memory network

When working with long sequences such as music, RNNs face one major problem called the “Vanishing Gradients” problem. This means that they cannot remember long-term dependencies [9, 10]. To combat this issue, a special network architecture called the Long Short-Term Memory Network (LSTM) was introduced [11]. It introduces a gating mechanism which allows for fine control over which part of the context needs to be remembered and how much needs to be forgotten.

Fig 9. A LSTM Cell.

Architecture Design

So we’ll be playing with Bi-directional LSTMs. But how do we design our model?

GenreNet

Remembering that genres have a definitive musical style, we can design a basic model that can learn the dynamics of a song within a known genre. This simple model is called “GenreNet”, it takes in sheet music and produces the corresponding dynamics. The model consists of two main layers as seen in Fig 10 :

The Bi-Directional LSTM layers: The LSTMs provide memory for learning dependencies, and the bi-directional architecture allows the model to take the future into consideration. To increase the expressive power of the model, these layers can be stacked which means one layer’s output feeds into the other layer’s input.
The linear layer: The output of LSTMs usually lies between $[-1, 1]$. To scale these numbers to represent a larger range, a linear layer is used. A linear layer performs a linear transformation on its input.

Fig 10. GenreNet.

StyleNet

However, there are several genres. GenreNet is limited to learning the dynamics of the genre it is being trained on. So this motivates a more complex design called “StyleNet”. This design contains one or more GenreNet subnetworks. This allows the network to learn genre-specific style. The GenreNet subnetworks share a layer called an interpretation layer. This was motivated by the Siamese Neural Network in computer vision [12], but in reverse. The Siamese Neural Network tries to learn similarity between inputs. However, in this case we know the similarity. It’s the sheet music.

StyleNet is similar to a translation tool, where it takes a sheet music input and generates dynamics in different styles; it’s a multitask learning model similar to how Luong et al. trained a neural network to translate English into a range of languages [13]. The shared layer, or the interpretation layer is advantageous because it reduces the number of parameters the network needs to learn. Less parameters should mean less data needed which is always good!

Fig 11. StyleNet.

Data

Now we’re ready to collect our data! But where do we get it from? There are several sites hosting music files in the MIDI format. There are existing MIDI datasets however they aren’t really suitable for this problem. We want to focus on actual human performances.

MIDI is a great format to use as it contains the musical properties of music unlike waveform. Dynamics are stored in the format through a parameter called velocity. This is analogous to volume but within a range of $0$ - $127$. So every played note has a velocity. So we will use velocity to capture the dynamics of our music. The next steps are to clean the gathered data and then use it to create a data representation for StyleNet.

The Piano Dataset

To create a valid dataset for learning style, we need our downloaded files to adhere to certain requirements:

Genres of the MIDI files This is important because we want to capture the dynamics for human-recorded performances. When it comes to recording music, a MIDI controller is used and the most prevalent one is the keyboard. This means that genres such as Classical and Jazz are most likely to contain human performance styles. So we will restrict our dataset to Classical and Jazz.
Instrument of the MIDI files Most MIDI files are in format 1 which is a multi-instrument track. This is especially true for Jazz. We will be restricting our problem to just the piano. There are also many MIDI files containing separated tracks for the left and right hand of the pianist. The best-suited format for our problem is format 0 as it allows us to focus on one track. So all downloaded MIDI files were converted and merged into format 0 MIDI files with a single piano track.
Human-Recorded Dynamics Many available MIDI files do not capture the dynamics of the performances. They usually contain a global shared velocity amongst the notes. I used the MIDI files from the Yamaha Piano e-Competition as a reference and noticed that human performances usually have at least $45$ different velocities. This can be seen in Fig 12. However, most of these performances are quite lengthy whereas there are shorter tracks in the downloaded set. Thus a minimum threshold of at least $20$ different velocities was chosen for the dataset.
Time Signature Time is continuous. Unfortunately, we need to discretise/quantise our notes in order to represent them in a way our model can process them. To maximise the amount of data captured across the dataset, only songs with the same time signature were kept. $4$/$4$ is most common and thus was chosen.

Fig 12. Left: Number of velocities in downloaded MIDI files. Right: Number of velocities in performance MIDI files.

So the Piano dataset contains $349$ Classical and Jazz tracks adding up to a total of $698$ files. This isn’t large however and I plan on expanding it soon by cleaning more data.

MIDI Encoding Scheme

So now with the Piano dataset, we can now design our input/output matrix representation.

Quantisation

Firstly, we need to quantise our dataset. This allows us to capture the notes and represent them in a matrix form. Unfortunately, we will lose the exact timing of our notes but this is unavoidable. If the note times are not binned, then the notes may not be captured in the matrix representation.

So I chose a sampling interval of a $1/16th$ note. This means that all the notes times are binned to the nearest $1/16th$ note. Now the sampling intervals align with note times which allows us to capture as much information as possible. A finer resolution would result in extremely large input files which would exponentially increase training times.

Fig 13. Quantisation.

Input Matrix Representation

After quantisation, we can start converting our MIDI files into a suitable input representation. The input is responsible for carrying note pitches and their start and end times. It is analogous to sheet music. A note can have three states: on, off and sustained from the previous time-step. Using a binary vector, “note on” is encoded as $[1,1]$, “note sustained” as $[0,1]$ and “note off” as $[0, 0]$. The first bit represents whether the note was played in that time-step or not and the second represents if the note was held or not.

Next, the note pitch needs to be encoded. At one time-step, any possible note pitch could be played. MIDI encodes pitch as a number ranging from $0$ - $127$, a matrix with the first dimension representing MIDI pitch number is created. The second dimension represents a quantised time-step or a $1/16$ note.

Due to the note state being a binary vector, the pitch dimension will be twice the size. A pitch dimension of $88 * 2 = 176$ was chosen where $88$ is the range of note pitches, and $2$ is the note state vector. The reasoning behind this is that most pianos only have $88$ keys. To summarise, only pitches from note $21$ to $109$ are captured. An explanatory diagram of representation can be seen in Fig 14.

Fig 14. Input and Output Representation.

Output Matrix Representation

The output is responsible for carrying the velocities of our sheet music input. Similar to our sheet music matrix above, the columns of our matrix represent pitch and the rows represent time-step. The difference, in this case, is that the pitch dimension is only $88$ notes wide as we only need to represent velocity.

To make the learning process easier, it is best to reduce the scale of the data. This helps the network as it does not have to learn the scale itself. This can be done by dividing the velocities by the max velocity, $127$. This ensures that all the velocities are between $0$ and $1$. Finally, the output velocity matrix can be decoded back into a quantised MIDI file. This is done by reversing the encoding process above. After the decoding process, the resulting stylised MIDI file is saved.

Training

Let’s start training!

Model Setup: The input interpretation layer is set to be $176$ nodes wide and only one layer deep. There are two GenreNet units: one for Jazz and one for Classical. Each GenreNet is $3$ layers deep. TensorFlow was used to build the model. If you wish to use another deep learning library, it shouldn’t be difficult to reimplement the model. I used my university’s HPC Zoo to train the model. The NVIDIA GTX 1080 Ti and Titan X were primarily used.

Training Method: For training, the model was trained on alternating mini-batches of Classical and Jazz songs. Mini-batches are created based on their genre. A mini-batch size of $4$ is chosen. Training is most successful when there is a large variation in the information carried by the data it learns from [14]. For this reason, the dataset is shuffled to encourage different combinations of songs in every mini-batch. This also reduces the chance of a batch of outliers from impacting training negatively.

Learning Rate: Choosing a suitable learning rate for the model can be complex. A small learning rate would result in the model taking an extremely long amount of time to train. On the other hand, a high learning rate can result in gradient descent performing big updates which can cause the algorithm to diverge which can lead to a disastrous training run. After a series of runs, a learning rate of $1e^{-3}$ was chosen.

Optimiser: The Adam optimiser was chosen. Adam performs stochastic optimisation with an adaptive learning rate and momentum [15].

Training-Validation: Most songs are perceived differently, and it is usually difficult to say that two songs sound the same. This observation leads to say that each song example contains meaningful information for the model to train on. This motivates our training set and validation set to be $95\%$ and $5\%$ respectively. This equates to $95\%$ Classical Songs and Jazz songs each for training, and $5\%$ for validation.

Error: StyleNet outputs a velocity matrix for each genre through its GenreNet unit. This is a one-to-many setting. Given a sheet music input for a specific genre, it predicts a velocity matrix through a GenreNet unit for the corresponding genre.

This is regression. Thus a meaningful metric to measure the performance of the model would be the mean squared error between the true and predicted velocity matrix. The genre of the input mini-batch determines which GenreNet’s prediction is used. However one must remember that since our matrices are mostly zeroes, the errors calculated will be on an extremely small scale.

Truncated Backpropagation Through Time: Backpropagation through time is a very computationally expensive process [14,16]. The time to backpropagate an error for all of the time-steps in a sheet music matrix takes a remarkably long time. The average length of a sheet music matrix is $1500$ time-steps. Backpropagation is truncated to $200$ time-steps to reduce training time. The downside to this is that it limits our model to learn dependencies within a $200$ time-step window. However, this improved training time significantly. Convergence time was reduced from $36$ hours to around $12$ hours with truncation.

Dropout: A common problem faced during training especially with small amounts of data is overfitting. This means the model memorises the dataset and does not generalise well to new data. Dropout tries to combat this issue [17, 18]. When training, a neuron in the network may “turned off” with a probability of $p$. The result is that only a subset of the network is updated. Such updates allows the weights of neurons to become independent of other neurons.

It can also be thought of as an ensemble of models. Bagging is a type of ensemble learning where each model is trained on a subsample of the training data and thus learns only a subset of the feature space. Dropout can be thought of as an ensemble of different subnetworks trained at each training step on a single sample.

A dropout $p = 0.5$ should produce optimal results [19]. However, in practice it is usually not the case. Since our dataset isn’t that large, losing half of a network at a given time could mean that the network would not see enough examples to learn less common patterns as there are some notes that are played rarely. This would result in a model that cannot learn the underlying patterns in music or underfits. This can be seen in Fig 15. After experimentation, $p$ was chosen to be $0.8$.

Fig 15. Epoch snapshot for dropout p = [0.5, 0.8].

Gradient Clipping: LSTM networks are vulnerable to having their gradients explode during training. A commonly used technique to combat explosive gradients is called “clipping gradients by norm”. This method introduces an additional hyperparameter called $g$. When the norm of a calculated gradients is greater than $g$, then the gradient is scaled relative to $g$. This parameter is set to $10$.

Batching Method: In the field of machine translation, Dong et al. used alternating mini-batches of different language pairs when translating English into other languages [20]. In StyleNet’s case, this means one could alternate between mini-batches of Jazz and Classical.

Fig 16. Error for training and validation set for final run.

Summary of Final Setup: Now the setup and results for the final model as can be listed. A dropout of $p = 0.8$ was applied, and gradients were clipped by norm where $g = 10$ with a learning rate of $1e^{-3}$. The model was training for a total of $160$ epochs. The final and validation loss were $7.0e^{-4}$ and $1.1e^{-3}$ respectively.

Fig 17. An epoch snapshot from a training session.

Click here if you wish to see more snapshots 📸.

The training snapshots show the outputs of the network at each epoch during training. I’ve plotted the difference between the Classical and Jazz output. There is a definitely a visible difference between the outputs. So it shows that the network is learning the style associated with each genre!

Results

Alright, so a really low loss has been achieved on the training and validation set. But this really doesn’t mean much for what we’re trying to accomplish. The loss metrics of our model do not display its ability to perform convincing music. The decreasing loss shows us that the model is trying to understand the problem numerically. However what one wants is to minimise the “perceptual” loss. The question at hand is whether StyleNet can generate human-like performances.

Music only holds meaning through the confirmation of a human. To evaluate StyleNet, Alan Turing’s Turing test will be taken as inspiration [21]. The Turing test tests whether a machine can exhibit an intelligent behaviour which is indistinguishable from that of a human. If the model passes a musical Turing test then that concludes that it is possible for a machine to play sheet music like a human. If it is able to fool a human, then job done!

“Identify The Human” Test

The “Identify the Human” survey was set up in two parts with $9$ questions each. For each question, participants are shown two $10$ second clips of the same performance as seen in Fig 18.

Fig 18. Identify the Human Test.

One performance is generated and the other is an actual human performance. Participants need to identify the human performance. The ordering of the generated and human tracks was randomised to reduce bias towards a particular answer. A total of $30$ respondents answered the first part, and $20$ answered the second.

Fig 19. Identify the Human Test Results.

Click here to view the survey! Select the human performance 🎶.

Q1
A

Click here for the answer.

Q2
A

Click here for the answer.

Q3
A

Click here for the answer.

Q4
A

Click here for the answer.

Q5
A

Click here for the answer.

Q6
A

Click here for the answer.

Q7
A

Click here for the answer.

Q8
A

Click here for the answer.

Q9
A

Click here for the answer.

Q10
A

Click here for the answer.

Q11
A

Click here for the answer.

Q12
A

Click here for the answer.

Q13
A

Click here for the answer.

Q14
A

Click here for the answer.

Q15
A

Click here for the answer.

Q16
A

Click here for the answer.

Q17
A

Click here for the answer.

Q18
A

Click here for the answer.

Fig 19 shows the average number of correct answers per question. Combining both surveys, an average of $53\%$ from the participant pool could highlight the human performance. We can’t really compare our results with an existing benchmark as this model is the first of its kind. So the baseline for this experiment is a random guess between the two available options as seen in Fig 18. These results reveals that on average, $3\%$ from the participant pool could perform better than random guessing. This is a surprisingly low number! We can conclude that the model passed the Turing test.

We could further analyse the model’s results through individual question. However, this would make the scope of evaluation subjective to the question in focus. It is difficult to make generic conclusions about the model when each song is a unique case. Participants mentioned two interesting points:

“10 seconds is too short to determine which track is human/machine-generated.”
“I was forced to choose an answer when I could not identify the human.”

It’s hard to assess a short clip without its surrounding music context. This scenario is analogous to taking a sentence out of its contextual paragraph. A more valid Turing test would assess the model on a complete performance. This time, we will generate a long performance where users also have the choice of “Cannot Determine”. This option will reduce any noise the previous survey produced by the small participant pool.

Final “Identify The Human” Test

The set-up was identical to the first “Identify the Human” test with short audio clips. The main difference is that participants had to answer one question featuring the extended performance. As mentioned earlier, a new option called “Cannot Determine” was added to the survey. The song used for this experiment was “chpn-p25.mid” which is a $2:30$ Classical piece called “Étude Op.25 No.1” by Frédéric Chopin [22]. This is the “Human or Bot” test at the top of this post. A total of $99$ people answered the survey.

Fig 20. Final Identify the Human Test Results.

Fig 20 shows that only $46\%$ participants could identify the human. This concludes humans are not capable of differentiating between synthetic and real music! The “Cannot Determine” answer is grouped together with participants incorrectly choosing the generated track as a human. This means $54\%$ percent of the participant pool cannot distinguish the human performance. This is more than half of a MUCH larger participant pool! So the StyleNet model has successfully passed the Turing Test and can generate performances that are indistinguishable from that of a human.

Closing Thoughts…

These results are quite exciting! This opens doors for using neural networks to assist the musical creative process. When musicians work on music, they have to encode velocities manually as they create MIDI tracks in their Digital Audio Workstation (DAW). This monotonous task is an opportunity for StyleNet. StyleNet should be able to synthesise and inject dynamics into MIDI files. Another use case would be to use the model to inject dynamics into monotonous MIDI files. Something like this could be packaged with DAWs such as Ableton Live as a plugin.

So I was curious and I needed to try it out for myself. I thought I’d find a track I liked and pass it through StyleNet. “River Flows in You” by Yiruma is a popular study track amongst university students. The only MIDI file I could find sounded really robotic and didn’t sound like a human recording [23].

This is what it sounded like originally:

Then I passed the MIDI file through StyleNet:

Baseline with randomised velocities:

I’m quite impressed! StyleNet’s performance actually sounds much better. I’ll be playing around with more samples… Stay tuned!

Links to Useful Materials

I would like to put out the dataset for public use. If you wish to use the Piano Dataset 🎹 for academic purposes, you can download it from here.

The Piano Dataset is distributed with a CC-BY 4.0 license. If you use this dataset, please reference this paper:

Iman Malik, Carl Henrik Ek, “Neural Translation of Musical Style”, 2017.

You can read my thesis here.

You can find the GitHub repo for this project here. It needs some polishing but I think you should be able to figure out what’s going on.

Discussion

Hacker News
Reddit

References

^{[1] W Xiong, J Droppo, X Huang, F Seide, M Seltzer, A Stolcke, D Yu, and G Zweig. Achieving Human Parity in Converational Speech Recognition.}

^{[2] Warren S. McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity.}

^{[3] D. Eck and J. Schmidhuber. A First Look at Music Composition using LSTM Recurrent Neural Networks}

^{[4] Bob L. Sturm, Joao Felipe Santos, Oded Ben-Tal, and Iryna Korshunova. Music transcription modelling and composition using deep learning.}

^{[5] Composing Music With Recurrent Neural Networks.}

^{[6] Alex Graves. Generating Sequences With Recurrent Neural Networks.}

^{[7] Feynman Liang and Churchill College. BachBot: Automatic composition in the style of Bach chorales Developing, analyzing, and evaluating a deep LSTM model for musical style.}

^{[8] M. Schuster and K. K Paliwal. Bidirectional recurrent neural networks.}

^{[9] Ilya Sutskever. Training Recurrent Neural Networks.}

^{[10] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. Understanding the exploding gradient problem.}

^{[11] Sepp Hochreiter and J Urgen Schmidhuber. LONG SHORT-TERM MEMORY.}

^{[12] Jane Bromley, James W. Bentz, Leon Bottou, Isabelle Guyon, Yann Lecun, Cliff Moore, Eduard Sackinger, and Roopak Shah. Signature Verification Using a Siamese Time Delay Neural Network.}

^{[13] Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol Vinyals, Lukasz Kaiser, and Google Brain. Multi-task Sequence to Sequence Learning.}

^{[14] Yann A. LeCun, Leon Bottou, Genevieve B. Orr, and Klaus Robert Miuller. Efficient backprop.}

^{[15] Diederik P Kingma and Jimmy Lei Ba. ADAM: A Method For Stochastic Optimization.}

^{[16] Alex Graves. Generating Sequences With Recurrent Neural Networks.}

^{[17] Wojciech Zaremba, Ilya Sutskever, Oriol Vinyals, and Google Brain. Recurrent Neural Network Regularisation.}

^{[18] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut- dinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting.}

^{[19] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio.On the difficulty of training recurrent neural networks.}

^{[20] Daxiang Dong, Hua Wu, Wei He, Dianhai Yu, and Haifeng Wang. Multi-Task Learning for Multiple Language Translation.}

^{[21] M Alan. Turing. Computing machinery and intelligence.}

^{[22] Frédéric Chopin’s Étude Op 25 No.1 Valentina Lisitsa, “chpn_op25_e1_format0.mid” from The Piano Dataset.}

^{[23] “River flows in You” by Yiruma from here.}