Come generare musica usando Machine Learning

Dec 10 2022

Hai mai desiderato generare musica usando Python e Machine Learning? Diamo un'occhiata a come possiamo farlo! In qualità di appassionato di musica e scienziato di dati, mi sono sempre chiesto se esistesse un modo per mescolare musica e apprendimento automatico e creare musica generata dall'IA. Beh, c'è! Esistono diversi modi per affrontare questo argomento, un modo è utilizzare un modello di sequenza (come un GRU o un LSTM) e creare una sequenza di note e/o accordi basata su n sequenze precedenti.

Hai mai desiderato generare musica usando Python e Machine Learning? Diamo un'occhiata a come possiamo farlo!

Foto di Namroud Gorguis su Unsplash

In qualità di appassionato di musica e scienziato di dati, mi sono sempre chiesto se esistesse un modo per mescolare musica e apprendimento automatico e creare musica generata dall'IA . Beh, c'è! Esistono diversi modi per affrontare questo argomento, un modo è utilizzare un modello di sequenza (come un GRU o un LSTM) e creare una sequenza di note e/o accordi basata su n sequenze precedenti. Un altro modo è elaborare l'audio grezzo in un VAE (Variational Autoencoder) addestrabile e fargli emettere suoni diversi.

Useremo il primo per questo articolo e proveremo il secondo un'altra volta.

Per questo avremo bisogno di un vasto set di dati di musica (preferibilmente tutti appartenenti a uno specifico genere musicale o simili) da inserire nel nostro modello di sequenza in modo da poter provare, si spera, a ricreare alcune delle canzoni o crearne di nostre.

Per questo progetto, lavoreremo con il genere LoFi. Ho trovato un ottimo set di dati pieno di segmenti LoFi che ci aiuteranno a ottenere quel suono LoFi a cui miriamo. Questo set di dati proviene da Kaggle e da molte altre fonti.

Ora che abbiamo acquisito il nostro set di dati pieno di MIDI, come lo trasformiamo in dati leggibili per la nostra macchina? Useremo music21 per convertire il nostro set di dati pieno di MIDI in un elenco di sequenze di note e accordi.

Per prima cosa dobbiamo ottenere la directory in cui abbiamo archiviato i nostri MIDI

from pathlib import Path

songs = []
folder = Path('insert directory here')
for file in folder.rglob('*.mid'):
  songs.append(file)

import random
# Get a subset of 1000 songs
result =  random.sample([x for x in songs], 1000)

from music21 import converter, instrument, note, chord
notes = []
for i,file in enumerate(result):
    print(f'{i+1}: {file}')
    try:
      midi = converter.parse(file)
      notes_to_parse = None
      parts = instrument.partitionByInstrument(midi)
      if parts: # file has instrument parts
          notes_to_parse = parts.parts[0].recurse()
      else: # file has notes in a flat structure
          notes_to_parse = midi.flat.notes
      for element in notes_to_parse:
          if isinstance(element, note.Note):
              notes.append(str(element.pitch))
          elif isinstance(element, chord.Chord):
              notes.append('.'.join(str(n) for n in element.normalOrder))
    except:
      print(f'FAILED: {i+1}: {file}')

import pickle
with open('notes', 'wb') as filepath:
  pickle.dump(notes, filepath)

>>> ['C2', 'A4', 'F1', 'F1', ..., '0.6', '0.4.7']

def prepare_sequences(notes, n_vocab):
    """ Prepare the sequences used by the Neural Network """
    sequence_length = 32

    # Get all unique pitchnames
    pitchnames = sorted(set(item for item in notes))
    numPitches = len(pitchnames)

     # Create a dictionary to map pitches to integers
    note_to_int = dict((note, number) for number, note in enumerate(pitchnames))

    network_input = []
    network_output = []

    # create input sequences and the corresponding outputs
    for i in range(0, len(notes) - sequence_length, 1):
        # sequence_in is a sequence_length list containing sequence_length notes
        sequence_in = notes[i:i + sequence_length]
        # sequence_out is the sequence_length + 1 note that comes after all the notes in
        # sequence_in. This is so the model can read sequence_length notes before predicting
        # the next one.
        sequence_out = notes[i + sequence_length]
        # network_input is the same as sequence_in but it containes the indexes from the notes
        # because the model is only fed the indexes.
        network_input.append([note_to_int[char] for char in sequence_in])
        # network_output containes the index of the sequence_out
        network_output.append(note_to_int[sequence_out])

    # n_patters is the length of the times it was iterated 
    # for example if i = 3, then n_patterns = 3
    # because network_input is a list of lists
    n_patterns = len(network_input)

    # reshape the input into a format compatible with LSTM layers
    # Reshapes it into a n_patterns by sequence_length matrix
    print(len(network_input))
    
    network_input = numpy.reshape(network_input, (n_patterns, sequence_length, 1))
    # normalize input
    network_input = network_input / float(n_vocab)

    # OneHot encodes the network_output
    network_output = np_utils.to_categorical(network_output)

    return (network_input, network_output)


n_vocab = len(set(notes))
network_input, network_output = prepare_sequences(notes,n_vocab)
n_patterns = len(network_input)
pitchnames = sorted(set(item for item in notes))
numPitches = len(pitchnames)

DataFrame per network_input

Se il set di dati sembra essere sbilanciato, va bene, non tutte le note/gli accordi sono ugualmente frequenti, ma potremmo inciampare in casi in cui abbiamo una nota che ricorre più di 4000 volte e un'altra che accade solo una volta. Possiamo provare a sovracampionare il set di dati, ma questo non sempre produce i migliori risultati, ma potrebbe valere la pena provare. Nel nostro caso, non eseguiremo un oversampling visto che il nostro set di dati è bilanciato.

def oversample(network_input,network_output,sequence_length=15):

  n_patterns = len(network_input)
  # Create a DataFrame from the two matrices
  new_df = pd.concat([pd.DataFrame(network_input),pd.DataFrame(network_output)],axis=1)

  # Rename the columns to numbers and Notes
  new_df.columns = [x for x in range(sequence_length+1)]
  new_df = new_df.rename(columns={sequence_length:'Notes'})

  print(new_df.tail(20))
  print('###################################################')
  print(f'Distribution of notes in the preoversampled DataFrame: {new_df["Notes"].value_counts()}')
  # Oversampling
  oversampled_df = new_df.copy()
  #max_class_size = np.max(oversampled_df['Notes'].value_counts())
  max_class_size = 700
  print('Size of biggest class: ', max_class_size)

  class_subsets = [oversampled_df.query('Notes == ' + str(i)) for i in range(len(new_df["Notes"].unique()))] # range(2) because it is a binary class

  for i in range(len(new_df['Notes'].unique())):
    try:
      class_subsets[i] = class_subsets[i].sample(max_class_size,random_state=42,replace=True)
    except:
      print(i)

  oversampled_df = pd.concat(class_subsets,axis=0).sample(frac=1.0,random_state=42).reset_index(drop=True)

  print('###################################################')
  print(f'Distribution of notes in the oversampled DataFrame: {oversampled_df["Notes"].value_counts()}')

  # Get a sample from the oversampled DataFrame (because it may be too big, and we also have to convert it into a 3D array for the LSTM)
  sampled_df = oversampled_df.sample(n_patterns,replace=True) # 99968*32 has to be equals to (99968,32,1)

  print('###################################################')
  print(f'Distribution of notes in the oversampled post-sampled DataFrame: {sampled_df["Notes"].value_counts()}')

  # Convert the training columns back to a 3D array
  network_in = sampled_df[[x for x in range(sequence_length)]]
  network_in = np.array(network_in)
  network_in = np.reshape(networkInput, (n_patterns, sequence_length, 1))
  network_in = network_in / numPitches
  print(network_in.shape)
  print(sampled_df['Notes'].shape)
  # Converts the target column into a OneHot encoded matrix
  network_out = pd.get_dummies(sampled_df['Notes'])
  print(network_out.shape)

  return network_in,network_out

networkInputShaped,networkOutputShaped = oversample(networkInput,networkOutput,sequence_length=seqLength)
networkOutputShaped = np_utils.to_categorical(networkOutput)

Ho raccolto i nostri file MIDI
Caricato i file MIDI in memoria
Trasformato i file MIDI in un elenco di note/accordi in sequenza
Trasformato l'elenco in una matrice (n, m, 1) e un vettore (n, 1) (n = 99968, m = 32)

Gli LSTM sono un tipo di rete neurale ricorrente, ma sono diversi dalle altre reti. Altre reti ripetono il modulo ogni volta che la voce riceve nuove informazioni. Tuttavia, LSTM ricorderà il problema più a lungo e ha una struttura simile a una stringa per ripetere il modulo.

LSTM sono fondamentalmente unità come raffigurato:

Immagine tratta da https://en.wikipedia.org/wiki/Long_short-term_memory

Un'unità LSTM è composta da una cella, una porta di ingresso, una porta di uscita e una porta dimenticata. Diamo un'occhiata a cosa significa e perché gli LSTM sono utili per i dati sequenziali.

Il compito del dimenticatoio è decidere se conservare o dimenticare le informazioni. Solo le informazioni che provengono da livelli precedentemente nascosti e l'input corrente vengono mantenute con la funzione sigmoid. Qualsiasi valore più vicino a uno rimarrà e qualsiasi valore più vicino a zero scomparirà.

Il gate di input aiuta ad aggiornare lo stato delle celle. L'input corrente e le informazioni sullo stato precedente vengono passate attraverso la funzione sigmoid , che aggiornerà il valore moltiplicandolo per 0 e 1. Allo stesso modo, per regolare la rete, i dati passano anche attraverso la funzione tanh . Ora, l'output del sigmoide viene moltiplicato per l'output di tanh . L'output del sigmoid identificherà informazioni preziose per evitare l'output di tanh .

La porta di uscita determina il valore del successivo stato nascosto. Per trovare le informazioni sullo stato nascosto, dobbiamo moltiplicare l' output sigmoideo per l'output tanh . Ora il nuovo stato nascosto e il nuovo stato della cella passeranno al passaggio successivo.

Durante l'addestramento di una rete LSTM è necessario utilizzare una GPU. Nel mio caso, ho utilizzato Google Colab Pro durante l'addestramento della rete neurale. Google Colab ha un limite prestabilito di unità di calcolo che possiamo utilizzare durante l'addestramento con le GPU. Puoi utilizzare la GPU gratuita per un paio di dozzine di epoche.

model = Sequential()
model.add(Dropout(0.2))
model.add(LSTM(
    512,
    input_shape=(networkInputShaped.shape[1], networkInputShaped.shape[2]),
    return_sequences=True
))
model.add(Dense(256))
model.add(Dense(256))
model.add(LSTM(512, return_sequences=True))
model.add(Dense(256))
model.add(LSTM(512))
#model.add(Dense(numPitches))
model.add(Dense(numPitches))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])

num_epochs = 100

filepath = "weights-improvement-{epoch:02d}-{loss:.4f}-bigger_1.hdf5"
checkpoint = ModelCheckpoint(
    filepath, monitor='loss', 
    verbose=1,        
    save_best_only=True,        
    mode='min'
)    
callbacks_list = [checkpoint]

history = model.fit(networkInputShaped, networkOutputShaped, epochs=num_epochs, batch_size=64, callbacks=callbacks_list)

I grafici seguenti mostrano il risultato dell'addestramento della rete neurale.

Il grafico a sinistra mostra l'accuratezza in relazione alle epoche. Il grafico a destra mostra la perdita in relazione alle epoche.

Cosa facciamo quando abbiamo finito di addestrare la nostra rete? Scegliamo un numero casuale da 0 alla lunghezza dell'input della rete, questo sarà l'indice della riga nella matrice di addestramento che useremo per fare le nostre previsioni. Prendiamo questa sequenza di 32 note/accordi come punto di partenza per fare una previsione di 1 nota. Dopo questo, lo facciamo (n — 1) più volte (n è 500 in questo caso). In ogni previsione spostiamo una finestra di 32 note/accordi di un elemento a destra. In altre parole, nella seconda previsione, una volta che abbiamo previsto una nota/accordo, eliminiamo la prima nota, e la nostra prima previsione diventa l'ultima nota/accordo nella sequenza di lunghezza 32. Le immagini seguenti mostrano il codice precedentemente spiegato

Sequenza illustrativa della prima previsione

Sequenza illustrativa della seconda previsione

def generate_notes(model, network_input, pitchnames, n_vocab):
    """ Generate notes from the neural network based on a sequence of notes """
    # pick a random sequence from the input as a starting point for the prediction
    # Selects a random row from the network_input
    start = numpy.random.randint(0, len(network_input)-1)
    print(f'start: {start}')
    int_to_note = dict((number, note) for number, note in enumerate(pitchnames))

    # Random row from network_input
    pattern = network_input[start]
    prediction_output = []

    # generate 500 notes
    for note_index in range(500):
        # Reshapes pattern into a vector
        prediction_input = numpy.reshape(pattern, (1, len(pattern), 1))
        # Standarizes pattern
        prediction_input = prediction_input / float(n_vocab)

        # Predicts the next note
        prediction = model.predict(prediction_input, verbose=0)

        # Outputs a OneHot encoded vector, so this picks the columns
        # with the highest probability
        index = numpy.argmax(prediction)
        # Maps the note to its respective index
        result = int_to_note[index]
        # Appends the note to the prediction_output
        prediction_output.append(result)

        # Adds the predicted note to the pattern
        pattern = numpy.append(pattern,index)
        # Slices the array so that it contains the predicted note
        # eliminating the first from the array, so the model can
        # have a sequence
        pattern = pattern[1:len(pattern)]

    return prediction_output

n_vocab = len(set(allNotes))
pitchnames = sorted(set(item for item in allNotes))
prediction_output = generate_notes(model, networkInputShaped, pitchnames, n_vocab)

>>> ['B2', 'B2', 2.7, ..., 5.10]

def create_midi(prediction_output):
    offset = 0
    output_notes = []

    # create note and chord objects based on the values generated by the model
    for pattern in prediction_output:
        # pattern is a chord
        if ('.' in pattern) or pattern.isdigit():
            notes_in_chord = pattern.split('.')
            notes = []
            for current_note in notes_in_chord:
                new_note = note.Note(int(current_note))
                new_note.storedInstrument = instrument.Piano()
                notes.append(new_note)
            new_chord = chord.Chord(notes)
            new_chord.offset = offset
            output_notes.append(new_chord)
        # pattern is a note
        else:
            new_note = note.Note(pattern)
            new_note.offset = offset
            new_note.storedInstrument = instrument.Piano()
            output_notes.append(new_note)

        # increase offset each iteration so that notes do not stack
        offset += 0.5
    midi_stream = stream.Stream(output_notes)
    midi_stream.write('midi', fp='output.mid')

FACOLTATIVO: durante l'addestramento di una rete neurale, ogni epoca produrrà un diverso set di pesi, nella generazione di musica, ogni set di pesi produrrà un risultato diverso (diversa sequenza di note/accordi), quindi è meglio tenere traccia di ogni peso.

Ora possiamo fare un ulteriore passo avanti e controllare le previsioni di tutti i pesi salvati in precedenza. Innanzitutto, prendiamo la posizione in cui abbiamo archiviato i pesi e iteriamo attraverso quella cartella:

songs = []
folder = Path('Training Weights LoFi')
for file in folder.rglob('*.hdf5'):
  songs.append(file)

songsList = []
weightsList = []
for i in range(len(songs)):
  try:
    model.load_weights(songs[i])
    prediction_output = generate_notes(model, networkInputShaped, pitchnames, n_vocab)
    songsList.append(prediction_output)
    weightsList.append(str(songs[i]))
  except:
    pass

songs_df = pd.DataFrame({'Weights':weightsList,
                         'Notes':songsList})

Tutto il codice mostrato qui è nel mio GitHub se vuoi replicare quello che hai visto qui. Con tutte queste informazioni, dovresti essere in grado di creare le tue canzoni. Se lo fai, per favore caricali da qualche parte e collegali a me così posso controllarli!