This reminds me of the Octuple MIDI tokenization scheme introduced in the MusicBERT paper (https://arxiv.org/pdf/2106.05630.pdf). Would be interesting to see how much of a performance difference results from using a smaller decoder model (in Megabyte) instead of just doing a softmax (a la Octuple).