Realtime audio decompression?

Hey guys,

For a prop I'm building, I'd like to include a little easter egg which will play back one or two short audio samples via the included DAC.

Now, the circuit doesn't contain any additional flash ram for storing the samples, so I need to fit them into the built in flash or progmem. I don't know exactly how much progmem I have left, but let's just assume it's less than 4K and I've got 1K of flash. That isn't much for one audio sample let alone two, so I've been looking for a simple way to compress and decompress the audio. If I could get even a 2:1 compression ratio, that would be fantastic.

I know about delta coding, but delta coding isn't a compression algorithm, it just converts the audio into a form which is easier to compress. So I've been trying to finding the best method by which to compress the delta coded audio sample, which I can then deompress in realtime on playback, but so far I haven't had much luck.

At this point I'm looking at the wikipedia article on entropy encoding... I think I want to use one of these schemes, but I'm not sure which one is best suited for audio. One which does not require a table of symbols would probably be best.

Linear predictive coding (LPC) is a good way to get started on compression for speech:

I think the simplest LPC decoders could fit in microcontroller code but unfortunately I don't have any specific code samples to refer you to.

--
The Gadget Shield: accelerometer, RGB LED, IR transmit/receive, speaker, microphone, light sensor, potentiometer, pushbuttons

I just want something simple. It looks like Huffman is the most efficient. I'm trying to understand how that works right now. I would think there would be a good way to do it without storing the symbol table, because with audio and using delta coding, you know that the lower values are more likely to occur than the higher values. So I guess I'm looking to construct a huffman table where 0 is the most likely to occur, and -127 and +127 are the least likely.

I don't think huffman is going to get me where I need to be, because the two samples I want to include are around 6K and 15K in size after being converted to 8 bits at 8000hz.

I was looking at the sound data though, and it sure looks like one should be able to get better than 4:1 compression without much difficulty. I mean in general, it looks like one could linearly interpolate over four or more samples on average and still have a pretty accurate reproduction of the waveform. Also, it occured to me that instead of interpolating linearly, one could interpolate using cosine, and get a more accurate reproduction, and if there are artifacts, they'd probably be easier on the ears. And I've already got a sine table in my code.

I'm counting the samples right now in areas of quiet and speech, and it seems like there's between 4 and 8 samples over which a half cosine, scaled properly, would be a good match. You'd need one byte to define the start point, and another to define the endpoint, but the endpoint of one would be the startpoint of another, so those count as just one byte per cosine. Then you'd need the number of samples over which to interpolate. You could use another byte for that, but a nibble should be sufficient. Also, we need a way to choose which half of the cosine to use. You could use a bit for that, but I think that just using the sign of the difference between the start position of this half cosine and the next will give the result we're looking for.

The only thing remaining is how to decide is how many samples we can interpolate for each half cosine before switching to the next. That, I'm not too sure about. I guess you could try both upslope and downslope half-cosines and different sample lengths, and have some kind of error metric and if you exceed that that's where you switch to the next one. Though something tells me there's gotta be a better way.

Anyway, let's say it's two bytes per half-cosine, and each covers 4-8 bytes. That's an average of 6 bytes coverage for 2 bytes, so that would be a 1:3 compression ratio. Not bad. And you could increase it if you only stored a nybble for the number of samples in each half-cosine.

Hm, it seems that doing that may be totally unecessary.

I've been experimenting in sound forge, and I decided to see what the samples would sound like if even fewer bits were used to encode them. So, I took the 8bit 8000hz samples, and reduced the volume by 10%. Then I restored the original volume by increasing it by 1000%. The resulting waveform was clearly aliased into around 10 discreet levels, with 95% of the audio being within the first 8. And when I played it back... it actually sounded really good. Way better than I would have expected it to. It would seem the samplerate is far more important than the number of bits used to encode the sample, because here I have cut the number of bits used to around 37% and the result sounds pretty good, but if I merely halve the samplerate to 4000hz, the sound comes out all muffled.

3:1 compression isn't enough to include both samples I wanted to, but it's definitely enough to include the shorter one which was only 6K to begin with.

Now to see if I can get away with 4:1 or 6:1 compression.

[edit]

Well, it seems if I drop the number of discreet levels down to 6 that the speech is still intelligible, but only barely. So there's no point trying to get it down to 4 levels.

I just realised though that I only counted the number of levels on one side of 0 though. So those 10 discreet levels were more like 20. So that was more like a 4 bit encode, not a 3 bit like I thought. So 2:1 compression. Still good enough perhaps for the one sample, but not nearly enough for the other.

I don't know if this will help you, but it may:

I found that earlier. It doesn't really seem to be a general compression algorithm. Rather it is a way to compress sound in a way that is suitable for playback with an RC filter. Without the RC filter there to fiddle with the output you'd need to simulate it in software.

Anyway, I discovered just now I don't have nearly as big a problem as I thought I did. I forgot how much flash the Atmega328 has, and I assumed it was more like 8K. It's actually 32K, and my program is only 10K, so I actually have around 20K to play with. That's almost enough to store both my samples without any compression at all, and if I store them as 4bit instead of 8bit, I should have have room to spare.

Now I just need a way to convert the files to 4 bit and output the source code for progmem storage. I think I can use Processing to do that but I haven't looked into what it can do yet.