Fixed point signed multiplication - shifts and unions

ShermanP · October 24, 2025, 4:54pm

I'm working on converting the Goertzel algorithm for detecting DTMF frequencies from floating to fixed point, in the hope that it will save some code space and maybe speed things up. It appears I will need to multiply two fixed point signed integers, one Q1.14 and the other Q13.2, with the result scaled to Q13.2.

If I understand correctly, the product long will be Q.16. So to bring it back to Q13.2, I would need to right-shift 14 places. But I'm wondering if instead I could left-shift the long two places, and then use a union to extract the high word.

I'm pretty sure the range of numbers I'm dealing with leave room on the left for the two shifts. But I read that a left shift in C++ does not preserve the sign bit. So if it's a negative product, I thought I would need to copy the sign bit, then restore it after the shifts. But then I thought, well, if the size of positive numbers would permit shifting left two places, then if it's a negative number, those bits are going to be 1's to begin with. This is based on my understanding that negative numbers are stored in 2's complement form, which means any unfilled bits on the left of a positive number would be zeros, but in 2's complement they would be ones. So the sign bit would be correct automatically.

This is my first attempt to deal with fixed point math(s), so I would appreciate any comments on my logic from those with more experience.

Paul_KD7HB · October 24, 2025, 6:59pm

You are showing human readable numbers. How are you using those numbers in your Arduino? Don’t tell me they are a string of characters. If you really are using fixed point integers, then those are multiplied by 100 and then do the calculations. Divide the results by 100 to print the actual numbers to see your results when you print them.

gcjr · October 24, 2025, 7:29pm

what processor are you doing this on? wouldn't that processor support the fixed-pt math operations

Coding_Badly · October 24, 2025, 8:38pm

That would be undefined behavior.

gcjr · October 24, 2025, 8:50pm

wouldn't a /2^n preserve the sign bits?

ShermanP · October 24, 2025, 9:15pm

I want to convert these calculations to fixed point arithmetic:

float Goertzel::detect()
{
  Q2 = 0;                  // defined elsewhere as floats
  Q1 = 0;
  for ( uint16_t index = 0; index < N; index++)
  {
    // byte sample is ( *( testData + index ) );
    float Q0;
    Q0 = coeff * Q1 - Q2 + (float) ( *( testData + index ) - 128 ) ; 
    Q2 = Q1;
    Q1 = Q0;
  }

  /* standard Goertzel processing. */
  float magnitude = sqrt(Q1 * Q1 + Q2 * Q2 - coeff * Q1 * Q2);
  return magnitude  ;
}

Edit: I can't do sqrt() in fixed point, so the result will just be the magnitude squared, which works as well.

This is a simple version of an FFT just looking for one frequency. It would be done on a classic Nano. It would use fixed point variables to represent fractional values, as described here:

https://en.wikipedia.org/wiki/Fixed-point_arithmetic

Not sure what you mean. I've used unions before to input individual bytes and read them out as a long, and vice versa, and that worked fine.

Yes it would, but that would take even more code than a simple >> n.

Paul_KD7HB · October 24, 2025, 9:18pm

Then use scaling to make them all identical scales.

kenb4 · October 24, 2025, 10:54pm

Let's see; for example

1.5 in Q1.14 is 01.1 and thirteen more zeroes
255 in Q13.2 is 0 for the sign bit, five more zeroes, then the eight 11111111 and .00
We'll cheat and pre-compute the product: 382.5

            01.10000000000000
00000011111111.00
-----------------------------
     101111110.1

The basic idea is to ignore/drop the decimal point, do the multiplication like normal, and then put the decimal back in the right spot, Q.16 in this case. For those following along in their browsers, you can do the math with JavaScript

> (b => b.slice(0,-16) +'.'+ b.slice(-16))
    ((parseInt('01.10000000000000'.replace('.',''),2) *
      parseInt('00000011111111.00'.replace('.',''),2)).toString(2))

"101111110.1000000000000000"

Since you're probably not manually shifting with a for loop, it would depend on the CPU. Maybe for some CPUs, the compiler has to unroll the product >> 14 into fourteen single-bit shifts; or it's faster to divide by 16384. More likely it's a single instruction to shift N places, which may be the same effective speed regardless of N.

(And since we're in the weeds here, right-shift of a signed integral type is actually implementation dependent before C++20, but usually does sign extension.)

With the multiplication of two 16-bit ints, since you start with two separate sign bits, you "get a bit back" and cannot overflow a long.

> (-32768 * -32768).toString(16)
"40000000"  // 8000'0000 would be negative, and therefore wrong

The issue is when you scale the result to fit, whether that will overflow. For example, multiplying max-Q13.2 by 1.5 again will overflow Q13.2

> (b => b.slice(0,-16) +'.'+ b.slice(-16))
    ((parseInt('01.10000000000000'.replace('.',''),2) *
      parseInt('01111111111111.11'.replace('.',''),2)).toString(2))

"10111111111111.1010000000000000"

That's 14 bits to the left of the decimal point. So you'll want to check the resulting long before the shift to see if it would overflow, positive or negative, and return the max values instead.

As you figured, left shift will preserve sign as long as it doesn't overflow. (If it does overflow, it may or may not preserve the sign.)

Finally, type punning through a union is (as previously mentioned) Undefined Behavior in C++, but can "work". If it's "working" with a particular CPU and compiler, is it likely stop working? No. If it does stop, is anyone obligated to fix it? No.

You can instead do memcpy and let the compiler figure out the best way to handle it

    int16_t a = 0b0110000000000000;
    int16_t b = 0b0000001111111100;
    int32_t product = a;
    product *= b;
    //TODO check for overflow, positive and negative
    product <<= 2;
    int16_t q13_2;
    memcpy(&q13_2, reinterpret_cast<int16_t*>(&product) + 1, sizeof(q13_2));
    Serial.println(q13_2, BIN);

gcjr · October 24, 2025, 10:59pm

the compiler is smart enoiugh to translate a divison by a power of N to a shift, but because it is division the the sign must be preserved

you're doing this

since the coefficients are values of cos, < 1, the results of the multiplications need to be normalized, hence `1/2^N'

ShermanP · October 25, 2025, 12:56am

Well, it's whatever the Arduino IDE and GCC do to execute >>14. Of course I would be shifting all four bytes of the long 14 times, so I suspect the compiler just sets up a loop. I'm pretty sure 328P assembly only does one shift at a time. But maybe >>14 gives the minimum code size to do that even if it takes a while.

I tried an earlier version that set the coeff value to eight binary places, and used that to calculate coeff*Q1 product to the nearest integer, but that wasn't accurate enough. It got the column frequency correct, but not the row. So if I pressed 2, it would read as either 2 or 5. So I'm hoping the additional significance doing Q1.14 and Q13.2 will make it work correctly while still using integers for everything but the multiplication. If not, I may have to put everything in longs. The original floats effectively have three bytes of significance. I'm just hoping two bytes will be enough. I don't think the IDE supports three byte integer types.

I modified the original float version to print out all 64 values of Q1 during the algorithm, and the largest value I saw was 3864. The coeff values range from about 0.9 to about 1.8. 3864 uses 12 bits, so 13.2 would allow up to 8191, and the saved extra sign bit would double that. The Q1 values get big only when the incoming frequency contains the one you're calculating for, but I don't know how big they actually get, which may depend upon the level of the signal. So I'll just have to see how it all works - if it works at all.

Apparently it works in Arduino, or at least it does for me. And it's quite handy.

Coding_Badly · October 25, 2025, 1:51am

Specious reasoning. Homer Simpson would be proud. But, it’s your project. I cannot force you to act in your own best interest.

For everyone else, a trip through memcpy is the correct choice. Don’t be like ShermanP.

You left the optimizer out of that list. Which means it may stop working given a particular code structure or register spill or if the optimizer unrolls a loop. Undefined behavior means all bets are off including correct code generation.

westfw · October 25, 2025, 6:44am

avr only shifts one bit at a time, so shifting by 14 does indeed result in a loop.

  volatile unsigned long y = x >>14;
  26:	2e e0       	ldi	r18, 0x0E	; 14
  28:	b6 95       	lsr	r27
  2a:	a7 95       	ror	r26
  2c:	97 95       	ror	r25
  2e:	87 95       	ror	r24
  30:	2a 95       	dec	r18
  32:	d1 f7       	brne	.-12

All the 32bit chips I know of have "barrel shifters" that can shift any number of bits in a single instruction/cycle.

Are you using the fixed point support built into avr-gcc, or trying to do things manually? If the former, it may do everything for you...

gcjr · October 25, 2025, 10:06am

  20000  0x4e20 a
   1250  0x4e20 a >>4
   1250  0x4e20 a /16

 -20000  0xffffb1e0 a
  -1250  0xffffb1e0 a >>4
  -1250  0xffffb1e0 a /16

#include <stdio.h>

void
shift (int a)
{
    printf (" %6d  0x%04x a\n", a, a);
    printf (" %6d  0x%04x a >>4\n", a >> 4, a);
    printf (" %6d  0x%04x a /16\n", a / 16, a);
    printf ("\n");
}

int
main ()
{
    int a = 1000;

    printf (" %d sizeof(int)\n", sizeof(int));

    shift (20000);
    shift (-20000);

    return 0;
}

J-M-L · October 25, 2025, 10:47am

may be this would work

run this first code

constexpr size_t numSamples = 205; // number of samples for Goertzel
constexpr float sampleRate = 8000.0f;
constexpr float dtmfFreqsInput[] = {697, 770, 852, 941, 1209, 1336, 1477, 1633};
constexpr size_t nbTones = sizeof dtmfFreqsInput / sizeof *dtmfFreqsInput;

void setup() {
  Serial.begin(115200);
  while (!Serial);Serial.println();

  // Print numSamples
  Serial.print(F("constexpr size_t numSamples = "));
  Serial.print(numSamples);
  Serial.println(F(";"));

  // Compute and print sample interval in microseconds
  constexpr unsigned long sampleIntervalUs = (unsigned long)(1e6 / sampleRate);
  Serial.print(F("constexpr unsigned long sampleIntervalUs = "));
  Serial.print(sampleIntervalUs);
  Serial.println(F(";"));

  Serial.print("constexpr int16_t dtmfFreqs[] = {");
  for (size_t i = 0; i < nbTones; i++) {
    Serial.print((int16_t)dtmfFreqsInput[i]);
    if (i < nbTones - 1) Serial.print(", ");
  }
  Serial.println("};");

  Serial.println("constexpr size_t nbTones = sizeof dtmfFreqsInput / sizeof *dtmfFreqsInput;");

  // Generate the Q15 coefficients
  Serial.println("constexpr int16_t coeffsQ15[] = {");
  for (size_t i = 0; i < nbTones; i++) {
    float k = 0.5f + numSamples * dtmfFreqsInput[i] / sampleRate;
    float w = 2.0f * 3.14159265f * k / numSamples;
    float coeff = 2.0f * cos(w);
    int16_t coeffQ15 = (int16_t)(coeff * 16384.0f);
    Serial.print(" ");
    Serial.print(coeffQ15);
    Serial.print(", // ");
    Serial.print((int)dtmfFreqsInput[i]);
    Serial.println(" Hz");
  }
  Serial.println("};");
}

void loop() {}

it calculates some data we can use in the algorithm . The output is

constexpr size_t numSamples = 205;
constexpr unsigned long sampleIntervalUs = 125;
constexpr int16_t dtmfFreqs[] = {697, 770, 852, 941, 1209, 1336, 1477, 1633};
constexpr size_t nbTones = sizeof dtmfFreqsInput / sizeof *dtmfFreqsInput;
constexpr int16_t coeffsQ15[] = {
 27714, // 697 Hz
 26667, // 770 Hz
 25386, // 852 Hz
 23877, // 941 Hz
 18662, // 1209 Hz
 15887, // 1336 Hz
 12622, // 1477 Hz
 8832, // 1633 Hz
};

so now use that into something like

constexpr size_t numSamples = 205;
constexpr unsigned long sampleIntervalUs = 125;
constexpr int16_t dtmfFreqs[] = {697, 770, 852, 941, 1209, 1336, 1477, 1633};
constexpr size_t nbTones = sizeof dtmfFreqs / sizeof *dtmfFreqs;
constexpr int16_t coeffsQ15[] = {
  27714, // 697 Hz
  26667, // 770 Hz
  25386, // 852 Hz
  23877, // 941 Hz
  18662, // 1209 Hz
  15887, // 1336 Hz
  12622, // 1477 Hz
  8832,  // 1633 Hz
};

constexpr int16_t adcCenterValue = 512; // 10-bit ADC center
static uint16_t sampleBlock[numSamples];

int32_t magnitudeSquaredGoertzelQ15(uint16_t *samples, int16_t coeff_q15, size_t numSamples) {
  int16_t Q1 = 0;
  int16_t Q2 = 0;
  int32_t Q0;
  for (size_t i = 0; i < numSamples; i++) {
    int16_t sample = (int16_t)samples[i] - adcCenterValue;
    Q0 = ((int32_t)coeff_q15 * Q1 >> 15) - Q2 + sample;
    Q2 = Q1;
    Q1 = (int16_t)Q0;
  }
  return (int32_t)Q1*Q1 + (int32_t)Q2*Q2 - ((int32_t)coeff_q15 * Q1 * Q2 >> 15);
}

// Acquire one sample at a time; return true when block is full
bool acquireSample() {
  static size_t counter = 0;
  static unsigned long lastMicros = 0;

  if (counter == 0) {
    lastMicros = micros() - sampleIntervalUs;
  }

  if (micros() - lastMicros < sampleIntervalUs) return false;
  lastMicros += sampleIntervalUs;

  sampleBlock[counter] = analogRead(A0);
  counter++;
  if (counter >= numSamples) {
    counter = 0;
    return true;
  }
  return false;
}

void setup() {
  Serial.begin(115200);
  while (!Serial); Serial.println();
  Serial.println(F("Goertzel Q15 demo"));
}

void loop() {
  if (acquireSample()) {
    for (size_t i = 0; i < nbTones; i++) {
      int32_t magnitude = magnitudeSquaredGoertzelQ15(sampleBlock, coeffsQ15[i], numSamples);
      Serial.print(F("Frequency "));
      Serial.print(dtmfFreqs[i]);
      Serial.print(F(" Hz: Magnitude "));
      Serial.println(magnitude);
    }
    Serial.println(F("------------"));
  }
}

may be that will work ?

PS: we could avoid the first code and use constexpr nowadays to compute the coeffsQ15 array at compile time (given you take an approximation of the cos() function as it's not a constexpr).

ShermanP · October 25, 2025, 2:37pm

The former. I should make clear that I am a hobbyist, and not experienced or trained in C++. I looked at that info you linked to, and don't have any idea what it's talking about. So I'll stick to manual if I can get it to work.

ShermanP · October 25, 2025, 2:48pm

@gcjr and @J-M-L, I really appreciate your posts with the code examples. I will go through them later today and try to understand. But I suspect you guys are pretty far over my head.

With respect to unions and my original question, I think the answer is to just >>14, and then make the long equal to an integer variable. If the size of the numbers involved is as I suspect, that should work. But of course I'll need to confirm that, or build in an overrun check. So then a union won't be needed.

J-M-L · October 25, 2025, 2:52pm

You can’t use union in C++ for what you have in mind, as mentioned by @Coding_Badly

ShermanP · October 25, 2025, 3:02pm

And yet, it works. It may be undefined in C++, but apparently it became defined in GCC or in the Arduino IDE. But it does actually work. As Galileo would say, "Eppure funziona". Are "setup" and "loop" defined in C++?

notthesharpestspooninthedrawer · October 25, 2025, 3:23pm

You’re comparing language behaviour with simple function names?

Coding_Badly · October 25, 2025, 4:21pm

It is not undefined. It triggers undefined behavior. Which includes undefined behavior in the compiler. Please do us all a favor, including yourself, reduce your ignorance.

Topic		Replies	Views
Floating point tones Programming	27	219	December 10, 2025
Casting uint16_t to uint8_t but >>1 shifts high byte as well as low Programming	40	643	June 20, 2025
Left an right shift a Long and also have access to the 4 individual bytes (efficently) Programming	31	2338	March 26, 2022
Why is division by 4096 not the same as ">>12"? Programming	24	474	January 7, 2025
Arbitrary Precision Decimal Arithmetic by Arduino Programming	39	560	June 7, 2025

Fixed point signed multiplication - shifts and unions

Related topics