Serial Encoding to 7-bit using Base85 or anything else - your thoughts?

I'd like to develop a generic strategy for specialising Firmata sketches for bespoke circuits. The aim is to preserve all the functionality of StandardFirmata whilst also permitting build-specific messages to be sent. I have a strategy in mind, building on the Base85 C code distributed as part of the code-versioning tool 'git' which I think should permit the easy encoding of arbitrary multi-byte data into 7-bit data space of Firmata Sysex messages. I'd like to get feedback on the feasibility and possible beartraps of the proposed approach, as well as preferred alternatives.

An example which requires this approach is a two-wheeled robot platform. The proposed Arduino sketch code should support the sending of messages to control the two stepper motors. It should also permit the requesting of measured distances from an ultrasonic sensor. However, students experimenting with the robot may also wish to attach arbitrary other analog sensors or digital components to the robot, which they should be able to query over bluetooth serial as part of control logic in python through pyFirmata. Building on StandardFirmata, changing the Serial data rate (to be consistent with Arduino code upload) and adding messages to support the extra hardware seems natural to serve all these needs.

However, it's hard to, for example, encode a 'long' integer value using the recommended Firmata approach. The Accelstepper library I'm using employs a 'long' as its representation of stepper position.

FIrmata recommends the use of the Sysex extension to achieve build-specific messaging, but data must be sent in a 7-bit encoding, as one of the bits is used to flag Firmata control messages (including the start and end of Sysex). With a 4-byte 'long' it's hard to know exactly how to shove it into 7-bits in a comprehensible and debuggable way.

Enter base85 (Ascii85 - Wikipedia). This has a canonical (and efficient) implementation as part of the git version control utility. The code is accessible at git/base85.c at master · git/git · GitHub The transform can be reversed using mainstream utilities for debugging.

I would like to know if there are any reasons why the git code for base85 linked above should not be ported to Arduino in order to facilitate the encoding of arbitrary bytestreams (including the 4 bytes of my long) into the 7-bit data space of Firmata Sysex messages. This would mean that generic sendInt, sendFloat, receiveInt, receiveFloat operations could be defined on each side of the Firmata link, using the base85 encoding on the component bytes of each type to create a straightforward wrapper for data.

I've arrived at code which almost compiles, but hoping to get insights from others on the practicality of the approach. Also there may be a much easier model of encoding and decoding bytes into a 7-bit transport, or a better base85 code reference which I should adopt instead.

The code I'm proposing to use to wrap the base85 invocations from git would look like this (I'm now back on the machine with the code).

Hopefully the example code using these wrappers (also inlined below) underlines the simplicity of the approach for users.

I've copied the base85 encoding and decoding functions from the git source git/base85.c at master · git/git · GitHub unmodified. These are just the wrapper functions intended to call those functions.

//CH generic function to write 4 bytes (an unsigned long) as 5 7-bit bytes using Base85
void writeBase85Long(long toencode){
    byte encodedBytes[5];
    encode_85(((char*)&encodedBytes), (const unsigned char*)&toencode, 4);
    Serial.write(encodedBytes[0]);
    Serial.write(encodedBytes[1]);
    Serial.write(encodedBytes[2]);
    Serial.write(encodedBytes[3]);
    Serial.write(encodedBytes[4]);
 }
 
 //CH: Just read one long (4 bytes) from base85 encoded source (5 bytes)
long readBase85Long(byte* src){
  long decodedLong;
  decode_85((char*)&decodedLong,((char*)src),4);
  return decodedLong;
}

//CH: Read a series of longs (4 bytes) from base85 encoded source (5 bytes), returning number of source bytes read
int readBase85Longs(byte* src, long* dst, int numLongs){
  for(int count = 0; count < numLongs; count++){
    dst[count]=readBase85Long(src);
    src+=5; //reading 4 bytes, but will consume 5 src bytes to do it
  }
  return numLongs * 5;
}

...and a typical invocation of this wrapper code in the context of decoding/encoding a Sysex message within the standard switch of Firmata's sysexCallback function might look like this...

  case STEPPER_MESSAGE:
    byte *stepperSide = argv;
    long stepperTarget = readBase85Long(++argv);
    long stepperCurrent = -1;
    if(stepperSide == STEPPER_LEFT){
      //left motor control
      leftStepper.moveTo(stepperTarget);
      stepperCurrent = leftStepper.currentPosition();
    }
    else if(stepperSide == STEPPER_RIGHT){
      //right motor control
      rightStepper.moveTo(stepperTarget);
      stepperCurrent = rightStepper.currentPosition();
    }
    //always report back current position
    Serial.write(START_SYSEX);
    Serial.write(STEPPER_MESSAGE);
    Serial.write(*stepperSide);
    writeBase85Long(stepperCurrent);
    Serial.write(END_SYSEX);
    break;

Particularly worrying in the compiler warnings is...
"warning, right shift count >= width of type"
...which I guess comes from the assumption of at least 32-bit-width data structures built into the base85 logic in the git code I'm using from git/base85.c at master · git/git · GitHub.

I think I have to go back to the drawing board.

I still like the idea of base85 (especially given python support for decoding it) but would struggle to implement a solid, conformant implementation. Perhaps a subset...?

Any other ideas how to tackle encoding multiple-of-8-bit data types into 7-bit bytes in simple to manage, easy to understand, debuggable ways.

This strategy has developed into a proprietary, but probably more elegant approach, to avoid the overengineering of Base85.

Essentially, I've written a routine which pushes the extra bit per byte into an overflow byte which is then written (at most) every 8th byte. This is immediately more efficient and comprehensible than Base85.

You can see some more discussion at the relevant issue on PyFirmata github, with some example (untested) code...