wasn't sure why some "commands" had the number first. as you can see, isdigit() is easy to use
i wonder if it might just be easier to sequentially process the tokens (in pairs) rather than translate them and store them in a struct.
code can keep track of the number of tokens as well as which token pair is being processed. however it seems delays are necessary to allow the robot to move the specified x and y distances as well as between movements.
millis() can be used to determine when those delays have passed without blocking other code execution (e.g. waiting for serial commands that may want to abort the current operation).
but there are different types of "actions" that are being waited on (e.g. an x or y movement, command delay). it seems some state needs to keep track of what action is next
a complete line of input, up to a newline (\n) can easily be read using readBytesUntil (). i've showed you how that line can be tokenized and decoded afterwards
i don't understand the code posted. it's not obvious how the robot maintains vertical and horizontal movments (is it in a maze with walls using "feelers", or dead reckoning)?
regardless of how movements are handled, loop() needs to read commands thru some interface (serial) and process tokens as described above.
i imagine a moveto() routine determining a sub-sequence of operations. it initiates an x or y move, terminated based on a timer expiring in loop() using millis() and finally updating its position.
as I suggested, a timer expiration may invoke some processing function that keeps track (i.e. state) of what sub-sequence step is next (X, Y or cmd-delay)
i assume all this sounds daunting. but as the code i posted demonstrated, can be broken down into sub-processing steps (e.g tokenize, decode, ...)