My question is now: how could I send on the one hand the data and the control commands and at the same time also a picture, if requested?
Every block of data that you send needs to have something in it to define if it is camera data or not. The app that receives the data needs to parse the block of data to decide what to do with the data.