Audio over IP System

I am interested in having someone develop a two-way audio over IP device for integration in a pre-existing system.

The microcontroller must support 16 bit audio input and output channels. It would be nice if it could be full duplex, but not necessary. Simplex is OK.

Since I plan to integrate multiple channels in one device, ideally, only one network interface/ethernet shield would be needed, but if that is not possible, it is OK. I just want to minimize the amount of hardware needed.

AES256 encryption is preferred

Attached is a diagram of what I hope to accomplish. If there are more questions, please ask.

Cool topic - my favorite (Audio via Network):

There is commercial solution: Audinate (DANTE):
Audinate (DANTE)

But there are also open source or "private use" solutions:
Pulse Audio (Linux only)

The recent one I have tried (and works well for me): VB-Audio, Voicemeeter:
VB-Audio, Voicemeeter

Also JACK:
JACK audio

In general, not a bit issue to bring Audio via network. Many channels, in both directions.
They use often UDP and a special "audio packet" in network frames.

You can also consider this:

  • use UDP with RTP
  • pack the audio into an RTP frame
    This one can be received by VLC Player (but it needs an SDP file to open the "stream" on network).

Try to get a clue about existing audio streaming solutions (e.g. VLC player, VLC streamer).
You can also think about "Spotiify" as 'format': it just becomes a question about "network protocol", how to pack audio into network packets so that "any" application can get it.

Even, you can use your "personal" protocol or format: pack audio into UDP frames and write a Python script to listen on UDP socket. When received in Python script - you forward to PC audio interface.

I would suggest: check out "VB-Audio", "VoiceMeeter" (double e). It does for me what you want to do. (even I like Audinate, DANTE, but not allowed to use for private use - even I have it working).

BTW: encrypting audio, e.g. via AES256 might add a lot of delay (latency, jitter).
Many network audio solutions (e.g. Audinate DANTE) do not do. It might not be "real-time anymore.

Sure, I understand: sending personal audio via public network should be encrypted. But it is a trade-off: you can encrypt and decrypt audio - but both sender and receiver might realize a larger latency.

The encryption itself is a separate topic (not related to audio itself, to "any" data stream). How to do on MCU - no idea. But streaming audio via network from an MCU - I can help:
I have MICs working on Portenta H7, I send via network (VB-Audio) to a host PC (but un-encrypted).

You have to deal with different topics:
a) how to bring audio via network to a remote host (PC)?
b) how to encrypt the data stream (as encrypting any network traffic)?
c) also, how to provide the key for decryption?

Thank you for the detailed response.

Can you tell me how many audio channels you can get the portenta to support at one time (audio channel to me is both input and output)?

Can it be 16 bit audio? It seems like pyaudio does not like 8 bit streams. At least in my experience.

Guessing you ask "how many audio channels can I bring over Network"?

8bit audio format: very old (telephony, but then often with Encoders), potentially "nobody" supports it anymore (16bit, 24bit are common).

Assume ETH connected (not WiFi: WiFi throughput is a different topic, e.g. bandwidth due to weak RF signals):

  • Portenta H7 (breakout board) has ETH connector
  • it is a 100 Mbps ETH
  • I have measured, with a single channel UDP stream: it looks like I can achieve approx. 90 Mbps throughput.

BUT:

  • how many channels (different UDP port streams) you can use - it depends on the MCU FW, the DMA buffers, the latency in your FW etc.
  • The more audio streams (UDP port connections) you have - the more DMA buffers, DMA descriptors etc. you need. This might be limited, e.g. to 8 or 16 parallel open network streams (sockets)
  • the more audio streams are running, the more the MCU FW is busy to handle, to send, receive ETH network traffic - the less channels possible in parallel (you lack just by MCU performance)

There is not a clear number. Also IN and OUT in parallel is challenging: you stress the MCU FW by handling both in parallel. So, compared to just "Send OUT Audio via Network" (uni-directional) - using bi-directional will cut into half channels possible (at least).

I would guess: 4 parallel audio channels via network are possible, potentially also as bi-directional (4 + 4). But more then this? Not enough memory and MCU performance.

All depends on the FW "load" and also the audio format:
If you change from 16bit to 24bit audio format (assuming sending now as 32bit samples) - it reduces also due to higher amount of data to be handled by MCU (and buffers needed, time to copy into buffers ...).

Rough estimate:

  • assume, your MCU runs with 480 MHz, you have 100 Mbps ETH: this is already 1/5 of MCU clock, or like: "5 assembly code instructions per ETH clock". Not so much: with C/C++ code, overhead in functions, latency ... - potentially you will never make it possible to have one single stream with 100 Mbps (way too fast for MCU performance, assume you need 100 instructions, each as 2 MCU clock cycles when sending a network stream).
  • also: the allocation of ETH DMA buffers, descriptors is set by your FW. No idea how it is in Arduino/mbed, but for my native STM32 project: I have just 8 per direction. So, no more as 8 channels possible because afterwards I am out of buffers and descriptors.
  • I could tweak config to have more DMA buffers and descriptors, but this reduces the amount of memory remaining available for other stuff (e.g. stack size, other local buffers). So, it is a "trade-off".

MCU memory is "small" and not just the MCU performance matters here what is possible: also the amount of memory available limits how many channels are possible.
But how many? It depends on you FW design (config, buffer allocations ...).

So, I guess (!): 4 in every direction for audio is reasonable. And even this might be tough to program the FW with "best performance" to do so. The more channels - the more overhead, e.g. also due to RTOS threads involved and context switches.

A 480MHz MCU with 1 MB SRAM is for sure limited to speed and amount of data.
Personally I would assume: FOUR audio streams in and out, but every one should not have a higher throughput as 10 Mbps (enough for regular audio channel). And your MCU is pretty busy (not enough performance left for LCD screens, doing Audio Processing as DFT/Spectrum ...).

This topic was automatically closed 180 days after the last reply. New replies are no longer allowed.