Some thoughts from what I have understood of the UOTGHS controller (full compliant USB 2.0):
Before a USB Host sets a USB Device configuration, the Host retrieves all necessary Descriptors from the Device (enumeration stage), then selects parameters amongst those provided by the Device thru its Descriptors to set (in fact propose) a configuration to the Device (Negociation stage between Host and Device), then the Device answers with the same configuration or another one (End of Negociation stage).
AFAIK, you only need a USB cable between the Native USB port ( a micro AB plug, can also receive a micro A plug) and the USB Device.
A guess: maybe you haven't extracted the correct parameters from the Device Descriptors to set a correct configuration resulting in a STALL from the Device.
There is a tutorial to retrieve a video stream from a web camera
here. Sections 6 and 7 are about Descriptors and setting a Device configuration, and the Device address is changed before setting a new Device configuration but IMO changing the address is not absolutely necesssary when there is only 1 Device (no hub).