Need some professional advice on my project ~~

Recognising a fruit on a video stream, deciding what type of fruit and how ripe, recognising where the fruit is in a three dimensional world and moving a mechanical arm to interact with the fruit, are all complex problems. Solving these problems is likely to involve man-decades of work. Of course you don't have to implement everything yourself from scratch, but even understanding what approach to take and what 3rd party tools are available to help you will be a huge undertaking.

If you're at the stage of thinking "Wouldn't it bre great if ..." then I suggest you start with something a lot simpler.