A method for fine-grained object recognition in a robotic system is disclosed that includes obtaining an image of an object from an imaging device. Based on the image, a deep category-level detection neural network is used to detect pre-defined categories of objects. A feature map is generated for each pre-defined category of object detected by the deep category-level detection neural network. Embedded features are generated, based on the feature map, using a deep instance-level detection neural network corresponding to the pre-defined category of the object, wherein each pre-defined category of an object comprises a corresponding different instance-level detection neural network. An instance-level of the object is determined based on classification of the embedded features.