(EN) A data processing apparatus comprises decoder circuitry responsive to an instruction (320 in Figure 9B) specifying a first source register and a second source register. In response to the instruction, processing circuitry performs a dot product operation, in which at least a first data element and a second data element are extracted from each of the first source register and the second source register, then at least first data element pairs and second data element pairs are multiplied together 340-346, with the results summed 348. The dot product operation is performed independently in each of multiple intra-register lanes across each of the first source register and the second source register, treating each as a vector. A widening operation with a large density of operations per instruction may thus be provided. FMA (fused multiply-add) units (514-520 in Figure 15A), and a dot-product and accumulate operation (Figure 14), may be implemented.