The serial version of this algorithm would have an outer loop, i, and an inner loop, j. We parallelize the i loop by thrust::transform(); for each i, it "transforms" the i, the output being the value of the j loop for that i. We then add across the i loop, by calling thrust::reduce(). Could add randomization of i's for possibly better load balance, as in rthconv(). This could be speeded up by using a transform iterator with the reduce operation.