granicus.if.org Git - libvpx/commit

author	A.Mahfoodh <ab.mahfoodh@gmail.com>
	Wed, 2 Oct 2013 22:41:54 +0000 (18:41 -0400)
committer	A.Mahfoodh <ab.mahfoodh@gmail.com>
	Thu, 3 Oct 2013 00:02:03 +0000 (20:02 -0400)
commit	5215b83aead928f66d9133e846ede6fd1b52aa89
tree	9b3cca405f9c4652961b3813f9db0e708e7904b1	tree \| snapshot
parent	d958c0486a755101146b81b58a01a37798afccee	commit \| diff

Simplifying and inlining k_cvtlo_epi16 and k_cvthi_epi16

Simplify the k_cvtlo_epi16 and k_cvthi_epi16 to only two
instructions. Then inlined them.

quoting from intel MMX_App_Compute_16bit_Vector.pdf‎
"The PMADDWD instruction multiplies four
pairs of 16-bit numbers and produces partial sums of the results
and can do so once per clock (with a three-clock latency)."
so I am assuming that there will be three clock overhead after the
last _mm_madd_pi16 command.
Even with the overhead the number of clocks in general should be
smaller. I am not sure though becasue I could not find information
about number of clocks required for instructions in k_cvtlo_epi16
and k_cvthi_epi16. I will run a test and compare the execution time.

Change-Id: Ieda4aa338f69ad3dd196ac6e7892da3cf1b47ea7