<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Gavin Smith &#187; Uncategorized</title>
	<atom:link href="http://www.gavin-smith.me/category/uncategorized/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.gavin-smith.me</link>
	<description>Software according to Gavin</description>
	<lastBuildDate>Mon, 30 Jan 2012 23:58:11 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=</generator>
		<item>
		<title>SSE intrinsics, moving between 32-bit and 16-bit integer types</title>
		<link>http://www.gavin-smith.me/sse-intrinsics-moving-between-32-bit-and-16-bit-integer-types-2222/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=sse-intrinsics-moving-between-32-bit-and-16-bit-integer-types</link>
		<comments>http://www.gavin-smith.me/sse-intrinsics-moving-between-32-bit-and-16-bit-integer-types-2222/#comments</comments>
		<pubDate>Mon, 30 Jan 2012 23:58:11 +0000</pubDate>
		<dc:creator>Gavin Smith</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.gavin-smith.me/?p=2222</guid>
		<description><![CDATA[Recently, I have been using SSE intrinsics to optimize some code which processes raw image/video data in YUYV 422 format. Now similar to other raw video formats, YUYV is packed in 2-byte (16-bit) chunks making it ideal for SSE processing &#8230; <a href="http://www.gavin-smith.me/sse-intrinsics-moving-between-32-bit-and-16-bit-integer-types-2222/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Recently, I have been using SSE intrinsics to optimize some code which processes raw image/video data in YUYV 422 format. Now similar to other raw video formats, YUYV is packed in 2-byte (16-bit) chunks making it ideal for SSE processing</p>
<p>Now if you intend to couple any single precision floating-point vector arithmetic with 16-bit integer types, there soon becomes the need to start converting between 16-bit and 32-bit integer types.</p>
<p>Amazingly, there are not too many SSE intrinsics tutorials or sources available which indicate how to convert between 16-bit and 32-bit types. One source of inspiration has been Apple&#8217;s &#8220;<a title="SSE Performance Programming" href="http://developer.apple.com/hardwaredrivers/ve/sse.html">SSE Performance Programming</a>&#8221; guide for hardware. Unfortunately, the introduction of AltiVec in the tutorial makes some parts slightly more difficult to understand.</p>
<p>Firstly, we&#8217;ll look at how to convert a vector of 16-bit values to 2 vectors of 32-bit values.</p>
<pre class="brush: cpp; gutter: true">void SSEUnsigned16to32
(
	const __m128i	*in_pvn,	// 1 set of 8 x 16-bit
	__m128i out_vn[]			// 2 sets of 4 x 32-bit
)
{
	__m128i zero = _mm_setzero_si128();

	out_vn[0] = _mm_unpacklo_epi16(*in_pvn, zero);
	out_vn[1] = _mm_unpackhi_epi16(*in_pvn, zero);

}

void SSESigned16to32
(
	const __m128i	*in_pvn,	// 1 set of 8 x 16-bit
	__m128i out_vn[]			// 2 sets of 4 x 32-bit
)
{
	out_vn[0] = _mm_unpacklo_epi16(*in_pvn, *in_pvn);
	out_vn[1] = _mm_unpackhi_epi16(*in_pvn, *in_pvn);

	out_vn[0] = _mm_srai_epi32(out_vn[0], 16 );
	out_vn[1] = _mm_srai_epi32(out_vn[1], 16 );
}</pre>
<p>Now, let&#8217;s look at how to convert back 2 vectors of 32-bit values to 1 vector of 16-bit values:</p>
<p>Now the instruction that does most of the magic here is</p>
<pre>_mm_packus_epi16</pre>
<p>This instruction takes signed input which is all very dandy for signed integer types but how does one handle unsigned types? What we do is that with any values larger than 32768 we truncate to 0 instead of 255. by bit AND&#8217;ing with 0x00ff.</p>
<pre class="brush: actionscript3; gutter: true">void SSEUnsigned32to16
(
	const __m128i	in_vn[],		// 2 sets of 4 x 32-bit
	__m128i *out_pvn				// 1 set of 8 x 16-bit
)
{
	__m128i vnMask = _mm_set1_epi16(0x00ff);

	// mask off high byte
	__m128i lo = _mm_and_si128(in_vn[0], vnMask);
	__m128i hi = _mm_and_si128(in_vn[1], vnMask);

	*out_pvn = _mm_packus_epi16(lo, hi);
}

void SSEUnsigned32to16
(
	const __m128i	in_vn[],		// 2 sets of 4 x 32-bit
	__m128i *out_pvn				// 1 set of 8 x 16-bit
)
{
    // shift hi and lo left by 8 to chop off high byte
    __m128i hi = _mm_slli_epi16(in_vn[1], 8 );
    __m128i lo = _mm_slli_epi16(in_vn[0], 8 );

    // shift hi and lo back right again (algebraic)
    hi = _mm_srai_epi16( hi, 8 );
    lo = _mm_srai_epi16( lo, 8 );

    return _mm_packs_epi16(lo, hi);
}</pre>
]]></content:encoded>
			<wfw:commentRss>http://www.gavin-smith.me/sse-intrinsics-moving-between-32-bit-and-16-bit-integer-types-2222/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>QMutex, what Nokia never told you</title>
		<link>http://www.gavin-smith.me/qmutex-what-nokia-never-told-you-2179/?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=qmutex-what-nokia-never-told-you</link>
		<comments>http://www.gavin-smith.me/qmutex-what-nokia-never-told-you-2179/#comments</comments>
		<pubDate>Sat, 19 Nov 2011 23:23:46 +0000</pubDate>
		<dc:creator>Gavin Smith</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.gavin-smith.me/?p=2179</guid>
		<description><![CDATA[Who is a big fan of easy-to-use, third-party cross-platform code? Yeah, hands down. I love it just as much as the rest of you. Over the years, I have come to realize that one disadvantage to using third party libraries &#8230; <a href="http://www.gavin-smith.me/qmutex-what-nokia-never-told-you-2179/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>Who is a big fan of easy-to-use, third-party cross-platform code? Yeah, hands down. I love it just as much as the rest of you.</p>
<p>Over the years, I have come to realize that one disadvantage to using third party libraries is that application developers often neglect to consider what is actually happening behind the scenes. Of course, this becomes even more of an issue with closed source libraries.</p>
<p>In one of our multithread applications, we&#8217;ve noticed that the application is spending an awful amount of unecessary time in kernel mode. There has always been suspicion of the culprit, but it&#8217;s not until recently (where performance has become an issue) that we have been conducting a ore thorough investigation. Using Intel&#8217;s VTune, we were able to pin-point down the culprit to Qt&#8217;s <a title="QMutex" href="http://doc.qt.nokia.com/latest/qmutex.html"><em>QMutex</em></a>.</p>
<p><img class="aligncenter size-full wp-image-2213" style="float: right;" title="QMutex" src="http://www.gavin-smith.me/wp-content/uploads/2011/11/QMutex1.png" alt="" width="234" height="283" /></p>
<p>Let&#8217;s talk a little bit about contention. If you have ever taken a university-level course in operating systems, <a title="Context Switch" href="http://en.wikipedia.org/wiki/Context_switch">context-switches</a> come at an expensive cost. Wikipedia offers a great analogy:</p>
<blockquote><p>&#8220;To give an analogy, multiple threads in a process are like multiple cooks reading off the same cook book and following its instructions, not necessarily from the same page.&#8221;</p></blockquote>
<p>Any time the page is busy being turned in the cook book, time is being wasted by all of the cooks. The goal in multithreaded applications is to turn the pages as little as possible AND only as and when it is needed. When the pages need to be turned, it must be quick so that the cooks can resume their work.</p>
<p>Although there are other operations which require an application to enter kernel mode,  a high ratio of kernel time to user time was indicative that a combination of our threads and points of mutual exclusion were causing issues.</p>
<p>Diving into the depths of Qt (4.6.3), I found that QMutex was using the following code on a windows machine:</p>
<pre class="brush: cpp; gutter: false">handle = ::CreateMutex( NULL, FALSE, NULL );</pre>
<p>Now this is a rather heavy-handed approach to mutual exclusion. In short, this code makes use of win32 mutexes. Exactly what does <a title="MSDN" href="http://msdn.microsoft.com/en-us/library/windows/desktop/ms686927%28v=vs.85%29.aspx">MSDN</a> have to say?</p>
<blockquote><p>&#8220;You can use a <a href="http://msdn.microsoft.com/en-us/library/windows/desktop/ms684266%28v=vs.85%29.aspx">mutex object</a> to protect a shared resource from simultaneous access by multiple threads or processes. Each thread must wait for ownership of the mutex before it can execute the code that accesses the shared resource.&#8221;</p></blockquote>
<p>They key phrase here is, &#8220;&#8230;use a mutex object to protect a shared resorce from simultaneous access by multiple &#8230; processes&#8221;. If you are wanting to provide mutual exclusion across threads and threads only, a win32 mutex object is far too heavy.</p>
<p>Digging through MSDN, I came across something that windows calls a <a href="http://msdn.microsoft.com/en-us/library/windows/desktop/ms686908%28v=VS.85%29.aspx">critical section</a>. Similar to a mutex, critical sections can be used only by the threads of a process.</p>
<blockquote><p>&#8220;Event, mutex, and semaphore objects can also be used in a single-process application, but critical section objects provide a slightly faster, more efficient mechanism for mutual-exclusion synchronization (a processor-specific test and set instruction).&#8221;</p></blockquote>
<p>Taking advantage of <a title="RAII" href="http://en.wikipedia.org/wiki/Resource_Acquisition_Is_Initialization">RAII</a>, it&#8217;s simple to construct a class to instantiate critical section objects.</p>
<pre class="brush: cpp; gutter: false">class CCriticalSection
{
public:
    CCriticalSection(quint32 nSpinCount=4000)
    {
        ::InitializeCriticalSectionAndSpinCount
        (
            &amp;m_critSection,
            static_cast&lt;DWORD&gt;(nSpinCount)
        );
    }

    ~CCriticalSection()
    {
        ::DeleteCriticalSection(&amp;m_critSection);
    }

    void lock()    {::EnterCriticalSection(&amp;m_critSection);}
    bool tryLock() {return (TRUE==::TryEnterCriticalSection(&amp;m_critSection));}
    void unlock()  {::LeaveCriticalSection(&amp;m_critSection);}

private:
    ::CRITICAL_SECTION m_critSection;
};</pre>
<p>The two (2) big advantages to critical sections are clear. They&#8217;re quick and don&#8217;t require kernel mode unless there is contention. According to <a title="Critical section" href="http://msdn.microsoft.com/en-us/magazine/cc164040.aspx">Microsoft</a>,</p>
<blockquote><p>&#8220;Unlike events, mutexes, and semaphores, which are also used for multithreaded synchronization, critical sections don&#8217;t always perform an expensive control transfer to kernel mode. As you&#8217;ll see later, acquiring an unheld critical section requires, in effect, just a few memory modifications and is very quick. Only if you try to acquire an already-held critical section does it jump into kernel mode.&#8221;</p></blockquote>
<p>I wrote a simple unit test using the common producer/consumer scenario to pit a win32 critical section against a Qt QMutex. A thread (the producer) adds 100000 integer items to a shared vector. In the meantime another thread runs in parallel (the consumer) and attempts to remove the same 100000 from the shared vector.</p>
<p>The results are in, and there is a clear winner. Firstly, I&#8217;ll start with how the QMutex performed:</p>
<pre class="brush: text; gutter: false">1 Writer, 1 Readers - 393us</pre>
<p>Here&#8217;s how the win32 critical section performed in comparison:</p>
<pre class="brush: text; gutter: false">1 Writer, 1 Readers - 21us</pre>
<p>The different in results between the Qt and the Win32 implementation was fairly consistent regardless of the number of writer (producer) and reader (consumer) threads.</p>
<p>[<strong>EDIT</strong>:] The new, revised implementation of QMutex in 4.7.4 does not perform any better.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.gavin-smith.me/qmutex-what-nokia-never-told-you-2179/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

