Boost C++ Libraries: Ticket #11895: Strand service scheduling is hurting ASIO scalability https://svn.boost.org/trac10/ticket/11895 <p> This problem will be explained best by walking through a scenario: </p> <p> I have an io_service being run with one thread per CPU core. The CPU has 4 cores. I also have a strand. I then post 10 one-second operations to the strand, and 30 one-second operations to the io_service. That's 40 cumulative seconds of work, and since there are 4 cores, that work should optimally take 10 seconds when performed in parallel. </p> <p> However, that's not what happens. The work in the io_service is given precedence over the work in the strand. So first, the 30 seconds of cumulative work in the io_service is performed in parallel which takes roughly 7.5 seconds (30 seconds / 4 cores). Only after that work is complete does the 10 seconds of work on the strand get performed. So it ends up taking a total of 17.5 seconds to do all the work because CPU time wasn't distributed optimally. </p> <p> If we think of the strand and the io_service as queues, what we have here is a queue within a queue. The outer queue's work can be performed in parallel, the inner queue's work is performed serially. It seems to make a whole lot of sense that when control reaches the inner queue, that queue should be serviced until it's empty. After all, there are other threads that can still service the outer queue while that's happening. The opposite is not true. By giving the outer queue priority, ASIO is crippling the concurrency potential of the work in the inner queue i.e. the strand. </p> <p> This simplified scenario is a very real performance problem for my companies server application. As far as I can tell, the only options for me to address the problem are to implement my own strand that doesn't give control back to the io_service until it's work queue is empty -or- put all of my strands onto one io_service and post all non-strand work to a second, separate io_service. </p> <p> I attached a sample application that demonstrates the scenario above, complete with timings and a visual print out of the order of operations. It also includes a stripped down strand implementation that demonstrates how the problem could be addressed. The application was built in Visual Studio 2013 with boost version 1.60.0. </p> en-us Boost C++ Libraries /htdocs/site/boost.png https://svn.boost.org/trac10/ticket/11895 Trac 1.4.3 Chris White <chriswhitemsu@…> Fri, 08 Jan 2016 04:39:30 GMT attachment set https://svn.boost.org/trac10/ticket/11895 https://svn.boost.org/trac10/ticket/11895 <ul> <li><strong>attachment</strong> → <span class="trac-field-new">Source.cpp</span> </li> </ul> Ticket Chris White <chriswhitemsu@…> Sat, 09 Jan 2016 18:47:12 GMT <link>https://svn.boost.org/trac10/ticket/11895#comment:1 </link> <guid isPermaLink="false">https://svn.boost.org/trac10/ticket/11895#comment:1</guid> <description> <p> To shed a little more light...if I post 10 operations to a strand in quick succession, my first operation goes into the ready_queue_, the strand gets scheduled, and then my other 9 operations go into waiting_queue_. When strand_service::do_complete() is called, it services the ready_queue_, then it appends the waiting_queue_ and reschedules the strand for the remaining work. So basically, whenever multiple operations are posted to a strand, the strand will have to be scheduled at least twice to perform all of the work. </p> <p> Instead, the waiting_queue_ should be drained at least once when do_complete is called in order to perform all the operations that were posted from the time the strand was scheduled until the time the strand was serviced. Here is one possible solution. I inlined some code and added comments for clarity: </p> <pre class="wiki">void strand_service::do_complete(io_service_impl* owner, operation* base, const boost::system::error_code&amp; ec, std::size_t /*bytes_transferred*/) { if (owner) { strand_impl* impl = static_cast&lt;strand_impl*&gt;(base); call_stack&lt;strand_impl&gt;::context ctx(impl); // ONLY THE FIRST POSTED OPERATION IS IN THE READY QUEUE. // SERVICE IT NOW WITHOUT HAVING TO LOCK THE MUTEX. while (operation* o = impl-&gt;ready_queue_.front()) { impl-&gt;ready_queue_.pop(); o-&gt;complete(*owner, ec, 0); } // SUBSEQUENT OPERATIONS ARE IN THE WAITING QUEUE. // SERVICE THEM NOW INSTEAD OF RESCHEDULING! impl-&gt;mutex_.lock(); impl-&gt;ready_queue_.push(impl-&gt;waiting_queue_); bool more_handlers = impl-&gt;locked_ = !impl-&gt;ready_queue_.empty(); impl-&gt;mutex_.unlock(); if (!more_handlers) return; while (operation* o = impl-&gt;ready_queue_.front()) { impl-&gt;ready_queue_.pop(); o-&gt;complete(*owner, ec, 0); } // ONLY NOW WE RESCHEDULE IF MORE WORK HAS SINCE BEEN POSTED. impl-&gt;mutex_.lock(); impl-&gt;ready_queue_.push(impl-&gt;waiting_queue_); more_handlers = impl-&gt;locked_ = !impl-&gt;ready_queue_.empty(); impl-&gt;mutex_.unlock(); if (more_handlers) owner-&gt;post_immediate_completion(impl, true); } } </pre><p> I was originally thinking that this could loop as long as there is work in either of the queues, but that might introduce concerns about a single strand monopolizing a thread, so I backed off from that solution. The implementation above is enough to addresses the problem in my attached sample application. </p> </description> <category>Ticket</category> </item> </channel> </rss>