Ring-buffer for high performance, reduced contention, parallel processing - LMAX Disruptor

LMAX
GIT Disruptor
Martin Fowler's write up

Example code for v3.0

Queues are wrong for inter-process communication
SEDA, Actor mechanisms bottleneck on contention

Mechanical Sympathy - know how to drive the systems to get the most out of them
DRAM not getting faster but is getting cheaper
BW to memory increasing
GHz race is over, CPU aren't getting faster
bigger caches, more cores

Networks getting faster
Standard 10 gig-e can RDMA bypass kernel to transfer userspace memory between memory in sub 10 uS
Java 7 SDP
j-zerocopy
User RDMA to HA DR replicate to another node

Mechanical Disks have great sequential access/streaming
SSD not much better for sequential access and single threaded
Great for multi-threaded random access
Disk controller is limited Standard SSD interface not very fast, new PCie much faster
Fusion IO card very fast

10 gig-e can do process on one system to process on another system in 10s of us

move data between cores in L3 cache for best performance

bytes read in cache line 64 bytes

keep data on one core only for write, broadcast for read

DRAM slow ~65nS

all staged write-through, one feeds the other
All SRAM, no charge cycle
L3 15nS - 8-24MB used for sharing data between cores
L2 3nS - 256K
L1 1nS - 64K 32K instruction 32K data
registers <1ns br=""> links between cores ~20ns
except for I/O, cache miss and DRAM access is slowest
Lists are fast because CPUs figure out and pipeline and stride and fetch into cache efficiently

cache data fetched in cache lines 64 bytes at a time, co-located pulled in together
object will load in together
during concurrency if hot, make sure not in same cache line
make sure one core owns data for write

Memory and CPUs aren't getting faster
more cores, largest caches = take advantage of them

write clean compact code
simple is high performance - hotspot likes this
use single responsibility
let the relationships do the work (carnality, methods, characteristics, usage patterns, etc.)

Einstein Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius -- and a lot of courage -- to move in the opposite direction.

try catch block caused order of magnitude when method got so complex hotspot gave up on optimizing

modeling is so important and under utilized
model and write code to be efficient in memory not disk
use software simulation for domain
model the disk controller
simple code is high performance - hotspot can optimize
single responsibility principle
one class one thing
one method one thing
one statement one thing
don't conflate concerns - complexity goes up combinatorally

two approaches locking concurrency and Atomic non-blocking CAS
use atomic/CAS non-blocking harder, happens in user space, but much faster
can update in single machine instruction (lock free)
avoids going to kernel for expensive arbitration
could lose quantum, cache and even be moved to another core when you are rescheduled
since one thread can't progress, might do something else like defrag memory or other house keeping
CAS happens in user space
more difficult to get right
very well optimized on CPU
if using locks and contention, have to go through layers to kernel to arbitrate

taking out all concurrency
single thread has 3 billion instructions per second
probably quite a bit more considering multiple execution units, ALU, shifting, load and stores, etc, all in parallel
10K+ if don't do anything stupid
100K+ if clean well organized code
1M+ with custom collections, cache friendly, no garbage, etc.
gotten up to 6M+ with other customization

must do test first development
optimize early, optimize later is fallacy
do premature optimization
optimize the tests before writing the code
must separate load generation from measurement

queues are always source of contention always at least two threads
either full or empty, never anything in between

ring buffer eliminates garbage, just copy into buffer
optimizes processor caches since data stays in cache
not serial like pipeline
everything occurs in parallel
only have to fix the slowest thing to improve performance unlike pipeline

more load increases performance up till saturating network or disk
because everything works in parallel and no contention
streaming journaling, replication, etc. can grab multiple items and chuck them in a byte array and out I/O in batches
if saturate the ring buffer, you busy-spin or yield, get the effect of queue without problems/contention

use monotonically increasing IDs and use modulo to assign slot - very fast no contention
use non-blocking CAS to increment ID

journaling logs allow simple reproduction of bugs - just replay logs to bug
also useful for producing load

they actually use the same system to hoover the TCP packets at the front door and reconstitute the TCP streams

knowing your hardware is extremely important to get most out of it
developers are too far away/abstracted from hardware
cache miss is biggest cost

restart to avoid GC
Can run about 5 days before needs a GC
snapshot and replay journal to reload

Don't use reflection, generate invoker code for incoming messages

To scale
stripe replicator mod n
stripe journaler to disks mod n
stripe un-marshaller mod n

Comments

Popular posts from this blog

Sites, Newsletters, and Blogs

Oracle JDBC ReadTimeout QueryTimeout

Locks held on Oracle for hours after sessions abnormally terminated by node failure