Posts

Showing posts from February, 2014

Atlantic.net - Full service hosting

I needed a special infrastructure configuration for testing a new architecture for high performance computing requirements. This architecture isn't offered through Amazon's Elastic Cloud, Google's Cloud, or Microsoft Azure. Atlantic.net's team took the time to understand our exact requirements and offered to build what we needed on their infrastructure at no additional cost. This team knows how to please. They've already earned my business. I was planning to move my personal hosting and business to Linode, however now I'll be moving to  Atlantic.net . They offer more value for the $$$ with virtual servers starting at $3.65/month which include the following features: Infiniband QDR 40Gb interconnect between servers SSDs 1Gb/s port 1000 GB / 1TB outbound transfer included Full root admin - Linux or Windows Dedicated IP Free nightly backup No commitments, no contract, not setup fee Redundant Tier-1 internet connections with automatic failover Redunda

Performance Metrics

From WhatsApp scalability talk Slides http://www.erlang-factory.com/upload/presentations/558/efsf2012-whatsapp-scaling.pdf Talk http://vimeo.com/44312354 pmcstat - processor hardware perf counters dtrace kernel lock-counting gprof fprof w/ & w/o cpu_timestamp BEAM lock-counting (invaluable) contention most significant issues FreeBSD backported TSC-based kernel timecounter gettimofday(2) calls much less expensive backported igp network driver had issue with MSI-X queue stalls syssctl tuning obvious limits (e.g. kern.ipc.maxsokets) net.inet.tcp.tcphashsize=524288 BEAM is erlangVM - lot of other info on that

AsyncIO

AsynchronousServerSocketChannel - Channels of this type are safe for use by multiple concurrent threads though at most one accept operation can be outstanding at any time . If a thread initiates an accept operation before a previous accept operation has completed then an AcceptPendingException will be thrown. AsynchronousChannelGroup specifies the thread pool to manage the async operation callbacks, if no Executor is specified, a default Need to pass in a custom ThreadFactory to AsynchronousChannelGroup or all the threads will have generic names and not be clear what they are for without inspecting the stack. Here are options for custom threadFactory. Server/Consumer class Consumer implements Runnable ..... class SocketAcceptHandler implements CompletionHandler ..... AsynchronousServerSocketChannel socket =           AsynchronousServerSocketChannel               .open(AsynchronousChannelGroup.withThreadPool(Executors.newFixedThreadPool(1))); socket.setOption(Standard

Unsafe - very fast serialization

Copy directly from an objects bytes into DirectByteFuffer http://java.dzone.com/articles/fast-java-file-serialization http://mishadoff.github.io/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/

PAXOS Concensus Protocol vs 2PC & 3PC

PAXOS is used by Google's Chubby , Apache Zookeeper , and FoundationDB Majority Concensus http://the-paper-trail.org/blog/consensus-protocols-paxos/ If the leader is relatively stable, phase 1 becomes unnecessary. Thus, it is possible to skip phase 1 for future instances of the protocol with the same leader. Multi-Paxos reduces the failure-free message delay (proposal to learning) from 4 delays to 2 delays. 2PC The greatest disadvantage of the two-phase commit protocol is that it is a blocking protocol. If the coordinator fails permanently, some cohorts will never resolve their transactions: After a cohort has sent an  agreement  message to the coordinator, it will block until a  commit  or  rollback  is received. 3PC The main disadvantage to this algorithm is that it cannot recover in the event the network is segmented in any manner. The original 3PC algorithm assumes a fail-stop model, where processes fail by crashing and crashes can be accurately detected, and d

How much does thread context cost

http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html

Threaded vs. Evented servers

http://mmcgrana.github.io/2010/07/threaded-vs-evented-servers.html

HPC Infiniband 40/56Gb vs 10GbE

http://www.mellanox.com/related-docs/case_studies/CS_Atlantic.Net.pdf Processors now have many hyper-threaded cores and lots of memory and cache. Standard high performance disk technology has lazy-write caches and battery backup for reliability. Disks are stripped/parallelized to alleviate them as performance bottlenecks. This means network I/O for Internet traffic, replication, caching, and disk access will typically be the most substantial bottlenecks. 40/56GbIB at $~5.6/Gb/s is currently lower than 10GbE at ~$11.5/Gb/s giving it better value. You can also run TCP/IP over IB too, IPoIB. There is a 40GbE switch for ~$208/Gb and even 100GbE, but nothing much available in 40Gb and nothing I could find in 100Gb. You can't get the 40/56Gb/s BW or lower latency of InfiniBand RDMA on 10GbE. RDMA supposedly exists on 10GbE, but after looking on Intel's site, I only found one card supporting it. Here is a paper which compares the performance of a custom key/value store, memcached,

Ring-buffer for high performance, reduced contention, parallel processing - LMAX Disruptor

LMAX GIT Disruptor Martin Fowler's write up Example code for v3.0 Queues are wrong for inter-process communication SEDA, Actor mechanisms bottleneck on contention Mechanical Sympathy - know how to drive the systems to get the most out of them DRAM not getting faster but is getting cheaper BW to memory increasing GHz race is over, CPU aren't getting faster bigger caches, more cores Networks getting faster Standard 10 gig-e can RDMA bypass kernel to transfer userspace memory between memory in sub 10 uS Java 7 SDP j-zerocopy User RDMA to HA DR replicate to another node Mechanical Disks have great sequential access/streaming SSD not much better for sequential access and single threaded Great for multi-threaded random access Disk controller is limited Standard SSD interface not very fast, new PCie much faster Fusion IO card very fast 10 gig-e can do process on one system to process on another system in 10s of us move data between cores in L3 cache for be

Distributed monitoring

Ganglia  - Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. Can be embedded

Screen Casting options

MAC Quicktime Player - free already installed Screenflow $100 - used by Lifehacker, simultaneously records screen, audio, and camera Camtasia $99 SnapzProX $69 Linux Kdenlive screen + Audacity for audio recordMyDesktop Wink Byzanz - records to gif which auto-plays

Load Balancing

GSLB Round-Robin DNS Anycast Redirects Cookies etc. http://www.tenereillo.com/GSLBPageOfShame.htm

Queuing

Apache Kafka Java Apache ActiveMQ Java JMS RedHat HornetQ Java JMS ZeroMQ 0MQ Kestrel JVM Scala Twitter Redis

Grid Computing

Open Grid Scheduler/Grid Engine http://gridscheduler.sourceforge.net/ SLURM: A Highly Scalable Resource Manager https://computing.llnl.gov/linux/slurm/ Univa took over Sun Grid Engine http://gridengine.org/blog/

Free web diagramming tool

Diagramly Saves to Google Drive or Dropbox No multi-user collaboration support