System.nanoTime() returns sporadic values on multi-processor Solaris on VMWare

Consecutive calls to System.nanoTime() return sporadic differences that are zero, negative, large/huge, and stuck/hung/frozen/not changing for long periods on multi-processor Solaris x86 on/under VMWare. This has not been reproduced on non-VMWare multi-processor systems and single-processor VMWare systems.

A trouble ticket was opened with VMWare and they indicated the %RDY was too high at 50 - 100% instead of 5%. They also recommend single-processor VMs vs. multi-processor VMs. The bottom line is that if the code does not run deterministically.


Yes nanoTime issues cause real errors


Dual processor Fedora 11 on VMWare does not exhibit zero, negative, or stuck nano times, so this appears to be a Solaris on VMWare issue.

So as it turns out the VMWare instance created for windows was reloaded with both Fedora 11 and Solaris 5.10 and none of them exhibited the issue. So it seems that the VM image defaults for Windows vs. Solaris may be the issue.

Dual processor Windows Server 2008 with Java 1.6.0_13 on VMWare does not exhibit zero, negative, or stuck nano times. Consecutive nanoTime calls appear 5-10 times slower than on Solaris.

Date Created: Fri Jun 12 08:10:20 MDT 2009
Type: bug
Category: java
Subcategory: classes_lang
release: 6u10
hardware: sun4
OSversion: solaris_10
priority: 4
Synopsis: System.nanoTime stops advancing and GC Logging malfunctions reporting 0 time
Description:
FULL PRODUCT VERSION :
Reproduced with both 1.6.0_2 and 1.6.0_13

ADDITIONAL OS VERSION INFORMATION :
SunOS 5.10 Generic_138889-06 i86pc i386 i86pc - latest patch level

EXTRA RELEVANT SYSTEM CONFIGURATION :
Reproduced on both AMD Opteron 8356 and Intel Xeon E5345.
This issue has only been observed when running with multiple processors on VMWare ESX. Not observed or reproducible on multi-processor Ubuntu 8.10, Kubuntu 9.04, Gentoo Linux, or Sparc Solaris when they are not running on VMWare. Also not observed on single processor Solaris running on VMWare ESX.

A DESCRIPTION OF THE PROBLEM :
GC logging malfunctions then stops reporting accurately. The underlying symptoms can be reproduced if System.nanoTime is polled continuously because it eventually stops incrementing. This condition causes many java.util.concurrent classes which rely on System.nanoTime to malfunction including ThreadPoolExecution, delay, and blocking queues among others.

Specifically in java.util.concurrent.locks.AbstractQueuedSynchronizer.ConditionObject.awaitNanos(long nanosTimeout) will not timeout if nanoTime stops incrementing and client code can hang indefinitely. Also it has been noted that System.nanoTime only on multi-processor VMWare sporadically increments by zero and negative values. This has never been observed when non-VMWare multi-processor or when on single-processor VMWare. System nanoTime returning sporadic negative values would also cause the above referenced awaitNanos to malfunction.

STEPS TO FOLLOW TO REPRODUCE THE PROBLEM :
Run the attached test code on multi-processor VMWare and eventually the nanoTime will stop incrementing.

EXPECTED VERSUS ACTUAL BEHAVIOR :
EXPECTED -
Not expecting nanoTime to stop incrementing indefinitely.
ACTUAL -
GC log sporadically reports large value, then reports zero indefinately:

Below you can see that GC logging initially is normal, then it reports a very large value ~38 hours, then reports zero indefinitely (for at least 12+ hours).

295.341: [GC [PSYoungGen: 197118K->3205K(204224K)] 578435K->388848K(1387968K), 0.5534078 secs]
296.489: [GC [PSYoungGen: 195013K->5698K(204672K)] 580656K->393592K(1388416K), 0.3963039 secs]
298.493: [GC [PSYoungGen: 197890K->3580K(204608K)] 585783K->392768K(1388352K), 0.4133058 secs]
300.697: [GC [PSYoungGen: 196156K->10969K(202688K)] 585344K->402131K(1386432K), 138476.3001686 secs]
138776.998: [GC [PSYoungGen: 201592K->12755K(203520K)] 592754K->413325K(1387264K), 0.0000000 secs]
138776.998: [GC [PSYoungGen: 203475K->12147K(198912K)] 604045K->423147K(1382656K), 0.0000000 secs]
138776.998: [GC [PSYoungGen: 197491K->8443K(201216K)] 608491K->430399K(1384960K), 0.0000000 secs]

Output from attached test code: eventually the nanoTime stops incrementing indefinately:

Below it can be seen that nanoTime has not incremented for 40-50 thousand executions.
Also it can be observed that the maximum difference between two consecutive nanoTime execution was reported as ~80 days. The test only ran for a few minutes.
On non-VMWare it increments on every execution and has never been observed to stop incrementing.

pool-1-thread-9 maxConsec=43314 minDiff=0 maxDiff=6978157632565349 a=13956314863692146 b=13956314863692146 changed=maxConsec
pool-1-thread-22 maxConsec=48786 minDiff=0 maxDiff=6978158312802313 a=13956314863692146 b=13956314863692146 changed=maxConsec
pool-1-thread-24 maxConsec=50552 minDiff=0 maxDiff=6978158093338584 a=13956314863692146 b=13956314863692146 changed=maxConsec
pool-1-thread-16 maxConsec=44805 minDiff=0 maxDiff=6978157852663558 a=13956314863692146 b=13956314863692146 changed=maxConsec
pool-1-thread-27 maxConsec=48969 minDiff=0 maxDiff=6978158252826428 a=13956314863692146 b=13956314863692146 changed=maxConsec
pool-1-thread-19 maxConsec=48628 minDiff=0 maxDiff=6978157632643095 a=13956314863692146 b=13956314863692146 changed=maxConsec
pool-1-thread-1 maxConsec=46531 minDiff=0 maxDiff=6978157792642802 a=13956314863692146 b=13956314863692146 changed=maxConsec
pool-1-thread-3 maxConsec=46575 minDiff=0 maxDiff=6978157752584362 a=13956314863692146 b=13956314863692146 changed=maxConsec
pool-1-thread-12 maxConsec=49622 minDiff=0 maxDiff=6978157472579260 a=13956314863692146 b=13956314863692146 changed=maxConsec
pool-1-thread-17 maxConsec=48469 minDiff=0 maxDiff=6978157912748608 a=13956314863692146 b=13956314863692146 changed=maxConsec
pool-1-thread-28 maxConsec=53015 minDiff=0 maxDiff=6978157674021878 a=13956314863692146 b=13956314863692146 changed=maxConsec
pool-1-thread-13 maxConsec=42159 minDiff=0 maxDiff=6978157512622147 a=13956314863692146 b=13956314863692146 changed=maxConsec
pool-1-thread-28 maxConsec=53016 minDiff=0 maxDiff=6978157674021878 a=13956314863692146 b=13956314863692146 changed=maxConsec


REPRODUCIBILITY :
This bug can be reproduced often.

---------- BEGIN SOURCE ----------
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;

/**
* @author Troy T. Collinsworth
*/
public class TestNanoTime implements Runnable {

public static void main(String[] args) throws InterruptedException {
int threads = 30;

final ExecutorService es = Executors.newFixedThreadPool(threads);
for (int i = 0; i < threads; i++) {
es.execute(new TestNanoTime());
}
Thread.sleep(30000000);
System.out.println("Completed test, exiting");
System.exit(0);
}

public void run() {
int loopSize = 10;
long[] times = new long[loopSize];

int consec = 0;
int maxConsec = 0;
int currConsec = 0;
long a = System.nanoTime();
long b = System.nanoTime();
long minDiff = b - a;
long maxDiff = b - a;

String changed = null;
while (true) {
for (int i = 0; i < loopSize; i++) {
times[i] = System.nanoTime();
}
a = b;
b = System.nanoTime();
long newDiff = b - a;
if (newDiff < minDiff) {
minDiff = newDiff;
changed = "minDiff";
}
if (newDiff > maxDiff) {
maxDiff = newDiff;
changed = "maxDiff";
}
if (a == b) {
++consec;
if (consec > currConsec) {
currConsec = consec;
}
if (currConsec > maxConsec) {
maxConsec = currConsec;
changed = "maxConsec";
}
} else {
consec = 0;
}
if (changed != null) {
System.out.println(Thread.currentThread().getName() + " maxConsec=" + maxConsec + " minDiff=" + minDiff
+ " maxDiff=" + maxDiff + " a=" + a + " b=" + b + " changed=" + changed);
changed = null;
}
}
}
}

---------- END SOURCE ----------

Comments

Popular posts from this blog

Oracle JDBC ReadTimeout QueryTimeout

Sites, Newsletters, and Blogs

Locks held on Oracle for hours after sessions abnormally terminated by node failure