Major Linux I/O Bug? SPOLER: No

aford's picture

Alternate title: Betteridge's Law Holds!

Last week, I posted on what I thought was an I/O bug in RHEL 7 and CentOS 7. I've now done enough testing to eliminate this possibility.  So, as promised:

So what was happening, then, that led me to get such drastically different results?

First, and most importantly, as I noted in an update to the prior post, one system was using SSD and the other was using spinning disk.  That accounted for the large bulk of the difference.  It still leaves the somewhat unsolved mystery of my production performance problems that coincided roughly in time with when the OS was updated (which is what led me down that rabbit hole in the first place).

But there may also have been other configuration changes made around that time that accounted for this.  In preliminary testing, I've found that I get better performance by turning kernel asynchronous I/O off, contrary to conventional wisdom (and to what I've found on other platforms). I've coupled this with also disabling DIRECT_IO, though I believe the latter may be a bad thing: it may be that the performance is buoyed by caching that's going on at the OS level, which would mean that data hasn't really been committed to disk even though the DB engine believes that it has. By making these two changes, I've gotten the performance on the production system back close to previous (good) levels, at least for the time being.

But anyway, false alarm, and egg on my face.