Tuning Journaling File Systems

Tuning Journaling File Systems http://www.r71.nl/wp-content/themes/blade/images/empty/thumbnail.jpg 150 150 Roderick Derks Roderick Derks https://secure.gravatar.com/avatar/d59d4b074c955fa0f9215bf7ff00f929ae7a4d3cb87455a8da3e7f0b6154d922?s=96&d=mm&r=g 03/02/2007 03/02/2007

File systems are part of our everyday lives. We store and retrieve data constantly, but rarely do we think think about how each file system works. Perhaps that’s as it should be: Linux supports many different kinds of file systems, and most are mature and robust. For example, the Linux kernel supports the traditional Ext2 file system (among others), several cluster file systems (Lustre, GFS, GPFS, and CXFS), and also includes no less than four journaling file systems that have been proven time and again in production server environments, where high throughput and near-perennial uptime is essential. (For additional information on journaling file systems, see the October 2002 Linux Magazine article titled “Journaling File Systems”, available online at http://www.linux-mag.com/2002-10/jfs_01.html.)

But journaling file systems need not be limited to servers. Journaling file systems can also benefit client machines, where performance and reliability is often just as critical. However, the jobs assigned to a workstation and the demands placed on a server are radically different. To get the best high throughput and high uptime requirements performance out of both, you have to tune each configuration to suit. Let’s use the open source dbench benchmark (http://samba.org/ftp/tridge/dbench/) to tweak and measure a number of different workloads and see how a little work can yield big results.

(This article taken from the www.linux-mag.com website, author Steve Best)

The Linux kernel source includes not one, but five journaling file systems: JFS from IBM (http://jfs.sourceforge.net/), XFS from SGI (http://oss.sgi.com/projects/xfs/), Ext3, and ReiserFS and Reiser4 from Namesys (http://www.namesys.com). The output in Figure One shows shows a system with four of the journaling systems.

FIGURE ONE: A system with four journaling file systems, Ext3, ReiserFS, JFS, and XFS

# mount
/dev/hda1 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
devpts on /dev/pts type devpts (rw,gid=5,mode=620)
tmpfs on /dev/shm type tmpfs (rw)
/dev/hda6 on /jfs type jfs (rw)
/dev/hda5 on /reiserfs type reiserfs (rw)
/dev/hda7 on /xfs type xfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
sunrpc on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
automount(pid1871) on /net type autofs (rw,fd=4,pgrp=1871,minproto=2,maxproto=4)

During installation, your distribution picks one of these journaling file systems as the default for each partition, but you can typically change the choice. The default format during a Red Hat install is Ext3; SuSE defaults the format to ReiserFS.

Figure Two shows that JFS, XFS, ReiserFS, and Ext3 are independent “peers.” It is possible for a single Linux machine to use all of these types of file systems at the same time. A system administrator can configure a system to use JFS on one partition/volume, and ReiserFS on another.

FIGURE TWO: The layers of the Linux file system

General Tips: File System Layout

File system performance is often a major component of overall system performance. To achieve optimal performance, the underlying file system configuration must be balanced to match the characteristics of the system’s primary application.

Before creating file systems, a plan should be created for the layout of your file systems. The following are some general considerations to be aware of when planning your system:

*I/O workload should be distributed as evenly as possible across the disk drives.

*The number of file systems on any one disk should be kept to a minimum. All of the Linux file systems are better able to manage fragmentation of a file system in a larger partition/volume than in a small, completely full partition.

*If a large set of files (in size, number, or both) has characteristics that make the files significantly different than “typical” files, create a separate file system for these files that is tuned to their requirements.

*Split file system workloads depending on access patterns. Put random I/O on separate spindles than streaming traffic.

Most parameters that affect file system performance are set once and for all when a file system is created. Hence, it’s simpler to provide the parameters you want when you run mkfs. Some mount options can also be used to change the performance of the file system.

Disable Access Times

The first performance tip is a simple change you can make via mount: Disable file access times if your system doesn’t need them.

Linux records an atime, or access time, whenever a file is read. However, this information isn’t very useful, and can be quite costly to track. To get a quick performance boost, simply disable access time updates with the mount option noatime.

Use Ext3 Instead of Ext2

Ext3 is a minimal extension to Ext2 to add support for journaling. Ext3 uses the same disk layout and data structures as Ext2, making it forward- and backward-compatible with Ext2. Migration from Ext2 to Ext3 (and vice versa) is quite easy, and can even be done in-place in the same partition. (The other journaling file systems require the partition to be formatted with a unique mkfs utility.) Despite the similarities, Ext3 provides higher availability and performance than Ext2, without impacting robustness (at least the simplicity and reliability).

Ext3’s first improvement is directory indexing. This feature improves file access in directories containing large files or many files by using hashed binary trees to store the directory information. If a directory grows beyond a single disk block, it’s automatically indexed using a hash tree.

One way to enable dir_index is to use the tune2fs command:

# tune2fs –O dir_index /dev/hda1

This command only applies to those directories created on the named filesystem after tune2fs runs. To apply directory indexing to existing directories, run the e2fsck utility to optimize and reindex the directories on the filesystem:

# e2fsck –D –f /dev/hda1

Another Ext3 enhancement is preallocation. This feature is useful when using Ext3 with multiple threads appending to files in the same directory. You can enable preallocation using the reservation option.

# mount –t ext3 –o reservation /dev/hda1 /ext3

You can further improve Ext3 performance by keeping the file system’s journal on another device. An external log improves performance because the log updates are saved to a different partition than the log for the corresponding file system. This reduces the number of hard disk seeks.

To create an external Ext3 log, run the mkfs utility on the journal device, making sure that the block size of the external journal is the same block size as the Ext3 file system. For example, the commands…

# mkfs.ext3 –b 4096 –O journal_dev /dev/hda1
# mkfs.ext3 –b 4096 –J device=/dev/hda1 /dev/hdb1

… use /dev/hda1 is used as the external log for the Ext3 file system on /dev/hdb1.

To measure the impact of the external journal, let’s run dbench. In the next few examples, the first run of dbench measures Ext3 with an internal log; the second run measures Ext3 with an external log. (Remember to place the dbench software on the partition being benchmarked.)

First, the internal log:

# mkfs.ext3 /dev/hda1
# mount –t ext3 /dev/hda1 /ext3
# cd /ext3
# cp /tmp/dbench-1.3.tar.gz . 
# tar zxvf dbench-1.3.tar.gz
# cd dbench
# make
# date && ./dbench 20 && date

Mon Jul  3 13:38:50 PDT 2006
20 clients started
.......................+......+.......+++++....+********* 
Throughput 84.6415 MB/sec (NB=105.802 MB/sec  846.415 MBit/sec) 20 procs
Mon Jul  3 13:39:21 PDT 2006

Next, /dev/hdb1 is used as the external log device.

# mkfs.ext3 –b 4096 –O journal_dev /dev/hdb1
# mkfs.ext3 –b 4096 –J device=/dev/hdb1 /dev/hda1
# mount –t ext3 /dev/hda1 /ext3
# cd /ext3
# cp /tmp/dbench-1.3.tar.gz . 
# tar zxvf dbench-1.3.tar.gz
# cd dbench
# make
# date && ./dbench 20 && date

Mon Jul  3 14:00:03 PDT 2006
20 clients started
.......................+......+.......+++++....+********* 
Throughput 117.848 MB/sec (NB=147.309 MB/sec  1178.48 MBit/sec) 20 procs
Mon Jul  3 14:00:25 PDT 2006

As you can see, the second run of dbench shows increased throughput, from 84.6415 MB/sec to 117.848 MB/sec. The time required to run dbench was also reduced, from 31 seconds to 22 seconds. The dbench benchmark creates a very large amount of metadata activity. Therefore, determining the metadata activity for your system can help determine the type of tuning that will be most useful.

Tweak ReiserFS

The ReiserFS journaling file system supports metadata journaling, and has a unique design that differentiates it from the other journaling file systems. Specifically, ResiserFS stores all file system objects into a single b*-balanced tree. ReiserFS also supports compact, indexed directories, dynamic inode allocation, resizable items, and 60-bit offsets.

The tree contains four basic components: stat() data, directory components, direct components, and indirect components. You can find components by searching for a key (where the key has an ID), the offset in the object that is being searched, and the item type. Directories have the capability to increase and decrease as their contents change. A hash of the file name is used to keep an entry’s offset in the directory permanent. For files, indirect components point to data blocks, and direct components contain packed file data. All of the components can be resized by rebalancing the tree.

ReiserFS is especially adept at managing lots and lots of small files.

Like Ext3, the ReiserFS file system journal can be maintained separately from the file system itself. To accomplish this, your system needs two unused partitions. Assuming that /dev/hda1 is the external journal and /dev/hdb1 is the file system you want to create, simply run the command:

# mkreiserfs –j /dev/hda1 /dev/hdb1

That’s all it takes.

In addition to an external journal, there are three mount options that can change the performance of ReiserFS:

*The hash option allows you to choose which hash algorithm to use to locate and write files within directories. There are three choices. The rupasov hashing algorithm is a fast hashing method that places and preserves locality, mapping lexicographically close file names to close hash values. The tea hashing algorithm is a Davis-Meyer function that creates keys by thoroughly permuting bits in the name. It achieves high randomness and, therefore, low probability of hash collision, but this entails performance costs. Finally, the r5 hashing algorithm is a modified version of the rupasov hash with a reduced probability of collisions. r5 is the default hashing algorithm. You can set the hash scheme using a command such as mount –t reiserfs –o hash=tea /dev/hdb1 /mnt/reiserfs.

2.The nolog option disables journaling, and also provides a slight performance improvement in some situations, albeit at the cost of forcing fsck if the file system is not cleanly unmounted. This is a good option to use when restoring a file system from a backup. A sample command is mount –t reiserfs –o nolog /dev/hdb1 /mnt/reiserfs.

3.The notail option disables the packing of files into the tree. By default, ReiserFS stores small files and “file tails” directly into the tree.

It is possible to combine mount options by separating them with a comma. Here’s an example that uses two mount options (noatime, notail) to increase file system performance:

# mount –t reiserfs –o noatime,notail /dev/hdb1 /mnt/reiserfs

Tweaking JFS

JFS for Linux is based on the IBM JFS file system for OS/2 Warp. JFS is well-suited to enterprise environments and uses many advanced techniques to boost performance, provide for very large file systems, and keep track changes to the file system. Some of the features of JFS include:

*Extent-based addressing structures. JFS uses extent-based addressing structures, along with aggressive block-allocation policies, to produce compact, efficient, and scalable structures for mapping logical offsets within files to physical addresses on disk. This feature yields excellent performance.

*Directory organization. Two different directory organizations are provided: one is used for small directories, and the other for large directories. The contents of a small directory — up to 8 entries, excluding the self (. or “dot”) and parent (.. or “dot dot” entries) — are stored within the directory’s inode. This eliminates the need for separate directory block I/O and the need to allocate separate storage. The contents of larger directories are organized in a B+- tree keyed on name. The B+- tree provides faster directory lookup, inserts, and deletes when compared to traditional, unsorted directory indices.

*64-bits. JFS is a full, 64-bit file system. All of the appropriate file system structure fields are 64 bits in size. This allows JFS to support large files and partitions.

There are other advanced features in JFS, such as allocation groups, which are shown in Figure Two. Allocation groups speed file access times by maximizing locality. (XFS also has this feature.)

FIGURE Three: JFS file system allocation groups

Again, JFS file systems can be journaled on a separate device. To create a JFS file system with the log on an external device, the system needs to have two unused partitions. In the following example, /dev/hda1 and /dev/hdb1 are spare partitions. /dev/hda1 is used as the external log.

# mkfs.jfs –j /dev/hda1 /dev/hdb1

There is one mount option that can change the performance of the JFS file system. nointegrity is used to not write to the journal, and is used to allow for higher performance when restoring a volume from backup media. The integrity of the volume is not guaranteed if the system abnormally aborts.

The integrity option is the default. It commits metadata changes to the journal. Use this option to remount a volume where the nointegrity option was previously specified to restore normal behavior.

Unlike ReiserFS, the JFS jfs_tune utility allows you to change the location of the journal. To create a journal on an external device, say, /dev/hda2, run:

# mkfs.jfs –J journal_dev /dev/hda2

Then attach the external journal to the file system, which is located on /dev/hdb1.

# jfs_tune –J device=/dev/hda2 /dev/hdb1

Tweaking XFS

The XFS file system for Linux is based on the SGI’s IRIX XFS file system technology. XFS supports metadata journaling and extremely large disk farms. In addition, XFS is designed to scale and have high-performance.

XFS is a 64-bit file system. All of the file system counters in the system are 64-bit, as are the addresses used for each disk block and the unique number assigned to each file inode number.

XFS supports delayed allocation. This feature allows the file system to optimize write performance. When it comes time to write data to disk, XFS can allocate free space in intelligent way that optimizes file system performance by allocating a single, contiguous region on the disk to store this data.

XFS partitions the file system into regions called Allocation Groups (AG). Each AG manages its own free space and inodes, as shown in Figure Three. In addition AGs provide scalability and parallelism for the file system. Files and directories are not limited to a single AG. Free space and inodes within each AG are managed so that multiple processes can allocate free space throughout the file system simultaneously, thus reducing the bottleneck that can occur on large, active file systems.

One option that can make a difference in an XFS file system is the –i size= xxx option. The default inode size is 256 bytes. However, the inode size can be increased (up to 4 KB), which allows more directories to retain contents in the inode and causes less disk I/O to read and write. However, larger inodes conversely need more I/O to read, because they are read and written in clusters. Because extents are also held in the inode if there is room, larger inodes also reduce the number of files with out-of-inode metadata.

Another option that affects performance of the filesystem is the log size: –l size= xxx. When there is a large amount of metadata activity, a larger log translates to more elapsed time before modified metadata is flushed to the disk. However, a larger log also slows down recovery.

As with the other journaling file systems, an external log improves performance because the log updates are saved to a different partition than their corresponding file system. To create an XFS file system with the log on an external device, you again need two unused partitions. In the following example, /dev/hda1 and /dev/hdb1 are spare partitions. The /dev/hda1 partition is used as the external log.

# mkfs.xfs –l logdev=/dev/hda1 /dev/hdb1

At mount time, there are three XFS options that can alter performance:

*osyncisdsync indicates that O_SYNC is treated as O_DSYNC, which is the behavior Ext2 gives you by default. Without this option, O_SYNC file I/O syncs more metadata for the file.

*logbufs=size sets the number of log buffers that are held in memory. This means you can have more active transactions at once, and can still perform metadata changes while the log is synced to disk. The flip side is that the amount of metadata changes that might be lost due to a system crash is greater. Valid values are 2 through 8.

*logbsize=size sets the size of the log buffer held in memory. Valid values are 16, 32, 64, 128, and 256 Kbytes.

For a metadata-intensive workload, the default log size could be the limiting factor that reduces the performance of the file system. Better results are achieved by creating file systems with a larger log size. The following mkfs command creates a log size of 32,768 bytes.

# mkfs –t xfs –l size=32768b –f /dev/hdb1

(Currently, in order to resize a log inside the volume, you need to remake the file system.)

Benchmarking XFS

Let’s look at two ways to tune the XFS file system and run dbench. The first example uses the defaults to format an XFS partition, which, by default, has the log inside the same partition as the data. This test provides a baseline. The second example uses the mount options logbufs and logbsize. A third example uses an external log and the same two mount options.

# mkfs.xfs –f /dev/hda1
# mount –t xfs /dev/hda1 /xfs
# cd /xfs
# cp /tmp/dbench-1.3.tar.gz .
# tar zxvf dbench-1.3.tar.gz
# cd dbench
# make
# date && ./dbench 10 && date

Mon Jul  3 14:13:25 PDT 2006
10 clients started
.......................+......+.......+++++....+********* 
Throughput 92.7404 MB/sec (NB=115.925 MB/sec  927.404 MBit/sec) 10 procs
Mon Jul  3 14:13:39 PDT 2006

Next, let’s use mount options logbufs and logbsize.

# mkfs.xfs –f /dev/hda1
# mount –t xfs –o logbufs=8,logbsize=32768b /dev/hda1 /xfs
# cd /xfs
# cp /tmp/dbench-1.3.tar.gz .
# tar zxvf dbench-1.3.tar.gz
# cd dbench
# make
# date && ./dbench 10 && date

Mon Jul  3 14:17:35 PDT 2006
10 clients started
.......................+......+.......+++++....+********* 
Throughput 96.4556 MB/sec (NB=120.57 MB/sec  964.556 MBit/sec) 10 procs
Mon Jul  3 14:17:49 PDT 2006

Finally, let’s place the log on an external device for XFS. The example in this section runs dbench with the same parameters as in the previous example, but the log is placed on external device, /dev/hdb1.

# mkfs.xfs –l logdev=/dev/hdb1,size=32768b –f /dev/hda1
# mount –t xfs –o logbufs=8,logbsize=32768b,logdev=/dev/hdb1 /dev/hda1 /xfs
# cd /xfs
# cp /tmp/dbench-1.3.tar.gz .
# tar zxvf dbench-1.3.tar.gz
# cd dbench
# make
# date && ./dbench 10 && date

Mon Jul  3 14:22:45 PDT 2006
10 clients started
.......................+......+.......+++++....+********* 
Throughput 182.083 MB/sec (NB=227.604 MB/sec  1820.83 MBit/sec) 10 procs
Mon Jul  3 14:22:52 PDT 2006

When the logbufs and logbsize options are added, throughput increases from 92.7404 MB/sec to 96.4556 MB/sec. When the log is moved to an external device the throughput nearly doubles to 182.083 MB/sec. Clearly, the external log increases file system performance under dbench, a program that has a large amount of metadata activity.

Tuning the I/O Scheduler

The I/O scheduler orders pending I/O requests to minimize the time spent moving the disk head. This, in turn, minimizes disk seek time and maximizes hard disk throughput. Hence, tuning the I/O scheduler can also help increase file system performance.

Linux I/O Schedulers

The Linux I/O schedulers presents I/O requests to block devices in an optimal order. There are currently four schedulers in the kernel, each with a different notion of “high performance”. All of them, however, maintain a dispatch queue, which is a list of requests which have been selected for submission to the device. The purpose of the I/O scheduler is to sort and merge the I/O requests from the I/O queues in order to increase efficiency and enable the best performance.

Using the /sys proc file system entries you can change and tune the I/O scheduler for a given block device. For any scheduler there is a different directory tree representing the tuning options. Let’s discuss the design of each of the schedulers and areas where one would be better than another.

The noop scheduler is a FIFO queue. Only the I/O merging is provided. Good if your application already sorts the I/O.

The deadline scheduler uses a round-robin algorithm to minimize the latency for any I/O request. It implements merging and sorting plus a deadline mechanism to avoid starvation. It prefers reads above writes.

The anticipatory scheduler tries to predict the future workload delaying the I/O in order to merge request and decrease the number of seeks. It implements merging and sorting plus an algorithm to minimize disk head movements. It is suggested for workstation and old hardware.

The cfq scheduler uses a round-robin technique trying to be fairly divided the available I/O bandwidth amongst all I/O requests. It implements sorting and merging. This is the default I/O scheduler for Red Hat Enterprise Linux 4 release and for the SuSE SLES9 and SLES10 releases.

To tell you which scheduler the system is using the following command can be used:

# cat /sys/block/hda/queue/scheduler
noop anticipatory deadline [cfq]

On newer kernel you can change the scheduler without a reboot by simply issuing the following command to switch to another I/O scheduler. The example shows how to switch to the deadline scheduler.

# echo deadline > /sys/block/hda/queue/scheduler

The I/O scheduler can be tuned using the kernel parameter elevator= xxx. The first option is noop, best for smart storage controllers. Another is deadline, which limits the maximum latency per request to disk. The third is anticipatory, which maximizes throughput by increasing latency, and is suitable to desktops. The fourth, completely fair queuing, abbreviated as cfq, compromises between reads and writes and tries to balance throughput and latency, which is best for file servers. (See the sidebar for additional information the I/O schedulers available in the 2.6.x series of the kernel.)

A Need for Speed

You can improve the performance of journaling file systems by taking the time to tune them. There has been considerable effort placed in the designs to make the file systems scalable and fast without significant expertise. Just twisting a few knobs — mount options and placing the journal on an external device — can make the journaling file systems run significantly faster.

Tuning Journaling File Systems

Tuning Journaling File Systems

Roderick Derks

Leave a Reply Cancel Reply