云计算环境中关于 rbd cache 的启用

磁盘Cache设置分析定义本文指的磁盘Cache（或者Disk Cache）指的是磁盘驱动器内部的写Cache，其为易失型Cache，断电数据会丢失。大约16~128M大小。何时同步系统调用fsync/fdatsync/sync都会把数据从内存刷回到磁盘，并且在文件系统barrier选项开启后，也会同时把数据从disk cache 刷回到磁盘。依据如下：fsync系统...

Jack-changtao

1720人浏览 · 2018-02-12 17:16:10

Jack-changtao · 2018-02-12 17:16:10 发布

磁盘Cache设置分析

定义

本文指的磁盘Cache（或者Disk Cache）指的是磁盘驱动器内部的写Cache，其为易失型Cache，断电数据会丢失。大约16~128M大小。

何时同步

系统调用fsync/fdatasync/sync都会把数据从内存刷回到磁盘，并且在文件系统barrier选项开启后，也会同时把数据从disk cache 刷回到磁盘。依据如下：

fsync系统调用说明如下：
fsync() transfers (“flushes”) all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even after the system crashed or was rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed. It also flushes metadata information associated with the file
查看ext3 和 xfs的 sync系统调用的实现，发现只有当barrier设置时，fsync才调用函数blkdev_issue_flush回刷disk cache。

ext3的实现代码

int ext3_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
{
  …...
/*
	 * In case we didn't commit a transaction, we have to flush
	 * disk caches manually so that data really is on persistent
	 * storage
	 */
	if (needs_barrier) {
		int err;

		err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
		if (!ret)
			ret = err;
	}

}

xfs的实现

STATIC int
xfs_file_fsync(
	struct file		*file,
	loff_t			start,
	loff_t			end,
	int			datasync)
{
/*
	 * If we only have a single device, and the log force about was
	 * a no-op we might have to flush the data device cache here.
	 * This can only happen for fdatasync/O_DSYNC if we were overwriting
	 * an already allocated file and thus do not have any metadata to
	 * commit.
	 */
	if ((mp->m_flags & XFS_MOUNT_BARRIER) &&
	    mp->m_logdev_targp == mp->m_ddev_targp &&
	    !XFS_IS_REALTIME_INODE(ip) &&
	    !log_flushed)
		xfs_blkdev_issue_flush(mp->m_ddev_targp);
}

块设备实现
如果是块设备文件，fysnc直接完成disk cache的flush。

int blkdev_fsync(struct file *filp, loff_t start, loff_t end, int datasync)
{
	struct inode *bd_inode = filp->f_mapping->host;
	struct block_device *bdev = I_BDEV(bd_inode);
	int error;
	
	error = filemap_write_and_wait_range(filp->f_mapping, start, end);
	if (error)
		return error;

	/*
	 * There is no need to serialise calls to blkdev_issue_flush with
	 * i_mutex and doing so causes performance issues with concurrent
	 * O_SYNC writers to a block device.
	 */
	error = blkdev_issue_flush(bdev, GFP_KERNEL, NULL);
	if (error == -EOPNOTSUPP)
		error = 0;

	return error;
}

rbd cache

rbd cache 类似于虚拟磁盘的Disk Cache

rbd cache 的设置

rbd cache = true
rbd cache writethrough until flush = true (默认为true)

当rbd cache = true时， rbd cache并不是开启write back模式。还有选项rbd cache writethrough until flush控制：
它使得 RBDCache 在收到第一个 flush 指令之前，使用 write through 模式，透传数据，避免数据丢失；在收到第一个 flush 指令后，才开始 write back 模式。

疑问

开启rbd cache会不会有数据丢失的风险？
有数据丢失的风险。一般情况下虚拟机里的应用应该可以容忍。如果应用不能容忍数据丢失，其应该在每次操作后自己通过系统调用fsync等确保数据刷回到磁盘。
开启rbd cache，会不会导致XFS损坏吗？
不会。
XFS的元数据都先写日志，并且确保日志保存在磁盘上。
对于Ceph的数据盘，需要设置xfs的barrier，否则有丢失数据风险。
- 当ceph写数据时，由于元数据和数据都保存在日志里，数据不会丢失。
- 但是当ceph日志同步时：1）首先确保 XFS数据刷到磁盘上持久化存储 2）然后才 trim 日志。这个过程需要确保数据确实回刷到磁盘上，需要开启数据盘 XFS 为barrier