swift/swift
Samuel Merritt cc2f0f4ed6 Speed up reading and writing xattrs for object metadata
Object metadata is stored as a pickled hash: first the data is
pickled, then split into strings of length <= 254, then stored in a
series of extended attributes named "user.swift.metadata",
"user.swift.metadata1", "user.swift.metadata2", and so forth.

The choice of length 254 is odd, undocumented, and dates back to the
initial commit of Swift. From talking to people, I believe this was an
attempt to fit the first xattr in the inode, thus avoiding a
seek. However, it doesn't work. XFS _either_ stores all the xattrs
together in the inode (local), _or_ it spills them all to blocks
located outside the inode (extents or btree). Using short xattrs
actually hurts us here; by splitting into more pieces, we end up with
more names to store, thus reducing the metadata size that'll fit in
the inode.

[Source: http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/Extended_Attributes.html]

I did some benchmarking of read_metadata with various xattr sizes
against an XFS filesystem on a spinning disk, no VMs involved.

Summary:

 name | rank | runs |      mean |        sd | timesBaseline
------|------|------|-----------|-----------|--------------
32768 |    1 | 2500 | 0.0001195 |  3.75e-05 |           1.0
16384 |    2 | 2500 | 0.0001348 | 1.869e-05 | 1.12809122912
 8192 |    3 | 2500 | 0.0001604 | 2.708e-05 | 1.34210998858
 4096 |    4 | 2500 | 0.0002326 | 0.0004816 | 1.94623473988
 2048 |    5 | 2500 | 0.0003414 | 0.0001409 | 2.85674781189
 1024 |    6 | 2500 | 0.0005457 | 0.0001741 | 4.56648611635
  254 |    7 | 2500 |  0.001848 |  0.001663 | 15.4616067887

Here, "name" is the chunk size for the pickled metadata. A total
metadata size of around 31.5 KiB was used, so the "32768" runs
represent storing everything in one single xattr, while the "254" runs
represent things as they are without this change.

Since bigger xattr chunks make things go faster, the new chunk size is
64 KiB. That's the biggest xattr that XFS allows.

Reading of metadata from existing files is unaffected; the
read_metadata() function already handles xattrs of any size.

On non-XFS filesystems, this is no worse than what came before:

ext4 has a limit of one block (typically 4 KiB) for all xattrs (names
and values) taken together [1], so this change slightly increases the
amount of Swift metadata that can be stored on ext4.

ZFS let me store an xattr with an 8 MiB value, so that's plenty. It'll
probably go further, but I stopped there.

[1] https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Extended_Attributes

Change-Id: Ie22db08ac0050eda693de4c30d4bc0d620e7f7d4
2014-12-05 15:52:58 -08:00
..
account Correct misspelled words 2014-11-25 15:44:30 +00:00
cli Fix the behavior of swift-ring-builder list_parts before rebalance 2014-12-06 02:44:59 +09:00
common Merge "Raise ValueError for offset on Timestamp over limit" 2014-12-04 18:44:12 +00:00
container Fix reclaim on deleted containers 2014-12-03 17:10:15 -08:00
locale Imported Translations from Transifex 2014-11-26 06:13:29 +00:00
obj Speed up reading and writing xattrs for object metadata 2014-12-05 15:52:58 -08:00
proxy Removing unused method: _remaining_items 2014-12-02 16:57:07 -05:00
__init__.py Make pbr a build-time only dependency 2013-10-29 12:29:49 -07:00