From 04a9379a62c2ab34a46f336e92be0f7ca7f54a70 Mon Sep 17 00:00:00 2001 From: Scott McClymont Date: Tue, 26 Sep 2017 15:31:04 +0000 Subject: [PATCH] Updating multi-hash spec for revision Change-Id: I1f25c8d49c2b117cefb09d7edab494d32fed4ddb --- specs/queens/approved/glance/multihash.rst | 237 +++++++++++++++++++++ specs/untargeted/glance/multihash.rst | 214 ------------------- 2 files changed, 237 insertions(+), 214 deletions(-) create mode 100644 specs/queens/approved/glance/multihash.rst delete mode 100644 specs/untargeted/glance/multihash.rst diff --git a/specs/queens/approved/glance/multihash.rst b/specs/queens/approved/glance/multihash.rst new file mode 100644 index 00000000..44cecbfd --- /dev/null +++ b/specs/queens/approved/glance/multihash.rst @@ -0,0 +1,237 @@ +.. + This work is licensed under a Creative Commons Attribution 3.0 Unported + License. + + http://creativecommons.org/licenses/by/3.0/legalcode + +============================= +Secure Hash Algorithm Support +============================= + +https://blueprints.launchpad.net/glance/+spec/multihash + +This spec provides a future-proof mechanism for providing a +secure SHA512 hash for each image. + +Problem description +=================== + +The hash of an image can be obtained from the ``checksum`` field in the +image metadata. Currently, this checksum field holds a md5 hexdigest +value. While md5 can still be used to verify data integrity against +unintentional corruption, it should not be used to verify data integrity +against tampering. A more collision-resistant hashing algorithm should +be used for broader data integrity coverage. + +Proposed change +=============== + +This spec proposes adding two new fields to the image metadata. The +proposed keys for this change would be ``os_hash_algo`` and ``os_hash_value``. +The proposed value is a hash value that is formatted as a sha512 hexdigest. +The hexdigest will be obtained using the current python hashlib library. +``os_hash_algo`` will be sha3_512 and ``os_hash_value`` will be the hex. At this +point in time, the config will default to sha3_512, but this change +will allow us to alter allowed hashing algorithms at any time. The ``os_hash_algo`` +field will also allow for users to dynamically pass this value to their python code. +We will leave MD5 calculations and checksums for backwards compatibility. + +Adding a new image will set all of ``checksum``, ``os_hash_algo`` and ``os_hash_value`` +fields. + +Requesting the image metadata of an image without the ``os_hash_algo/os_hash_value`` +value set, will result in a null for the ``os_hash_algo/os_hash_value`` field in the +metadata response. + +Alternatives +------------ + +1. Update the value of the ``checksum`` field with the hexdigest of a + different hashing algorithm. This will likely break client-side + checksum verification code. + +2. Add only the new ``os_hash_value`` field. While it is a simpler change, it + is a less future proof approach because it ties the sha512 algorithm to the + ``os_hash_value`` property in the same way that the md5 algorithm is tied to the + current ``checksum`` property. When a sha512 collision is produced, there will + have to be a new spec in to add yet another checksum field. + + The approach described in this spec of using an additional ``os_hash_algo`` field + will allow changing the hash algorithm without adding a new field or breaking + the API contract. All we'll have to do is update the algorithm used and put + its name in the ``os_hash_algo`` property, and then any image consumer will know + how to interpret what's in the ``os_hash_value`` property. + + It is worth noting that the Glance implementation of image signature validation + gives us a precedent for having a value in one property (``img_signature``) and + the name of the algorithm used in another property + (``img_signature_hash_method``) [1]_. Thus the proposal in this spec is + completely consistent with a related Glance workflow. + +3. Use the hexdigest of a different algorithm. Tests using hashlib's + sha256 and sha512 algorithms consistently yielded faster times for + sha512 (SEE: Performance impact). The implementation of the sha512 + algorithm within hashlib demonstrates reasonable performance and + should be considered collision-resistant for many years. + +4. Implement a single ``multihash`` field in the image metadata in which we calculate + the SHA512 value. The single field will contain the coded algorithm name and hash [0]_. + The advantage is that it is only one field being added to the data model and schema. + However, having only one field would make it difficult to use the hash quickly as + the end user would be required to decode what algorithm is being used before verifying + the hash. + +Data model impact +----------------- + +* Triggers: None +* Expand: Two new columns (``os_hash_algo`` & ``os_hash_value``) with type string/varchar + (defaulting to null) will be added to the images table. We will create an + index similar to the one on ``checksum`` as well. +* Migrate: None +* Contract: None +* Conflicts: None + +REST API impact +--------------- + +Two new fields (``os_hash_algo`` & ``os_hash_value``) will exist in requests for the image +metadata. + +Security impact +--------------- + +A new more collision-resistant hash will be returned in addition to the +current checksum. + +Notification impact +------------------- + +None + +Other end user impact +--------------------- + +None + +Performance impact +------------------ + +New image uploads (and the first download of previously uploaded images) +will take an additional amount of time to calculate the sha512 checksum: + +5G binary blob of random data: +* md5: ~9s +* sha256: ~22s +* sha512: ~14s +* 1Gbps line speed upload: 42s + +1.5G Ubuntu 16.04 cloud image: +* md5: ~2.9s +* sha256: ~7.2s +* sha512: ~4.6s +* 1Gbps line speed upload: 12s + +555M Debian 8 cloud image: +* md5: ~1.0 +* sha256: ~2.5 +* sha512: ~1.6 +* 1Gbps line speed upload: 4.5s + +* Note: SHA512 has been selected and should have minimal impact on overall upload time + with regards to the entire process. + +Test Code: + +.. code:: python + + #!/usr/bin/env python3 + + import hashlib + import time + + + def runtime(f): + def wrapper(*args, **kwargs): + start = time.time() + f(*args, **kwargs) + print("Time elapsed: %s" % (time.time() - start)) + return wrapper + + + @runtime + def checksum(filename, algorithm): + algorithms = {"md5": hashlib.md5, + "256": hashlib.sha256, + "512": hashlib.sha512, + } + with open(filename, "rb") as f: + m = algorithms[algorithm]() + for chunk in iter(lambda: f.read(65536), ''): + m.update(chunk) + print("%s: %s" % (algorithm, m.hexdigest())) + + checksum("fake.img", "512") + checksum("fake.img", "256") + checksum("fake.img", "md5") + checksum("fake.img", "256") + checksum("fake.img", "md5") + checksum("fake.img", "512") + + +Developer impact +---------------- + +Any future checksum verification code should use the ``os_hash_algo`` & ``os_hash_value`` fields. +Fallback to the ``checksum`` field if not properly populated. + + +Implementation +============== + +Assignee(s) +----------- + +Primary assignee: Scott McClymont + + +Other contributors: + +Work Items +---------- + +* Add tests + +* Update the db to add ``os_hash_algo`` & ``os_hash_value`` columns to the images table (including + expand, migrate, contract, and monolith code) + +* Update the sections of code that calculate the ``checksum`` to also + calculate ``os_hash_algo`` & ``os_hash_value`` (includes calculation on upload) + +* Discuss updating on download ``os_hash_algo`` & ``os_hash_value`` when value is null + +* Update internal checksum verification code to use the ``os_hash_value`` + field and fallback to ``checksum`` field when not present + +* Update glance client + +* Update docs + +* Add the os_hash_algo value to the discovery API if the API is ready + +Dependencies +============ + +None + +Testing +======= + +Update the tests to verify proper population of image properties + + +References +========== + +.. [0] http://multiformats.io/multihash/ +.. [1] https://docs.openstack.org/glance/latest/user/signature.html diff --git a/specs/untargeted/glance/multihash.rst b/specs/untargeted/glance/multihash.rst deleted file mode 100644 index f1a8a61b..00000000 --- a/specs/untargeted/glance/multihash.rst +++ /dev/null @@ -1,214 +0,0 @@ -.. - This work is licensed under a Creative Commons Attribution 3.0 Unported - License. - - http://creativecommons.org/licenses/by/3.0/legalcode - -================= -Multihash Support -================= - -https://blueprints.launchpad.net/glance/+spec/multihash - -This spec provides a future-proof mechanism for providing a -self-describing hash for each image. - -Problem description -=================== - -The hash of an image can be obtained from the ``checksum`` field in the -image metadata. Currently, this checksum field holds a md5 hexdigest -value. While md5 can still be used to verify data integrity against -unintentional corruption, it should not be used to verify data integrity -against tampering. A more collision-resistant hashing algorithm should -be used for broader data integrity coverage. - -Proposed change -=============== - -This spec proposes adding a new field to the image metadata. The -proposed key name for this field is ``multihash``. The proposed value is -a multihash [0]_ formatted sha512 hexdigest (codec:sha2-512, code:0x13). -The hexdigest will be obtained using the current hashlib library. The -digest length and hash function type will be prepended as described in -the documentation [0]_ for the multihash format. - -Adding a new image will set both the ``checksum`` and ``multihash`` -fields. - -Requesting the image metadata of an image without the ``multihash`` -value set, will result in a null for the ``multihash`` field in the -metadata response. - -Downloading an image without the ``multihash`` value set, will result in -the value being populated upon completion. - -Uploading an image to an existing image location will result in the -``multihash`` value being set / updated. - -Alternatives ------------- - -1. Update the value of the ``checksum`` field with the hexdigest of a - different hashing algorithm. This will likely break client-side - checksum verification code. - -2. Add a new non-multihash checksum field. This is a less future proof - approach that will likely result in a new spec in a few years to add - yet another checksum field. Using the multihash format allows - changing the algorithm without adding a new field or breaking the API - contract. - -3. Use the hexdigest of a different algorithm. Tests using hashlib's - sha256 and sha512 algorithms consistently yielded faster times for - sha512 (SEE: Performance impact). The implementation of the sha512 - algorithm within hashlib demonstrates reasonable performance and - should be considered collision-resistant for many years. - -Data model impact ------------------ - -* Triggers: None -* Expand: A new column (``multihash``) with type string/varchar - (defaulting to null) will be added to the images table along with an - index similar to the one on ``checksum`` -* Migrate: None -* Contract: None -* Conflicts: None - -REST API impact ---------------- - -A new field (``multihash``) will exist in requests for the image -metadata. - -Security impact ---------------- - -A new more collision-resistant hash will be returned in addition to the -current checksum. - -Notification impact -------------------- - -None - -Other end user impact ---------------------- - -None - -Performance impact ------------------- - -New image uploads (and the first download of previously uploaded images) -will take an additional amount of time to calculate the sha512 checksum: - -5G binary blob of random data: -* md5: ~9s -* sha256: ~22s -* sha512: ~14s -* 1Gbps line speed upload: 42s - -1.5G Ubuntu 16.04 cloud image: -* md5: ~2.9s -* sha256: ~7.2s -* sha512: ~4.6s -* 1Gbps line speed upload: 12s - -555M Debian 8 cloud image: -* md5: ~1.0 -* sha256: ~2.5 -* sha512: ~1.6 -* 1Gbps line speed upload: 4.5s - -Test Code: - -.. code:: python - - #!/usr/bin/env python3 - - import hashlib - import time - - - def runtime(f): - def wrapper(*args, **kwargs): - start = time.time() - f(*args, **kwargs) - print("Time elapsed: %s" % (time.time() - start)) - return wrapper - - - @runtime - def checksum(filename, algorithm): - algorithms = {"md5": hashlib.md5, - "256": hashlib.sha256, - "512": hashlib.sha512, - } - with open(filename, "rb") as f: - m = algorithms[algorithm]() - for chunk in iter(lambda: f.read(65536), ''): - m.update(chunk) - print("%s: %s" % (algorithm, m.hexdigest())) - - checksum("fake.img", "512") - checksum("fake.img", "256") - checksum("fake.img", "md5") - checksum("fake.img", "256") - checksum("fake.img", "md5") - checksum("fake.img", "512") - - -Developer impact ----------------- - -Any future checksum verification code should use the ``multihash`` field. - - -Implementation -============== - -Assignee(s) ------------ - -Primary assignee: unassigned - - -Other contributors: - -Work Items ----------- - -* Add tests - -* Update the db to add ``multihash`` column to the images table (including - expand, migrate, contract, and monolith code) - -* Update the sections of code that calculate the ``checksum`` to also - calculate ``multihash`` (includes calculation on upload) - -* Update ``multihash`` on image download when value is null - -* Update internal checksum verification code to use the ``multihash`` - field and fallback to ``checksum`` field when not present - -* Update glance client - -* Update docs - -Dependencies -============ - -The multihash specification [0]_ - -Testing -======= - -Update the tests to verify multihash - - -References -========== - -.. [0] http://multiformats.io/multihash/