mirror of
https://github.com/ceph/ceph
synced 2025-02-06 10:23:52 +00:00
594 lines
20 KiB
ReStructuredText
594 lines
20 KiB
ReStructuredText
========
|
|
Manifest
|
|
========
|
|
|
|
|
|
============
|
|
Introduction
|
|
============
|
|
|
|
As described in ../deduplication.rst, adding transparent redirect
|
|
machinery to RADOS would enable a more capable tiering solution
|
|
than RADOS currently has with "cache/tiering".
|
|
|
|
See ../deduplication.rst
|
|
|
|
At a high level, each object has a piece of metadata embedded in
|
|
the object_info_t which can map subsets of the object data payload
|
|
to (refcounted) objects in other pools.
|
|
|
|
This document exists to detail:
|
|
|
|
1. Manifest data structures
|
|
2. Rados operations for manipulating manifests.
|
|
3. How those operations interact with other features like snapshots.
|
|
|
|
|
|
Status and Future Work
|
|
======================
|
|
|
|
At the moment, initial versions of a manifest data structure along
|
|
with IO path support and rados control operations exist. This section
|
|
is meant to outline next steps.
|
|
Future work will be proceeded as cleanups -> snapshot -> cache/tiering -> rbd support
|
|
in order.
|
|
|
|
Cleanups
|
|
--------
|
|
|
|
There are some rough edges we may want to improve:
|
|
|
|
* object_manifest_t: appears to duplicate offset for chunks between the
|
|
map and the chunk_info_t with different widths (uint64_t in map vs
|
|
uint32_t in chunk_info_t).
|
|
* SET_REDIRECT: Should perhaps create the object if it doesn't exist.
|
|
* SET_CHUNK:
|
|
|
|
* Appears to trigger a new clone as user_modify gets set in
|
|
do_osd_ops. This probably isn't desirable, see Snapshots section
|
|
below for some options on how generally to mix these operations
|
|
with snapshots.
|
|
TODO: Modify SET_CHUNK to set user_modify = false not to trigger a new clone.
|
|
* Appears to assume that the corresponding section of the object
|
|
does not exist (sets FLAG_MISSING) but does not check whether the
|
|
corresponding extent exists already in the object.
|
|
TODO: Set FLAG_DIRTY if the corresponding section of the object exists.
|
|
|
|
* Appears to clear the manifest unconditionally if not chunked,
|
|
that's probably wrong. We should return an error if it's a
|
|
REDIRECT.
|
|
TODO: add the following lines ::
|
|
|
|
case CEPH_OSD_OP_SET_CHUNK:
|
|
if (oi.manifest.is_redirect()) {
|
|
result = -EINVAL;
|
|
goto fail;
|
|
}
|
|
|
|
|
|
* TIER_PROMOTE:
|
|
|
|
* Document control flow in this file. The CHUNKED case is
|
|
particularly unintuitive and could use some explanation.
|
|
- Control flow
|
|
Based on the chunk_map in a head object (each chunk's state can be
|
|
dirty or missing), chunked objects for the head object
|
|
can be retrieved when tier_promote is invoked.
|
|
After promotion is completed, all chunks in the object should be clean state.
|
|
|
|
One thing to note is snapshotted manifest object.
|
|
Suppose that there are 10(9), 5(4). If we want to read clone id 4,
|
|
need to read chunks according to the chunk_map for clone id 4.
|
|
But, because the base tier keeps object_info_t in the case of manifest object,
|
|
the read with snapshot can work.
|
|
If clone id 4 is dirty or clean, the chunks will be read directly from the base tier.
|
|
If clone id 4 is missing, the chunks will be copied from the lower tier
|
|
|
|
* SET_REDIRECT clears the contents of the object. PROMOTE appears
|
|
to copy them back in, but does not unset the redirect or clear the
|
|
reference. Does this not violate the invariant? In particular, as
|
|
long as the redirect is set, it appears that all operations
|
|
will be proxied even after the promote defeating the purpose.
|
|
TODO: To prevent it, clear redirect as part of promote.
|
|
|
|
* For a chunked manifest, we appear to flush prior to promoting,
|
|
it that actually desirable?
|
|
TODO: The initial thought behind of this is that we don't know fingerprint
|
|
oid before flushing the chunk. So, before promoting the chunk,
|
|
we need to figure out the fingerprint oid.
|
|
But, to avoid unnecessary dedup work if the user specifically ask for
|
|
the data to be resident in the base pool, TIER_DO_NOT_DEDUP_PIN can be used.
|
|
|
|
Plans
|
|
1. https://github.com/ceph/ceph/pull/29283
|
|
(This PR adds basic snapshotted manifest object by using per clone's chunk_info_t)
|
|
2. Modify SET_CHUNK not to trigger a clone
|
|
3. Cleanups regarding existing manifest object as discussed here
|
|
|
|
|
|
Cache/Tiering
|
|
-------------
|
|
|
|
There already exists a cache/tiering mechanism based on whiteouts.
|
|
One goal here should ultimately be for this manifest machinery to
|
|
provide a complete replacement.
|
|
|
|
See cache-pool.rst
|
|
|
|
The manifest machinery already shares some code paths with the
|
|
existing cache/tiering code, mainly stat_flush.
|
|
|
|
In no particular order, here's in incomplete list of things that need
|
|
to be wired up to provide feature parity:
|
|
|
|
* Online object access information: The osd already has pool configs
|
|
for maintaining bloom filters which provide estimates of access
|
|
recency for objects. We probably need to modify this to permit
|
|
hitset maintenance for a normal pool -- there are already
|
|
CEPH_OSD_OP_PG_HITSET* interfaces for querying them.
|
|
* Tiering agent: The osd already has a background tiering agent which
|
|
would need to be modified to instead flush and evict using
|
|
manifests.
|
|
|
|
Plans
|
|
|
|
* Rework the tiering agent for manifest objects
|
|
- flush method (passive vs active)
|
|
|
|
Current flush operation for manifest objects uses a active flush model like
|
|
read the cotent - generating fingerprint - write and set.
|
|
But, exising tiering method uses copy-get and copy-from to flush the object.
|
|
I think the active flush model has two advantages compared to the existing method.
|
|
First, it can reduce the number of operations for a communication. Such as
|
|
Active: read -> generating fingerprint -> write and set
|
|
Passive: request copy_from -> copy_get -> read -> getnerating fingerprint -> write and set
|
|
Second, By managing I/O related to flush on the base tier,
|
|
we can control overall I/Os on the cluster appropriately.
|
|
|
|
- tier agent
|
|
|
|
Because the active flush model is used, the tier agent does more jobs than before
|
|
(read - generating fingerprint - write and set vs sending copy_from).
|
|
So, my concern is a parallism and performance.
|
|
since a single agent thread is probably hard to achieve high throughput.
|
|
To this end, the idea is to use N threads working on N pgs.
|
|
Existing agent threads (agent_work) can manages the list of dirty objects (pgbackend->list_partial)
|
|
, and flush the dirty objects in parallel.
|
|
|
|
* Use exiting existing features regarding the cache flush policy such as histset, age, ratio
|
|
- hitset
|
|
- age, ratio, bytes
|
|
|
|
* Add tiering-mode to manifest-tiering
|
|
- Writeback
|
|
- Read-only
|
|
|
|
|
|
|
|
Snapshots
|
|
---------
|
|
|
|
Fundamentally, I think we need to be able to manipulate the manifest
|
|
status of clones because we want to be able to dynamically promote,
|
|
flush (if the state was dirty when the clone was created), and evict
|
|
clones.
|
|
|
|
As such, I think we want to make a new op type which permits writes on
|
|
clones.
|
|
|
|
See snaps.rst for a rundown of the librados snapshot system and osd
|
|
support details. I'd like to call out one particular data structure
|
|
we may want to exploit.
|
|
|
|
We maintain a clone_overlap mapping which gives between two adjacent
|
|
clones byte ranges which are identical. It's used during recovery and
|
|
cache/tiering promotion to ensure that the ObjectStore ends up with
|
|
the right byte range sharing via clone. We probably want to ensure
|
|
that adjacent clones agree on chunk mappings for shared regions.
|
|
|
|
* Implementation
|
|
|
|
My thought is that set_chunk and set_reredirect shouldn't set
|
|
user_modify to indicate not to trigger a new clone (probably also need a flag like cache_evict).
|
|
Also, to use manifest with snapshot, set_chunk should be set when the manifest object is created.
|
|
As a result, the overall procedure is:
|
|
|
|
1. Write the object A
|
|
2. SET_CHUNK (offset: 0 ~ 4)
|
|
3. SET_CHUNK (offset: 8 ~ 12)
|
|
4. SET_CHUNK (offset: 16 ~ 20)
|
|
5. Create a snapshot
|
|
6. Write the object A
|
|
7. Create a snapshot
|
|
|
|
When a snapshot is created between 5 and 6, clones prior to head are dirty
|
|
if the head object is dirty (and the flush should start from old clones).
|
|
Also, there are two use cases.
|
|
|
|
Use case 1
|
|
|
|
1. Create object A
|
|
2. Write Full
|
|
3. SET_CHUNK
|
|
4. SET_CHUNK
|
|
5. Write the object A
|
|
6. Create a snapshot
|
|
7. Write the object A
|
|
8. Create a snapshot
|
|
|
|
Use case 2
|
|
|
|
1. Create object A
|
|
2. Write Full
|
|
3. SET_CHUNK
|
|
4. SET_CHUNK
|
|
5. Write the object A
|
|
6. Create a snapshot ABC
|
|
7. Write the object A
|
|
8. SET_CHUNK to the snapshot ABC
|
|
|
|
The state of the chunk_map in clone should be MISSING after eviction is done.
|
|
If we want to read clones, we have to look at the chunk_info_t according to given snap_id
|
|
to find out which chunk is needed.
|
|
Therefore, clone needs to be mutable. To do that, we probably need a new op type.
|
|
If snap read occurs, the chunks can be read from snapshots without doing a rollback.
|
|
Such reads must return the contents of the object at the time the snapshot was taken.
|
|
|
|
- Clone_overlap
|
|
|
|
clone_overlap is an optimization that ensures that recovery preserves the underlying
|
|
ObjectStore level byte range sharing inherent in clone.
|
|
Therefore, to support snapshot with manifest object, we should ensure interaction
|
|
between clone_overlap and manifest object. To do so, here is the basic example describing how
|
|
it works.
|
|
HEAD: [1 ~ 10 nochunk] [10 ~ 10 nochunk] [20 ~ 10 nochunk]
|
|
clone 10(9, 7): [1 ~ 10 nochunk] [10 ~ 10 nochunk] [20 ~ 10 nochunk]
|
|
clone 6(6, 5): [1 ~ 10 nochunk] [10 ~ 10 nochunk] [20 ~ 10 nochunk]
|
|
clone 4(2, 1): [1 ~ 10 nochunk] [10 ~ 10 nochunk] [20 ~ 10 nochunk]
|
|
clone_overlap: {10: [1 ~ 10, 20 ~ 10], 6: [10 ~ 10], 4: [1 ~ 10, 10 ~ 20]}
|
|
|
|
At this point, if we SET_CHUNK clone 6 20 ~ 10 aaa, the object will become as below.
|
|
|
|
HEAD: [1 ~ 10 nochunk] [10 ~ 10 nochunk] [20 ~ 10 nochunk]
|
|
clone 10(9, 7): [1 ~ 10 nochunk] [10 ~ 10 nochunk] [20 ~ 10 nochunk]
|
|
clone 6(6, 5): [1 ~ 10 nochunk] [10 ~ 10 nochunk] [20 ~ 10 aaa]
|
|
clone 4(2, 1): [1 ~ 10 nochunk] [10 ~ 10 nochunk] [20 ~ 10 aaa]
|
|
clone_overlap: {10: [1 ~ 10, 20 ~ 10], 6: [10 ~ 10], 4:[1 ~ 10]}
|
|
|
|
|
|
- Leak management
|
|
|
|
Fixing a reference leak by dedup-tool also need to be reconsidered because
|
|
dedup-tool finds out the leaks by using the head object. If the head object
|
|
has snapshots, it should search chunk_info_t in clones as well.
|
|
|
|
|
|
* Plans
|
|
|
|
1. separate evict from flush (manifest tier uses flush+evict)
|
|
2. promote, flush and evict clones for snapshotted manifest object (with a new op)
|
|
3. scrub test for manifest object and dedup tool
|
|
|
|
|
|
Interaction between RBD
|
|
-----------------------
|
|
|
|
ToDo
|
|
|
|
|
|
Data Structures
|
|
===============
|
|
|
|
Each object contains an object_manifest_t embedded within the
|
|
object_info_t (see osd_types.h):
|
|
|
|
::
|
|
|
|
struct object_manifest_t {
|
|
enum {
|
|
TYPE_NONE = 0,
|
|
TYPE_REDIRECT = 1,
|
|
TYPE_CHUNKED = 2,
|
|
};
|
|
uint8_t type; // redirect, chunked, ...
|
|
hobject_t redirect_target;
|
|
std::map<uint64_t, chunk_info_t> chunk_map;
|
|
}
|
|
|
|
The type enum reflects three possible states an object can be in:
|
|
|
|
1. TYPE_NONE: normal rados object
|
|
2. TYPE_REDIRECT: object payload is backed by a single object
|
|
specified by redirect_target
|
|
3. TYPE_CHUNKED: object payload is distributed among objects with
|
|
size and offset specified by the chunk_map. chunk_map maps
|
|
the offset of the chunk to a chunk_info_t shown below further
|
|
specifying the length, target oid, and flags.
|
|
|
|
::
|
|
|
|
struct chunk_info_t {
|
|
typedef enum {
|
|
FLAG_DIRTY = 1,
|
|
FLAG_MISSING = 2,
|
|
FLAG_HAS_REFERENCE = 4,
|
|
FLAG_HAS_FINGERPRINT = 8,
|
|
} cflag_t;
|
|
uint32_t offset;
|
|
uint32_t length;
|
|
hobject_t oid;
|
|
cflag_t flags; // FLAG_*
|
|
|
|
|
|
Request Handling
|
|
================
|
|
|
|
Similarly to cache/tiering, the initial touchpoint is
|
|
maybe_handle_manifest_detail.
|
|
|
|
For manifest operations listed below, we return NOOP and continue onto
|
|
dedicated handling within do_osd_ops.
|
|
|
|
For redirect objects which haven't been promoted (apparently oi.size >
|
|
0 indicates that it's present?) we proxy reads and writes.
|
|
|
|
For reads on TYPE_CHUNKED, if can_proxy_chunked_read (basically, all
|
|
of the ops are reads of extents in the object_manifest_t chunk_map),
|
|
we proxy requests to those objects.
|
|
|
|
|
|
|
|
RADOS Interface
|
|
================
|
|
|
|
To set up deduplication pools, you must have two pools. One will act as the
|
|
base pool and the other will act as the chunk pool. The base pool need to be
|
|
configured with fingerprint_algorithm option as follows.
|
|
|
|
::
|
|
|
|
ceph osd pool set $BASE_POOL fingerprint_algorithm sha1|sha256|sha512
|
|
--yes-i-really-mean-it
|
|
|
|
1. Create objects ::
|
|
|
|
- rados -p base_pool put foo ./foo
|
|
|
|
- rados -p chunk_pool put foo-chunk ./foo-chunk
|
|
|
|
2. Make a manifest object ::
|
|
|
|
- rados -p base_pool set-chunk foo $START_OFFSET $END_OFFSET --target-pool
|
|
chunk_pool foo-chunk $START_OFFSET --with-reference
|
|
|
|
Operations:
|
|
|
|
* set-redirect
|
|
|
|
set a redirection between a base_object in the base_pool and a target_object
|
|
in the target_pool.
|
|
A redirected object will forward all operations from the client to the
|
|
target_object. ::
|
|
|
|
void set_redirect(const std::string& tgt_obj, const IoCtx& tgt_ioctx,
|
|
uint64_t tgt_version, int flag = 0);
|
|
|
|
rados -p base_pool set-redirect <base_object> --target-pool <target_pool>
|
|
<target_object>
|
|
|
|
Returns ENOENT if the object does not exist (TODO: why?)
|
|
Returns EINVAL if the object already is a redirect.
|
|
|
|
Takes a reference to target as part of operation, can possibly leak a ref
|
|
if the acting set resets and the client dies between taking the ref and
|
|
recording the redirect.
|
|
|
|
Truncates object, clears omap, and clears xattrs as a side effect.
|
|
|
|
At the top of do_osd_ops, does not set user_modify.
|
|
|
|
This operation is not a user mutation and does not trigger a clone to be created.
|
|
|
|
The purpose of set_redirect is two.
|
|
|
|
1. Redirect all operation to the target object (like proxy)
|
|
2. Cache when tier_promote is called (rediect will be cleared at this time).
|
|
|
|
* set-chunk
|
|
|
|
set the chunk-offset in a source_object to make a link between it and a
|
|
target_object. ::
|
|
|
|
void set_chunk(uint64_t src_offset, uint64_t src_length, const IoCtx& tgt_ioctx,
|
|
std::string tgt_oid, uint64_t tgt_offset, int flag = 0);
|
|
|
|
rados -p base_pool set-chunk <source_object> <offset> <length> --target-pool
|
|
<caspool> <target_object> <taget-offset>
|
|
|
|
Returns ENOENT if the object does not exist (TODO: why?)
|
|
Returns EINVAL if the object already is a redirect.
|
|
Returns EINVAL if on ill-formed parameter buffer.
|
|
Returns ENOTSUPP if existing mapped chunks overlap with new chunk mapping.
|
|
|
|
Takes references to targets as part of operation, can possibly leak refs
|
|
if the acting set resets and the client dies between taking the ref and
|
|
recording the redirect.
|
|
|
|
Truncates object, clears omap, and clears xattrs as a side effect.
|
|
|
|
This operation is not a user mutation and does not trigger a clone to be created.
|
|
|
|
TODO: SET_CHUNK appears to clear the manifest unconditionally if it's not chunked.
|
|
That seems wrong. ::
|
|
|
|
if (!oi.manifest.is_chunked()) {
|
|
oi.manifest.clear();
|
|
}
|
|
|
|
* tier-promote
|
|
|
|
promote the object (including chunks). ::
|
|
|
|
void tier_promote();
|
|
|
|
rados -p base_pool tier-promote <obj-name>
|
|
|
|
Returns ENOENT if the object does not exist
|
|
Returns EINVAL if the object already is a redirect.
|
|
|
|
For a chunked manifest, copies all chunks to head.
|
|
|
|
For a redirect manifest, copies data to head.
|
|
|
|
TODO: To atomically replace a redirect or dedup'd chunk with a local copy atomically,
|
|
redirect will be clear after the promote.
|
|
|
|
Does not clear the manifest.
|
|
|
|
Note: For a chunked manifest, calls start_copy on itself and uses the
|
|
existing read proxy machinery to proxy the reads.
|
|
|
|
TODO: Use TIER_DO_NOT_DEDUP_PIN to avoid unnecessary dedup work.
|
|
- Two use cases.
|
|
|
|
Case a:
|
|
|
|
1. Create Object A and B
|
|
2. Setchunk A to B
|
|
3. Write A
|
|
4. TIER_DO_NOT_DEDUP_PIN
|
|
5. Flush does not occur
|
|
|
|
Case b:
|
|
|
|
1. Create Object A and B
|
|
2. Setchunk A to B
|
|
3. TIER_DO_NOT_DEDUP_PIN
|
|
4. Promote A
|
|
5. Write A
|
|
6. Flush does not occur
|
|
|
|
TODO: Free old fingerprint oid earlier.
|
|
There is a HEAD: [1-10 manifest: aaa, clean, size is 20]
|
|
Then, we write the region of 5 ~ 15.
|
|
HEAD:[size is 20] (1-10 is not in the manifest)
|
|
Then, we write the region of 6 ~ 15.
|
|
HEAD:[size is 20] (1-10 is not in the manifest)
|
|
Then, we write the region of 7 ~ 15.
|
|
If the tiering agent wants to dedup 1-10 because it is now cold, it can use a read and set-chunk to:
|
|
Read 1-10 and computation
|
|
HEAD:[1-10 manifest:ddd, clean, size is 20]
|
|
This way, we shorten the lifetime of the aaa dedup target object freeing space earlier.
|
|
|
|
At the top of do_osd_ops, does not set user_modify.
|
|
|
|
|
|
* unset-manifest
|
|
|
|
unset the manifest info in the object that has manifest. ::
|
|
|
|
void unset_manifest();
|
|
|
|
rados -p base_pool unset-manifest <obj-name>
|
|
|
|
Clears manifest chunks or redirect. Lazily releases references, may
|
|
leak.
|
|
|
|
do_osd_ops seems not to include it in the user_modify=false whitelist,
|
|
and so will trigger a snapshot. Note, this will be true even for a
|
|
redirect though SET_REDIRECT does not flip user_modify.
|
|
|
|
* tier-flush
|
|
|
|
flush the object which has chunks to the chunk pool. ::
|
|
|
|
void tier_flush();
|
|
|
|
rados -p base_pool tier-flush <obj-name>
|
|
|
|
Included in the user_modify=false whitelist, does not trigger a clone.
|
|
|
|
|
|
Dedup tool
|
|
==========
|
|
|
|
Dedup tool has two features: finding an optimal chunk offset for dedup chunking
|
|
and fixing the reference count (see ./refcount.rst).
|
|
|
|
* find an optimal chunk offset
|
|
|
|
a. fixed chunk
|
|
|
|
To find out a fixed chunk length, you need to run the following command many
|
|
times while changing the chunk_size. ::
|
|
|
|
ceph-dedup-tool --op estimate --pool $POOL --chunk-size chunk_size
|
|
--chunk-algorithm fixed --fingerprint-algorithm sha1|sha256|sha512
|
|
|
|
b. rabin chunk(Rabin-karp algorithm)
|
|
|
|
As you know, Rabin-karp algorithm is string-searching algorithm based
|
|
on a rolling-hash. But rolling-hash is not enough to do deduplication because
|
|
we don't know the chunk boundary. So, we need content-based slicing using
|
|
a rolling hash for content-defined chunking.
|
|
The current implementation uses the simplest approach: look for chunk boundaries
|
|
by inspecting the rolling hash for pattern(like the
|
|
lower N bits are all zeroes).
|
|
|
|
- Usage
|
|
|
|
Users who want to use deduplication need to find an ideal chunk offset.
|
|
To find out ideal chunk offset, Users should discover
|
|
the optimal configuration for their data workload via ceph-dedup-tool.
|
|
And then, this chunking information will be used for object chunking through
|
|
set-chunk api. ::
|
|
|
|
ceph-dedup-tool --op estimate --pool $POOL --min-chunk min_size
|
|
--chunk-algorithm rabin --fingerprint-algorithm rabin
|
|
|
|
ceph-dedup-tool has many options to utilize rabin chunk.
|
|
These are options for rabin chunk. ::
|
|
|
|
--mod-prime <uint64_t>
|
|
--rabin-prime <uint64_t>
|
|
--pow <uint64_t>
|
|
--chunk-mask-bit <uint32_t>
|
|
--window-size <uint32_t>
|
|
--min-chunk <uint32_t>
|
|
--max-chunk <uint64_t>
|
|
|
|
Users need to refer following equation to use above options for rabin chunk. ::
|
|
|
|
rabin_hash =
|
|
(rabin_hash * rabin_prime + new_byte - old_byte * pow) % (mod_prime)
|
|
|
|
c. Fixed chunk vs content-defined chunk
|
|
|
|
Content-defined chunking may or not be optimal solution.
|
|
For example,
|
|
|
|
Data chunk A : abcdefgabcdefgabcdefg
|
|
|
|
Let's think about Data chunk A's deduplication. Ideal chunk offset is
|
|
from 1 to 7 (abcdefg). So, if we use fixed chunk, 7 is optimal chunk length.
|
|
But, in the case of content-based slicing, the optimal chunk length
|
|
could not be found (dedup ratio will not be 100%).
|
|
Because we need to find optimal parameter such
|
|
as boundary bit, window size and prime value. This is as easy as fixed chunk.
|
|
But, content defined chunking is very effective in the following case.
|
|
|
|
Data chunk B : abcdefgabcdefgabcdefg
|
|
|
|
Data chunk C : Tabcdefgabcdefgabcdefg
|
|
|
|
|
|
* fix reference count
|
|
|
|
The key idea behind of reference counting for dedup is false-positive, which means
|
|
(manifest object (no ref), chunk object(has ref)) happen instead of
|
|
(manifest object (has ref), chunk 1(no ref)).
|
|
To fix such inconsistency, ceph-dedup-tool supports chunk_scrub. ::
|
|
|
|
ceph-dedup-tool --op chunk_scrub --chunk_pool $CHUNK_POOL
|
|
|