Device::binding_port
1. port_id is more meaningful compared to i as variable name.
2. start port_id from 1 instead of 0.
PoolAllocator::malloc
1. make clear relationship among buffer/chunk/block/memory_region with new
variable name.
2. define the variable when it's first being used.
RDMAConnectedSocketImpl::submit
1. use "wait_copy_len" to replace "need_reserve_bytes" which stands for the memory
that is waiting to be copied into chunk.
2. use "copy_start" to replace "copy_it" which stands for the start iterator to be copied.
3. use "total_copied" to replace "total" which stands for the memory that has been copied.
allocate huge page
1. use "HUGE_PAGE_SIZE_2MB" to be used for 2MB page alignment.
2. use "ALIGN_TO_PAGE_2MB" to stands align request size to 2MB.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
The parameter "block" points to mem_info::chunks space. It's not quite
clear about the function of "reinterpret_cast<mem_info *>(block) - 1;".
Get the mem_info::chunks address and minus the member offset from struct
head to get mem_info address.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
When releasing read chunk to pool, the chunk::offset & chunk::bound
should be reset to zero. For write chunk, it's better to reset
chunk::offset to zero and chunk::bound to chunk length which means that
[offset, bound) is writable.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
API usage:
int ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr, struct ibv_send_wr **bad_wr)
Input Parameters:
qp struct ibv_qp from ibv_create_qp
wr first work request (WR)
Output Parameters:
bad_wr pointer to first rejected WR
Return Value:
0 on success, -1 on error.
If the call fails, errno will be set to indicate the reason for the failure.
To avoid wrong checking return value, it's better to initialize the
value to be nullptr.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
1. There's no need to get stack & dispatcher from RDMAStack again
since RDMAWorker has stored the value.
2. cache the Infiniband object to be used in local scope.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
After refactoring, there's no need to do below judgement
- if (c != buffers.end() && (*c)->over())
- ++c;
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
1. It's not proper to allocate large space in stack. e.g. rx_queue_len is 4096.
The patch changes to allocate rx_work_request and isge in heap.
2. Set rx_work_request and isge array whole space into zero which could avoid
setting the space into zero one by one in the while loop.
3. Change parameter name "num" to be "rq_wr_num" to improve readiness
rq_wr_num i.e. receive-queue_work-request_number
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
1. all values are initialized in construction function
In this way, it's easy to construct Chunk object in
PoolAllocator::malloc function.
2. For read chunk, member bound is initialized to be 0.
3. For send chunk, member bound is initialzied to be full space size.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
Extract the long lambda function to improve readability.
There's no advantage since "this" pointer is also needed
in original lambad function.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
1. List all asynchronous event of the RDMA device
2. Output the fatal error events to check RDMA device status
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
The original implementation makes it's hard to understand:
1) Whether timer event should be executed.
2) How long should epoll wait for timeout.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
After reading one chunk, the chunk could be pushed into buffer list if its
effecitve content size is not zero. In this case, it also means that the
caller has got the required read length. Then all the continuous chunk will
be pushed into buffer list since the effective content size is not zero.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
Keep same logic:
1. If parameter block_size is zero, then allocate all the free chunks
to parameter std::vector<Chunk*> &chunks. i.e.
chunk_buffer_number = free_chunks.size()
2. If paramter block_size is not zero, then allocate the requested or
all the free chunks to paramter std::vector<Chunk*> &chunks.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
1. offload chunk::read without managing bound.
2. reset chunk::offset & chunk::bound before releasing to pool.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
remove Chunk::over interface and add Chunk::get_size interface
1) It's not clear when reading "over" function name.
2) Some places need know the current chunk block effective content size.
3) "Chunk::over()" could be replaced by "Chunk::get_size() == 0"
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
If ms_async_rdma_cm is false, there's no need to call the api
rdma_get_device. If rdma_get_device is called, the devices remain
opened while librdmacm is loaded. This is not what we want when
ms_async_rdma_cm is false.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
1. use wrapper function event_read & event_write to access
event file descriptor.
2. change event fd access value name to be event_val.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
1. It's wrong to use "-1" as argument to query queue state.
In rdma library, ibv_query_qp will call ibv_cmd_query_qp to query
queue state. If "-1" is used as attr_mask, ibv_cmd_query_qp will
return error EOPNOTSUPP which means query failed.
2. In class QueuePair, is_error() could use member function get_state()
to get the queue pair state.
3. It's better to use qp_state as queue pair state according to
ibv_query_qp manual guide.
struct ibv_qp_attr {
enum ibv_qp_state qp_state; /* Current QP state */
enum ibv_qp_state cur_qp_state; /* Current QP state - irrelevant for ibv_query_qp */
...
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
In rdma-core library, ibv_fork_init will check environment variable
RDMAV_HUGEPAGES_SAFE to decide whether huge page is usable in system.
It doesn't make sense to export RDMAV_HUGEPAGES_SAFE env after
calling ibv_fork_init.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
1. Avoid to do memory management without using pointer to operate
operate the allocated space. Or, it could have memory leak.
2. Since member type has been changed in class Device, it need
to use member domain operator "." to access to the sub-member in
object.
3. There's no need to consider experimental API of ibv_query_port.
So, merge ibv_query_port in the prolog.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
The allocated buf size should be under hardware's max_mr_size. Or it'll
trigger out-of-bound access problem when calling ibv_reg_mr.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
Some rdma devices don't support srq(shared receive queue).
Check hardware attribute if ceph is configured to use srq.
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
It'll trigger out-of-bound access problem in kernel if the required
memory region size is bigger than ibv_device_attr.max_mr_size
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
It will hit below misleading log without this patch:
Infiniband init requested receive queue length 4095 is too big. Setting 4095
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
* refs/pull/29780/head:
osd/PeeringState: semi-colon after DECLARE_LOCALS
osd/PeeringState: on_new_interval on child PG after split
Reviewed-by: Samuel Just <sjust@redhat.com>
The pool_stats map comes from a get('df') that may not include a pool
because it was just deleted.
Fixes: https://tracker.ceph.com/issues/41386
Signed-off-by: Sage Weil <sage@redhat.com>
* refs/pull/28378/head:
qa/tasks: introduce Thrasher base class
qa/tasks: Fix typo
qa/tasks: manage thrashers
qa/tasks: start DaemonWatchdog when ceph starts
qa/tasks: make watch and bark handle more daemons
qa/tasks: move DaemonWatchdog to new file
Reviewed-by: Patrick Donnelly <pdonnell@redhat.com>