If poll() says a socket is ready for reading, but zero bytes
are read, that means that the peer has sent a FIN. Handle that.
One way the incorrect handling was manifesting is as follows:
Under a heavy write load, clients log many messages like this:
[19021.523192] libceph: tid 876 timed out on osd6, will reset osd
[19021.523328] libceph: tid 866 timed out on osd10, will reset osd
[19081.616032] libceph: tid 841 timed out on osd0, will reset osd
[19081.616121] libceph: tid 826 timed out on osd2, will reset osd
[19081.616176] libceph: tid 806 timed out on osd3, will reset osd
[19081.616226] libceph: tid 875 timed out on osd9, will reset osd
[19081.616275] libceph: tid 834 timed out on osd12, will reset osd
[19081.616326] libceph: tid 874 timed out on osd10, will reset osd
After the clients are done writing and the file system should
be quiet, osd hosts have a high load with many active threads:
$ ps u -C cosd
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1383 162 11.5 1456248 943224 ? Ssl 11:31 406:59 /usr/bin/cosd -i 7 -c /etc/ceph/ceph.conf
$ for p in `ps -C cosd -o pid --no-headers`; do grep -nH State /proc/$p/task/*/status | grep -v sleep; done
/proc/1383/task/10702/status:2:State: R (running)
/proc/1383/task/10710/status:2:State: R (running)
/proc/1383/task/10717/status:2:State: R (running)
/proc/1383/task/11396/status:2:State: R (running)
/proc/1383/task/27111/status:2:State: R (running)
/proc/1383/task/27117/status:2:State: R (running)
/proc/1383/task/27162/status:2:State: R (running)
/proc/1383/task/27694/status:2:State: R (running)
/proc/1383/task/27704/status:2:State: R (running)
/proc/1383/task/27728/status:2:State: R (running)
With this fix applied, a heavy load still causes many client
resets of osds, but no runaway threads result.
Signed-off-by: Jim Schutt <jaschut@sandia.gov>
Signed-off-by: Sage Weil <sage@newdream.net>
make: create /etc/ceph if it doesn't exist. On uninstall, remove the
directory if it's empty. (Never remove a user's config file, though.)
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
object_info_t has one constructor that initializes everything from a
bufferlist. This means that the decode function needs to give default
values to fields in object_info_t that aren't found in the bufferlist.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
bench_write and bench_seq will now wait on any write/read
rather than the one least recently started.
bench_write adds its pid to the BENCH_DATA object
bench_read uses the pid in BENCH_DATA to generate the object
names to read.
Signed-off-by: Samuel Just <samuelj@hq.newdream.net>
SyslogStreambuf is a kind of stream buffer that allows you to output
characters from an ostream to syslog. Most standard IO streams can make
use of this Streambuf.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
We used to call apply_transactions, which avoided rejournaling anything
because the journal wasn't writeable yet, but that uses all kinds of other
machinery that relies on threads and finishers and such that aren't
appropriate or necessary when we're just replaying journaled events.
Instead, call the lower-level do_transactions() directly.
Signed-off-by: Sage Weil <sage@newdream.net>
Not even sure where min() was coming from, but it seems to be missing on
i386 lucid.:
g++ -DHAVE_CONFIG_H -I. -Wall -D__CEPH__ -D_FILE_OFFSET_BITS=64 -D_REENTRANT -D_THREAD_SAFE -rdynamic -g -O2 -MT rbd.o -MD -MP -MF .deps/rbd.Tpo -c -o rbd.o rbd.cc
rbd.cc: In function 'int do_import(void*, const char*, int, const char*)':
rbd.cc:837: error: no matching function for call to 'min(uint64_t&, off_t)'
make[3]: *** [rbd.o] Error 1
Reported-by: John Leach <john@johnleach.co.uk>
Signed-off-by: Sage Weil <sage@newdream.net>
I've found the manpage problem that I've noted before. It's about
monmaptool, the CLI says it's usage:
[--print] [--create [--clobber]] [--add name 1.2.3.4:567] [--rm name]
<mapfilename>
But the manpage states this as an example:
monmaptool --create --add 192.168.0.10:6789 --add 192.168.0.11:6789 --add
192.168.0.12:6789 --clobber monmap
This definitely misses 'name' after the 'add' switch, resulting:
"invalid ip:port '--add'" as an error message. Attached patch fixes this
inconsistency.
Signed-off-by: Laszlo Boszormenyi <gcs@debian.hu>
If for some reason we enter scrub() without scrub_reserved == true, don't
adjust the osd->scrubs_pending or we'll screw up the accounting.
Signed-off-by: Sage Weil <sage@newdream.net>
Create a copy constructor for object_info_t, since we often want to copy
an object_info_t and would rather not try to remember all the fields.
Drop the lost parameter from one of the other constructors, because it's
not used that much.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
mark_all_unfound_as_lost: just delete items from the rmissing set as we
find them, rather than using a multi-pass system.
Update info.last_update as we go so that log printouts will look correct
(the log printout function checks info.last_update)
Don't remove from missing or missing_loc in mark_obj_as_lost.
PG::missing_loc should never have the soid, and PG::missing we handle
elsewhere.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
In PG::mark_obj_as_lost, we have to mark a missing object as lost. We
should not assume that we have an old version of the missing object in
the ObjectStore. If the object doesn't exist in the object store, we
have to create it so that recovery can function correctly.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
This one verifies:
1. Client asks for an unfound object and gets put to sleep
2. Object gets declared lost
3. Client wakes up
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
We don't have enough information to mark objects as lost until we
activate the PG. might_have_unfound isn't even built until PG::activate.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
Make all survivors participate in resolve stage, so that survivors can
properly determine the outcome of migrations to the failed node that did
not complete.
The sequence (before):
- A starts to export /foo to B
- C has ambiguous auth (A,B) in it's subtree map
- B journals import_start
- B fails
...
- B restarts
- B sends resolves to everyone
- does not claim /foo
- A sends resolve _only_ to B
- does claim /foo
- B knows it's import did not complete
- C doesn't know anything. Also, the maybe_resolve_finish stuff was
totally broken because the recovery_set wasn't initialized
See new (commented out) assert in Migrator.cc to reproduce the above.
Signed-off-by: Sage Weil <sage@newdream.net>
In _process_pg_info, if the primary sends us a PG::Log, a replica should
merge that log into its own.
mark_all_unfound_as_lost / share_pg_log: don't send the whole PG::Log.
Just send the new entries that were just added when marking the objects
as lost.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
We can now permanently mark objects as lost by setting the lost bit in
their object_info_t. Rev the object_info_t struct.
get_object_context: re-arrange this so that we're always setting the
lost bit. Also avoid some unecessary steps.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>
In activate_map, we now mark objects that we know are unfindable as
lost. This relies on the might_have_unfound set introduced earlier.
Signed-off-by: Colin McCabe <colinm@hq.newdream.net>