prometheus

History

Matt Palmer 3369422327 Improve DNS response handling to prevent "stuck" records [Fixes #2799 ] (#3138 ) The problem reported in #2799 was that in the event that all records for a name were removed, the target group was never updated to be the "empty" set. Essentially, whatever Prometheus last saw as a non-empty list of targets would stay that way forever (or at least until Prometheus restarted...). This came about because of a fairly naive interpretation of what a valid-looking DNS response actually looked like -- essentially, the only valid DNS responses were ones that had a non-empty record list. That's fine as long as your config always lists only target names which have non-empty record sets; if your environment happens to legitimately have empty record sets sometimes, all hell breaks loose (otherwise-cleanly shutdown systems trigger up==0 alerts, for instance). This patch is a refactoring of the DNS lookup behaviour that maintains existing behaviour with regard to search paths, but correctly handles empty and non-existent record sets. RFC1034 s4.3.1 says there's three ways a recursive DNS server can respond: 1. Here is your answer (possibly an empty answer, because of the way DNS considers all records for a name, regardless of type, when deciding whether the name exists). 2. There is no spoon (the name you asked for definitely does not exist). 3. I am a teapot (something has gone terribly wrong). Situations 1 and 2 are fine and dandy; whatever the answer is (empty or otherwise) is the list of targets. If something has gone wrong, then we shouldn't go updating the target list because we don't really know what the target list should be. Multiple DNS servers to query is a straightforward augmentation; if you get an error, then try the next server in the list, until you get an answer or run out servers to ask. Only if all the servers return errors should you return an error to the calling code. Where things get complicated is the search path. In order to be able to confidently say, "this name does not exist anywhere, you can remove all the targets for this name because it's definitely GORN", at least one server for all the possible names need to return either successful-but-empty responses, or NXDOMAIN. If any name errors out, then -- since that one might have been the one where the records came from -- you need to say "maintain the status quo until we get a known-good response". It is possible, though unlikely, that a poorly-configured DNS setup (say, one which had a domain in its search path for which all configured recursive resolvers respond with REFUSED) could result in the same "stuck" records problem we're solving here, but the DNS configuration should be fixed in that case, and there's nothing we can do in Prometheus itself to fix the problem. I've tested this patch on a local scratch instance in all the various ways I can think of: 1. Adding records (targets get scraped) 2. Adding records of a different type 3. Remove records of the requested type, leaving other type records intact (targets don't get scraped) 4. Remove all records for the name (targets don't get scraped) 5. Shutdown the resolver (targets still get scraped) There's no automated test suite additions, because there isn't a test suite for DNS discovery, and I was stretching my Go skills to the limit to make this happen; mock objects are beyond me.		2017-09-15 12:26:10 +02:00
..
azure	Use log.Logger interface for all discovery services	2017-06-01 11:25:55 -05:00
consul	Hide consul token on UI. (#2797 )	2017-06-01 22:14:23 +01:00
dns	Improve DNS response handling to prevent "stuck" records [Fixes #2799 ] (#3138 )	2017-09-15 12:26:10 +02:00
ec2	Use log.Logger interface for all discovery services	2017-06-01 11:25:55 -05:00
file	Use log.Logger interface for all discovery services	2017-06-01 11:25:55 -05:00
gce	Use log.Logger interface for all discovery services	2017-06-01 11:25:55 -05:00
kubernetes	Replace regex with Secret type and remarshal config to hide secrets (#2775 )	2017-05-29 12:46:23 +01:00
marathon	Marathon SD: Set port index label	2017-09-11 13:40:51 +01:00
openstack	Openstack Service Discovery (#2701 )	2017-06-01 23:49:02 +02:00
triton	Add triton zone brand metadata	2017-04-06 21:35:42 +00:00
zookeeper	Use log.Logger interface for all discovery services	2017-06-01 11:25:55 -05:00
discovery.go	Openstack Service Discovery (#2701 )	2017-06-01 23:49:02 +02:00
discovery_test.go	Use log.Logger interface for all discovery services	2017-06-01 11:25:55 -05:00