#munin log

19:29:27 <TheSnide> #startmeeting
19:29:27 <MeetBot> Meeting started Wed Jan 27 19:29:27 2016 UTC.  The chair is TheSnide. Information about MeetBot at http://wiki.debian.org/MeetBot.
19:29:27 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic.
19:29:30 <TheSnide> hi all
19:31:18 <TheSnide> another meeting. those were quite sporadic these days, i'll try to make them much more regular.
19:32:18 <TheSnide> since last time, *lots* has been done on the debian package, which enabled us much more testing :)
19:33:03 <TheSnide> notably the dmm0 has been updated, first with the latest beta, then with the latest dailybuilds.
19:33:23 <TheSnide> the daily builds are also back, but not automated yet. They should be by tomorrow.
19:34:07 <TheSnide> ... They will be auto updated from both github:devel and the alioth:debian/experimental.
19:35:19 <TheSnide> i only have resources (time, knowledge & incentive) to do it for the debian package, but i *do* encourage any other distro to do the same.
19:35:59 <TheSnide> ... the deb package should also work on ubuntu, but i won't test them there.
19:37:15 <TheSnide> also, you should thank shapirus for quite a number of bugfixes.
19:37:55 <TheSnide> ssm for the deb pacakging (as usual). So if the package doesn't work for you, it's him. And if the package is great, it's him also :-D
19:38:45 <TheSnide> that said, if the debianauto package isn't working, it's usually me.
19:39:24 <TheSnide> chteuchteu: i think what you maid for the new UI is quite great now. most of the things are just working as planned.
19:40:19 <TheSnide> chteuchteu also redesigned our insitutional website, which looks quite great now, you can have a peek at http://mm0.eu/n
19:40:28 <shapirus> > ... the deb package should also work on ubuntu  <--  as far as 14.10 is concerned, not without a few patches: there's a test failing, and List::Util doesn't have the "any" and "all" methods. Also the init scripts have to be worked on.
19:40:54 <TheSnide> shapirus: yeah, the init scripts are systemd-based IIRC.
19:41:48 <shapirus> yes, and systemd isn't in 14.10. If we want the packages to work for that distro, we'll have to provide upstart or sysv-style configs.
19:42:06 <TheSnide> i'd prefer sysv-style if i have my word to say.
19:42:20 <TheSnide> as it's compat with any /sbin/init
19:42:27 <shapirus> upstart sounds easier to me, but sysv is way more portable
19:42:57 <shapirus> also think of debian systems with sysv instead of systemd (yes it is possible to choose either of them in jessie)
19:43:00 <TheSnide> ... but i'd take anything which is making it working in ubuntu
19:43:18 <shapirus> so it's not only ubuntu-related
19:43:23 <TheSnide> yeah, that's why i'd prefer sysv-style.
19:43:56 <shapirus> I have a working munin-node sysv script already (took it from 2.1.9 I believe, it works almost as is)
19:43:59 <TheSnide> the good part is that sysv-style, i can do it myself if needed.
19:44:16 <TheSnide> (only time will be missing, not skills)
19:44:26 <shapirus> I can share it, but I have no idea on how to inject it into the deb packages, so that'll be up to you guys
19:44:47 <shapirus> as far as munin-asyncd, I run that one under supervisord currently
19:44:48 <TheSnide> shapirus: put it in a PR as a contrib/
19:45:05 <shapirus> should probably be not difficult to make a sysv script for it as well
19:45:18 <shapirus> I'll look at that
19:45:34 <TheSnide> sysv script aren't that difficult if you don't care about babysitting the processes
19:45:45 <shapirus> I'm also working on a mysql plugin that provides a bunch of TokuDB status graphs
19:45:46 <TheSnide> (restart upon failure, etc)
19:45:51 <shapirus> will contribute it when it's ready
19:46:14 <TheSnide> also, now is the time to make the graphs look more 2016-ish :)
19:46:19 <shapirus> I thought of incorporating it in the existing mysql_ plugin, but that one's too bloated and maintained outside of munin repo
19:46:26 <shapirus> and it's not multigraph capable
19:46:29 <TheSnide> the new CPU colors are quite great IMHO
19:46:35 <shapirus> hence the decision for a separate plugin
19:47:02 <shapirus> thanks, I've spent quite a bit of time picking all those colors
19:47:45 <shapirus> so that they don't merge into one visually on the adjacent fields yet still remain contrast, readable and easy on the eyes
19:48:03 <TheSnide> oh, and i really think about providing all the "system" plugins a 1sec variant
19:48:13 <shapirus> and there's only 16.7M colors to choose from
19:48:18 <TheSnide> since, it has a great WOW factor.
19:49:08 <TheSnide> shapirus: now you can proceed to others :)
19:49:25 <shapirus> yeah and some good documentation on 1sec (or anything of higher than 5min resolution for that matter) would be nice
19:49:36 <TheSnide> +1 on doc
19:49:53 <TheSnide> I'll review the whole guide and write more doc
19:50:01 <shapirus> then, what comes to my mind at once, is the logging and configuration issues
19:50:13 <TheSnide> as it's only a copy/paste from a former blog article of mine right now
19:50:18 <shapirus> munin-httpd: prefork or not, how many processes to run
19:50:54 <TheSnide> the CPAN lib might not be in debian. so we have to choose at runtime, based on the availability of it.
19:50:55 <shapirus> munin-asyncd or others: log level, log destination (if it is feasible to switch it between syslog and files)
19:51:47 <TheSnide> the logging master is ssm, i did mostly delegate everything to him. let's ping him. If he doesn't reply, I'll see.
19:52:18 <TheSnide> #action TheSnide will write more details in the Guide about 1sec plugins
19:52:19 <shapirus> then that issue of the host subdirectories under /var/spool/munin/async
19:52:35 <TheSnide> shapirus: yup. it was by design.
19:52:41 <shapirus> which I think has been agreed upon, but hardly documented anywhere
19:52:53 <TheSnide> ... but after some time, it just feels wrong.
19:53:23 <TheSnide> shapirus: yes, what we agreed is much better. can you write the guide blurb about it ?
19:53:28 <shapirus> oh and the "1 hour bug"
19:53:49 <TheSnide> it's called the GhostBug™ :)
19:53:50 <shapirus> I have not a slightest idea how to possibly debug it
19:54:09 <shapirus> other than recording a whole log of the process's strace
19:54:33 <shapirus> or there has to be a way to turn on the most verbose debug output in the process itself
19:54:40 <TheSnide> shapirus: that's what i did before, and discovered several races. but i guess i didn't find all.
19:54:52 <shapirus> upon which I can set up some alerting and debug it when it hangs
19:55:28 <TheSnide> if it hangs, it's already too late, as it issues a sleep(3600)
19:55:44 <TheSnide> the key to fixing it, is *why*.
19:56:00 <shapirus> that's already something
19:56:18 <shapirus> there must be some condition that makes it call sleep(3600), right?
19:56:47 <shapirus> well probably recording a strace log is not a bad idea at all
19:57:35 <shapirus> if only it happened predictably...
19:57:38 <TheSnide> shapirus: i did previously, and it paid off.
19:57:52 <TheSnide> but it's a little bit tedious and HDD unfriendly :)
19:58:30 <shapirus> I have a couple dozen of VMs
19:58:49 <shapirus> I'll see what it takes to record that output
19:58:54 <hugin> [13munin] 15steveschnepp commented on issue #625: Recent version of the deb package are ok. Closing it. 02https://git.io/vzHBk
19:59:04 <shapirus> I'd love to catch it
19:59:13 <TheSnide> also, madduck has a good point here https://github.com/munin-monitoring/munin/issues/619
20:00:04 <shapirus> right
20:00:37 <shapirus> it should be easy to convert that to "send as you read"
20:00:53 <shapirus> unless there happen some data aggregation after it's read completely
20:01:03 <TheSnide> i don't think there is.
20:01:08 <TheSnide> also, https://github.com/munin-monitoring/munin/issues/617 is quite relevant
20:01:49 <TheSnide> https://github.com/munin-monitoring/munin/issues/612 has obviously to be fixed
20:01:59 <shapirus> hard coded stuff bites again
20:02:40 <shapirus> the latter looks like pure coding mistake
20:02:49 <shapirus> or underimplemented something
20:03:35 <shapirus> I'd also mention this one https://github.com/munin-monitoring/munin/issues/634
20:04:22 <shapirus> at the very least it needs a cron job schedule workaround merged
20:04:40 <shapirus> until there is a decision of what to do for a permanent solution
20:05:55 <TheSnide> oh, the update-async struggle :)
20:06:51 <TheSnide> but the first part of the issue, i don't really understand
20:07:07 <hugin> [13munin] 15Skaronator commented on issue #592: ah completely forgot this issue but yeah rebuild the whole graph system in HTML5 with some fancy JS-Graphs would be the best solution. 02https://git.io/vzH0k
20:07:13 <shapirus> what I was thinking about in regard to that is a mechanism that allowed munin-master to connect to the nodes at once as soon as there are new data available
20:07:32 <shapirus> instead of fixed cron runs
20:07:52 <shapirus> but that has to involve support for some callback from the nodes
20:08:17 <shapirus> which doesn't sound impossible, however a new daemon on the master server will be needed
20:08:47 <shapirus> the first part issue is easy
20:09:03 <shapirus> well basically it's all in the steps to reproduce
20:09:10 <shapirus> try them and see what happens :)
20:10:48 <TheSnide> i was thinking about allowing some POST on the munin-httpd from the nodes
20:10:59 <shapirus> yes
20:11:07 <TheSnide> but that got dropped from 3.0
20:11:08 <shapirus> node->master: "I am ready"
20:11:19 <TheSnide> nah, directly the data :D
20:11:20 <shapirus> master->node: "ok I'm coming, give me your data"
20:11:32 <shapirus> how does that sound?
20:11:37 <madduck> go TheSnide go! ;)
20:11:46 <shapirus> it'll also spread load over time
20:12:08 <shapirus> since munin-update will not connect to all nodes at once, but only as soon as there are new data
20:12:18 <madduck> shapirus: this poking could also presumably be done with SSH?
20:12:26 <TheSnide> as it would enable to monitor loosely connected nodes (those behind an infamous mobile internet NAT)
20:12:26 <shapirus> which in turn will allow updates more frequent than every 5 min
20:12:41 <shapirus> or less frequent as well
20:12:45 <madduck> shapirus: or simply keep persistent connections from the server to the nodes and poll.
20:12:59 <shapirus> that doesn't sound as scalable
20:13:11 <shapirus> the notify->poll method sounds better to me
20:13:14 <madduck> you mean past 64k connections?
20:13:23 <TheSnide> ;)
20:13:27 <shapirus> think of poor connections
20:13:50 <TheSnide> madduck: well, if you have more than 1 ip for all the node, you can have more than 64k
20:13:53 <shapirus> I'm imagining some DNS transfers-like system here
20:14:05 <madduck> YUK ;)
20:14:11 <shapirus> where on serial change the master sends notifies
20:14:16 <shapirus> and the slaves come for the fresh zone copy
20:14:23 <TheSnide> shapirus: i have the FTP-data in mind when i look at your proposal, and it doesn't sound pretty
20:14:32 <shapirus> that approach suits munin perfectly I think
20:15:00 <madduck> so wait a minute (sorry that I am talking as bystander, but…)
20:15:05 <TheSnide> shapirus: even better would be "node POST the new data to munin-httpd". period.
20:15:24 <TheSnide> madduck: we're quite tolerant & open :)
20:15:29 <madduck> asyncd collects data and you are suggesting that it reaches out to newdaemond to inform it about new data such that munin then opens SSH to asyncd?
20:15:34 <shapirus> FTP isn't bad either, it's just that the technology turned out to be not very suitable for where people started using it
20:15:55 <TheSnide> madduck: ... basically, yes.
20:15:59 <madduck> why not let asynd push the data instead of poking for a pickup?
20:16:16 <TheSnide> madduck: *precisely* what's i'd prefer :)
20:16:17 <madduck> without replacing pull support ;)
20:16:31 <shapirus> the problem is with the "node POST the new data to munin-httpd" approach is that you'll have to handle the situation when the server isn't available
20:16:44 <shapirus> and mark a portion of data as "failed to transfer"
20:16:49 <TheSnide> shapirus: bah. the async already takes care of that.
20:16:50 <madduck> you just accumulate, as if the server doesn't fetch
20:17:01 <TheSnide> ... remember the "node" is *dumb* approach ?
20:17:19 <shapirus> yes, but you've just said the opposite: that the node has to push data
20:17:20 <TheSnide> the callback will *only* be done by async implementations.
20:17:40 <TheSnide> shapirus: ok. i lied. async will directly push data to master.
20:18:35 <TheSnide> madduck: i think you are thinking the same as my original design.
20:18:49 <TheSnide> (which i didn't have time to implement)
20:19:15 <TheSnide> what's currently missing is : #1 a persistant SQL db. #2 a POST handler on the munin-httpd
20:19:29 <shapirus> [22:16] < madduck> asyncd collects data and you are suggesting that it reaches out to newdaemond to inform it about new data such that munin then opens SSH to
20:19:32 <shapirus> asyncd?
20:19:40 <shapirus> that's exactly what I'm proposing
20:19:58 <shapirus> as to the "why not push data" thing
20:20:14 <shapirus> well, it's just gonna be more difficult to implement I think
20:20:19 <shapirus> more prone to bugs
20:20:24 <shapirus> and less reliable therefore
20:20:46 <shapirus> whereas the notify->poll method will be bulletproof even in case of poor connection
20:20:53 <shapirus> think dns :)
20:21:26 <shapirus> then again, think of load
20:21:43 <shapirus> if a hundred nodes chime in and try to push their data at once
20:21:48 <shapirus> it may overload the master
20:22:19 <shapirus> and you'll have to implement a "sorry come later" mechanism and make the node handle it properly
20:23:03 <shapirus> where as with the notify-poll method, the master will know the list of nodes which have fresh data and work on it at its convenience
20:23:03 <kenyon> like what http already has with its status codes
20:24:05 <shapirus> and that will feel like more 'distributed' rather than 'centralized' architecture
20:25:00 <shapirus> then you can have the node update its spool files as frequent as you wish (think 1sec graph) and notify the master in a lightweight way about that (every time or not more frequent than a configured interval)
20:25:30 <shapirus> and then the master may poll "when the node asks for it but not more frequent than <configvalue> seconds"
20:25:42 <TheSnide> shapirus: i agree in most of what you said. But, as the spooling is already done, avoiding callbacks alltogether means that we don't need to open FW in both directions
20:26:01 <shapirus> that's right
20:26:02 <TheSnide> and retrying is mostly implemented
20:26:20 <shapirus> in what I suggest, there must be an open port on the master
20:26:32 <TheSnide> (since the biggest issue in retring is retaining the failed data. which we *already* do.)
20:26:40 <shapirus> the only drawback as I see it
20:27:01 <TheSnide> so, the async will POST *data* to master has mostly only advantages.
20:27:18 <shapirus> not until you think of scalability
20:27:21 <TheSnide> ... the only drawback i see is "security"
20:27:46 <shapirus> as soon as the number of nodes grows high enough, the master is stuck
20:27:46 <TheSnide> shapirus: bah. scaling HTTP requests is something that is well understood nowadays :D
20:28:14 <shapirus> no, I mean the case of N nodes coming with their data all at the same time
20:28:21 <TheSnide> ... and with persistant DB, comes *multi-host* master
20:28:48 <TheSnide> N nodes coming with their data all at the same time <-- that *will* happen. as we are time-sensitive
20:28:55 <shapirus> the security issue is still there no matter POST or new-data notification for poll
20:29:29 <shapirus> my approach will achieve time-insensitiveness as a side bonus
20:29:34 <TheSnide> if notif, you'll only eventually DoS the platform. not inject wrong data
20:29:35 <shapirus> think of that :)
20:30:23 <TheSnide> shapirus: i'm _really_ not in favors of callbacks. Got my back burnt too often with that :)
20:31:05 <shapirus> and I've had my share of trouble with centralized systems ;)
20:31:23 <shapirus> I'm 8 years in designing and working with HA and high-load systems
20:31:24 <TheSnide> ... it becomes a nightmare to open holes in most FW, and you are at the mercy of NAT :)
20:31:44 <shapirus> I can see where scalability issues are part of design
20:31:59 <shapirus> but then, as far as the firewall goes
20:32:01 <TheSnide> shapirus: that said, _nothing_ prevents you to provide a contrib tool to do that.
20:32:17 <shapirus> let's assume your HTTP data POST method
20:32:24 <TheSnide> i'd even merge it if it works
20:32:25 <shapirus> how are you going to secure it?
20:32:51 <TheSnide> i was naively thinking about SSL
20:33:04 <TheSnide> client-side certs
20:33:16 <TheSnide> _if_ you want to secure it.
20:33:39 <TheSnide> or simple HTTPS + HTTP-Basic Auth.
20:33:47 <TheSnide> that might also be enough
20:33:48 <shapirus> well if not secured, then anyone can inject wrong data or simply DoS the server
20:33:57 <TheSnide> yup
20:34:14 <shapirus> but then, secured in any way, the node callbacks can be secured in just the same way
20:34:31 <shapirus> so the security issue is the same for both approaches
20:34:39 <TheSnide> i said : "node callbacks are *more* secure than node posts"
20:34:50 <shapirus> how are they more secure?
20:35:02 <TheSnide> data injection is impossible. only dos is.
20:35:23 <shapirus> agreed
20:35:51 <shapirus> and then with some firewall rules (e.g., allow LAN only) it protects against DoS as well
20:35:58 <TheSnide> I also wish for some UDP data protocol :)
20:36:14 <shapirus> that's a bright idea
20:36:30 <shapirus> the perfect choise for notifications
20:36:41 <shapirus> we don't care about the packet loss
20:36:42 <TheSnide> ... as a typical "$plugin fetch" does usually fit in a 1.5K packet
20:37:01 <shapirus> 512 bytes.
20:37:12 <TheSnide> ha ?
20:37:22 <shapirus> UDP packet size is up to 512 bytes
20:37:30 <TheSnide> #wut ?
20:37:31 <shapirus> lower than the typical MTU :)
20:38:00 <be0rn> That's not true. UDP packets can easily be bigger.
20:38:01 <TheSnide> well, that's the *guaranteed* size.
20:38:10 <shapirus> well theoretically it can be up to 65k
20:38:17 * TheSnide enlarged his UDP packets.
20:38:26 <be0rn> Bloat
20:38:45 <TheSnide> but i'd say a 1k packet is fair
20:38:50 <shapirus> either way, I don't think it's a good idea to use UDP for actual data fetch: too much hassle for no profit
20:39:16 <shapirus> very unlike the lightweight "new data" notifications sent from the nodes to the master, if we think that way
20:39:18 <TheSnide> super-duper lightweight node. no async.
20:39:29 <TheSnide> if data is lost, so be it.
20:39:52 <shapirus> as an additional feature?
20:39:59 * TheSnide has a WIP of munin-node-c with streaming plugin via UDP
20:40:00 <shapirus> might be useful
20:40:42 <TheSnide> cherry on the cake would be to have a statd gateway (that's what i'm _really_ thinking of)
20:40:44 <shapirus> won't you eventually be creating another graphite that way? :)
20:41:30 <TheSnide> shapirus: graphite does many things right.
20:41:35 <TheSnide> from a user perspective.
20:41:52 <TheSnide> anyway... i'm closing the meeting
20:41:56 <TheSnide> #endmeeting