19:29:27 #startmeeting 19:29:27 Meeting started Wed Jan 27 19:29:27 2016 UTC. The chair is TheSnide. Information about MeetBot at http://wiki.debian.org/MeetBot. 19:29:27 Useful Commands: #action #agreed #help #info #idea #link #topic. 19:29:30 hi all 19:31:18 another meeting. those were quite sporadic these days, i'll try to make them much more regular. 19:32:18 since last time, *lots* has been done on the debian package, which enabled us much more testing :) 19:33:03 notably the dmm0 has been updated, first with the latest beta, then with the latest dailybuilds. 19:33:23 the daily builds are also back, but not automated yet. They should be by tomorrow. 19:34:07 ... They will be auto updated from both github:devel and the alioth:debian/experimental. 19:35:19 i only have resources (time, knowledge & incentive) to do it for the debian package, but i *do* encourage any other distro to do the same. 19:35:59 ... the deb package should also work on ubuntu, but i won't test them there. 19:37:15 also, you should thank shapirus for quite a number of bugfixes. 19:37:55 ssm for the deb pacakging (as usual). So if the package doesn't work for you, it's him. And if the package is great, it's him also :-D 19:38:45 that said, if the debianauto package isn't working, it's usually me. 19:39:24 chteuchteu: i think what you maid for the new UI is quite great now. most of the things are just working as planned. 19:40:19 chteuchteu also redesigned our insitutional website, which looks quite great now, you can have a peek at http://mm0.eu/n 19:40:28 > ... the deb package should also work on ubuntu <-- as far as 14.10 is concerned, not without a few patches: there's a test failing, and List::Util doesn't have the "any" and "all" methods. Also the init scripts have to be worked on. 19:40:54 shapirus: yeah, the init scripts are systemd-based IIRC. 19:41:48 yes, and systemd isn't in 14.10. If we want the packages to work for that distro, we'll have to provide upstart or sysv-style configs. 19:42:06 i'd prefer sysv-style if i have my word to say. 19:42:20 as it's compat with any /sbin/init 19:42:27 upstart sounds easier to me, but sysv is way more portable 19:42:57 also think of debian systems with sysv instead of systemd (yes it is possible to choose either of them in jessie) 19:43:00 ... but i'd take anything which is making it working in ubuntu 19:43:18 so it's not only ubuntu-related 19:43:23 yeah, that's why i'd prefer sysv-style. 19:43:56 I have a working munin-node sysv script already (took it from 2.1.9 I believe, it works almost as is) 19:43:59 the good part is that sysv-style, i can do it myself if needed. 19:44:16 (only time will be missing, not skills) 19:44:26 I can share it, but I have no idea on how to inject it into the deb packages, so that'll be up to you guys 19:44:47 as far as munin-asyncd, I run that one under supervisord currently 19:44:48 shapirus: put it in a PR as a contrib/ 19:45:05 should probably be not difficult to make a sysv script for it as well 19:45:18 I'll look at that 19:45:34 sysv script aren't that difficult if you don't care about babysitting the processes 19:45:45 I'm also working on a mysql plugin that provides a bunch of TokuDB status graphs 19:45:46 (restart upon failure, etc) 19:45:51 will contribute it when it's ready 19:46:14 also, now is the time to make the graphs look more 2016-ish :) 19:46:19 I thought of incorporating it in the existing mysql_ plugin, but that one's too bloated and maintained outside of munin repo 19:46:26 and it's not multigraph capable 19:46:29 the new CPU colors are quite great IMHO 19:46:35 hence the decision for a separate plugin 19:47:02 thanks, I've spent quite a bit of time picking all those colors 19:47:45 so that they don't merge into one visually on the adjacent fields yet still remain contrast, readable and easy on the eyes 19:48:03 oh, and i really think about providing all the "system" plugins a 1sec variant 19:48:13 and there's only 16.7M colors to choose from 19:48:18 since, it has a great WOW factor. 19:49:08 shapirus: now you can proceed to others :) 19:49:25 yeah and some good documentation on 1sec (or anything of higher than 5min resolution for that matter) would be nice 19:49:36 +1 on doc 19:49:53 I'll review the whole guide and write more doc 19:50:01 then, what comes to my mind at once, is the logging and configuration issues 19:50:13 as it's only a copy/paste from a former blog article of mine right now 19:50:18 munin-httpd: prefork or not, how many processes to run 19:50:54 the CPAN lib might not be in debian. so we have to choose at runtime, based on the availability of it. 19:50:55 munin-asyncd or others: log level, log destination (if it is feasible to switch it between syslog and files) 19:51:47 the logging master is ssm, i did mostly delegate everything to him. let's ping him. If he doesn't reply, I'll see. 19:52:18 #action TheSnide will write more details in the Guide about 1sec plugins 19:52:19 then that issue of the host subdirectories under /var/spool/munin/async 19:52:35 shapirus: yup. it was by design. 19:52:41 which I think has been agreed upon, but hardly documented anywhere 19:52:53 ... but after some time, it just feels wrong. 19:53:23 shapirus: yes, what we agreed is much better. can you write the guide blurb about it ? 19:53:28 oh and the "1 hour bug" 19:53:49 it's called the GhostBug™ :) 19:53:50 I have not a slightest idea how to possibly debug it 19:54:09 other than recording a whole log of the process's strace 19:54:33 or there has to be a way to turn on the most verbose debug output in the process itself 19:54:40 shapirus: that's what i did before, and discovered several races. but i guess i didn't find all. 19:54:52 upon which I can set up some alerting and debug it when it hangs 19:55:28 if it hangs, it's already too late, as it issues a sleep(3600) 19:55:44 the key to fixing it, is *why*. 19:56:00 that's already something 19:56:18 there must be some condition that makes it call sleep(3600), right? 19:56:47 well probably recording a strace log is not a bad idea at all 19:57:35 if only it happened predictably... 19:57:38 shapirus: i did previously, and it paid off. 19:57:52 but it's a little bit tedious and HDD unfriendly :) 19:58:30 I have a couple dozen of VMs 19:58:49 I'll see what it takes to record that output 19:58:54 [13munin] 15steveschnepp commented on issue #625: Recent version of the deb package are ok. Closing it. 02https://git.io/vzHBk 19:59:04 I'd love to catch it 19:59:13 also, madduck has a good point here https://github.com/munin-monitoring/munin/issues/619 20:00:04 right 20:00:37 it should be easy to convert that to "send as you read" 20:00:53 unless there happen some data aggregation after it's read completely 20:01:03 i don't think there is. 20:01:08 also, https://github.com/munin-monitoring/munin/issues/617 is quite relevant 20:01:49 https://github.com/munin-monitoring/munin/issues/612 has obviously to be fixed 20:01:59 hard coded stuff bites again 20:02:40 the latter looks like pure coding mistake 20:02:49 or underimplemented something 20:03:35 I'd also mention this one https://github.com/munin-monitoring/munin/issues/634 20:04:22 at the very least it needs a cron job schedule workaround merged 20:04:40 until there is a decision of what to do for a permanent solution 20:05:55 oh, the update-async struggle :) 20:06:51 but the first part of the issue, i don't really understand 20:07:07 [13munin] 15Skaronator commented on issue #592: ah completely forgot this issue but yeah rebuild the whole graph system in HTML5 with some fancy JS-Graphs would be the best solution. 02https://git.io/vzH0k 20:07:13 what I was thinking about in regard to that is a mechanism that allowed munin-master to connect to the nodes at once as soon as there are new data available 20:07:32 instead of fixed cron runs 20:07:52 but that has to involve support for some callback from the nodes 20:08:17 which doesn't sound impossible, however a new daemon on the master server will be needed 20:08:47 the first part issue is easy 20:09:03 well basically it's all in the steps to reproduce 20:09:10 try them and see what happens :) 20:10:48 i was thinking about allowing some POST on the munin-httpd from the nodes 20:10:59 yes 20:11:07 but that got dropped from 3.0 20:11:08 node->master: "I am ready" 20:11:19 nah, directly the data :D 20:11:20 master->node: "ok I'm coming, give me your data" 20:11:32 how does that sound? 20:11:37 go TheSnide go! ;) 20:11:46 it'll also spread load over time 20:12:08 since munin-update will not connect to all nodes at once, but only as soon as there are new data 20:12:18 shapirus: this poking could also presumably be done with SSH? 20:12:26 as it would enable to monitor loosely connected nodes (those behind an infamous mobile internet NAT) 20:12:26 which in turn will allow updates more frequent than every 5 min 20:12:41 or less frequent as well 20:12:45 shapirus: or simply keep persistent connections from the server to the nodes and poll. 20:12:59 that doesn't sound as scalable 20:13:11 the notify->poll method sounds better to me 20:13:14 you mean past 64k connections? 20:13:23 ;) 20:13:27 think of poor connections 20:13:50 madduck: well, if you have more than 1 ip for all the node, you can have more than 64k 20:13:53 I'm imagining some DNS transfers-like system here 20:14:05 YUK ;) 20:14:11 where on serial change the master sends notifies 20:14:16 and the slaves come for the fresh zone copy 20:14:23 shapirus: i have the FTP-data in mind when i look at your proposal, and it doesn't sound pretty 20:14:32 that approach suits munin perfectly I think 20:15:00 so wait a minute (sorry that I am talking as bystander, but…) 20:15:05 shapirus: even better would be "node POST the new data to munin-httpd". period. 20:15:24 madduck: we're quite tolerant & open :) 20:15:29 asyncd collects data and you are suggesting that it reaches out to newdaemond to inform it about new data such that munin then opens SSH to asyncd? 20:15:34 FTP isn't bad either, it's just that the technology turned out to be not very suitable for where people started using it 20:15:55 madduck: ... basically, yes. 20:15:59 why not let asynd push the data instead of poking for a pickup? 20:16:16 madduck: *precisely* what's i'd prefer :) 20:16:17 without replacing pull support ;) 20:16:31 the problem is with the "node POST the new data to munin-httpd" approach is that you'll have to handle the situation when the server isn't available 20:16:44 and mark a portion of data as "failed to transfer" 20:16:49 shapirus: bah. the async already takes care of that. 20:16:50 you just accumulate, as if the server doesn't fetch 20:17:01 ... remember the "node" is *dumb* approach ? 20:17:19 yes, but you've just said the opposite: that the node has to push data 20:17:20 the callback will *only* be done by async implementations. 20:17:40 shapirus: ok. i lied. async will directly push data to master. 20:18:35 madduck: i think you are thinking the same as my original design. 20:18:49 (which i didn't have time to implement) 20:19:15 what's currently missing is : #1 a persistant SQL db. #2 a POST handler on the munin-httpd 20:19:29 [22:16] < madduck> asyncd collects data and you are suggesting that it reaches out to newdaemond to inform it about new data such that munin then opens SSH to 20:19:32 asyncd? 20:19:40 that's exactly what I'm proposing 20:19:58 as to the "why not push data" thing 20:20:14 well, it's just gonna be more difficult to implement I think 20:20:19 more prone to bugs 20:20:24 and less reliable therefore 20:20:46 whereas the notify->poll method will be bulletproof even in case of poor connection 20:20:53 think dns :) 20:21:26 then again, think of load 20:21:43 if a hundred nodes chime in and try to push their data at once 20:21:48 it may overload the master 20:22:19 and you'll have to implement a "sorry come later" mechanism and make the node handle it properly 20:23:03 where as with the notify-poll method, the master will know the list of nodes which have fresh data and work on it at its convenience 20:23:03 like what http already has with its status codes 20:24:05 and that will feel like more 'distributed' rather than 'centralized' architecture 20:25:00 then you can have the node update its spool files as frequent as you wish (think 1sec graph) and notify the master in a lightweight way about that (every time or not more frequent than a configured interval) 20:25:30 and then the master may poll "when the node asks for it but not more frequent than seconds" 20:25:42 shapirus: i agree in most of what you said. But, as the spooling is already done, avoiding callbacks alltogether means that we don't need to open FW in both directions 20:26:01 that's right 20:26:02 and retrying is mostly implemented 20:26:20 in what I suggest, there must be an open port on the master 20:26:32 (since the biggest issue in retring is retaining the failed data. which we *already* do.) 20:26:40 the only drawback as I see it 20:27:01 so, the async will POST *data* to master has mostly only advantages. 20:27:18 not until you think of scalability 20:27:21 ... the only drawback i see is "security" 20:27:46 as soon as the number of nodes grows high enough, the master is stuck 20:27:46 shapirus: bah. scaling HTTP requests is something that is well understood nowadays :D 20:28:14 no, I mean the case of N nodes coming with their data all at the same time 20:28:21 ... and with persistant DB, comes *multi-host* master 20:28:48 N nodes coming with their data all at the same time <-- that *will* happen. as we are time-sensitive 20:28:55 the security issue is still there no matter POST or new-data notification for poll 20:29:29 my approach will achieve time-insensitiveness as a side bonus 20:29:34 if notif, you'll only eventually DoS the platform. not inject wrong data 20:29:35 think of that :) 20:30:23 shapirus: i'm _really_ not in favors of callbacks. Got my back burnt too often with that :) 20:31:05 and I've had my share of trouble with centralized systems ;) 20:31:23 I'm 8 years in designing and working with HA and high-load systems 20:31:24 ... it becomes a nightmare to open holes in most FW, and you are at the mercy of NAT :) 20:31:44 I can see where scalability issues are part of design 20:31:59 but then, as far as the firewall goes 20:32:01 shapirus: that said, _nothing_ prevents you to provide a contrib tool to do that. 20:32:17 let's assume your HTTP data POST method 20:32:24 i'd even merge it if it works 20:32:25 how are you going to secure it? 20:32:51 i was naively thinking about SSL 20:33:04 client-side certs 20:33:16 _if_ you want to secure it. 20:33:39 or simple HTTPS + HTTP-Basic Auth. 20:33:47 that might also be enough 20:33:48 well if not secured, then anyone can inject wrong data or simply DoS the server 20:33:57 yup 20:34:14 but then, secured in any way, the node callbacks can be secured in just the same way 20:34:31 so the security issue is the same for both approaches 20:34:39 i said : "node callbacks are *more* secure than node posts" 20:34:50 how are they more secure? 20:35:02 data injection is impossible. only dos is. 20:35:23 agreed 20:35:51 and then with some firewall rules (e.g., allow LAN only) it protects against DoS as well 20:35:58 I also wish for some UDP data protocol :) 20:36:14 that's a bright idea 20:36:30 the perfect choise for notifications 20:36:41 we don't care about the packet loss 20:36:42 ... as a typical "$plugin fetch" does usually fit in a 1.5K packet 20:37:01 512 bytes. 20:37:12 ha ? 20:37:22 UDP packet size is up to 512 bytes 20:37:30 #wut ? 20:37:31 lower than the typical MTU :) 20:38:00 That's not true. UDP packets can easily be bigger. 20:38:01 well, that's the *guaranteed* size. 20:38:10 well theoretically it can be up to 65k 20:38:17 * TheSnide enlarged his UDP packets. 20:38:26 Bloat 20:38:45 but i'd say a 1k packet is fair 20:38:50 either way, I don't think it's a good idea to use UDP for actual data fetch: too much hassle for no profit 20:39:16 very unlike the lightweight "new data" notifications sent from the nodes to the master, if we think that way 20:39:18 super-duper lightweight node. no async. 20:39:29 if data is lost, so be it. 20:39:52 as an additional feature? 20:39:59 * TheSnide has a WIP of munin-node-c with streaming plugin via UDP 20:40:00 might be useful 20:40:42 cherry on the cake would be to have a statd gateway (that's what i'm _really_ thinking of) 20:40:44 won't you eventually be creating another graphite that way? :) 20:41:30 shapirus: graphite does many things right. 20:41:35 from a user perspective. 20:41:52 anyway... i'm closing the meeting 20:41:56 #endmeeting