15:58:11 <phw> #startmeeting anti-censorship team meeting 15:58:11 <MeetBot> Meeting started Thu Dec 10 15:58:11 2020 UTC. The chair is phw. Information about MeetBot at http://wiki.debian.org/MeetBot. 15:58:11 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic. 15:58:15 <cohosh> hi 15:58:23 <phw> hi all. here's our meeting pad: https://pad.riseup.net/p/tor-anti-censorship-keep 15:58:45 <agix> hi 15:59:05 <phw> i added a few discussion items but didn't have enough time to prepare, so i'll probably cover them next week 15:59:47 <phw> agix: let's talk about your topic then 15:59:56 <agix> sure 16:00:04 <agix> I'll provide some more context 16:00:29 <agix> The research question on that issue is how to generate a synthetic index page for HTTPT proxies. 16:01:08 <agix> The way I see it is that we would need to first find content keywords that most likely won’t be blocked, for something like a web blog. We could feed those keywords into OpenAI’s text generator for example, in order to create the content for the blog. 16:01:21 <agix> What seems to be trickier is the issue of how to generate different DOM structures and css layouts for different proxy index pages. 16:01:44 <agix> So I wanted to gather your thoughts on this and additionally, how a realistic threat model would look like? Are we trying to defend against web scrapers or human censors? 16:01:57 <agix> If the latter, will the index page be sufficient or do we need to generate more content and pages in order to not attract attention? 16:02:52 <phw> i would say that scanners should be in the threat model but targeted analysis by a person should not 16:03:55 <cohosh> yeah, agreed. at least for a first step 16:04:05 <phw> hmm, is it really such a big deal if we have blocked keywords on the page? only if we assume that the censor would then block the entire host/domain, no? 16:04:46 <phw> we should probably avoid it as much as we can but it may not be an automatic death sentence 16:04:59 <agix> good point 16:05:55 <agix> so in case we consider scanners as a potential threat, would the index page be enough to masquerade the proxy? 16:06:10 <cohosh> i wonder if the GAN work our race colleagues have been doing would come in handy for evaluating and coming up with pages that look realistic enough to a bot? 16:06:22 * cohosh says this not really knowing much about GANs 16:07:14 <agix> I don't have so much experience with GANs either ^^ but phw was so kind to ask the people from race and they provided us with the paper 16:07:45 <cohosh> ah nice! 16:07:45 <agix> I still have to look into though 16:08:39 <agix> I also found another paper, were researcher we able to create synthetic research papers, perhaps also helpful for us 16:08:44 <agix> *were 16:08:48 <phw> there may be papers out there that looked at the landing pages of web servers and tell you would you could expect 16:09:20 <phw> a simple default nginx/apache index page may be good enough in a lot of cases (even though other censors may not hesitate to block those) 16:09:58 <phw> oh, this one? https://pdos.csail.mit.edu/archive/scigen/ 16:10:20 <agix> oh yeah i think it was that one 16:10:56 <agix> sergey also proposed similar approaches, were we could display common error pages or a login page 16:11:45 <cohosh> oh a login page is a good one 16:11:55 <phw> yes, i like that too 16:12:52 <agix> is there any way on how to evaluate how "censorable" a web page is? 16:13:29 <agix> just based on the appearance 16:14:20 <cohosh> i don't know of any previous work on this 16:14:34 <phw> that's difficult, in part because it's country-specific 16:14:51 <phw> i also think we shouldn't get carried away worrying about increasingly exotic attacks 16:15:01 <phw> similar to how people worry about obfs4 flow classifiers 16:15:11 <phw> ...when the real issue is that a simple decision tree already works great 16:15:13 <cohosh> for a research paper you'll probably need a way to evaluate it though 16:15:29 <cohosh> but I agree with the simple decision tree 16:16:12 <agix> cohosh the evaluation might be a tricky one :-/ 16:16:51 <cohosh> i guess you could attempt to show that a decision tree would have low accuracy when distinguishing between your page and non HTTPT pages? 16:17:05 <cohosh> i always find evaluation for censorship resistance research a bit tricky 16:17:44 <agix> yeah I like that one, that might be good way to do it 16:18:10 <phw> (the comparison to obfs4 isn't great because in httpt there's a clear difference between the protocol itself and the content that's served by the web server) 16:19:43 <phw> in other words: it may be wise to broaden your scope a bit and look at synthetic content specifically in the context of httpt 16:20:11 <agix> phw good point 16:20:14 <phw> if this was a corporate meeting room, i'd say you have to look at it holistically 16:20:47 <agix> Do you think that in the end a simple login page might still be a better choice than a complex AI generated weg page? 16:21:29 <cohosh> lol phw 16:22:14 <phw> i'd say diversity is important here. if all we have is login pages, then we may find ourselves in trouble soon. if we have login pages *and* default apache pages *and* synthetic content, it gets much easier for censors 16:24:02 <agix> sure, that makes sense 16:25:02 <agix> I don't want to take to much of your time, so thanks so much for the input and I'll keep the open issue updated on my progress :) 16:25:59 <phw> no worries, i enjoy these brainstorming sessions 16:26:24 <agix> cool, I will bug you with new ones in the future 16:27:56 <phw> ok, let's move to reviews 16:28:39 <phw> hmm, nothing? 16:28:47 <cohosh> i guess not 16:29:05 <phw> anything else for today? 16:29:18 <cohosh> not from me :) 16:29:25 <agix> same here 16:29:33 <phw> same. let's wrap it up then 16:29:35 <phw> #endmeeting