15:58:11 <phw> #startmeeting anti-censorship team meeting
15:58:11 <MeetBot> Meeting started Thu Dec 10 15:58:11 2020 UTC.  The chair is phw. Information about MeetBot at http://wiki.debian.org/MeetBot.
15:58:11 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic.
15:58:15 <cohosh> hi
15:58:23 <phw> hi all. here's our meeting pad: https://pad.riseup.net/p/tor-anti-censorship-keep
15:58:45 <agix> hi
15:59:05 <phw> i added a few discussion items but didn't have enough time to prepare, so i'll probably cover them next week
15:59:47 <phw> agix: let's talk about your topic then
15:59:56 <agix> sure
16:00:04 <agix> I'll provide some more context
16:00:29 <agix> The research question on that issue is how to generate a synthetic index page for HTTPT proxies.
16:01:08 <agix> The way I see it is that we would need to first find content keywords that most likely won’t be blocked, for something like a web blog. We could feed those keywords into OpenAI’s text generator for example, in order to create the content for the blog.
16:01:21 <agix> What seems to be trickier is the issue of how to generate different DOM structures and css layouts for different proxy index pages.
16:01:44 <agix> So I wanted to gather your thoughts on this and additionally, how a realistic threat model would look like? Are we trying to defend against web scrapers or human censors?
16:01:57 <agix> If the latter, will the index page be sufficient or do we need to generate more content and pages in order to not attract attention?
16:02:52 <phw> i would say that scanners should be in the threat model but targeted analysis by a person should not
16:03:55 <cohosh> yeah, agreed. at least for a first step
16:04:05 <phw> hmm, is it really such a big deal if we have blocked keywords on the page? only if we assume that the censor would then block the entire host/domain, no?
16:04:46 <phw> we should probably avoid it as much as we can but it may not be an automatic death sentence
16:04:59 <agix> good point
16:05:55 <agix> so in case we consider scanners as a potential threat, would the index page be enough to masquerade the proxy?
16:06:10 <cohosh> i wonder if the GAN work our race colleagues have been doing would come in handy for evaluating and coming up with pages that look realistic enough to a bot?
16:06:22 * cohosh says this not really knowing much about GANs
16:07:14 <agix> I don't have so much experience with GANs either ^^ but phw was so kind to ask the people from race and they provided us with the paper
16:07:45 <cohosh> ah nice!
16:07:45 <agix> I still have to look into though
16:08:39 <agix> I also found another paper, were researcher we able to create synthetic research papers, perhaps also helpful for us
16:08:44 <agix> *were
16:08:48 <phw> there may be papers out there that looked at the landing pages of web servers and tell you would you could expect
16:09:20 <phw> a simple default nginx/apache index page may be good enough in a lot of cases (even though other censors may not hesitate to block those)
16:09:58 <phw> oh, this one? https://pdos.csail.mit.edu/archive/scigen/
16:10:20 <agix> oh yeah i think it was that one
16:10:56 <agix> sergey also proposed similar approaches, were we could display common error pages or a login page
16:11:45 <cohosh> oh a login page is a good one
16:11:55 <phw> yes, i like that too
16:12:52 <agix> is there any way on how to evaluate how "censorable" a web page is?
16:13:29 <agix> just based on the appearance
16:14:20 <cohosh> i don't know of any previous work on this
16:14:34 <phw> that's difficult, in part because it's country-specific
16:14:51 <phw> i also think we shouldn't get carried away worrying about increasingly exotic attacks
16:15:01 <phw> similar to how people worry about obfs4 flow classifiers
16:15:11 <phw> ...when the real issue is that a simple decision tree already works great
16:15:13 <cohosh> for a research paper you'll probably need a way to evaluate it though
16:15:29 <cohosh> but I agree with the simple decision tree
16:16:12 <agix> cohosh the evaluation might be a tricky one :-/
16:16:51 <cohosh> i guess you could attempt to show that a decision tree would have low accuracy when distinguishing between your page and non HTTPT pages?
16:17:05 <cohosh> i always find evaluation for censorship resistance research a bit tricky
16:17:44 <agix> yeah I like that one, that might be good way to do it
16:18:10 <phw> (the comparison to obfs4 isn't great because in httpt there's a clear difference between the protocol itself and the content that's served by the web server)
16:19:43 <phw> in other words: it may be wise to broaden your scope a bit and look at synthetic content specifically in the context of httpt
16:20:11 <agix> phw good point
16:20:14 <phw> if this was a corporate meeting room, i'd say you have to look at it holistically
16:20:47 <agix> Do you think that in the end a simple login page might still be a better choice than a complex AI generated weg page?
16:21:29 <cohosh> lol phw
16:22:14 <phw> i'd say diversity is important here. if all we have is login pages, then we may find ourselves in trouble soon. if we have login pages *and* default apache pages *and* synthetic content, it gets much easier for censors
16:24:02 <agix> sure, that makes sense
16:25:02 <agix> I don't want to take to much of your time, so thanks so much for the input and I'll keep the open issue updated on my progress :)
16:25:59 <phw> no worries, i enjoy these brainstorming sessions
16:26:24 <agix> cool, I will bug you with new ones in the future
16:27:56 <phw> ok, let's move to reviews
16:28:39 <phw> hmm, nothing?
16:28:47 <cohosh> i guess not
16:29:05 <phw> anything else for today?
16:29:18 <cohosh> not from me :)
16:29:25 <agix> same here
16:29:33 <phw> same. let's wrap it up then
16:29:35 <phw> #endmeeting