#tor-meeting log

16:00:08 <shelikhoo> #startmeeting tor anti-censorship meeting
16:00:08 <shelikhoo> here is our meeting pad: https://pad.riseup.net/p/r.9574e996bb9c0266213d38b91b56c469
16:00:08 <shelikhoo> editable link available on request
16:00:08 <MeetBot> Meeting started Thu Feb 27 16:00:08 2025 UTC.  The chair is shelikhoo. Information about MeetBot at http://wiki.debian.org/MeetBot.
16:00:08 <MeetBot> Useful Commands: #action #agreed #help #info #idea #link #topic.
16:00:12 <shelikhoo> hi~hi~
16:00:24 <cohosh> hi
16:01:01 <meskio> hello
16:04:08 <shelikhoo> okay, I think we can start with the first discussion point:
16:04:09 <shelikhoo> Next Step for Datagram mode transport mode for Snowflake
16:04:09 <shelikhoo> The broker can now reject older proxies based on the version number
16:04:09 <shelikhoo> The new server, broker, and proxy is designed to work with both new and old client
16:04:09 <shelikhoo> We still need to add the support for new protocol to webextension version of the proxy
16:04:10 <shelikhoo> Should we add both version of protocol to the client? or should we just merge the proxy, broker, and server code now, and wait long enough before merging the client?
16:04:32 <shelikhoo> https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/merge_requests/315
16:05:19 <shelikhoo> It is possible add both stream and datagram mode transport to the client, but this work would not be strictly necessary
16:06:00 <shelikhoo> We can process with aiming to merge both proxy, broker, and server first
16:06:07 <shelikhoo> and wait for proxy deployment
16:06:16 <shelikhoo> before go ahead with merging the client
16:06:33 <shelikhoo> so that we don't have to support both mode in client, then delete it later
16:07:00 <shelikhoo> we might still wants to run a staging broker and some proxy for testing
16:07:20 <shelikhoo> before proxy part is merged
16:07:44 <shelikhoo> so there are 2 discussion points: should we go ahead and deploy a staging server
16:07:59 <shelikhoo> should we add both version of protocol to client?
16:09:04 <cohosh> we've had one instance before where we rolled out a feature that required all proxies to update before we updated clients
16:09:29 <cohosh> that was when we added support for multiple snowflake bridges
16:09:31 <cohosh> https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/issues/28651
16:09:47 <cohosh> we ended up deploying some metrics to track the proxy update process https://gitlab.torproject.org/tpo/anti-censorship/team/-/issues/95
16:10:21 <shelikhoo> yes, and in that case the clients already have code to support both single bridge mode
16:10:30 <shelikhoo> and multiple bridge mode
16:10:43 <shelikhoo> we just changed an option on the client
16:11:11 <shelikhoo> in the current case, we are discussing whether to support a both transport mode in a single version of code base
16:11:11 <cohosh> that's right, so it was a little different
16:11:27 <shelikhoo> in will increase complexity of code
16:11:38 <shelikhoo> this will increase complexity of code
16:13:22 <meskio> seeing that we can take time to do this deployment and wait until most proxies have upgraded to me it sounds like the complexity maybe is not needed
16:13:33 <cohosh> i think the plan of merging and deploying support in proxies first, and watching how it rolls out is a good one
16:13:50 <meskio> +1
16:13:59 <meskio> we can change our mind if the roll out is too slow
16:14:09 <meskio> BTW, is the webextension also ready for this?
16:14:25 <cohosh> i think it would be good for performance testing purposes to try out both old and new clients with the deployed proxies
16:14:44 <cohosh> since that's the goal of this work
16:14:51 <shelikhoo> no, but webextension deployment is much faster
16:15:04 <shelikhoo> while standalone proxy will take some time to update
16:15:12 <cohosh> it would be nice to get an idea of how successful it was before we flip the switch
16:15:25 <cohosh> well, i guess it's one goal of this work
16:15:51 <shelikhoo> yes, in our current configuration, if the client asked for udp transport mode and proxy does not support it
16:16:01 <cohosh> yes
16:16:02 <shelikhoo> the connection would fail or not work
16:16:37 <shelikhoo> so unless the broker is rejecting old proxies, connecting with new proxy will not work as expected
16:16:46 <shelikhoo> so unless the broker is rejecting old proxies, connecting with new client will not work as expected
16:16:47 <dcf1> so, is the question about whether in implement backward compatibility at the broker or the client?
16:17:23 <dcf1> "the connection would fail or not work" could be addressed in at least two ways: the broker could avoid making matches that don't work; or all clients could be universally compatible
16:17:23 <shelikhoo> we are thinking about whether to add the code to support both transport mode into client
16:17:44 <cohosh> i think they are linked, if we decide not to make the client backwards compatible, we will need the broker to reject old proxies
16:17:49 <shelikhoo> the broker can already avoid making match that don't work by rejecting old proxies
16:18:02 <shelikhoo> this part is done
16:18:13 <cohosh> we've used this before
16:18:22 <dcf1> Ok, to me these seem like two alternatives, not something where you would do both.
16:18:58 <shelikhoo> yes, and my current proposal is to merge the proxy, server, and broker part first
16:19:11 <shelikhoo> then wait for enough proxy to switch
16:19:26 <dcf1> also I'm mentally reserving that the broker could be more subtle than rejecting old proxies, it could instead avoid matching new clients with old proxies
16:19:36 <shelikhoo> while we write the web extension proxy support for udp transport mode
16:19:51 <shelikhoo> before we reject the old proxy
16:20:04 <shelikhoo> dcf1: it would make matching process more complex
16:20:32 <dcf1> yes, granted
16:20:46 <shelikhoo> and the version that does a broker assisted protocol negotiation was rejected in merge review
16:20:48 <dcf1> and universal protocol support in clients is also a cost in complexity
16:21:07 <dcf1> that's why I was mentally framing it as two alternatives, but my understanding might be bad
16:21:21 <cohosh> we also don't need to make this decision now, we can wait and see what the proxy metrics look like after the deployment
16:21:22 <shelikhoo> and the client select protocol approach was used
16:21:34 <dcf1> "the version that does a broker assisted protocol negotiation was rejected in merge review" that's not because it's a bad idea, it's because that was not the focus of that part of the patch development
16:22:04 <shelikhoo> okay... but anyway the thing we could do now is 2:
16:22:21 <dcf1> that was a feature I asked you to leave out of the protocol development, specifically to defer the discussion of how to do backward compatibility for later, which is what we are doing now
16:22:28 <shelikhoo> 1. I can create a merge request with only proxy, broker, and server update
16:23:05 <shelikhoo> 2. I can deploy a testing broker, so that we can test the new protocol without impacting current proxy pool
16:24:29 <shelikhoo> alternatively we could consider other plans such as adding more tolerance to version difference
16:25:23 <cohosh> a testing broker would be nice for catching bugs, we'll be limited in what we can learn about performance
16:26:05 <shelikhoo> yes, I could just deploy a testing broker
16:26:13 <shelikhoo> without deploying testing proxies
16:26:33 <cohosh> why without?
16:26:56 <shelikhoo> in this way when testing, the proxies's network environment can be controlled by the tester
16:27:13 <cohosh> ah i see
16:28:20 <shelikhoo> oh, actually we can have them both, firstly we run a broker with proxy to testing for bugs
16:28:33 <shelikhoo> before we shutdown proxy and have another round of testing
16:29:33 <cohosh> that sounds like a reasonable testing plan to me
16:30:01 <meskio> +1
16:30:16 <cohosh> we can also test proxies that have support for the new feature in the real snowflake network, if some operators are willing
16:30:30 <shelikhoo> yes. So, my plan is as follow: I will create a testing broker deployment with some proxies for testing
16:30:50 <cohosh> (just to make sure the proxy backwards compatability works for now)
16:31:46 <shelikhoow> cohosh: I assume we need to merge and update broker, and server for the  backwards comparability to actually work
16:32:02 <cohosh> shelikhoo: ah, i missed that part of it
16:32:06 <cohosh> ok
16:32:11 <shelikhoow> otherwise if the proxies are updated first, then the result will be undefined
16:32:50 <cohosh> can this testing broker can be used with both new and old clients?
16:33:06 <shelikhoow> both new and old clients will work
16:33:19 <shelikhoow> but proxies, broker, and server must be updated
16:34:49 <cohosh> ok so the test set up would need a server too
16:34:51 <shelikhoo> so my plan will be to create a testing network with server, broker, and proxies
16:35:13 <cohosh> ok, it's a lot of work but this is a big feature
16:35:21 <shelikhoo> yes, but things like that isn't hard, comparing to get our current pool to update
16:35:48 <cohosh> we can make some tor browser builds with the new client and build in the test bridge line if we want to do a wider testing call
16:36:33 <shelikhoo> yes! that being said the speed won't be directly comparable
16:36:40 <cohosh> yep
16:36:48 <shelikhoo> unless we distribute 2 tor browser
16:36:58 <shelikhoo> one with new client, and another one with old client
16:37:07 <shelikhoo> and no one else is running proxy
16:37:36 <shelikhoo> but anyway I think we already reached an conclusion for next step: setting up testing network
16:37:42 <shelikhoo> let's move to the next topic
16:37:46 <meskio> :)
16:38:04 <cohosh> in the meantime, it also makes sense to me to prepare a MR with just the proxy, broker, and server changes since regardless of how we handle backwards compatability that will be the first step
16:38:27 <cohosh> oh, we can move on, since there's also no rush on that i suppose
16:38:38 <shelikhoo> yes, we are running out of time
16:38:46 <shelikhoo> there is no rush
16:38:47 <shelikhoo> Gitlab Dependency Proxy and our merge request pipeline
16:38:47 <shelikhoo> https://gitlab.torproject.org/tpo/anti-censorship/pluggable-transports/snowflake/-/merge_requests/522
16:38:47 <shelikhoo> The pipeline will stop working when we are developing from our "personal" fork of it
16:39:13 <shelikhoo> with the new dependency proxy system
16:39:39 <shelikhoo> the pipeline would use dependency proxy if it is running from team namespace
16:39:58 <shelikhoo> and pull directly without it when running from personal name space
16:40:02 <shelikhoo> so...
16:40:16 <shelikhoo> if we are continue to develop with our existing workflow
16:40:44 <shelikhoo> then our personal fork that we develop on will have malfunctioning pipeline
16:40:57 <shelikhoo> should we: develop in the team namespace
16:41:00 <meskio> ohh, I guess I missed that part when writting it, I just copypasted from what the network team is using
16:41:15 <shelikhoo> or find some better plan about gitlab actions
16:41:45 <shelikhoo> we could, obviously, run the gitlab actions locally on our own machine
16:42:03 <meskio> I like the personal forks as they don't pollute the main repo with branches
16:42:10 <meskio> but I don't have strong opinions there
16:42:18 <shelikhoo> but it would make it significantly harder to collaborating and merge reviewing
16:42:34 <meskio> yes, is nice to get CI on our merge requests
16:43:07 <meskio> I could check with TPA on this and see if we can find a solution that includes personal forks
16:43:30 <meskio> I haven't really look into their template
16:43:36 <cohosh> +1 checking that sounds good
16:43:38 <shelikhoo> okay, we can discuss this again next week
16:43:47 <meskio> okay, I'll dig into it
16:43:57 <shelikhoo> here are the announcements:
16:43:59 <shelikhoo> rdsys 1.0 release
16:43:59 <shelikhoo> https://blog.torproject.org/making-connections-from-bridgedb-to-rdsys/
16:44:01 <shelikhoo> yeah!
16:44:18 <shelikhoo> interesting links:
16:44:19 <shelikhoo> TURN/STUN server networks from https://www.petsymposium.org/foci/2025/foci-2025-0003.php "Using TURN Servers for Censorship Evasion"
16:44:19 <shelikhoo> https://developers.cloudflare.com/calls/turn/
16:44:19 <shelikhoo> https://www.metered.ca/tools/openrelay/
16:44:19 <shelikhoo> https://www.expressturn.com/
16:44:20 <shelikhoo> https://xirsys.com/
16:44:32 <shelikhoo> ===================
16:44:33 <shelikhoo> https://arxiv.org/abs/2409.06247 "Differential Degradation Vulnerabilities in Censorship Circumvention Systems"
16:44:33 <shelikhoo> Several recently proposed censorship circumvention systems use encrypted network channels of popular applications to hide their communications. For example, a Tor pluggable transport called Snowflake uses the WebRTC data channel, while a system called Protozoa substitutes content in a WebRTC video-call application. By using the same channel as the cover application and (in the case of Protozoa) matching its observable traffic
16:44:34 <shelikhoo> characteristics, these systems aim to resist powerful network-based censors capable of large-scale traffic analysis. Protozoa, in particular, achieves a strong indistinguishability property known as behavioral independence.
16:44:36 <shelikhoo> We demonstrate that this class of systems is generically vulnerable to a new type of active attacks we call "differential degradation." These attacks do not require multi-flow measurements or traffic classification and are thus available to all real-world censors. They exploit the discrepancies between the respective network requirements of the circumvention system and its cover application. We show how a censor can use the minimal
16:44:40 <shelikhoo> application-level information exposed by WebRTC to create network conditions that cause the circumvention system to suffer a much bigger degradation in performance than the cover application. Even when the attack causes no observable differences in network traffic and behavioral independence still holds, the censor can block circumvention at a low cost, without resorting to traffic analysis, and with minimal collateral damage to non-
16:44:42 <shelikhoo> circumvention users.
16:44:47 <shelikhoo> ================================
16:44:49 <shelikhoo> Wallbleed: A Memory Disclosure Vulnerability in the Great Firewall of China
16:44:51 <shelikhoo> https://gfw.report/publications/ndss25/en/
16:44:56 <dcf1> I suppose none of you on the snowflake team was contacted about the "Differential Degradations" manuscript?
16:45:07 <cohosh> no
16:45:14 <shelikhoo> no from me
16:45:24 <dcf1> I'm guessing this is yet another instance of the research anti-pattern.
16:45:47 <dcf1> I found out about it in the references when reading another paper draft. I haven't read it yet.
16:46:00 <shelikhoo> Wallbleed was previously discussed in last year's FOCI online ver.
16:46:34 <shelikhoo> okay, let's discuss it once we have read it
16:46:42 <shelikhoo> and now is the reading group
16:46:43 <shelikhoo> Identifying VPN Servers through Graph-Represented Behaviors
16:46:46 <meskio> yes, maybe for another reading group
16:46:57 <shelikhoo> https://dl.acm.org/doi/10.1145/3589334.3645552
16:46:57 <shelikhoo> https://dl.acm.org/doi/pdf/10.1145/3589334.3645552
16:47:38 <shelikhoo> do we wants to have a summary of the paper or we can move to discussion directly?
16:47:44 <dcf1> I wrote a summary of this one, unfortunately I had trouble keeping it concise.
16:47:48 <dcf1> https://github.com/net4people/bbs/issues/455
16:48:10 <meskio> dcf1: thank you for the summary, it helped me to understand parts of the paper I was struggling with
16:48:26 <meskio> and also to see that some parts that are confusing to me are also unclear to you
16:48:35 <dcf1> The main reason I was interested in this one is the "graph-represented" term in the title. I supposed that that meant it was going to be a paper about using server access patterns to identify VPNs.
16:48:51 <onyinyang> yes thanks for this write up and pointing to the open reviews dcf1
16:49:30 <dcf1> It turns out that is true (they do do that), but only in part. Because they also represent a whole bunch of active probing features, and store those in a graph as well, which muddies the concept somewhat.
16:50:28 <dcf1> These are the "communication graph" and the "probing graph". They use a technique called GraphSAGE to aggregate features from nearby nodes in each graph, then concatenate the features from a particular server to be classified and feed it to a normal ML classifier.
16:50:50 <shelikhoo> I think one of the important observation is that: "no feature" is a feature
16:51:18 <shelikhoo> like when receiving bad request, keep reading the connection
16:51:23 <dcf1> Ultimately I was disappointed, because they don't have much to say about what makes a communication graph characteristic of a VPN. All they say (and they do not back it up with any evidence) is that VPNs have fewer distinct users than "normal" servers.
16:51:33 <shelikhoo> or do not send close signal
16:51:57 <dcf1> shelikhoo: yes, that is relevant. However it's an active probing feature, not an "access relation" feature as I hoped to read more about.
16:52:38 <cohosh> this work reminded me a little of the host-based classification from https://censorbib.nymity.ch/#Wails2024a
16:53:01 <shelikhoo> yes, I think once there is machine learning involved, the method is no longer human explainable
16:53:13 <dcf1> If you check Table 4 "Ablation Experiments" on page 7, you can see in the "W/PG" row (which means "without probing graph; i.e., with communication graph only") that the access relation features, on their own, are not good at all. E.g. Accuracy = 0.5582.
16:53:50 <shelikhoo> so even if they did able to make machine learning based detection work, we could learn very little about it, even with all the weights open access
16:54:10 <dcf1> shelikhoo: I disagree, I mean they are doing feature engineering in selecting the features, they must have some motivation for them. But my guess is that the authors were just kind of throwing some thing together to see if they worked, without having a strong insight.
16:54:41 <shelikhoo> dcf1: I agree the feature inputs are important
16:55:02 <shelikhoo> we should try to make our protocol match other protocol's "features input"
16:55:13 <dcf1> If I had to guess, I would venture that the motivation for writing this paper was to try classification using communication graph features only, because, as they say in the introduction, that offers the possibility of classifying probe-resistant servers.
16:56:09 <dcf1> Then they discovered they were not able to make that work well, so they added a bunch of traditional probing features (which they were then obligated to cast in the form of a graph for compatibility with what they had already done).
16:56:48 <shelikhoo> yeah, I think this theory would explain the structure of the paper
16:57:08 <dcf1> Because their way of constructing the probing graph is really arbitrary. Every node represents a server, and edges between servers indicate similarity of open port patterns? I mean sure, but why not any of a dozen other things?
16:58:09 <dcf1> I am still quite interested in access relation–based classification, and will keep on the lookout for such. I listed a few other possibilities at https://github.com/net4people/bbs/issues/455#issuecomment-2683821673.
16:58:35 <onyinyang> it seemed suspicious to me that open port patterns was a good indicator of anything
16:58:50 <onyinyang> especially when some of the ones they listed were port 22, and ports 80 and 443 -__-
16:59:02 <dcf1> The "Web Proxy Detection based on Multiple Features Analysis" is the only one I've read, and it actually is pretty clear in the motivation, saying things like, a user that accesses one proxy is likely to access another proxy.
16:59:16 <dcf1> https://jcs.iie.ac.cn/xxaqxben/ch/reader/view_abstract.aspx?file_no=20180404
16:59:28 <shelikhoo> 80 443 22 is like any web server...
16:59:34 <dcf1> There is an English translation of this that I'm not sure is online anywhere yet.
16:59:34 <onyinyang> exactly
17:00:00 <meskio> they also look into hidden ports, wich is a bit more of an indicator
17:00:26 <dcf1> There is some similarity there with the idea of, say, and obfs4 bridge being compromised because it is also a vanilla bridge. That is, a server offers multiple ports/protocols, which in itself makes it look like a VPN or proxy server.
17:01:25 <cohosh> yeah i think the access pattern technique could still work to identify proxies/bridges, i was wondering if combining it with the host-based threshold idea would make it stronger too
17:01:32 <dcf1> Yeah, the "stealth ports" part is good. I was a bit confused about how they represent it as a feature. As I understood it, the concrete feature is the *number* of stealth ports detected? Alongside other features like the *number* of "observed ports" and the *number* of scanned ports?
17:02:03 <dcf1> cohosh: I think you are absolutely right about the similarity with host-based classification of https://censorbib.nymity.ch/#Wails2024a
17:03:06 <dcf1> Let me gripe a little bit about the notation.
17:03:39 <dcf1> meskio, you said some parts were hard to understand, I think part of the cause is that the authors are trying to pull a trick, using a lot of $$ LaTeX in an effort to appear intimidating.
17:04:02 <onyinyang> lol
17:04:12 <dcf1> But it really falls apart in places like Equation 1 on page 3.
17:04:19 <meskio> XD
17:04:25 <shelikhoo> we are a little over time
17:04:48 <dcf1> Where they are using the ∑ operator to somehow sum *sets*?!? And then taking the union of these now *integers*?
17:04:57 <cohosh> lmao yes formula 1 just to say all servers
17:05:00 <dcf1> shelikhoo: I'll only gripe a moment longer.
17:05:07 <dcf1> cohosh: yes!!! infuriating
17:05:12 * meskio is ok with a bit of overtime
17:05:17 <cohosh> shelikhoo: i think it's okay to run over time, unless you need to go
17:05:18 <dcf1> And Equation 2 is supposed to be Jaccard similarity
17:05:19 <shelikhoo> yes, dcf1: you have lock now
17:05:45 <dcf1> But they got the || in the wrong places, so again they're taking intersections and unions of integers
17:06:37 <dcf1> It really irritates me when authors use a pile of notation to try and look obscure, but what they're explaining isn't really all that complicated, and then they mess up basic details like writing ">=" instead of "≥".
17:07:00 <shelikhoo> cohosh: yes, I have no rush to leave...
17:07:21 <shelikhoo> I think they are trying to impress their boss to get more KPI
17:07:28 <meskio> maybe a way to cover when you don't have so much real meat for your paper
17:07:34 <dcf1> Rule of thumb: whenever you see a paper break out the \mathcal, it either means you are reading something difficult but quite real and meaningful, or you are reading a lot of fluff that probably doesn't mean much.
17:07:43 <dcf1> And this paper sure does love its \mathcal
17:07:51 <cohosh> i have a high level discussion question from this work
17:08:11 <cohosh> that's related to something i think roya asked me a while ago
17:08:38 <cohosh> which was whether bridges or proxies with more users could be subject to more scrutiny and blocking
17:08:56 <cohosh> and if that's the case, what kinds of defenses could we come up with against that
17:09:03 <cohosh> or how would we detect it
17:09:23 <cohosh> i don't necessarily trust the outcomes of this paper to say how feasible it is
17:09:24 <dcf1> My last point about notation, the equations in 3.4 look intimidating if you haven't seen them before, but it's not anything novel, they're just re-stating some standard neural network classification stuff. E.g., search for "tanh", "softmax", "ReLu" at https://en.wikipedia.org/wiki/Convolutional_neural_network.
17:09:31 <shelikhoo> I have once read some paper where the author overrides math symbols with their own function, as a result one reader need to read other papers first to understand it
17:10:03 <dcf1> cohosh: yes, exactly, that's the source of my interest too.
17:10:27 <shelikhoo> cohosh: it is a lot more difficult to run experiments on such claims
17:10:38 <shelikhoo> since we would need a lots of clients
17:10:39 <cohosh> we do have some usage metrics from collector, and we have shell's vantage point data
17:10:52 <dcf1> (Oh yeah, and Equations 7 and 8, what a bad way to express what they're doing.)
17:11:05 <cohosh> i was thinking at some point to do some data analysis on the rotating bridge reachability data we have to see if there is a correlation between use and blocks
17:11:06 <shelikhoo> yes...
17:11:20 <dcf1> cohosh: I imaging it could also go the other direction; that is, 1:1, one user using one server, a sign of a personal VPN perhaps.
17:11:37 <cohosh> dcf1: that's true
17:11:47 <dcf1> Is rotating bridges something that happens with Tor Browser default bridges?
17:11:52 <cohosh> low use also means a trivial tradeoff for a censor to block it
17:12:21 <cohosh> dcf1: no, i think these might have been telegram bridges? or maybe bridges we handed out manually with the community team?
17:12:27 <cohosh> maybe others remember more
17:13:07 <dcf1> https://www.bamsoftware.com/proxy-probe/ is kind of similar, but we were measuring time intervals, not levels of use.
17:13:30 <cohosh> yes our vantage point scripts are based on that originally :)
17:13:32 <meskio> I have always assumed high use bring high blocks because if people keep sharing a bridge a censor will hear about it at some point, more than the censor noticing the traffic
17:13:49 <dcf1> meskio: there are at least 2 possible cuases of that though
17:14:09 <cohosh> meskio: yeah it's hard to track down what the cause would be
17:14:20 <dcf1> one is that the censor observes high uses through passive monitoring, this is the "access relation" type classification I am talking about
17:14:56 <dcf1> another is that, the more users there are, the higher the chance that one of them is a censor agent, this is the "bridge distribution"/Lox type classification
17:15:18 <meskio> yes, and I've being assuming the former is more common
17:15:21 <cohosh> lox does also limit the number of users per bridge in way that our other distribution methods dont
17:15:57 <onyinyang> at the initial access point yes, but not in how many users can learn of it through invitations
17:16:03 <dcf1> Yeah. Which would mitigate the "censor agent" attack and have an unknown effect on the "access relation" attack (probably mitigates it tho).
17:16:10 <cohosh> oh true good point onyinyang
17:16:14 <onyinyang> though it takes longer to build up a user base through invitations
17:16:15 <meskio> I tend to believe that if I have a high tech and a low tech (social) explanations the low tech is the one true
17:16:59 <meskio> but I know there are many examples where I'm wrong
17:17:17 <shelikhoo> there is another possibility that some users have a compromised device, like a windows machine with a app that upload user's bridge line to censor
17:17:49 <dcf1> shelikhoo: yes, that's right, in that case real users are acting as censor agents without knowing it
17:17:49 <shelikhoo> so they can be helping censor to block proxies without opt-in into such thing
17:17:54 <shelikhoo> yes
17:18:30 <dcf1> Let me just write a few questions about this paper.
17:18:39 <shelikhoo> so from this point of view, a server with a lots of users does increase the risk, no matter how the increased risk actually works
17:18:54 <dcf1> I was not sure what to make of this strange sentence in the introduction: "Given our limited dataset, the result may not fully represent the actual capabilities of these detection engines, and our purpose is not to distinguish which is the best."
17:19:21 <cohosh> heh yeah that was interesting
17:19:31 <dcf1> It's not clear how they integrated "probing graph" features into the "offline" evaluation. The ISP gave them a list of IP addresses; did they then do their active probing after the fact? After what time delay?
17:19:55 <shelikhoo> and if a user has only one user, but generate a lots of traffics, this will be another pattern, as usually with unowned server, it is not appropriate   exchange a lot of traffic
17:20:04 <shelikhoo> consider how expensive the traffic is
17:20:21 <shelikhoo> like 0.3 USD per GB with many providers
17:20:39 <dcf1> The data set seems unbalanced. See Table 2 on page 6. The servers represented in the access logs are 6.6% Psiphon? That seems unreasonably high. Is this supposed to be a uniform sample, or is it something already filtered by the ISP?
17:21:29 <meskio> maybe a true usage percentage in their corner of china
17:21:39 <meskio> but for sure not representative of other parts of the world
17:21:44 <cohosh> oh wow, that's a hell of a base rate lol
17:21:48 <dcf1> There's a really ambiguous sentence in Section 3.3, describing what the nodes are that make up the communication graph. "Where L_i represents a node of client IP, server IP, and domain"
17:22:23 <dcf1> The only way I could make this make sense to me is if I change the "and" to an "or": ""Where L_i represents a node of client IP, server IP, or domain"; i.e., there are three types of nodes in the communication graph.
17:23:05 <dcf1> But even that doesn't quite work, because the domain names comes from PTR records and TLS certificates, they should not stand on their own, they should be somehow attached to server nodes.
17:23:07 <meskio> their ethics review claims that they didn't have access to client IPs
17:23:09 <meskio> ...
17:23:27 <cohosh> they might have been replaced with identifiers
17:23:48 <meskio> maybe
17:23:58 <dcf1> meskio: https://github.com/chenxuStep/VPNChecker/blob/main/dataset/cipIndex_time_sip.csv, they have identifiers such as ClientIP1
17:24:15 <meskio> I see
17:24:49 <dcf1> Also https://github.com/chenxuStep/VPNChecker/tree/main/dataset, there are long lists of reasonable-looking server IP addresses, I don't know if they have been tweaked in some way
17:25:25 <shelikhoo> I also believe this kind of analysis(flowdata analysis) will be less useful with the CGNAT deployment
17:25:32 <dcf1> Also of note regarding the communication graph, they never actually say that an edge in the graph means that some communication was detected between two hosts (even though I can't imagine they would mean anything else).
17:25:57 <dcf1> They just say, circularly, "R(L_i) signifies the edges between node L_i and others"
17:26:35 <shelikhoo> a IP address no longer represent a single client, but a random set of clients assigned to a NAT gateway
17:26:37 <dcf1> I looked at the https://github.com/chenxuStep/VPNChecker code by the way. It only has some of the active probing code, none of the GraphSAGE aggregation or analysis code.
17:27:18 <dcf1> Oh yeah, I don't want to get into this point because it will take too long, but there is mention of NAT in Section 4.1 around "seed IPs".
17:27:36 <dcf1> To me, this sequence of sentences is a non-sequitur, though it seems to be trying to say something:
17:28:12 <dcf1> "Owing to IP-sharing access technologies, such as Network Address Translation (NAT), multiple endpoints might be behind one client IP."
17:28:15 <dcf1> "VPN software allows a portion of application traffic to be routed through the VPN tunnel, while it routes other traffic outside this tunnel."
17:28:18 <dcf1> "As a result, clients may access multiple VPN servers within a short time."
17:28:57 <dcf1> Like, it's hard to see how these three sentences are related, but it seems to be hinting at the idea of multiple clients behind one client IP address.
17:29:25 <dcf1> But if that's the case, it would be a reason *not* to use "seed" server IPs to discover more server IPs.
17:29:44 <dcf1> But, they don't even seem to use the "seed IPs" in that way at all! This part was thoroughly confusing to me.
17:30:14 <dcf1> In conclusion, I apologize for once again suggesting a problematic paper, but maybe we can learn a little from it.
17:30:31 <cohosh> it was an interesting discussion!
17:30:35 <meskio> is being a nice conversation, so it was useful
17:30:45 <meskio> and I find funny how they call VPN providers "attackers"
17:30:53 <onyinyang> agree :0
17:31:00 <shelikhoo> personally I think the active probing part is very interesting, so many things we should avoid when designing a new protocols
17:31:00 <onyinyang> oops I meant :)
17:31:12 <shelikhoo> haha attackers....
17:31:23 <meskio> also how they say their motivation is netflix blocking VPNs, but they only care about the traffic client-VPN not the outgoing traffic from the VPN
17:32:33 <dcf1> meskio: the "attackers" thing was called out by one of the reviewer too.
17:33:06 <cohosh> solidarity means attack!
17:33:07 <dcf1> https://openreview.net/forum?id=7024czziih&noteId=3RLxwrJDQC "In Section 5, who are the "attackers"? This term first appears in this section and does not make sense."
17:33:26 <meskio> :D
17:33:31 <cohosh> maybe it's positive :P
17:33:33 <shelikhoo> do we really wants them to say "you are the propriety of your ruler, who will decision what you can read or write, some external attackers wants to violate the propriety right of rulers"
17:34:09 <shelikhoo> it would make sense if we put it into context like that 233333
17:34:30 <dcf1> I have a collection of first-paragraph rationales from papers like this.
17:34:39 <dcf1> One even claimed the goal was to prevent ticket scalping...
17:35:34 <meskio> lol
17:36:34 <cohosh> lol
17:36:41 <shelikhoo> preventing editing wikipedia
17:37:29 <dcf1> Let's keep this "access relation" idea in the back of our mind, and maybe something good will come up.
17:37:52 <meskio> :)
17:37:54 <shelikhoo> yes...
17:38:05 <dcf1> I'll mention there is a connection to "zig-zag between bridges and users" from "10 ways to discover Tor bridges" https://research.torproject.org/techreports/ten-ways-discover-tor-bridges-2011-10-31.pdf#page=5
17:40:28 <meskio> I guess we are done with the reading group
17:40:31 <shelikhoo> I imagine this would only work if the IP:port combination is used for a single purpose
17:40:45 <shelikhoo> with httpupgrade/websocket based proxy
17:40:59 <shelikhoo> the same ip port is used for more than one purpose
17:41:22 <shelikhoo> so it is harder to say someone connected to a specific port is the user of proxy service
17:41:31 <shelikhoo> and yes, we can call this a meeting
17:41:41 <shelikhoo> is there anything we would like to discuss in this meeting?
17:42:13 <meskio> not from me
17:42:26 <shelikhoo> #endmeeting