Follow-up to: Law of No Evidence
Recently, there was some debate about a few Twitter polls, which led into a dispute over the usefulness of Twitter polls in general and how to deal with biased and potentially misleading evidence.
Agnus Callard is explicitly asking the same question I asked, which is the opposite of ignoring sample bias: What is accounting for the difference?
Sample selection is definitely one of the explanations here. One can also point to several other key differences.
My poll asks about you, Patrick asks about how others seem.
My poll asks about struggle, Patrick asks about stability.
My poll asks about a year versus a point in time, a potential flaw.
My poll asks about now, Patrick asks about since pandemic onset.
None of this is well-controlled or ‘scientific’ in the Science™ sense. No one is saying any of this is conclusive or precise.
What is ‘bad’ evidence if it isn’t weak evidence? Adam’s theory here is that it is misleading evidence. That makes sense as a potential distinction. Under this model:
Weak evidence induces a small Bayesian update in the correct direction.
Bad evidence can induce an update in the wrong direction.
Usually, people with such taxonomies will also think that strong evidence by default trumps weak evidence, allowing you to entirely ignore it. That is not how that works. Either something has a likelihood ratio, or it doesn’t.
The question is, what to do about the danger that someone might misinterpret the data and update ‘wrong’?
I love that the account is called ‘Deconstruction Guide.’ Thanks, kind sir.
Whether or not this ‘depends on the poll’ depends on what level of technically correct we are on, and one can go back and forth on that several times. The fully correct answer is: Yes, some info. You always know that the person chose to make the poll, and how many people chose to respond given the level of exposure, and the responses always tell you something, even if the choices were ‘Grune’ and ‘Mlue,’ ‘Yes’ and ‘Absolutely,’ or ‘Maybe’ and ‘Maybe Not.’
Remember that if any other result would have told you something, then this result also tells you something, because it means the result that would have told you something did not happen. That doesn’t mean it helps you with any particular question.
Anyway, back to main thread.
Getting into a Socratic dialog with a Socratic philosopher, and letting them play the role of Socrates. Classic blunder.
I certainly want to know the extent to which the world is full of lunatics.
Adam Gurri’s new claim has now narrowed to something more reasonable, that citing a Twitter poll as representative even of some subgroup marks you as foolish.
We can agree that taking a Twitter poll, not adjusting for sample bias, and drawing conclusions is foolish. Saying it equates to a subgroup that is similar to the group polled still requires dealing with response bias and all that, but mostly seems fine. Adjusting for the nature of your sample should render the whole thing fine in any case.
You can also find good information in a Twitter poll by comparing its results to another Twitter poll using the same account (and same retweets, ideally). The difference between the two is meaningful. This can be a difference between questions or wordings, or a difference over time, or something else.
Rules of Evidence
Aristotle is indeed wise. He points to the important distinction between evidence, as in Bayesian evidence or a reason one might change one’s mind or one’s probabilities, and the rules of evidence in a given format of debate or discourse. In a court of law, some forms of Bayesian evidence are considered irrelevant or, even more extremely, prejudicial, exactly because they should cause one to update their probabilities and the law wants the jury not to do that.
Which is sometimes the right thing to do. Still, you have to admit it is kind of weird.
I think a lot of the reason it is so often right to do it is because we use very strange standards of evidence and burdens of proof in other places, forcing corrections. And also of course juries are random people so they have a lot of biases and we worry about overadjustments. Then there are the cases where we think the jury would reach exactly the right conclusion, but we think that’s bad, actually.
Anyway.
In the formal rules for public discourse, how should we consider Twitter polls?
A Twitter poll without proper context should be fully inadmissible here.
What about with the proper context? That gets trickier.
I consider what I do on my blog a form of public discourse, and I notice that in whatever thing that it is I am doing in most posts, a Twitter poll with context is obviously admissible. That is because ‘the thing I am doing’ is attempting to reason in public and establish a model of the world, how it works and what it is going to do. I am not trying to persuade anyone as such.
That’s a different department.
We should strive to minimize our visits to that department, whenever possible.
Exactly. Keep your evidential requirements as low as possible. But no lower.
I do occasionally, and likely will more often in the future, visit the other department. In those situations, I am more careful about using such evidence. I know it is by its nature unpersuasive to most, and a point of vulnerability, and requires a certain level of epistemic trust. Thus, in these situations, I try even more than usual to at most rely on it and other similar facts only for loose bounds and non-binding intuitions - by default, it’s not admissible.
Crux One
And now, at least I hope, a crux.
Yes, exactly. Everything is evidence. You should update on almost anything. That is indeed how probability and knowledge work.
To state the obvious, if evidence does not cause one to be more likely to be led to the correct conclusion, you are doing evidence wrong, bro do you even Bayes?
My first response would be to attempt to fix it. If I couldn’t, then yes, I would consider not seeking out, or even actively avoiding, such information.
The tricky case is when you are being shown evidence that is selected to attempt to change your mind. Which is the basis of most ‘public discourse,’ especially that which is going to engage with someone (in any direction) with a publication called Liberal Currents. In such situations, you need to ask what actual evidence you are getting when you are given evidence. Often this is mainly comparing the quality and strength of the evidence you got to the quality and strength you would expect. If the evidence is weaker than you expected, you should update in the opposite direction on the information that this was the best this source could do.
I do not understand the claim that ‘we have statistics’ on the Twitter poll question. Is Adam suggesting someone ran a Proper Scientific Study on people’s updates from looking at Twitter polls? Which seems very hard to do usefully, and I assume is not it. Instead, I am assuming he means ‘we have statistical tools for evaluating samples and they say that your samples are worthless.’
I think this claim is simply doing statistics wrong. The samples are quite big enough. All you have to do is understand the nature of the samples. Or, use the poll to get insight into the sample. Which, then, you can, among other things, poll again later.
Whenever I read a scientific paper, there is about a 50/50 chance I conclude that they have buried the lead, often entirely missing the lead, even if I also agree with their main claim. They do not realize what they have learned. They do the equivalent of concluding that the key thing in life is herring sandwiches, instead of realizing it is boredom.
Instead of looking for something specific, look for anything at all. Much better odds.
Crux Two
Thus:
Tiago nails it. Knowing that different samples and differently worded questions and answers explain the answer is better than not knowing that. One should not mistake it either for Deep Wisdom, or for the main thing available to be learned. It is a way to avoid learning what there is to learn, by figuring out which differences did it. There is a surprising result. It has a cause, and the details there are often going to be interesting. Using ‘there is a cause one could find’ as a semantic stop sign will not help you.
Indeed, I realized I could Do Science to the situation. Was it primarily the different samples, or was it primarily the different wording? There’s a way to find out!
I grabbed the results here because someone new retweeted the poll, potentially corrupting the comparison after that, and any sample >300 is fine here. Here is the larger sample, which converged some towards Patrick’s results.
That is exactly Patrick’s wording. Does it match Patrick’s poll?
Mostly it does. The difference is that my sample includes more ‘about the same’ and less at the extremes, which is likely cultural differences in what counts as about the same. I’m also guessing my audience has a lower-than-usual Lizardman Constant, and that together they explain the whole difference.
Thus, we have learned that, at least in this context, no, the samples are very similar. Mostly the difference is the wordings. If Patrick were to do my exact poll For Science, I expect him to get roughly my result with a bit more noise.
The next step, if one wanted to continue learning, would be to change individual components and see if anything more changed - e.g. do Patrick’s wording with respect to yourself only.
Does this represent people having a more optimistic view of themselves than they do of others? Or is this people correctly doing aggregation, since 10% of people becoming less stable makes people overall less stable and larger groups have less variance? My presumption is this is a mix.
This still does leave stability down versus the original finding of struggling also down. That too is logically compatible but on its own implausible, so there is more here to explain. One could continue. For now, I will stop there.
Conclusions
The original thread finished up with Agnus using the Robin Hanson signal to attempt to put a bet together, which did not work as there was nothing close to a meeting of the minds on what was in dispute. Adam’s final position seemed to be that as long as Twitter polls did not match national polls as accurately as other national polls matched each other then they were useless. It was unclear whether you would be allowed to correct for bias before checking. That seems important given that most national polls are doing various bias-correcting things under the hood.
Adam’s whole position here, to me, is rather silly, even if we limit ourselves to use cases where the Twitter poll is being used only to try and extrapolate towards national sentiment. Of course when we are trying to measure the output of process X we will get a less accurate measure by using process Y than by repeating process X. That is true even if X is not doing as good a job as Y of measuring underlying value V. We still might gain insight into V. We especially might gain insight into V if X costs hundreds or thousands of dollars per use while Y falls under the slogan ‘this website is free.’
The principle mirrors the question about to what extent Proper Scientific Studies are the only form of evidence, making it legitimate to say No Evidence of X whenever there is no Proper Scientific Study claiming X, no matter what your lying eyes think or how many times your lying ears hear “Look! It’s an X!”
Takeaways
All evidence is evidence. All evidence is net useful if well-handled.
Those who deny this are likely epistemically hostile and/or operate in a highly hostile epistemic environment. Treat accordingly.
Do your best to stay out of such places and discussions, when you can.
Biased or misleading evidence is evidence, often of many things.
One must preserving Conservation of Expected Evidence.
Mostly compare information from hostile or biased sources to expectations.
See what is there to be learned, being curious and exploring.
Look for comparisons that let you control for bias. Often quite straightforward.
Never get into a Socratic dialog where a Socratic philosopher gets to ask the questions when death is on the line. Or you want to ‘win.’ Otherwise, sure.
Twitter polls are neat and chances are you are not doing enough of them.
I really want a shirt that says "Bro do you even Bayes?"
I notice a few things:
1: You seem to assume that you can accurately tell when survey evidence is misleading. Perhaps that falls under the heading of "well-handled" but identifying whether the survey evidence is misleading seems to be terribly difficult. Sure, if you can tell it is misleading based on verifying against evidence you are more certain of that tells you something (ala #4 in your list) but how do you get there? If information has a very high possibility of being misleading, but you can't be sure, probably the answer is to not update on it, no? (I think that is Adam's point, but I wouldn't want to presume.)
2: I think your discussion under-weights the selection effects of Twitter. If your questions are along the lines of "what do my followers and the followers of their retweets think" then you are maybe fine, although you still have all the caveats of your sample being people who respond to surveys, people who check Twitter enough to notice this and respond before the survey closes, etc. If your questions are "What do humans in general in my country think" nope, right out. Even things like "What do humans who care about this topic and are on Twitter think?" won't be well sampled, because the poll still has to reach them for them to answer. Even beyond problems of adjusting for sample bias (like having 75% women, or whatever, that you can't adjust for because you don't have all the demographics) you still run into the "survey at a shopping mall" effect of the people even exposed to the possibility of taking your survey are a fraction of the possible group you actually care about, but your brain wants to lie to you about that and over generalize.
3: That brings up the next problem: you are always trying to persuade someone with evidence, and that someone is yourself. Evidence that could be misleading but you aren't sure of, the non-transparent non-transparencies as David Levy puts it, is the most dangerous. A transparent non-transparency (a known lie) you can work around, but if you are not sure it is a lie the information is extra dangerous because you will always want to interpret it in a way that supports what you want it to. In that case it might be best to assign it a very, very small update amount, and possibly zero, or even slightly negative just to be sure. After all, if the evidence might be misleading, and you don't know exactly how likely that is, there is a good possibility that the proper updating direction is in the opposite direction of the evidence/brain interaction.
Just some thoughts on that. I dabbled in survey design and that kind of work in my graduate student years, and... whew... it wasn't a field I wanted to get much involved in afterwards. The difficulties of getting information that doesn't actively mislead is only surpassed by the difficulty in telling whether it is misleading or not, and whether misleading about magnitude, direction, or both.