Motivated Reasoning As Mis-applied Reinforcement Learning
# | #psychology, #mental-model, #productivity, #procrastination
Scott Alexander presents a model to explain the question
why does the brain so often confuse what is true vs. what I want to be true?
Explanation:
So there are two types of brain region: basically behavioral (which hedonic reinforcement learning makes better), and basically epistemic (which hedonic reinforcement learning would make worse, so they don’t do it).
...
Motivated reasoning is the tendency for people to believe comfortable lies, like “my wife isn’t cheating on me” or “I’m totally right about politics, the only reason my program failed was that wreckers from the other party sabotaged it”. In this model, it’s got to be what happens when you try to run epistemics on partly-reinforceable architecture.
...
Maybe thinking about politics - like doing your taxes - is such a novel modality that the relevant brain networks get placed kind of randomly on a bunch of different architectures, and some of them are reinforceable and others aren’t. Or maybe evolution deliberately put some of this stuff on reinforceable architecture in order to keep people happy and conformist and politically savvy.
This is a neat explaination. But, why is RL being applied on rewards in the short-term only? If we consider the longer term rewards, then the examples Scott presents wouldn't make so much sense. I will present an alternate view - does not explain everything, but... it doesn't have to, it's elegant, sorta.
It's all a action -> reward -> reinforce up/down depending on reward compared to expectation. The rewards are spread out in time. You could choose to ignore your taxes - positive reward, and wait for a bit and a massive anvil drops on your face -> negative reinforce. Now the discount rate on the reward dictates the magnitude of reinforcement (the action -> reward connection gets weaker over time.) If one could avoid the bitter 'truth' while continuing to live on, then RL does not place any constraints on this and people will continue living on believeing in the untrue things.
Lower discount rate would mean stronger action -> reward connections that last longer, and maybe you could argue that people with lower discount rates do better generally. Discount rates vary within a single individual in different areas of life - one might be a great strategist in professional life, but cannot be bothered about their failing marriage (high discounting).
This RL is on the pain/happiness metric. One level above, evolution can be seen as RL on the survival metric. If propensity for certain actions lead to more survival - reinforce it.
Phil:
Not sold on the "visual-cortex-is-not-a-reinforcement-learner" conclusion. If the objective is to maximize total reward (the reinforcement learning objective), then surely having your day ruined by spotting a tiger is better than ignoring the tiger and having your day much more ruined by being eaten by said tiger. (i.e.: visual cortex is "clever" and has incurred some small cost now in order to save you a big cost). Total reward is the same reason humans will do any activities with delayed payoffs.
Sounds like a plausible experiment would be something like: have somebody show you pictures of circles. You report whether it looks like a circle or a square (be honest!) Each time you say circle, the person gives you an electric shock. See whether, after long enough, you start genuinely seeing squares.
My guess is this never happens - do you guess the opposite?
You would stop saying you see a circle lol. If you got an electric shock for seeing red - say by placing an electrode in your red percieving brain region and you got shocked each time activity there is detected, would the region stop activating over time? Yeah, maybe not. Pure RL certainly does not explain everything. Over some generations no one doubts a change in what we'll see. It's a question of the timelines involved in brain architectures being molded over RL - long enough timelines for a single persons life is essentially unchangable though🤷.
I also have a similar mental model of two brain architecures just as Scott:
- low discount rate - easier long term planning (epistemic)
- high discount rate - immediate discomfort overrides all (reinforcement/behavioural)