Hi! Here I post links of interesting things I find. You can click on link below each post to go to the link.
Posts
-
Are OpenAI and Anthropic Really Losing Money on Inference?
Importance: 4 | # | llm
I'm only going to look at raw compute costs. This is obviously a complete oversimplification, but given how useful the current models are - even assuming no improvements - I want to stress test the idea that everyone is losing so much money on inference that it is completely unsustainable.
Martin makes some rough calculations - starting with deepseek V3 as baseline and taking into account hourly rate of H100s and assuming bandwidth bound process (3.35 TB/s HBM badnwidth) - calculates number of forward passes per second. Two seperate calculations for prefill and decode stage and here's the outcome:
The asymmetry is stark: $144 Ć· 46,800M = $0.003 per million input tokens versus $144 Ć· 46.7M = $3.08 per million output tokens. That's a thousand-fold difference!
Of course the prefill stage can become compute bound at long context lengths.
These costs map to what DeepInfra charges for R1 hosting, with the exception there is a much higher markup on input tokens.
By these numbers, there is good chance that even the claude-max plan for claude-code users are running at a 10x markup for heavy users - costing $20 and charging $200.
One interesting point here is how Anthropic and OpenAI differ in their approach to models. Judging for raw token speed, my guess would be that Claude models are larger but token efficient (Opus 4.1 barely thinks and is competitive with say GPT-5-thinking-high), whereas GPT5 without thinking (minimal) is a pretty dumb model and for high quality outputs it thinks forever. If each output token is so costly, and OpenAI was supposedly optimizing for inference cost - wouldn't they have gone for a model like Claude? One explanation is that these estimates are assuming MOE model with like 5% active parameters, it could be that Claude is either much higher active params or not MOE at all. It will be interesting to see what approach people take moving forward - Anthropic doesn't need to optimize for inference as much as OpenAI - no free users.
-
In Search Of AI Psychosis
Importance: 5 | # | acx, ai-psychosis
AI psychosis (NYT, PsychologyToday) is an apparent phenomenon where people go crazy after talking to chatbots too much. There are some high-profile anecdotes, but still many unanswered questions. For example, how common is it really? Are the chatbots really driving people crazy, or just catching the attention of people who were crazy already? Isnāt psychosis supposed to be a biological disease? Wouldnāt that make chatbot-induced psychosis the same kind of category error as chatbot-induced diabetes?
I donāt have all the answers, so think of this post as an exploration of possible analogies and precedents rather than a strongly-held thesis. Also, I might have one answer - I think the yearly incidence of AI psychosis is somewhere around 1 in 10,000 (for a loose definition) to 1 in 100,000 (for a strict definition). Iāll talk about how I got those numbers at the end.
Scott conducts a survey (4156 responses - they are asked to estimate psychosis rate among friends and family) and estimates loose psychosis rate at 1 in 10,000 - rate of all people who get pushed to psychosis and a strict rate of 1 in 100,000 - people who showed no signs/risks of psychosis pre-AI-use and get pushed to psychosis.
-
Google will require developer verification to install Android apps, including sideloading
Importance: 7 | # | google
To combat malware and financial scams, Google announced today that only apps from developers that have undergone verification can be installed on certified Android devices starting in 2026.
-
How to think in writing
Importance: 6 | # | henrik-karlsson, writing, thinking
When I sit down to write, the meadow is still engulfed in darkness under a sky where satellites pass by, one after the other. My thoughts are flighty and shapeless, like dreams; they morph as I approach them. But when I open my notebook and write, it is as if I pin my thoughts to the table. I can examine them. ...
Since the goal is to find flaws in our guesses (so that we can change our minds, refine our mental models and our language, and be more right) stretching a claim thin through an explanation is progress. Even if the explanation is wrong.
You are interested only in proofs which āproveā what they have set out to prove. I am interested in proofs even if they do not accomplish their intended task. Columbus did not reach India but he discovered something interesting. āLakatos
Super useful and well written as always. Henrik provides a couple of concrete techniques to write for thinking better:
- go as far as possible with an idea/hypothesis/conjecture you have and make a strong and concrete case for it
- concrete is important because the core value of writing comes from pinning down thoughts - handwaving possibilites should be minimized
- you are not looking to be right, you are looking to learn and refine your ideas
- the mere act of having to concretely write down the idea and put thoughts into words is a lot of effort and is fruitful by itself and you have concrete words etched in ink or silicon to hold yourself accountable
- find counter examples
I make extensive use of the counter examples technique. I try to lay down a hypothesis and then think of counter evidence. Kinda comes naturally after having scrolled through hackernews comments for so long now.
Writing down concrete ideas while trying to reduce ambiguity is not something I do concisouly. I am usually trying to accurately represent my state of mind. And as the state evolves so does the ink. I do spread ideas thin - but it's handwavy a lot of the times. Henrik provides plenty of examples where one escapes ones own words by handwaving away what they actually meant. I observe this with myself as well. Making a concious effor to minimize this should bode well.
It is useful to take a hypothesis and make as strong a case for it as possible, write it down, leave little room for ambiguity. Then find counter evidence, refine, repeat. This is usually how anyone hardens ideas. Very bayesian. It is also useful to do this when you think the idea is wrong or only right in 80% of the cases. Wrong ideas are ofcourse worth writing and talking about.
Gwern has a confidence tag - to indicate how likely he thinks it is that the ideas in the post are to be true. There's also a lesswrong post/comment about how simple ideas that explain say 80% of observations and cannot account for the rest are super useful - I can't find it right now.
- go as far as possible with an idea/hypothesis/conjecture you have and make a strong and concrete case for it
-
P(doom) vs P(permanent underclass)
Importance: 4 | # | zvi, agi, post-agi
zvi:
Jordi Hays: I'm updating my timelines. You now have have at least 4 years to escape the permanent underclass.
Luke Metro: This is the best news that founding engineers have received in years.
Nabeel Qureshi: The 'vibe shift' on here is everyone realizing they will still have jobs in 2030.
(Those jobs will look quite different, to be clear...)
It's a funny marker of OpenAI's extreme success that they released what is likely going to be most people's daily driver AI model across both chat and coding, and people are still disappointed.
Part of the issue is that the leaps in the last two years were absolutely massive (gpt4 to o3 in particular) and it's going to take time to work out the consequences of that. People were bound to be disappointed eventually.
Zvi: On the question of economic prospects if and when They Took Our Jobs and how much to worry about this, I remind everyone that my position is unchanged: I do not think one should worry much about being in a āpermanent underclassā or anything like that, as this requires a highly narrow set of things to happen - the AI is good enough to take the jobs, and the humans stay in charge and alive, but those humans do you dirty - and even if it did happen the resulting underclass probably does damn well compared to today.
You should worry more about not surviving or humanity not remaining in control, or your place in the social and economic order if transformational AI does not arrive soon, and less about your place relative to other humans in positive post-AI worlds.
So how likely is the scenario that (most people + you lose your jobs to AI) AND (powerful people decide to fuck over everyone else) AND (AI good enough to do all of human labour does not kill us all)? We can ignore the issue of the underclass having a way better quality of life - should come included in the second event.
-
Claude Sonnet 4 now supports 1M tokens of context
Importance: 4 | # | anthropic, claude
Claude Sonnet 4 now supports up to 1 million tokens of context on the Anthropic APIāa 5x increase that lets you process entire codebases with over 75,000 lines of code or dozens of research papers in a single request.
Long context support for Sonnet 4 is now in public beta on the Anthropic API and in Amazon Bedrock, with Google Cloudās Vertex AI coming soon.
Anthropic has kinda cucked themselves here with their pro and max plans. The claude-code users will leave them bleeding dry if 1M context is brought to claude-code. It wouldn't be surprising if they had this ready for a while but couldn't figure out how to manage costs and now it's a bit too late and they are weirdly releasing for API only and not for subscribers.
Opus is good model. It isn't a reasoning model in the way openai's or google's are. Opus rarely thinks long and is super token effecient in general. Test time scaling is least used with Opus comared to other frontier models. It's crazy that this almost-not-a-reasoning-model is one of the very best right now.
Fitting everything into the 200k context while being super useful requires that you have high token efficiency - especially for code. Given that this is the first time we are seeing 1M context from Anthropic, it wouldn't be surprisnig if they are working on reducing inference cost by decreasing the model size and increasing test-time-compute.
I am also curious about how these 1M models are trained. So we have had 4.1 sonnet for a while now and they presumably took the 200k context length model and subsequently trained it on larger sentences? But then this would definitely affect the model behaviour even at smaller context lengths. It would make so much more sense if they leave the current model as is and release this new model with a different number - 4.2.
-
Qwen3-4B-Thinking: āThis is artāpelicans donāt ride bikes!ā
Importance: 4 | # | simonw, qwen, slm
These are relatively tiny models that punch way above their weight.
I used the Instruct model to summarize this Hacker News conversation about GPT-5.
The good news is Qwen spat out a genuinely useful summary of the conversation! You can read that hereāitās the best Iāve seen yet from a model running on my laptop, though honestly Iāve not tried many other recent models in this way.
The bad news... it took almost five minutes to process and return the result!
Theyāre fun, they have personality and Iām confident there are classes of useful problems they will prove capable at despite their small size. Their ability at summarization should make them a good fit for local RAG, and Iāve not started exploring their tool calling abilities yet.
I tried the model locally (ollama) and on openrouter. A 4B model that runs on a phone with some (unreliable) intelligence is insane. If it can do something well - like if it could just summarize reliably - that would be huge. In practice when you want to make use of intermediate LLMs between user request and response, the latency become super crucial - and although providers like groq and cerebras can push 1000+tok/s, the time-to-first-token becomes the bottleneck. A model small enough to easily host on a server you can control could solve this to an extent.
-
Ozempic Shows Anti-Aging Effects in First Clinical Trial, Reversing Biological Age by 3.1 Years
Importance: 6 | # | aging, ozempic
A randomized controlled trial of 108 people with HIV-associated lipohypertrophy found that weekly Ozempic treatment for 32 weeks reversed biological age by an average of 3.1 years.
The study used epigenetic clocks to measure biological aging, showing the most pronounced anti-aging effects in the inflammatory system and brain, where aging was delayed by almost 5 years.
TIL about epigenetic clocks. I asked Claude 4.1 Opus about it:
DNA methylation involves adding small chemical groups called methyl groups (CHā) to specific locations on your DNA, particularly at sites called CpG sites (where a cytosine nucleotide is followed by a guanine).
As we age, our methylation patterns change in predictable ways
- Hypermethylation: Some regions gain methyl groups over time, often silencing genes that were previously active
- Hypomethylation: Other regions lose methyl groups, potentially activating genes that should remain silent
- These changes accumulate like marks on a biological scorecard
Training the Clock: Scientists analyzed methylation patterns at hundreds or thousands of CpG sites across the genome in people of known ages. Using machine learning, they identified which specific sites show the most consistent age-related changes.
Patterns are identified and algorithms are developed to measure age. This age has good correlation with health outcomes, mortality risk, and age-related diseases better than chronological age alone. Also, this is tissue specific measure measurement. This ozempic trial also found anit-aging in some organs but not in others.
I had only known about telomere lenght measurement until now for measurement of age, and opus says "High variability, poor correlation with age, easily influenced by single factors".
-
Practically-A-Book Review: Byrnes on Trance
We donāt directly perceive the external world. Every philosopher has their own way of saying exactly what it is we do perceive, but the predictive processing interpretation is that we perceive our models of the world. To be very naive and hand-wavey, lower-level brain centers get sense-data, make a guess about what produced that sense data, then āshowā āusā that guess. If the guess is wrong, too bad - we see the incorrect guess, not the reality.
-
Anthropic tightens usage limits for Claude Code ā without telling users
Importance: 2 | # | claude, claude-code
Since Monday morning, Claude Code users have been hit with unexpectedly restrictive usage limits. The problems, many of which have been aired on Claude Codeās GitHub page, seem to be concentrated among heavy users of the service, many of whom are on the $200-a-month Max plan.
All you gotta do is inform the users. Anthropic has a lil-nvidia vibe as there's little to no competition to coding capabilites of other frontier labs. Cursor and claude-code interfaces are different enough that switching back and forth isn't frictionless.
More here.
Say hello.