34
Alignment is not free: How model upgrades can silence your confidence signals (variance.co)
a week ago | karinemellata | variance.co | best
4
We used sparse autoencoders to explain LLM moderation flags of violent threats (variance.co)
3 weeks ago | karinemellata | variance.co | newest