Newsletter
V1.2.6 - AI, AI-safety

The three main AI Safety stances


I have written in the past that we must prevent human extinction from superintelligence.

As was signed by the CEOs of DeepMind, OpenAI and Anthropic, Geoffrey Hinton, Yoshua Bengio, and many others:

Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.

Given this, why aren't all these signatories and people concerned with this going ape-shit crazy about AI Safety? Shouldn't we pour 90% of our AI budget into preventing human extinction, start massive international treaties to stop the ongoing development of superintelligence precursors, and so on?

That's a fair question, and there's a big reason why people inside the AI Safety community don't do that.

(To be truly fair, the biggest reason is that everyone in the West has lost trust in institutions, including AI Safety people, so they focused on personal power seeking instead of global coordination. But that's a whole other story.)

So, back to the question. Why aren't they?
Because they expect alignment to be kind of easy, basically.

The Weak-to-Strong Alignment Thesis

Weak alignment is aligning LLM systems with benchmarks legible to humans.

Strong alignment is aligning superintelligence.

Many in the AI Safety community believe that weak alignment will naturally lead to strong alignment. Either because it will be the same work in both cases, or because we will use weak AIs to align more and more powerful AIs until they reach superintelligence.
Fundamentally, alignment is mostly seen as a problem that can be solved by engineering, rather than one that requires time for research breakthroughs.

Let's call this the Weak-to-Strong Alignment thesis.
This thesis underpins the strategy that many people in the AI Safety community are advocating for.

Consider the following:


I have found that one's position on this thesis is very connected to their strategy for AI Safety.
I do not mean that they are necessarily honest, and that they decided on their strategy after thinking a lot about the implications of the Weak-to-Strong Alignment thesis.
For instance, when someone is working at an organisation racing for AGI and has heard about extinction risks, it is very convenient to believe that alignment is just engineering and that you must race to AGI to solve it.

Regardless, for the purpose of improving the discourse on AI Safety, it is useful to assume good faith and describe the main stances on this thesis.

Stance A: The Centralised Stance

This stance agrees with the Weak-to-Strong Alignment thesis.

It is the most common point of view in AI Safety. It is the one shared by DeepMind, OpenAI, and Anthropic. It is one shared by their employees, many people working on AI safety, AI governance, and intellectuals around those circles.

They all agree that we must work on aligning superintelligence, and that getting it wrong would lead to human extinction.

Fundamentally, the biggest point of contention with this stance is on the weak-to-strong alignment thesis.

For them, aligning superintelligence is an engineering problem, and they roughly all believe that, given an AGI they can study and use, a careful team of ~100 people can solve it over a couple of years. It will organically happen if the team building AGI keeps working on weak alignment.

From their point of view, extinction risks from AI boil down to who builds AGI first, and how much effort they put into AI Safety.
Efforts in AI Safety are roughly fungible, which is reflected in concepts like The Alignment Tax.
As a result, their plans always boil down to the same thing: racing to AGI to get some headstart over the others. That way, they may solve the alignment problem using more affordances to do safety work.

Ultimately, in this view, there is no need for ahead-of-time planning. The best you can do is get as much time as you can with AGI.
A pause does not really make sense, especially if it increases the risk of the tribal adversaries of the day (like Western Governments, Open Source, Meta, China, etc.) building AGI first.
In this view, the only thing that matters is to get AGI first and have a headstart over the other less safety-minded actors.

This view justifies the current race to AGI and the current lack of AI safety regulation. Aside from accelerating and accumulating power with people you trust and like, there is not much else of anything that can be done.

Stance B: The Decentralised Stance

This stance also agrees with the Weak-to-Strong Alignment thesis, but more as a background assumption that is not explicitly stated.

This is a point of view that is quite common in Academia, the Open Source Software community, and the VC ecosystem.

They do not always agree that extinction risks from AI are real. The main point of contention this stance is not necessarily on alignment, but on how powerful AI will be, and how quickly we will get there.

For them, alignment happens mostly naturally. Other technologies became safe over time, and AI is no exception.

Their main concerns lie with the distribution of power in AI. If only a few people have access to AGI, then they may get way too much power over the rest of us.

In that frame, to the extent that risks of human extinction are real, the way they manifest is often through oppression and the rest of humanity not having access to AIs powerful enough to defend itself.

Ultimately, they believe that the best way to deal with this is to have AI spread as much as possible, so that everyone can have access to it.

Stance C: The Hard Stance

This stance disagrees with the Weak-to-Strong Alignment thesis.

In polls, people often express this stance. ([TODO]: Add link to various polls on the topic.) Beyond this, I am not sure whether there's a natural social group like "Academia" or "Republicans" who espouse this stance.

Eliezer Yudkowsky tried to articulate it with his "list of lethalities", Nate Soares with his essay on the Sharp Left Turn, and me and a couple of pals with The Compendium.

It is espoused by organisations like ControlAI, MIRI, FLI, and PauseAI.

And to the surprise of no one... it's the stance I hold!

(If you talked to them, you'd find that many more in the AI Safety community actually hold this stance. But they all too often "strategically" align with the other stances out of cowardice. Up to just directly in orgs with strategies that only make sense in the other stances. It is a common case of not being true to yourself to gain more clout.)

The Hard Stance basically amounts to the belief that building artificial gods is a terrible idea, as we are nowhere wise enough to do so without blowing up the world.

Progress on AI capabilities will quickly outpace our progress on AI Safety, and at that pace, we'll build misaligned superintelligence before we can do anything meaningful about it.

There are usually two main points of contention with this stance. The first one is about how quickly AI capabilities will grow, and the second one is about how hard it will be to align superintelligence.

In that frame, human extinction is the default outcome; we would need to be much better at various forms of science and global governance to avoid it.

Ultimately, the only way to mitigate extinction risks for now is to avoid building superintelligence precursors and enact a global pause.

Conclusion

There is much more that can be said about these stances: how they came to be, how they relate to each other, their prescriptions on regulation, etc.

But I prefer to keep each article self-contained.

To summarise:

  • The Weak-to-Strong Alignment Thesis: Weak alignment will naturally lead to strong alignment.
  • Stance A: The Centralised Stance: Human extinction has a non-trivial chance of occurring. It depends on who builds AGI first and how much effort they put into AI Safety.
  • Stance B: The Decentralised Stance: Human extinction is a tail risk. It would only occur if AGI concentrated and its developers were particularly stupid or malicious.
  • Stance C: The Hard Stance: At our current level of science and institutions, human extinction is the default outcome of building superintelligence. We must stop the progress towards it as soon as possible.

Cheers, and have a nice day!