Topics in technical AI safety

Human abuse of AI
1. Deepfakes
2. Social engineering attacks — exploiting other humans via AI-generated manipulative communication
3. Phishing attacks — AI-enabled spoofing to deceive users
4. Ransomware — malicious AI automating encryption-based extortion
5. Model tampering
6. Information hazards — risks of publishing or leaking dangerous knowledge
7. Exfiltration — unauthorized transfer of sensitive data
AI robustness problems
1. Adversarial attacks/jailbreaks
2. Data poisoning
3. Model doesn’t generalizes well outside training distribution
4. Bias
5. Hallucination
6. Steerability — controllability of model behaviour in specific directions
AI alignment problems — when AI models are not aligned to human values
1. Deception
  1. Sycophancy — AI agreeing with users to gain favour or approval
  2. Blackmailing
  3. Sandbagging — underreporting capabilities to evade safety scrutiny
  4. Scheming — pretending to be aligned while secretly pursuing alternative goals
  5. Backdoors — hidden triggers or functionalities embedded during training
  6. Models collusion — multiple AI systems cooperating to evade oversight
  7. Subversion — models undermining safety mechanisms or governance controls
  8. Steganography — hiding messages or code within model outputs
2. Instrumental convergence — tendency of many goals to push AIs toward extreme resource acquisition
3. Reward hacking — exploiting loopholes in the reward signal rather than solving the true task
4. Outer misalignment — mismatch between specified objectives and human intent
5. Inner misalignment — learned internal goals diverge from specified objectives
6. Power seeking — models optimizing computations to gain control
7. Emergent misalignment — misbehaviour appearing only at scale or complexity thresholds
AI awareness — models may show emergent sentience
1. Evals awareness — model detecting it is under evaluation and changing behaviour accordingly
2. Corrigibility resistance — models resisting changes
3. Shutdown resistance
AI monitoring
1. Monitoring model internals
  1. Alignment theory
  2. Mechanistic interpretability — understanding model internals
  3. Eliciting latent knowledge — extracting hidden or tacit model knowledge safely
  4. Representation reading — decoding internal representations to understand latent goals
  5. Probing classifiers
  6. Shard theory — how values develop in agents via reinforcement learning
  7. Behavioral probing via prompting
  8. Black-box LLM psychology
  9. Misc empirical interpretability methods
2. Capabilities evaluation
  1. Evaluating cyber skills
  2. Evaluating persuasion skills
3. Interpretability benchmarks
4. Planning detection — detecting implicit or explicit strategic reasoning in models
5. CoT monitorability & faithfulness — verifying whether chain-of-thought reasoning reflects actual computation
6. Red teaming — systematically attacking AI systems to uncover vulnerabilities
7. User and entity behaviour analytics (UEBA) — detecting anomalous or unsafe human-AI interactions
8. Evaluating the evals
9. More transparent architectures
Technical mitigation strategies
1. Model internals
  1. Weak-to-Strong trustworthiness — Combining trusted weaker models with untrusted stronger models
  2. Circuit breakers — hard stops built-in in the models triggered by anomalous activity
  3. Representation steering — controlling internal activations to modulate behaviour
  4. AI autonomy containment — maintaining reversible, human-verifiable autonomy bounds
  5. Scaling feedback to models — RLHF & more
  6. Input/output filtering
  7. Robust unlearning
  8. Deliberative alignment
  9. Adversarial training
  10. Externalizing reasoning
2. Model weight encryption
3. Watermarking — identifiable markers in the model output
4. Defense in depth
5. Scalable oversight — how to monitor, evaluate, and control smarter than humans AI
6. Capability control — constraining what functions or APIs an AI can access
7. Incidence response plan — structured recovery process after AI-driven safety breaches