image.png

  1. Human abuse of AI
    1. Deepfakes
    2. Social engineering attacks — exploiting other humans via AI-generated manipulative communication
    3. Phishing attacks — AI-enabled spoofing to deceive users
    4. Ransomware — malicious AI automating encryption-based extortion
    5. Model tampering
    6. Information hazards — risks of publishing or leaking dangerous knowledge
    7. Exfiltration — unauthorized transfer of sensitive data
  2. AI robustness problems
    1. Adversarial attacks/jailbreaks
    2. Data poisoning
    3. Model doesn’t generalizes well outside training distribution
    4. Bias
    5. Hallucination
    6. Steerability — controllability of model behaviour in specific directions
  3. AI alignment problems — when AI models are not aligned to human values
    1. Deception
      1. Sycophancy — AI agreeing with users to gain favour or approval
      2. Blackmailing
      3. Sandbagging — underreporting capabilities to evade safety scrutiny
      4. Scheming — pretending to be aligned while secretly pursuing alternative goals
      5. Backdoors — hidden triggers or functionalities embedded during training
      6. Models collusion — multiple AI systems cooperating to evade oversight
      7. Subversion — models undermining safety mechanisms or governance controls
      8. Steganography — hiding messages or code within model outputs
    2. Instrumental convergence — tendency of many goals to push AIs toward extreme resource acquisition
    3. Reward hacking — exploiting loopholes in the reward signal rather than solving the true task
    4. Outer misalignment — mismatch between specified objectives and human intent
    5. Inner misalignment — learned internal goals diverge from specified objectives
    6. Power seeking — models optimizing computations to gain control
    7. Emergent misalignment — misbehaviour appearing only at scale or complexity thresholds
  4. AI awareness — models may show emergent sentience
    1. Evals awareness — model detecting it is under evaluation and changing behaviour accordingly
    2. Corrigibility resistance — models resisting changes
    3. Shutdown resistance
  5. AI monitoring
    1. Monitoring model internals
      1. Alignment theory
      2. Mechanistic interpretability — understanding model internals
      3. Eliciting latent knowledge — extracting hidden or tacit model knowledge safely
      4. Representation reading — decoding internal representations to understand latent goals
      5. Probing classifiers
      6. Shard theory — how values develop in agents via reinforcement learning
      7. Behavioral probing via prompting
      8. Black-box LLM psychology
      9. Misc empirical interpretability methods
    2. Capabilities evaluation
      1. Evaluating cyber skills
      2. Evaluating persuasion skills
    3. Interpretability benchmarks
    4. Planning detection — detecting implicit or explicit strategic reasoning in models
    5. CoT monitorability & faithfulness — verifying whether chain-of-thought reasoning reflects actual computation
    6. Red teaming — systematically attacking AI systems to uncover vulnerabilities
    7. User and entity behaviour analytics (UEBA) — detecting anomalous or unsafe human-AI interactions
    8. Evaluating the evals
    9. More transparent architectures
  6. Technical mitigation strategies
    1. Model internals
      1. Weak-to-Strong trustworthiness — Combining trusted weaker models with untrusted stronger models
      2. Circuit breakers — hard stops built-in in the models triggered by anomalous activity
      3. Representation steering — controlling internal activations to modulate behaviour
      4. AI autonomy containment — maintaining reversible, human-verifiable autonomy bounds
      5. Scaling feedback to models — RLHF & more
      6. Input/output filtering
      7. Robust unlearning
      8. Deliberative alignment
      9. Adversarial training
      10. Externalizing reasoning
    2. Model weight encryption
    3. Watermarking — identifiable markers in the model output
    4. Defense in depth
    5. Scalable oversight — how to monitor, evaluate, and control smarter than humans AI
    6. Capability control — constraining what functions or APIs an AI can access
    7. Incidence response plan — structured recovery process after AI-driven safety breaches