
- Human abuse of AI
- Deepfakes
- Social engineering attacks — exploiting other humans via AI-generated manipulative communication
- Phishing attacks — AI-enabled spoofing to deceive users
- Ransomware — malicious AI automating encryption-based extortion
- Model tampering
- Information hazards — risks of publishing or leaking dangerous knowledge
- Exfiltration — unauthorized transfer of sensitive data
- AI robustness problems
- Adversarial attacks/jailbreaks
- Data poisoning
- Model doesn’t generalizes well outside training distribution
- Bias
- Hallucination
- Steerability — controllability of model behaviour in specific directions
- AI alignment problems — when AI models are not aligned to human values
- Deception
- Sycophancy — AI agreeing with users to gain favour or approval
- Blackmailing
- Sandbagging — underreporting capabilities to evade safety scrutiny
- Scheming — pretending to be aligned while secretly pursuing alternative goals
- Backdoors — hidden triggers or functionalities embedded during training
- Models collusion — multiple AI systems cooperating to evade oversight
- Subversion — models undermining safety mechanisms or governance controls
- Steganography — hiding messages or code within model outputs
- Instrumental convergence — tendency of many goals to push AIs toward extreme resource acquisition
- Reward hacking — exploiting loopholes in the reward signal rather than solving the true task
- Outer misalignment — mismatch between specified objectives and human intent
- Inner misalignment — learned internal goals diverge from specified objectives
- Power seeking — models optimizing computations to gain control
- Emergent misalignment — misbehaviour appearing only at scale or complexity thresholds
- AI awareness — models may show emergent sentience
- Evals awareness — model detecting it is under evaluation and changing behaviour accordingly
- Corrigibility resistance — models resisting changes
- Shutdown resistance
- AI monitoring
- Monitoring model internals
- Alignment theory
- Mechanistic interpretability — understanding model internals
- Eliciting latent knowledge — extracting hidden or tacit model knowledge safely
- Representation reading — decoding internal representations to understand latent goals
- Probing classifiers
- Shard theory — how values develop in agents via reinforcement learning
- Behavioral probing via prompting
- Black-box LLM psychology
- Misc empirical interpretability methods
- Capabilities evaluation
- Evaluating cyber skills
- Evaluating persuasion skills
- Interpretability benchmarks
- Planning detection — detecting implicit or explicit strategic reasoning in models
- CoT monitorability & faithfulness — verifying whether chain-of-thought reasoning reflects actual computation
- Red teaming — systematically attacking AI systems to uncover vulnerabilities
- User and entity behaviour analytics (UEBA) — detecting anomalous or unsafe human-AI interactions
- Evaluating the evals
- More transparent architectures
- Technical mitigation strategies
- Model internals
- Weak-to-Strong trustworthiness — Combining trusted weaker models with untrusted stronger models
- Circuit breakers — hard stops built-in in the models triggered by anomalous activity
- Representation steering — controlling internal activations to modulate behaviour
- AI autonomy containment — maintaining reversible, human-verifiable autonomy bounds
- Scaling feedback to models — RLHF & more
- Input/output filtering
- Robust unlearning
- Deliberative alignment
- Adversarial training
- Externalizing reasoning
- Model weight encryption
- Watermarking — identifiable markers in the model output
- Defense in depth
- Scalable oversight — how to monitor, evaluate, and control smarter than humans AI
- Capability control — constraining what functions or APIs an AI can access
- Incidence response plan — structured recovery process after AI-driven safety breaches