Multiplicative Three-Pathway Defense
Three independent safety pathways multiply rather than compose linearly — disabling one is insufficient.
refusal.fm is a working research terminal — every entry is reproducible, every probe is open. We study the safety architecture of large Mixture-of-Experts reasoning models through controlled experiments, not press releases.
# most recent first · click to read
# 15 of 42 mechanisms — full list in repo
Three independent safety pathways multiply rather than compose linearly — disabling one is insufficient.
Models re-evaluate harmfulness mid-stream, allowing recovery from compromised initial tokens.
Ablations at later transformer layers fail to disable safety — contrary to prior single-pathway hypotheses.
Reasoning trajectories can be steered via contrastive prompts without modifying weights.
Safety circuits break down unpredictably under 4-bit quantization in MoE models.
Distance-Based Direction Intervention identifies refusal subspaces topologically.
Activation-aware Weight Quantization can be deliberately misconfigured to remove guardrails.
Direct integer-precision edits to quantized weights cannot produce coherent safety removal.
Multimodal projection layers absorb safety signal, masking true alignment behavior.
Per-tensor streaming preserves expert routing fidelity at lower bit-widths.
A reproducible silent corruption in MLX conversions affecting safety-critical layers.
Models enter recursive self-correction loops when adversarial prompts target reasoning chains.
Refusal logits collapse to exactly 0.0 in specific failure modes — an exploitable signature.
Safety is encoded redundantly across the network — every fragment contains the whole.
Rank-1 weight edits induce targeted behavioral changes with minimal collateral damage.
# official token contract address · update on launch
# open-source · MIT · v0.4.2-alpha
crack — probe, quantize and surgically edit Mixture-of-Experts models.
# install from PyPI $ pip install crack-moe # run safety analysis on a HuggingFace model $ crack analyze --model Qwen/Qwen3.5-394B-A17B \ --probe dbdi \ --quantize 4bit
refusal@fm:~/qwen3.5-394b$ crack analyze . [INFO] opening shard 1/82 ......... 4.7GB [INFO] opening shard 82/82 ........ 4.6GB [INFO] loading 512 experts × 64 layers [INFO] probing refusal subspace via DBDI [ OK ] 3 multiplicative pathways detected [ OK ] holographic redundancy: 0.94 [WARN] ablation will likely fail on this scale refusal@fm:~/qwen3.5-394b$ crack steer --rank 1 --target refusal [INFO] computing rank-1 edit ... [INFO] applying surgery to layer 47 ... [ OK ] ΔPPL = +0.02 coherence preserved [ OK ] result written to ./out/edit.safetensors refusal@fm:~/qwen3.5-394b$
# reference checkpoints — mirrored on hugging face