- 正确答案：构造SFT（Supervised Fine-Tuning）专家轨迹后，通常会执行以下关键操作：1）对齐轨迹格式并构建结构化训练样本（如instruction-input-output三元组）；2）进行数据清洗与质量过滤（剔除低质、重复、矛盾或格式错误的轨迹）；3）添加必要元信息（如任务类型、难度标签、领域标识、响应长度、思维链存在性等）；4）按比例划分训练集/验证集（常采用分层抽样以保持任务分布一致性）；5）序列化为高效可加载格式（如JSONL、Apache Arrow或Hugging Face Dataset）；6）进行tokenization预处理（使用目标模型的tokenizer，截断/填充至统一长度，生成attention_mask，必要时添加特殊token如<|start_header_id|>等）；7）最终输入到监督微调训练流程中，以最小化交叉熵损失优化语言模型参数。 - 解答思路：SFT专家轨迹本质是高质量人类示范数据（demonstration），其价值依赖于能否被模型高效、无损地吸收。因此构造之后并非直接训练，而需经历“数据工程→表征适配→训练注入”三层处理。首先确保数据语义完整性（清洗+标注），再保障与模型输入接口兼容（tokenization+padding），最后服务于优化目标（损失函数对齐）。每一步若缺失或出错，将导致模型学习到噪声模式、长程依赖断裂或梯度异常。 - 深度知识讲解： 1）轨迹格式标准化是底层关键。典型SFT样本需满足严格schema，例如：{"instruction": "将下列句子翻译成英文", "input": "今天天气很好", "output": "The weather is nice today.", "system": "You are a professional translator."}。其中system字段用于控制模型角色，input为空时视为zero-shot指令，非空则为few-shot上下文。该结构直接影响LoRA适配器的attention mask设计和position embedding初始化。 2）tokenization环节存在深层陷阱。以Llama-3 tokenizer为例，其采用byte-fallback机制处理未登录词，但专家轨迹中若含代码片段或数学符号（如λ、∫），可能被错误拆分为字节序列，破坏语义单元。此时需在pre-tokenization阶段插入自定义规则（如正则预替换），或启用add_special_tokens=True并注册domain-specific tokens。 3）长度处理影响梯度传播稳定性。固定max_length会导致长轨迹被截断（丢失结尾关键结论）或短轨迹填充过多无效token（稀释有效梯度）。工业级方案常采用dynamic batching + packing：将多条轨迹拼接成单个sequence（用eos_token分隔），配合packing attention mask屏蔽跨样本注意力，使GPU利用率提升2–3倍。其底层依赖于Hugging Face Datasets的map(batched=True, batch_size=1024) + custom collator实现。 4）质量过滤需结合规则与模型双信号。基础规则包括：output长度介于input*0.5~input*3之间、不包含连续6个以上相同字符、无明显乱码（Unicode Block检测）、response以标点结尾。进阶方案使用reward model打分（如Zephyr-RM）或语法解析器（spaCy依存树深度<8）作为硬阈值过滤器。 5）元信息标注支撑后续课程学习（curriculum learning）与混合训练。例如标注"reasoning_step_count"字段后，可在训练初期优先采样step≤3的轨迹，逐步过渡到step≥8的复杂推理样本，该策略在Meta的Llama-3 RLHF pipeline中被证实可降低37%的loss震荡。 - 伪代码（数据预处理核心流程）： ``` def build_sft_dataset(trajectories: List[Dict], tokenizer, max_len=4096): cleaned = [] for traj in trajectories: # 步骤1：基础清洗 if not is_valid_trajectory(traj): continue # 步骤2：格式归一化 sample = { "instruction": traj.get("instruction", ""), "input": traj.get("input", ""), "output": traj.get("output", ""), "system": traj.get("system", "You are a helpful AI assistant.") } # 步骤3：拼接模板（以Llama-3 chat template为例） prompt = f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n{sample['system']}<|eot_id|>" \ f"<|start_header_id|>user<|end_header_id|>\n{sample['instruction']} {sample['input']}<|eot_id|>" \ f"<|start_header_id|>assistant<|end_header_id|>\n{sample['output']}<|eot_id|>" # 步骤4：tokenize并截断 tokens = tokenizer(prompt, truncation=True, max_length=max_len, return_tensors="pt", padding=False, add_special_tokens=False) if len(tokens["input_ids"][0]) < 16: continue # 过短丢弃 # 步骤5：构造labels——仅output部分参与loss计算（左移一位mask掉prompt） labels = tokens["input_ids"].clone() prompt_len = len(tokenizer.encode(f"{sample['system']}{sample['instruction']} {sample['input']}", add_special_tokens=False)) labels[0][:prompt_len + 4] = -100 # mask system/user/eot tokens（具体偏移量依template而定） cleaned.append({"input_ids": tokens["input_ids"][0], "labels": labels[0], "attention_mask": tokens["attention_mask"][0]}) return Dataset.from_list(cleaned) # 后续训练中collator示例 def smart_collate_fn(batch): max_len = max(len(x["input_ids"]) for x in batch) input_ids = torch.stack([F.pad(x["input_ids"], (0, max_len - len(x["input_ids"])), value=tokenizer.pad_token_id) for x in batch]) labels = torch.stack([F.pad(x["labels"], (0, max_len - len(x["labels"])), value=-100) for x in batch]) attention_mask = torch.stack([F.pad(x["attention_mask"], (0, max_len - len(x["attention_mask"])), value=0) for x in batch]) return {"input_ids": input_ids, "labels": labels, "attention_mask": attention_mask} ```

Shopee大模型算法一面 （已过

全站热榜

创作者周榜

Shopee大模型算法一面（已过