昨天 19:49 广西大学算法工程师发布于天津

关注

北京深言科技大模型agent算法社招

这个小伙伴5年的工作经历双非研究生好像是医学和计算机交叉学科

1. 请先做一下自我介绍

2. 你们是直接拿开源 BERT 做分类，还是基于开源 BERT 做微调？

答案：

一般不是直接拿开源 BERT 裸用，而是基于开源预训练 BERT 做下游分类任务微调。

开源 BERT 只提供了通用语义表示能力，它在大规模语料上通过 MLM、NSP 等任务学到了语言知识，但它不知道我们具体业务里的标签体系，比如用户意图分类、文本风险分类、工单分类等。

所以实际做法是：

拿一个开源预训练模型，比如：

bert-base-chinese
chinese-roberta-wwm-ext
hfl/chinese-macbert-base

然后在业务数据上接一个分类层，对 BERT 和分类层一起训练。

典型结构是：

输入文本
  ↓
Tokenizer
  ↓
BERT Encoder
  ↓
Pooling / CLS 向量
  ↓
Linear 分类层
  ↓
Softmax
  ↓
类别概率

代码示例：

from transformers import BertModel
import torch.nn as nn

class BertClassifier(nn.Module):
    def __init__(self, model_name, num_labels):
        super().__init__()
        self.bert = BertModel.from_pretrained(model_name)
        self.dropout = nn.Dropout(0.1)
        self.classifier = nn.Linear(self.bert.config.hidden_size, num_labels)

    def forward(self, input_ids, attention_mask, token_type_ids=None):
        outputs = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids
        )
        cls = outputs.last_hidden_state[:, 0, :]
        logits = self.classifier(self.dropout(cls))
        return logits

3. BERT 微调具体是怎么做的？分类任务的数据和训练流程是什么？

答案：

BERT 微调分类任务，本质上是把文本转成 token，然后输入 BERT，取出句向量，再用分类头预测标签。

数据一般长这样：

文本: "用户想查询订单物流"
标签: logistics_query

文本: "帮我退一下这个商品"
标签: refund_request

训练时会把标签映射成 id：

label2id = {
    "logistics_query": 0,
    "refund_request": 1,
    "complaint": 2,
    "other": 3
}

然后用 tokenizer 编码：

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-chinese")

encoded = tokenizer(
    "用户想查询订单物流",
    max_length=128,
    padding="max_length",
    truncation=True,
    return_tensors="pt"
)

print(encoded.keys())
# input_ids, token_type_ids, attention_mask

训练目标一般是交叉熵：

import torch
import torch.nn as nn

criterion = nn.CrossEntropyLoss()

logits = model(
    input_ids=encoded["input_ids"],
    attention_mask=encoded["attention_mask"],
    token_type_ids=encoded["token_type_ids"]
)

labels = torch.tensor([0])
loss = criterion(logits, labels)
loss.backward()

完整微调流程：

准备标注数据
  ↓
标签映射
  ↓
Tokenizer 编码
  ↓
构建 Dataset / DataLoader
  ↓
加载预训练 BERT
  ↓
接分类层
  ↓
使用 CrossEntropyLoss 训练
  ↓
验证集评估
  ↓
保存模型
  ↓
线上推理

4. 分类层具体取 BERT 的哪一层输出做分类？

答案：

最常见的是取最后一层的 [CLS] token 向量。

BERT 输入一般是：

[CLS] 今 天 天 气 很 好 [SEP]

经过 BERT 后，每个 token 都会有一个 hidden state：

[CLS] -> h_cls
今    -> h_1
天    -> h_2
...
[SEP] -> h_sep

如果做句子级分类，通常用：

cls_vector = outputs.last_hidden_state[:, 0, :]

它的形状是：

(batch_size, hidden_size)

对于 bert-base-chinese，hidden_size 是 768，所以分类层一般是：

nn.Linear(768, num_labels)

不过实际项目里也可以有其他策略，比如：

1. 最后一层 CLS
2. 最后四层 CLS 拼接
3. 所有 token 做 mean pooling
4. attention pooling
5. CLS + mean pooling 融合

如果业务文本比较短，比如意图识别，用最后一层 CLS 通常就够了。如果文本较长、关键信息分布在多个 token 上，mean pooling 或 attention pooling 可能更稳定。

代码示例：

outputs = self.bert(
    input_ids=input_ids,
    attention_mask=attention_mask,
    output_hidden_states=True
)

# 最后一层 CLS
cls_last = outputs.last_hidden_state[:, 0, :]

# 最后四层 CLS 拼接
hidden_states = outputs.hidden_states
cls_concat = torch.cat(
    [hidden_states[-i][:, 0, :] for i in range(1, 5)],
    dim=-1
)

如果拼接最后四层，线性层输入维度就是：

self.classifier = nn.Linear(768 * 4, num_labels)

5. BERT 后面的 Linear 层和 BERT 是怎么连接的？

答案：

Linear 层本质上就是接收 BERT 编码后的句向量，然后映射到类别空间。

以 bert-base 为例，BERT 输出的 CLS 向量维度是 768，如果有 10 个分类，那么 Linear 层就是：

nn.Linear(768, 10)

它做的事情是：

logits = xW + b

其中：

x:      [batch_size, 768]
W:      [768, num_labels]
b:      [num_labels]
logits: [batch_size, num_labels]

代码：

class BertForIntent(nn.Module):
    def __init__(self, num_labels):
        super().__init__()
        self.bert = BertModel.from_pretrained("bert-base-chinese")
        self.dropout = nn.Dropout(0.1)
        self.fc = nn.Linear(768, num_labels)

    def forward(self, input_ids, attention_mask, token_type_ids):
        output = self.bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids
        )

        cls = output.last_hidden_state[:, 0, :]
        cls = self.dropout(cls)
        logits = self.fc(cls)

        return logits

训练的时候，BERT 参数和 Linear 参数一般一起更新：

optimizer = torch.optim.AdamW(model.parameters(), lr=2e-5)

如果业务数据很少，也可以先冻结 BERT，只训练分类层：

for param in model.bert.parameters():
    param.requires_grad = False

但一般效果会比全量微调差一些。

6. Linear 线性层的输入是什么？维度怎么确定？

答案：

Linear 层的输入取决于你选择的 pooling 策略。

如果取最后一层 CLS：

x = outputs.last_hidden_state[:, 0, :]

那么输入维度是：

hidden_size

对于 BERT-base：

对于 BERT-large：

如果使用 mean pooling，输入维度仍然是 768：

def mean_pooling(last_hidden_state, attention_mask):
    mask = attention_mask.unsqueeze(-1).float()
    summed = torch.sum(last_hidden_state * mask, dim=1)
    count = torch.clamp(mask.sum(dim=1), min=1e-9)
    return summed / count

如果拼接最后四层 CLS：

x = torch.cat([
    hidden_states[-1][:, 0, :],
    hidden_states[-2][:, 0, :],
    hidden_states[-3][:, 0, :],
    hidden_states[-4][:, 0, :]
], dim=-1)

那么输入维度是：

768 * 4 = 3072

对应分类层：

self.classifier = nn.Linear(3072, num_labels)

所以 Linear 层输入维度不是固定的，而是由 BERT hidden_size 和特征融合方式共同决定。

7. 如果不用单层 CLS，而是做多层融合，你会怎么设计？

答案：

多层融合一般是为了解决最后一层语义过于任务化、局部信息损失的问题。BERT 的不同层捕获的信息不一样：

底层：偏词法、字符、局部结构
中层：偏句法、短语关系
高层：偏语义、任务相关特征

一种常见方式是最后四层加权融合：

import torch
import torch.nn as nn

class LayerWeightedPooling(nn.Module):
    def __init__(self, num_layers=4):
        super().__init__()
        self.weights = nn.Parameter(torch.ones(num_layers))

    def forward(self, hidden_states):
        # hidden_states: tuple, length = 13 for bert-base
        selected = hidden_states[-4:]
        norm_weights = torch.softmax(self.weights, dim=0)

        output = 0
        for w, h in zip(norm_weights, selected):
            output += w * h[:, 0, :]

        return output

然后接分类层：

self.pooling = LayerWeightedPooling(num_layers=4)
self.classifier = nn.Linear(768, num_labels)

outputs = self.bert(
    input_ids=input_ids,
    attention_mask=attention_mask,
    output_hidden_states=True
)

x = self.pooling(outputs.hidden_states)
logits = self.classifier(x)

这种方式相比直接拼接更省参数，因为它最后还是保持 768 维。如果数据量足够，也可以拼接最后四层，表达能力更强，但更容易过拟合。

8. BERT 分类里 CLS pooling、mean pooling、attention pooling 分别适合什么场景？

答案：

CLS pooling 适合短文本分类，比如意图识别、情感分类、句子级分类。它实现简单，推理快，也是最常用的 baseline。

mean pooling 是把所有有效 token 的向量做平均，适合关键信息分散在整段文本里的情况，比如长文本分类、摘要语义分类、FAQ 匹配。

attention pooling 是让模型自动学习哪些 token 更重要，适合业务文本里有明显关键词，但位置不固定的情况。

代码示例：

class AttentionPooling(nn.Module):
    def __init__(self, hidden_size):
        super().__init__()
        self.attention = nn.Linear(hidden_size, 1)

    def forward(self, last_hidden_state, attention_mask):
        scores = self.attention(last_hidden_state).squeeze(-1)

        scores = scores.masked_fill(attention_mask == 0, -1e9)
        weights = torch.softmax(scores, dim=-1)

        pooled = torch.sum(last_hidden_state * weights.unsqueeze(-1), dim=1)
        return pooled

接分类层：

pooled = self.att_pooling(outputs.last_hidden_state, attention_mask)
logits = self.classifier(pooled)

我一般会先用 CLS 做 baseline，如果发现长文本、噪声文本效果不好，再尝试 mean pooling 或 attention pooling。

9. BERT 微调时怎么处理类别不均衡问题？

答案：

类别不均衡在分类任务里很常见，比如大部分样本都是“其他”，少数样本是“投诉”或者“风险”。

常见处理方式有三类：

第一类是在 loss 上加权，让少数类 loss 权重大一些：

import torch
import torch.nn as nn

class_weights = torch.tensor([1.0, 2.5, 4.0, 1.2]).to(device)
criterion = nn.CrossEntropyLoss(weight=class_weights)

第二类是采样层面处理，比如 oversampling 少数类，或者用 WeightedRandomSampler：

from torch.utils.data import WeightedRandomSampler

sample_weights = [class_weights[label].item() for label in labels]

sampler = WeightedRandomSampler(
    weights=sample_weights,
    num_samples=len(sample_weights),
    replacement=True
)

第三类是换损失函数，比如 Focal Loss：

class FocalLoss(nn.Module):
    def __init__(self, gamma=2.0, weight=None):
        super().__init__()
        self.gamma = gamma
        self.weight = weight

    def forward(self, logits, targets):
        ce_loss = nn.functional.cross_entropy(
            logits,
            targets,
            weight=self.weight,
            reduction="none"
        )
        pt = torch.exp(-ce_loss)
        loss = ((1 - pt) ** self.gamma) * ce_loss
        return loss.mean()

实际使用时我会先看混淆矩阵和每个类别的 precision / recall。如果少数类召回很差，优先用 class weight 或 Focal Loss；如果数据本身质量差，还要补充样本和清洗标注。

10. BERT 微调时怎么避免过拟合和灾难性遗忘？

答案：

BERT 微调很容易过拟合，尤其是业务数据只有几千条的时候。常用方式有：

降低学习率：

2e-5、3e-5、5e-5 是常见范围

使用 warmup：

from transformers import get_linear_schedule_with_warmup

scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=int(total_steps * 0.1),
    num_training_steps=total_steps
)

加 dropout：

self.dropout = nn.Dropout(0.1)

使用 weight decay：

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=2e-5,
    weight_decay=0.01
)

还可以做分层学习率，让底层 BERT 学得慢一点，分类层学得快一点：

optimizer = torch.optim.AdamW([
    {"params": model.bert.embeddings.parameters(), "lr": 1e-5},
    {"params": model.bert.encoder.layer[:6].parameters(), "lr": 1e-5},
    {"params": model.bert.encoder.layer[6:].parameters(), "lr": 2e-5},
    {"params": model.classifier.parameters(), "lr": 1e-4},
])

如果数据非常少，可以冻结一部分 BERT 层：

for layer in model.bert.encoder.layer[:8]:
    for param in layer.parameters():

剩余60%内容，订阅专栏后可继续查看/也可单篇购买

AI-Agent面试实战专栏文章被收录于专栏

本专栏聚焦 AI-Agent 面试高频考点，内容来自真实面试与项目实践。系统覆盖大模型基础、Prompt工程、RAG、Agent架构、工具调用、多Agent协作、记忆机制、评测、安全与部署优化等核心模块。以“原理+场景+实战”为主线，提供高频题解析、标准答题思路与工程落地方法，帮助你高效查漏补缺.

全部评论

推荐最新楼层

昨天 15:09

已编辑

深圳大学 Java

招银网络后端 base 深圳下午还有机会吗？

投票

煎熬了一上午，看到身边的朋友都开出来了，说冲击不大肯定是假的，内心无比煎熬有过来人说说开奖第一天，下午还会继续发吗，我现在官网还是资料评审阶段

我的求职进度条

点赞评论收藏

04-07 11:14

已编辑

华南师范大学 Java

27暑期字节后端

一面：自我介绍拷打项目项目1 MongoDB相对于MySQL的优势 RAG流程 如果提问的与知识库不相关怎么办 市面上成熟的聊天记录的管理方案项目2 session共享的问题 优惠券秒杀的流程 Lua脚本里干了什么八股 redis中zset的底层 zset插入和查询的时间复杂度 redis为什么快 redis io多路复用如何实现 redis的性能瓶颈在哪里：面试官让在内存、cpu、网络io中选 MySQL中的索引实现 B+树比B树好在哪里？为什么B+树更矮 事务隔离级别 mysql的默认级别 可重复读是如何实现的：回答了读视图，面试官纠正说了读视图的专业名词 HTTPS和HTTP的区别 加密...

查看26道真题和解析

点赞评论收藏

昨天 19:14

沈阳工业大学测试开发

你这模型自己部署的吗

面试问智能体项目，面试官问背后的大模型是本地还是线上，我回答本地跑不动，就用了线上，还问我线上用哪一家，（阿里云）最后还问了如果要把智能体部署到服务器上，并用自己特调的模型，服务器4核4G，该怎么操作。答利用平台（又是阿里云）部署自己的模型，肯定不会在买的服务器上部署，带不动。

查看3道真题和解析

点赞评论收藏

04-07 00:10

广西大学算法工程师

百度大模型开发一面

1. 你对 Claude Code、Codex、OpenHands 这类 Coding Agent 的理解是什么，它们的核心差别在哪这类产品表面上都在做“让模型帮你写代码”，但真正拉开差距的不是补全能力，而是任务编排、工具使用、上下文管理和反馈闭环。像 Claude Code 更强调命令行工作流、仓库级理解和连续执行，适合把模型放进真实开发环境里；Codex 早期更偏代码生成和 API 能力，强在补全和局部代码理解；OpenHands 这类开源 Agent 往往更强调可扩展性和多工具编排，方便接入自己的环境和流程。如果从工程视角看，这类 Agent 最终拼的是三件事：一是对代码仓库的建模能力，...

AI-Agent面试实战...

点赞评论收藏

04-28 03:02

武汉大学算法工程师

带薪上班这一块

点赞评论收藏

全站热榜

创作者周榜

正在热议

# 联宝杯大学生创新大赛，你的技术值得产业级答案 #

# 你和你的mentor相处模式是__ #

# 实习想申请秋招offer，能不能argue薪资 #

257935次浏览 1351人参与

# 你的实习什么时候入职 #

377537次浏览 2400人参与

# 你最满意的offer薪资是哪家公司？ #

北京深言科技 大模型agent算法 社招

1. 请先做一下自我介绍

2. 你们是直接拿开源 BERT 做分类，还是基于开源 BERT 做微调？

3. BERT 微调具体是怎么做的？分类任务的数据和训练流程是什么？

4. 分类层具体取 BERT 的哪一层输出做分类？

5. BERT 后面的 Linear 层和 BERT 是怎么连接的？

6. Linear 线性层的输入是什么？维度怎么确定？

7. 如果不用单层 CLS，而是做多层融合，你会怎么设计？

8. BERT 分类里 CLS pooling、mean pooling、attention pooling 分别适合什么场景？

9. BERT 微调时怎么处理类别不均衡问题？

10. BERT 微调时怎么避免过拟合和灾难性遗忘？

全站热榜

创作者周榜

北京深言科技大模型agent算法社招