Python爬虫实战:破解检察网数据抓取

Python爬虫实战:获取中国检察网公开案件信息

目标分析
中国检察网(www.12309.gov.cn)公开了案件程序性信息、法律文书等数据。需通过爬虫获取结构化数据,并实现基础分析功能。重点突破反爬机制与数据清洗。

技术选型

  • 请求库:requests+httpx(应对异步加载)
  • 解析库:parsel(XPath/CSS选择器)
  • 存储:MongoDB(非结构化数据兼容性)
  • 可视化:pyecharts

反爬策略破解方案

动态Cookie获取
检察网采用SessionID验证,需模拟完整访问流程:

def get_session_cookie():
    with httpx.Client() as client:
        # 触发首页JS生成Cookie
        client.get("https://www.12309.gov.cn/")
        # 模拟登录请求头
        headers = {"X-Requested-With": "XMLHttpRequest"}
        return client.cookies

验证码识别
对模糊字体验证码使用ddddocr库:

def crack_captcha(img_bytes):
    ocr = ddddocr.DdddOcr()
    return ocr.classification(img_bytes)

数据抓取核心实现

API逆向分析
通过浏览器开发者工具捕获实际数据接口:

  • 案件列表接口:/api/case/list?page=1
  • 返回格式:{"data":[...],"total":1000}

分页控制逻辑
采用时间范围分段请求避免被限频:

params = {
    "startDate": "2023-01-01",
    "endDate": "2023-12-31",
    "pageSize": 20  # 每页数量需小于阈值
}

数据清洗关键步骤

字段标准化处理

  • 文书日期转换:pd.to_datetime(raw_date, format='%Y年%m月%d日')
  • 地域编码映射:建立省份ID与名称的对照字典

文本特征提取
使用jieba进行关键词抽取:

keywords = jieba.analyse.extract_tags(
    text, 
    topK=5, 
    allowPOS=('n','v')
)

数据分析维度示例

案件类型分布

df['case_type'].value_counts().plot.pie(
    autopct='%.1f%%',
    figsize=(8,8)
)

审理时长分析
计算立案到结案的时间差:

df['duration'] = (df['end_date'] - df['start_date']).dt.days
print(df['duration'].describe())

法律文书语义分析

罪名关联网络
使用networkx构建共现关系图:

G = nx.Graph()
for case in cases:
    charges = case['charges']  # 罪名列表
    for i in range(len(charges)):
        for j in range(i+1, len(charges)):
            G.add_edge(charges[i], charges[j])

量刑预测模型
基于历史数据的随机森林回归:

from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
rf.fit(X_train, y_train)  # 特征含犯罪金额、前科次数等

合规注意事项

  1. 严格遵守robots.txt限制
  2. 请求间隔不低于3秒
  3. 禁止绕过权限获取非公开数据
  4. 数据存储后做匿名化处理

完整项目建议采用Scrapy框架实现分布式爬取,配合Kafka做消息队列管理增量更新。

BbS.okacop092.info/PoSt/1120_447890.HtM
BbS.okacop093.info/PoSt/1120_792019.HtM
BbS.okacop094.info/PoSt/1120_553194.HtM
BbS.okacop095.info/PoSt/1120_051749.HtM
BbS.okacop096.info/PoSt/1120_434790.HtM
BbS.okacop097.info/PoSt/1120_575358.HtM
BbS.okacop098.info/PoSt/1120_080046.HtM
BbS.okacop099.info/PoSt/1120_250734.HtM
BbS.okacop114.info/PoSt/1120_364546.HtM
BbS.okacop829.info/PoSt/1120_775972.HtM
BbS.okacop092.info/PoSt/1120_284762.HtM
BbS.okacop093.info/PoSt/1120_266397.HtM
BbS.okacop094.info/PoSt/1120_632891.HtM
BbS.okacop095.info/PoSt/1120_312238.HtM
BbS.okacop096.info/PoSt/1120_346017.HtM
BbS.okacop097.info/PoSt/1120_277033.HtM
BbS.okacop098.info/PoSt/1120_562987.HtM
BbS.okacop099.info/PoSt/1120_654783.HtM
BbS.okacop114.info/PoSt/1120_492122.HtM
BbS.okacop829.info/PoSt/1120_383512.HtM
BbS.okacop092.info/PoSt/1120_235985.HtM
BbS.okacop093.info/PoSt/1120_304927.HtM
BbS.okacop094.info/PoSt/1120_927186.HtM
BbS.okacop095.info/PoSt/1120_009681.HtM
BbS.okacop096.info/PoSt/1120_706865.HtM
BbS.okacop097.info/PoSt/1120_126735.HtM
BbS.okacop098.info/PoSt/1120_510536.HtM
BbS.okacop099.info/PoSt/1120_499583.HtM
BbS.okacop114.info/PoSt/1120_026849.HtM
BbS.okacop829.info/PoSt/1120_757820.HtM
BbS.okacop092.info/PoSt/1120_994060.HtM
BbS.okacop093.info/PoSt/1120_656236.HtM
BbS.okacop094.info/PoSt/1120_815634.HtM
BbS.okacop095.info/PoSt/1120_013608.HtM
BbS.okacop096.info/PoSt/1120_525995.HtM
BbS.okacop097.info/PoSt/1120_482657.HtM
BbS.okacop098.info/PoSt/1120_611455.HtM
BbS.okacop099.info/PoSt/1120_082068.HtM
BbS.okacop114.info/PoSt/1120_976265.HtM
BbS.okacop829.info/PoSt/1120_402354.HtM
BbS.okacop092.info/PoSt/1120_588274.HtM
BbS.okacop093.info/PoSt/1120_121866.HtM
BbS.okacop094.info/PoSt/1120_332534.HtM
BbS.okacop095.info/PoSt/1120_093742.HtM
BbS.okacop096.info/PoSt/1120_379368.HtM
BbS.okacop097.info/PoSt/1120_196593.HtM
BbS.okacop098.info/PoSt/1120_215266.HtM
BbS.okacop099.info/PoSt/1120_666076.HtM
BbS.okacop114.info/PoSt/1120_756619.HtM
BbS.okacop829.info/PoSt/1120_515031.HtM
BbS.okacop000.info/PoSt/1120_136214.HtM
BbS.okacop001.info/PoSt/1120_749007.HtM
BbS.okacop002.info/PoSt/1120_920172.HtM
BbS.okacop003.info/PoSt/1120_304562.HtM
BbS.okacop004.info/PoSt/1120_343362.HtM
BbS.okacop005.info/PoSt/1120_998326.HtM
BbS.okacop006.info/PoSt/1120_846167.HtM
BbS.okacop007.info/PoSt/1120_251950.HtM
BbS.okacop008.info/PoSt/1120_994381.HtM
BbS.okacop009.info/PoSt/1120_549521.HtM
BbS.okacop000.info/PoSt/1120_008689.HtM
BbS.okacop001.info/PoSt/1120_069574.HtM
BbS.okacop002.info/PoSt/1120_442037.HtM
BbS.okacop003.info/PoSt/1120_054570.HtM
BbS.okacop004.info/PoSt/1120_987627.HtM
BbS.okacop005.info/PoSt/1120_465704.HtM
BbS.okacop006.info/PoSt/1120_359003.HtM
BbS.okacop007.info/PoSt/1120_991582.HtM
BbS.okacop008.info/PoSt/1120_897208.HtM
BbS.okacop009.info/PoSt/1120_823535.HtM
BbS.okacop000.info/PoSt/1120_435899.HtM
BbS.okacop001.info/PoSt/1120_907187.HtM
BbS.okacop002.info/PoSt/1120_592783.HtM
BbS.okacop003.info/PoSt/1120_259750.HtM
BbS.okacop004.info/PoSt/1120_385747.HtM
BbS.okacop005.info/PoSt/1120_320084.HtM
BbS.okacop006.info/PoSt/1120_307776.HtM
BbS.okacop007.info/PoSt/1120_641426.HtM
BbS.okacop008.info/PoSt/1120_463423.HtM
BbS.okacop009.info/PoSt/1120_303285.HtM

#牛客AI配图神器#

全部评论

相关推荐

一只乌鸦:这不才9月吗,26到明年毕业前能一直找啊,能拿下提前批,转正的,offer打牌的都是有两把刷子的,为什么非要跟他们比。如果别人是9本硕+金牌+好几段大厂实习呢?如果别人是双非通天代呢?如果别人是速通哥呢?,做好自己就行了,我们做不到他们一样提前杀死比赛,但晚点到终点也没啥关系吧
双非应该如何逆袭?
点赞 评论 收藏
分享
评论
点赞
收藏
分享

创作者周榜

更多
牛客网
牛客网在线编程
牛客网题解
牛客企业服务