冷值分析软件记录(一)爬取数据
本次工作大概分为一下几步:
- 获取彩客网足彩部分每天场次的平均欧赔,澳门亚赔,必发指数数据
- 将数据存入数据库
一. 爬取数据
1. 设计每场比赛的类,包含所需所有数据
class Race:
id = '' # 场次id,为月+日+当日场次序号
matchid = '' # 网站所安排的id,用来搜索数据
conpany = '' # 举办公司
BeginTime = ''
HomeTeam = ''
VisitingTeam = ''
euro_win = 0 # 欧赔
euro_draw = 0
euro_lose = 0
asia_win = 0 # 亚赔
asia_draw = 0
asia_lose = 0
betfair_win = 0 # 必发指数
betfair_draw = 0
betfair_lose = 0 2.分析网页结构
经过查看分析,场次信息在id = "MatchTable"下
前两个标签为表头,第三个标签为场次日期
接下来的标签则为比赛信息
每个中的表项可针对性提取,获取用信息。
发现主要信息,欧赔亚赔必发指数等信息不在源码中静态显示,通过改变选项查看Network中XHR变化找到信息存放的网页,可使用正则表达式匹配提取。
3.爬取网页处理基本数据
建立spider.py文件,写入有关获取网页信息的代码
导入request库,爬取网页html文件,采用BeautifulSoup库,整理html代码,用find_all查询id="MatchTable"的标签,对表单进行数据获取。
具体代码:
def deal(text): #text
now = time.strftime("%m%d")
soup = BeautifulSoup(text,'html.parser')
all = soup(id = "MatchTable")[0]
date = ''
matchid = ''
for tr in all.children:
if not isinstance(tr,bs4.element.Tag):
continue
tr_class = tr.get('class')[0]
if tr_class == 'ttis':
continue
elif tr_class == 'niDate':
date = tr('td')[0].text[:12]
continue
if date[5:7]+date[8:10] != now:
continue;
matchid = tr.get('matchid')
# print(matchid)
# assert()
print("-Attaching the new race...") #
new_race = Race()
new_race.BeginTime = date
cnt = 0
no = tr.get('id')[-7:]
for td in tr.children:
if not isinstance(td,bs4.element.Tag):
continue
cnt+=1
if cnt == 1:
new_race.id = date[5:7]+date[8:10]+td.text
elif cnt == 2:
new_race.conpany = td.text
elif cnt == 3:
new_race.BeginTime += td.text
elif cnt == 4 or cnt == 6 or cnt == 7:
continue
elif cnt == 5:
new_race.HomeTeam = td.text[:-2]
elif cnt == 8:
new_race.VisitingTeam = td.text[:-2]
else :
continue;
new_race.matchid = matchid
try:
data.append(new_race)
print("Successfully.");
except:
printf("Failed to get new race.");
assert() 每场比赛基本信息得到后,同样爬取欧赔,亚赔,必发指数文件
使用正则表达式匹配需要的数据条,以','分割为列表返回,接受返回值,选择数据存入data列表中
def get_euro(no):
url = 'http://www.310win.com/data/op101.xml'
data_html = get_html(url)
p = '<i>0,'+no+',.[^<\/i>]*'
# print(p)
# assert()
pattern = re.compile(p)
d = re.search(pattern,data_html).group()
a = d.split(',')
print("-Get score successfully.")
return a 二. 存入数据库
1.配置环境
安装MySQLdb库,使用函数连接本机数据库。
db = MySQLdb.connect("localhost", "root", "q123", "lz", charset='utf8' )
cur = db.cursor() 2.建立数据表
数据库设计如下
CREATE TABLE race_info(
id INT(8) PRIMARY KEY,
conpany VARCHAR(20) NOT NULL,
begitime VARCHAR(40) NOT NULL,
hometeam VARCHAR(20) NOT NULL,
visitingteam VARCHAR(20) NOT NULL
);
CREATE TABLE race_euro(
id INT(8) PRIMARY KEY,
win DOUBLE,
drow DOUBLE,
lose DOUBLE,
FOREIGN KEY(id) REFERENCES race_info(id)
);
CREATE TABLE race_asia(
id INT(8) PRIMARY KEY,
win DOUBLE,
drow DOUBLE,
lose DOUBLE,
FOREIGN KEY(id) REFERENCES race_info(id)
);
CREATE TABLE race_betfair(
id INT(8) PRIMARY KEY,
win DOUBLE,
drow DOUBLE,
lose DOUBLE,
FOREIGN KEY(id) REFERENCES race_info(id)
); 对数据进行处理
前一天的数据基本没有价值,仅需记录当时数据即可,所以插入数据前可以先清空数据表。
清空数据:
def clean_mysysql():
sql ="TRUNCATE TABLE race_euro"
cur.execute(sql)
db.commit()
sql ="TRUNCATE TABLE race_asia"
cur.execute(sql)
db.commit()
sql ="TRUNCATE TABLE race_betfair"
cur.execute(sql)
db.commit()
sql ="DELETE FROM race_info"
cur.execute(sql)
db.commit()
print("TABLE race have cleaned.") 插入数据:
def insert_into_mysql():
for item in spider.data:
sql = "INSERT INTO race_info VALUES({},\'{}\',\'{}\',\'{}\',\'{}\');".format(item.id,item.conpany,item.BeginTime,item.HomeTeam,item.VisitingTeam)
try:
cur.execute(sql)
db.commit()
print("-Insert bace information successfully.")
except:
db.rollback()
print(sql)
print("-Failed to insert data")
for item in spider.data:
sql = "INSERT INTO race_euro VALUES({},\'{}\',\'{}\',\'{}\');".format(item.id,item.euro_win,item.euro_draw,item.euro_lose)
try:
cur.execute(sql)
db.commit()
print("-Insert euro successfully.")
except:
db.rollback()
print(sql)
print("-Failed to insert data")
for item in spider.data:
sql = "INSERT INTO race_asia VALUES({},\'{}\',\'{}\',\'{}\');".format(item.id,item.asia_win,item.asia_draw,item.asia_lose)
try:
cur.execute(sql)
db.commit()
print("-Insert asia successfully.")
except:
db.rollback()
print(sql)
print("-Failed to insert data")
for item in spider.data:
sql = "INSERT INTO race_betfair VALUES({},\'{}\',\'{}\',\'{}\');".format(item.id,item.betfair_win,item.betfair_draw,item.betfair_lose)
try:
cur.execute(sql)
db.commit()
print("-Insert betfair successfully.")
except:
db.rollback()
print(sql)
print("-Failed to insert data")

查看14道真题和解析
