2022-08-12 14:48 已编辑厦门大学数据分析师

关注

【python数据分析入门到高阶】：数据导入

加载数据

加载scikit-learn中的数据集
创建模拟数据集
导入csv数据集
导入excel数据集
连接mysql数据库

1.加载sklearn包中的数据集

sklearn是一个机器学习库，里面包含了许多机器学习数据集。例如比较常见的鸢尾花数据集

digits
boston
iris# 加载数据
加载scikit-learn中的数据集
创建模拟数据集
导入csv数据集
导入excel数据集
连接mysql数据库

1.加载sklearn包中的数据集

sklearn是一个机器学习库，里面包含了许多机器学习数据集。例如比较常见的鸢尾花数据集

digits
boston
iris

from sklearn import datasets

# 手写数字数据集
digits = datasets.load_digits()

# 创建特征向量
features = digits.data
# 创建目标向量
tatget = digits.target

features[0]

array([ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
       15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
       12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
        0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
       10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.])

load_boston 波士顿房价的观测值用于研究回归算法

load_iris 150个花的数据，用于研究分类算法

load_digits 手写数字图片的观测值，用于研究图形分类算法的优质数据集

2.创建仿真数据集

2.1 回归数据集

下面我们通过make_regression来模拟一个回归数据集

from sklearn.datasets import make_regression
features, target, coefficients = make_regression(n_samples=100,
                                                 n_features=3,
                                                 n_informative=3,
                                                 n_targets=1,
                                                 noise=0,
                                                 coef=True,
                                                 random_state=1)

print('Featrue Matrix\n', features[:3])
print('Target Vector\n', target[:3])

Featrue Matrix
 [[ 1.29322588 -0.61736206 -0.11044703]
 [-2.793085    0.36633201  1.93752881]
 [ 0.80186103 -0.18656977  0.0465673 ]]
Target Vector
 [-10.37865986  25.5124503   19.67705609]

2.2 分类模拟数据集

from sklearn.datasets import make_classification
features, target= make_classification(n_samples=100,
                                      n_features=3,
                                      n_informative=3,
                                      n_redundant=0,
                                      n_classes=2,
                                      weights=[.25, .75],
                                      random_state=1)

print('Featrue Matrix\n', features[:3])
print('Target Vector\n', target[:3])

Featrue Matrix
 [[ 1.06354768 -1.42632219  1.02163151]
 [ 0.23156977  1.49535261  0.33251578]
 [ 0.15972951  0.83533515 -0.40869554]]
Target Vector
 [1 0 0]

import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(features[:,0], features[:,1],c=target)

图片说明

2.3 聚类数据集

# 用于聚类
from sklearn.datasets import make_blobs
features, target = make_blobs(n_samples=100,
                              n_features=2,
                              centers=3,
                              cluster_std=0.5,
                              shuffle=True,
                              random_state=1)

print('Featrue Matrix\n', features[:3])
print('Target Vector\n', target[:3])

Featrue Matrix
 [[ -1.22685609   3.25572052]
 [ -9.57463218  -4.38310652]
 [-10.71976941  -4.20558148]]
Target Vector
 [0 1 1]

plt.scatter(features[:,0], features[:,1],c=target)