생능출판사 (가칭)"데이터과학 파이썬" 코드 14장

14.6 선형 회귀를 scikit-learn 라이브러리로 구현해 보자

In [1]:
import numpy as np 
from sklearn import linear_model  # scikit-learn 모듈을 가져온다

regr = linear_model.LinearRegression()
In [2]:
X = [[164], [179], [162], [170]]  # 다중회귀에도 사용하도록 함 
y = [53, 63, 55, 59]              # y = f(X)의 결과 
regr.fit(X, y)
Out[2]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

14.6 선형 회귀 학습결과를 확인하고 예측하기

In [3]:
regr.fit(X, y)
Out[3]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [4]:
coef = regr.coef_           # 직선의 기울기
intercept = regr.intercept_ # 직선의 절편
score = regr.score(X, y)    # 학습된 직선이 데이터를 얼마나 잘 따르나

print("y =", coef, "* X + ", intercept)
print("The score of this line for the data: ", score)
y = [0.55221745] * X +  -35.686695278969964
The score of this line for the data:  0.903203123105647
In [5]:
input_data = [ [180], [185] ]

14.8 선형회귀로 예측하기 : 키와 몸무게는 상관관계가 있을까

In [10]:
regr.predict([[169]])
Out[10]:
array([57.63805436])
In [11]:
import matplotlib.pyplot as plt
import numpy as np 
from sklearn import linear_model  # scikit-learn 모듈을 가져온다
 
regr = linear_model.LinearRegression() 
 
X = [[164], [179], [162], [170]]  # 선형회귀의 입력은 2차원으로 만들어야 함
y = [53, 63, 55, 59]     # y = f(X)의 결과값
regr.fit(X, y)

# 학습 데이터와 y 값을 산포도로 그린다. 
plt.scatter(X, y, color='black')

# 학습 데이터를 입력으로 하여 예측값을 계산한다.
y_pred = regr.predict(X)

# 학습 데이터와 예측값으로 선그래프로 그린다. 
# 계산된 기울기와 y 절편을 가지는 직선이 그려진다 
plt.plot(X, y_pred, color='blue', linewidth=3)
plt.show()

LAB 14-1 다차원 선형회귀

In [12]:
import numpy as np 
from sklearn import linear_model 
 
regr = linear_model.LinearRegression() 
# 남자는 0, 여자는 1
X = [[164, 1], [167, 1], [165, 0], [170, 0], [179, 0], [163, 1], [159, 0], [166, 1]]    # 입력데이터를 2차원으로 만들어야 함 
y = [43, 48, 47, 66, 67, 50, 52, 44]     # y 값은 1차원 데이터
regr.fit(X, y)         # 학습 
print('계수 :', regr.coef_ )
print('절편 :', regr.intercept_)
print('점수 :', regr.score(X, y))
print('은지와 동민이의 추정 몸무게 :', regr.predict([[166, 1], [166, 0]]))
계수 : [ 0.88542825 -8.87235818]
절편 : -90.97330367074522
점수 : 0.7404546306026769
은지와 동민이의 추정 몸무게 : [47.13542825 56.00778643]

14.8 당뇨병 예제와 학습 데이터 생성

In [27]:
import matplotlib.pyplot as plt 
import numpy as np 
from sklearn.linear_model import LinearRegression 
from sklearn import datasets 
 
# 당뇨병 데이터 세트를 sklearn의 데이터집합으로부터 읽어들인다. 
diabetes = datasets.load_diabetes()
In [28]:
print('shape of diabetes.data: ', diabetes.data.shape)
print(diabetes.data)
shape of diabetes.data:  (442, 10)
[[ 0.03807591  0.05068012  0.06169621 ... -0.00259226  0.01990842
  -0.01764613]
 [-0.00188202 -0.04464164 -0.05147406 ... -0.03949338 -0.06832974
  -0.09220405]
 [ 0.08529891  0.05068012  0.04445121 ... -0.00259226  0.00286377
  -0.02593034]
 ...
 [ 0.04170844  0.05068012 -0.01590626 ... -0.01107952 -0.04687948
   0.01549073]
 [-0.04547248 -0.04464164  0.03906215 ...  0.02655962  0.04452837
  -0.02593034]
 [-0.04547248 -0.04464164 -0.0730303  ... -0.03949338 -0.00421986
   0.00306441]]
In [29]:
print('입력데이터의 특성들')
print(diabetes.feature_names)
입력데이터의 특성들
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
In [30]:
print('target data y:', diabetes.target.shape)
print(diabetes.target)
target data y: (442,)
[151.  75. 141. 206. 135.  97. 138.  63. 110. 310. 101.  69. 179. 185.
 118. 171. 166. 144.  97. 168.  68.  49.  68. 245. 184. 202. 137.  85.
 131. 283. 129.  59. 341.  87.  65. 102. 265. 276. 252.  90. 100.  55.
  61.  92. 259.  53. 190. 142.  75. 142. 155. 225.  59. 104. 182. 128.
  52.  37. 170. 170.  61. 144.  52. 128.  71. 163. 150.  97. 160. 178.
  48. 270. 202. 111.  85.  42. 170. 200. 252. 113. 143.  51.  52. 210.
  65. 141.  55. 134.  42. 111.  98. 164.  48.  96.  90. 162. 150. 279.
  92.  83. 128. 102. 302. 198.  95.  53. 134. 144. 232.  81. 104.  59.
 246. 297. 258. 229. 275. 281. 179. 200. 200. 173. 180.  84. 121. 161.
  99. 109. 115. 268. 274. 158. 107.  83. 103. 272.  85. 280. 336. 281.
 118. 317. 235.  60. 174. 259. 178. 128.  96. 126. 288.  88. 292.  71.
 197. 186.  25.  84.  96. 195.  53. 217. 172. 131. 214.  59.  70. 220.
 268. 152.  47.  74. 295. 101. 151. 127. 237. 225.  81. 151. 107.  64.
 138. 185. 265. 101. 137. 143. 141.  79. 292. 178.  91. 116.  86. 122.
  72. 129. 142.  90. 158.  39. 196. 222. 277.  99. 196. 202. 155.  77.
 191.  70.  73.  49.  65. 263. 248. 296. 214. 185.  78.  93. 252. 150.
  77. 208.  77. 108. 160.  53. 220. 154. 259.  90. 246. 124.  67.  72.
 257. 262. 275. 177.  71.  47. 187. 125.  78.  51. 258. 215. 303. 243.
  91. 150. 310. 153. 346.  63.  89.  50.  39. 103. 308. 116. 145.  74.
  45. 115. 264.  87. 202. 127. 182. 241.  66.  94. 283.  64. 102. 200.
 265.  94. 230. 181. 156. 233.  60. 219.  80.  68. 332. 248.  84. 200.
  55.  85.  89.  31. 129.  83. 275.  65. 198. 236. 253. 124.  44. 172.
 114. 142. 109. 180. 144. 163. 147.  97. 220. 190. 109. 191. 122. 230.
 242. 248. 249. 192. 131. 237.  78. 135. 244. 199. 270. 164.  72.  96.
 306.  91. 214.  95. 216. 263. 178. 113. 200. 139. 139.  88. 148.  88.
 243.  71.  77. 109. 272.  60.  54. 221.  90. 311. 281. 182. 321.  58.
 262. 206. 233. 242. 123. 167.  63. 197.  71. 168. 140. 217. 121. 235.
 245.  40.  52. 104. 132.  88.  69. 219.  72. 201. 110.  51. 277.  63.
 118.  69. 273. 258.  43. 198. 242. 232. 175.  93. 168. 275. 293. 281.
  72. 140. 189. 181. 209. 136. 261. 113. 131. 174. 257.  55.  84.  42.
 146. 212. 233.  91. 111. 152. 120.  67. 310.  94. 183.  66. 173.  72.
  49.  64.  48. 178. 104. 132. 220.  57.]
In [31]:
X = diabetes.data[:, 2]
print(X)
[ 0.06169621 -0.05147406  0.04445121 -0.01159501 -0.03638469 -0.04069594
 -0.04716281 -0.00189471  0.06169621  0.03906215 -0.08380842  0.01750591
 -0.02884001 -0.00189471 -0.02560657 -0.01806189  0.04229559  0.01211685
 -0.0105172  -0.01806189 -0.05686312 -0.02237314 -0.00405033  0.06061839
  0.03582872 -0.01267283 -0.07734155  0.05954058 -0.02129532 -0.00620595
  0.04445121 -0.06548562  0.12528712 -0.05039625 -0.06332999 -0.03099563
  0.02289497  0.01103904  0.07139652  0.01427248 -0.00836158 -0.06764124
 -0.0105172  -0.02345095  0.06816308 -0.03530688 -0.01159501 -0.0730303
 -0.04177375  0.01427248 -0.00728377  0.0164281  -0.00943939 -0.01590626
  0.0250506  -0.04931844  0.04121778 -0.06332999 -0.06440781 -0.02560657
 -0.00405033  0.00457217 -0.00728377 -0.0374625  -0.02560657 -0.02452876
 -0.01806189 -0.01482845 -0.02991782 -0.046085   -0.06979687  0.03367309
 -0.00405033 -0.02021751  0.00241654 -0.03099563  0.02828403 -0.03638469
 -0.05794093 -0.0374625   0.01211685 -0.02237314 -0.03530688  0.00996123
 -0.03961813  0.07139652 -0.07518593 -0.00620595 -0.04069594 -0.04824063
 -0.02560657  0.0519959   0.00457217 -0.06440781 -0.01698407 -0.05794093
  0.00996123  0.08864151 -0.00512814 -0.06440781  0.01750591 -0.04500719
  0.02828403  0.04121778  0.06492964 -0.03207344 -0.07626374  0.04984027
  0.04552903 -0.00943939 -0.03207344  0.00457217  0.02073935  0.01427248
  0.11019775  0.00133873  0.05846277 -0.02129532 -0.0105172  -0.04716281
  0.00457217  0.01750591  0.08109682  0.0347509   0.02397278 -0.00836158
 -0.06117437 -0.00189471 -0.06225218  0.0164281   0.09618619 -0.06979687
 -0.02129532 -0.05362969  0.0433734   0.05630715 -0.0816528   0.04984027
  0.11127556  0.06169621  0.01427248  0.04768465  0.01211685  0.00564998
  0.04660684  0.12852056  0.05954058  0.09295276  0.01535029 -0.00512814
  0.0703187  -0.00405033 -0.00081689 -0.04392938  0.02073935  0.06061839
 -0.0105172  -0.03315126 -0.06548562  0.0433734  -0.06225218  0.06385183
  0.03043966  0.07247433 -0.0191397  -0.06656343 -0.06009656  0.06924089
  0.05954058 -0.02668438 -0.02021751 -0.046085    0.07139652 -0.07949718
  0.00996123 -0.03854032  0.01966154  0.02720622 -0.00836158 -0.01590626
  0.00457217 -0.04285156  0.00564998 -0.03530688  0.02397278 -0.01806189
  0.04229559 -0.0547075  -0.00297252 -0.06656343 -0.01267283 -0.04177375
 -0.03099563 -0.00512814 -0.05901875  0.0250506  -0.046085    0.00349435
  0.05415152 -0.04500719 -0.05794093 -0.05578531  0.00133873  0.03043966
  0.00672779  0.04660684  0.02612841  0.04552903  0.04013997 -0.01806189
  0.01427248  0.03690653  0.00349435 -0.07087468 -0.03315126  0.09403057
  0.03582872  0.03151747 -0.06548562 -0.04177375 -0.03961813 -0.03854032
 -0.02560657 -0.02345095 -0.06656343  0.03259528 -0.046085   -0.02991782
 -0.01267283 -0.01590626  0.07139652 -0.03099563  0.00026092  0.03690653
  0.03906215 -0.01482845  0.00672779 -0.06871905 -0.00943939  0.01966154
  0.07462995 -0.00836158 -0.02345095 -0.046085    0.05415152 -0.03530688
 -0.03207344 -0.0816528   0.04768465  0.06061839  0.05630715  0.09834182
  0.05954058  0.03367309  0.05630715 -0.06548562  0.16085492 -0.05578531
 -0.02452876 -0.03638469 -0.00836158 -0.04177375  0.12744274 -0.07734155
  0.02828403 -0.02560657 -0.06225218 -0.00081689  0.08864151 -0.03207344
  0.03043966  0.00888341  0.00672779 -0.02021751 -0.02452876 -0.01159501
  0.02612841 -0.05901875 -0.03638469 -0.02452876  0.01858372 -0.0902753
 -0.00512814 -0.05255187 -0.02237314 -0.02021751 -0.0547075  -0.00620595
 -0.01698407  0.05522933  0.07678558  0.01858372 -0.02237314  0.09295276
 -0.03099563  0.03906215 -0.06117437 -0.00836158 -0.0374625  -0.01375064
  0.07355214 -0.02452876  0.03367309  0.0347509  -0.03854032 -0.03961813
 -0.00189471 -0.03099563 -0.046085    0.00133873  0.06492964  0.04013997
 -0.02345095  0.05307371  0.04013997 -0.02021751  0.01427248 -0.03422907
  0.00672779  0.00457217  0.03043966  0.0519959   0.06169621 -0.00728377
  0.00564998  0.05415152 -0.00836158  0.114509    0.06708527 -0.05578531
  0.03043966 -0.02560657  0.10480869 -0.00620595 -0.04716281 -0.04824063
  0.08540807 -0.01267283 -0.03315126 -0.00728377 -0.01375064  0.05954058
  0.02181716  0.01858372 -0.01159501 -0.00297252  0.01750591 -0.02991782
 -0.02021751 -0.05794093  0.06061839 -0.04069594 -0.07195249 -0.05578531
  0.04552903 -0.00943939 -0.03315126  0.04984027 -0.08488624  0.00564998
  0.02073935 -0.00728377  0.10480869 -0.02452876 -0.00620595 -0.03854032
  0.13714305  0.17055523  0.00241654  0.03798434 -0.05794093 -0.00943939
 -0.02345095 -0.0105172  -0.03422907 -0.00297252  0.06816308  0.00996123
  0.00241654 -0.03854032  0.02612841 -0.08919748  0.06061839 -0.02884001
 -0.02991782 -0.0191397  -0.04069594  0.01535029 -0.02452876  0.00133873
  0.06924089 -0.06979687 -0.02991782 -0.046085    0.01858372  0.00133873
 -0.03099563 -0.00405033  0.01535029  0.02289497  0.04552903 -0.04500719
 -0.03315126  0.097264    0.05415152  0.12313149 -0.08057499  0.09295276
 -0.05039625 -0.01159501 -0.0277622   0.05846277  0.08540807 -0.00081689
  0.00672779  0.00888341  0.08001901  0.07139652 -0.02452876 -0.0547075
 -0.03638469  0.0164281   0.07786339 -0.03961813  0.01103904 -0.04069594
 -0.03422907  0.00564998  0.08864151 -0.03315126 -0.05686312 -0.03099563
  0.05522933 -0.06009656  0.00133873 -0.02345095 -0.07410811  0.01966154
 -0.01590626 -0.01590626  0.03906215 -0.0730303 ]
In [32]:
X = diabetes.data[:, np.newaxis, 2]
print(X)
[[ 0.06169621]
 [-0.05147406]
 [ 0.04445121]
 [-0.01159501]
 [-0.03638469]
 [-0.04069594]
 [-0.04716281]
 [-0.00189471]
 [ 0.06169621]
 [ 0.03906215]
 [-0.08380842]
 [ 0.01750591]
 [-0.02884001]
 [-0.00189471]
 [-0.02560657]
 [-0.01806189]
 [ 0.04229559]
 [ 0.01211685]
 [-0.0105172 ]
 [-0.01806189]
 [-0.05686312]
 [-0.02237314]
 [-0.00405033]
 [ 0.06061839]
 [ 0.03582872]
 [-0.01267283]
 [-0.07734155]
 [ 0.05954058]
 [-0.02129532]
 [-0.00620595]
 [ 0.04445121]
 [-0.06548562]
 [ 0.12528712]
 [-0.05039625]
 [-0.06332999]
 [-0.03099563]
 [ 0.02289497]
 [ 0.01103904]
 [ 0.07139652]
 [ 0.01427248]
 [-0.00836158]
 [-0.06764124]
 [-0.0105172 ]
 [-0.02345095]
 [ 0.06816308]
 [-0.03530688]
 [-0.01159501]
 [-0.0730303 ]
 [-0.04177375]
 [ 0.01427248]
 [-0.00728377]
 [ 0.0164281 ]
 [-0.00943939]
 [-0.01590626]
 [ 0.0250506 ]
 [-0.04931844]
 [ 0.04121778]
 [-0.06332999]
 [-0.06440781]
 [-0.02560657]
 [-0.00405033]
 [ 0.00457217]
 [-0.00728377]
 [-0.0374625 ]
 [-0.02560657]
 [-0.02452876]
 [-0.01806189]
 [-0.01482845]
 [-0.02991782]
 [-0.046085  ]
 [-0.06979687]
 [ 0.03367309]
 [-0.00405033]
 [-0.02021751]
 [ 0.00241654]
 [-0.03099563]
 [ 0.02828403]
 [-0.03638469]
 [-0.05794093]
 [-0.0374625 ]
 [ 0.01211685]
 [-0.02237314]
 [-0.03530688]
 [ 0.00996123]
 [-0.03961813]
 [ 0.07139652]
 [-0.07518593]
 [-0.00620595]
 [-0.04069594]
 [-0.04824063]
 [-0.02560657]
 [ 0.0519959 ]
 [ 0.00457217]
 [-0.06440781]
 [-0.01698407]
 [-0.05794093]
 [ 0.00996123]
 [ 0.08864151]
 [-0.00512814]
 [-0.06440781]
 [ 0.01750591]
 [-0.04500719]
 [ 0.02828403]
 [ 0.04121778]
 [ 0.06492964]
 [-0.03207344]
 [-0.07626374]
 [ 0.04984027]
 [ 0.04552903]
 [-0.00943939]
 [-0.03207344]
 [ 0.00457217]
 [ 0.02073935]
 [ 0.01427248]
 [ 0.11019775]
 [ 0.00133873]
 [ 0.05846277]
 [-0.02129532]
 [-0.0105172 ]
 [-0.04716281]
 [ 0.00457217]
 [ 0.01750591]
 [ 0.08109682]
 [ 0.0347509 ]
 [ 0.02397278]
 [-0.00836158]
 [-0.06117437]
 [-0.00189471]
 [-0.06225218]
 [ 0.0164281 ]
 [ 0.09618619]
 [-0.06979687]
 [-0.02129532]
 [-0.05362969]
 [ 0.0433734 ]
 [ 0.05630715]
 [-0.0816528 ]
 [ 0.04984027]
 [ 0.11127556]
 [ 0.06169621]
 [ 0.01427248]
 [ 0.04768465]
 [ 0.01211685]
 [ 0.00564998]
 [ 0.04660684]
 [ 0.12852056]
 [ 0.05954058]
 [ 0.09295276]
 [ 0.01535029]
 [-0.00512814]
 [ 0.0703187 ]
 [-0.00405033]
 [-0.00081689]
 [-0.04392938]
 [ 0.02073935]
 [ 0.06061839]
 [-0.0105172 ]
 [-0.03315126]
 [-0.06548562]
 [ 0.0433734 ]
 [-0.06225218]
 [ 0.06385183]
 [ 0.03043966]
 [ 0.07247433]
 [-0.0191397 ]
 [-0.06656343]
 [-0.06009656]
 [ 0.06924089]
 [ 0.05954058]
 [-0.02668438]
 [-0.02021751]
 [-0.046085  ]
 [ 0.07139652]
 [-0.07949718]
 [ 0.00996123]
 [-0.03854032]
 [ 0.01966154]
 [ 0.02720622]
 [-0.00836158]
 [-0.01590626]
 [ 0.00457217]
 [-0.04285156]
 [ 0.00564998]
 [-0.03530688]
 [ 0.02397278]
 [-0.01806189]
 [ 0.04229559]
 [-0.0547075 ]
 [-0.00297252]
 [-0.06656343]
 [-0.01267283]
 [-0.04177375]
 [-0.03099563]
 [-0.00512814]
 [-0.05901875]
 [ 0.0250506 ]
 [-0.046085  ]
 [ 0.00349435]
 [ 0.05415152]
 [-0.04500719]
 [-0.05794093]
 [-0.05578531]
 [ 0.00133873]
 [ 0.03043966]
 [ 0.00672779]
 [ 0.04660684]
 [ 0.02612841]
 [ 0.04552903]
 [ 0.04013997]
 [-0.01806189]
 [ 0.01427248]
 [ 0.03690653]
 [ 0.00349435]
 [-0.07087468]
 [-0.03315126]
 [ 0.09403057]
 [ 0.03582872]
 [ 0.03151747]
 [-0.06548562]
 [-0.04177375]
 [-0.03961813]
 [-0.03854032]
 [-0.02560657]
 [-0.02345095]
 [-0.06656343]
 [ 0.03259528]
 [-0.046085  ]
 [-0.02991782]
 [-0.01267283]
 [-0.01590626]
 [ 0.07139652]
 [-0.03099563]
 [ 0.00026092]
 [ 0.03690653]
 [ 0.03906215]
 [-0.01482845]
 [ 0.00672779]
 [-0.06871905]
 [-0.00943939]
 [ 0.01966154]
 [ 0.07462995]
 [-0.00836158]
 [-0.02345095]
 [-0.046085  ]
 [ 0.05415152]
 [-0.03530688]
 [-0.03207344]
 [-0.0816528 ]
 [ 0.04768465]
 [ 0.06061839]
 [ 0.05630715]
 [ 0.09834182]
 [ 0.05954058]
 [ 0.03367309]
 [ 0.05630715]
 [-0.06548562]
 [ 0.16085492]
 [-0.05578531]
 [-0.02452876]
 [-0.03638469]
 [-0.00836158]
 [-0.04177375]
 [ 0.12744274]
 [-0.07734155]
 [ 0.02828403]
 [-0.02560657]
 [-0.06225218]
 [-0.00081689]
 [ 0.08864151]
 [-0.03207344]
 [ 0.03043966]
 [ 0.00888341]
 [ 0.00672779]
 [-0.02021751]
 [-0.02452876]
 [-0.01159501]
 [ 0.02612841]
 [-0.05901875]
 [-0.03638469]
 [-0.02452876]
 [ 0.01858372]
 [-0.0902753 ]
 [-0.00512814]
 [-0.05255187]
 [-0.02237314]
 [-0.02021751]
 [-0.0547075 ]
 [-0.00620595]
 [-0.01698407]
 [ 0.05522933]
 [ 0.07678558]
 [ 0.01858372]
 [-0.02237314]
 [ 0.09295276]
 [-0.03099563]
 [ 0.03906215]
 [-0.06117437]
 [-0.00836158]
 [-0.0374625 ]
 [-0.01375064]
 [ 0.07355214]
 [-0.02452876]
 [ 0.03367309]
 [ 0.0347509 ]
 [-0.03854032]
 [-0.03961813]
 [-0.00189471]
 [-0.03099563]
 [-0.046085  ]
 [ 0.00133873]
 [ 0.06492964]
 [ 0.04013997]
 [-0.02345095]
 [ 0.05307371]
 [ 0.04013997]
 [-0.02021751]
 [ 0.01427248]
 [-0.03422907]
 [ 0.00672779]
 [ 0.00457217]
 [ 0.03043966]
 [ 0.0519959 ]
 [ 0.06169621]
 [-0.00728377]
 [ 0.00564998]
 [ 0.05415152]
 [-0.00836158]
 [ 0.114509  ]
 [ 0.06708527]
 [-0.05578531]
 [ 0.03043966]
 [-0.02560657]
 [ 0.10480869]
 [-0.00620595]
 [-0.04716281]
 [-0.04824063]
 [ 0.08540807]
 [-0.01267283]
 [-0.03315126]
 [-0.00728377]
 [-0.01375064]
 [ 0.05954058]
 [ 0.02181716]
 [ 0.01858372]
 [-0.01159501]
 [-0.00297252]
 [ 0.01750591]
 [-0.02991782]
 [-0.02021751]
 [-0.05794093]
 [ 0.06061839]
 [-0.04069594]
 [-0.07195249]
 [-0.05578531]
 [ 0.04552903]
 [-0.00943939]
 [-0.03315126]
 [ 0.04984027]
 [-0.08488624]
 [ 0.00564998]
 [ 0.02073935]
 [-0.00728377]
 [ 0.10480869]
 [-0.02452876]
 [-0.00620595]
 [-0.03854032]
 [ 0.13714305]
 [ 0.17055523]
 [ 0.00241654]
 [ 0.03798434]
 [-0.05794093]
 [-0.00943939]
 [-0.02345095]
 [-0.0105172 ]
 [-0.03422907]
 [-0.00297252]
 [ 0.06816308]
 [ 0.00996123]
 [ 0.00241654]
 [-0.03854032]
 [ 0.02612841]
 [-0.08919748]
 [ 0.06061839]
 [-0.02884001]
 [-0.02991782]
 [-0.0191397 ]
 [-0.04069594]
 [ 0.01535029]
 [-0.02452876]
 [ 0.00133873]
 [ 0.06924089]
 [-0.06979687]
 [-0.02991782]
 [-0.046085  ]
 [ 0.01858372]
 [ 0.00133873]
 [-0.03099563]
 [-0.00405033]
 [ 0.01535029]
 [ 0.02289497]
 [ 0.04552903]
 [-0.04500719]
 [-0.03315126]
 [ 0.097264  ]
 [ 0.05415152]
 [ 0.12313149]
 [-0.08057499]
 [ 0.09295276]
 [-0.05039625]
 [-0.01159501]
 [-0.0277622 ]
 [ 0.05846277]
 [ 0.08540807]
 [-0.00081689]
 [ 0.00672779]
 [ 0.00888341]
 [ 0.08001901]
 [ 0.07139652]
 [-0.02452876]
 [-0.0547075 ]
 [-0.03638469]
 [ 0.0164281 ]
 [ 0.07786339]
 [-0.03961813]
 [ 0.01103904]
 [-0.04069594]
 [-0.03422907]
 [ 0.00564998]
 [ 0.08864151]
 [-0.03315126]
 [-0.05686312]
 [-0.03099563]
 [ 0.05522933]
 [-0.06009656]
 [ 0.00133873]
 [-0.02345095]
 [-0.07410811]
 [ 0.01966154]
 [-0.01590626]
 [-0.01590626]
 [ 0.03906215]
 [-0.0730303 ]]

14.10 체질량지수bmi와 당뇨수치는 어떤 상관관계가 있을까

In [33]:
regr.fit(X, diabetes.target)         # 학습을 통한 선형회귀 모델을 생성 
print(regr.coef_, regr.intercept_)
[949.43526038] 152.1334841628967
In [36]:
# 학습 데이터와 테스트 데이터를 분리한다. 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(diabetes.data, diabetes.target,
                                                    test_size=0.2)
In [37]:
# 학습 데이터와 테스트 데이터를 분리한다. 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(diabetes.data[:,np.newaxis,2],
                                                    diabetes.target,
                                                    test_size=0.2) 
regr = LinearRegression() 
regr.fit(X_train, y_train)
Out[37]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [38]:
score = regr.score(X_train, y_train)
print(score)
score = regr.score(X_test, y_test)
print(score)
0.34621263244077793
0.33329025924682276

14.10 당뇨병 예제를 학습 데이터와 테스트 데이터로 구분하자

In [40]:
X_train, X_test, y_train, y_test = train_test_split(diabetes.data, diabetes.target,
                                                    test_size=0.2)
regr.fit(X_train, y_train)
y_pred = regr.predict(X_test)     # 테스트 데이터로 예측해보자.
In [43]:
print(y_pred)
print(y_test)
[ 76.19288971 122.8980136  132.68679726 176.74222532 150.40493432
 142.48710548  66.19143825 225.75315415  60.93811615 160.06630859
 161.7616245  127.1956492  250.32597653 136.83805312 104.47431247
  99.52979979 127.50925329 101.41130274 158.20747795 132.79182392
  80.7795418  174.9973218  287.02165646 190.78262199 117.11515329
  62.16905109  64.53672125 225.29320066  69.1953779  120.80935652
 213.84025117  57.90048558 207.03514126  50.57317748 214.37114275
 179.37268122 142.4440009  142.0586992  234.74067428 163.87708461
  68.65378412 183.03182167 137.46083338 168.48871886 160.57063366
 183.50320213 168.43267522 158.65030154 177.76700402 233.77210559
  90.96710468 208.25586583 142.7774825   99.44421264 190.41776893
 154.66243439 125.51055833 100.85916621 110.98087975 153.3642741
 131.898583    83.15038276 104.96664787 101.17321497 190.12253877
 232.45170311  98.64556045 218.84435877  86.86250865 191.45645608
 164.76520878 157.28402233 135.79646464 239.28512034 145.47140687
  55.81550388 170.95013983 208.96457691  94.03668631  98.86180891
 183.61886889 171.4910994  137.94151312 157.16379177 108.19535084
 109.11138602 260.44818846 196.28927273  97.83528846]
[ 60. 103. 178. 138. 197.  88.  72. 261.  96. 185. 196.  97. 310. 202.
 101.  65.  51. 170.  94. 170.  77. 111. 281. 233.  71.  52.  43. 217.
  59.  68. 163.  65. 268.  55. 275. 217. 116. 302. 208. 120.  70.  77.
 187. 127. 259. 272. 110. 118.  66. 281.  71. 151. 172.  31. 202. 276.
  92.  69.  88. 202.  59.  42.  69. 104. 178. 321.  49. 180. 181.  68.
 178. 109. 103. 272.  93.  90. 216. 288.  84.  54. 107. 180. 230. 150.
 125.  72. 308. 163. 137.]

LAB 14-1 데이터 80%로 학습하여 예측한 결과와 실제 데이터 비교

In [45]:
import numpy as np 
from sklearn import linear_model  # scikit-learn 모듈을 가져온다
from sklearn import datasets
import matplotlib.pyplot as plt

regr = linear_model.LinearRegression() 
# 학습 데이터와 테스트 데이터를 분리한다. 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(diabetes.data,
                                                    diabetes.target,
                                                    test_size=0.2) 
regr.fit(X_train, y_train)
print(regr.coef_, regr.intercept_)

y_pred = regr.predict(X_test)

plt.scatter(y_pred, y_test)
plt.show()
[ -63.57492467 -284.6049703   510.29607689  332.80804801 -914.13334266
  554.96458111  175.3746655   245.40214146  795.88565343   48.31971934] 149.6661397653907

14.11 알고리즘이 가지는 오차

In [48]:
plt.scatter(y_pred, y_test,  color='black')

x = np.linspace(0, 330, 100)  # 특정 구간의 점 
plt.plot(x, x, linewidth = 3, color = 'blue')
plt.show()
In [50]:
from sklearn.metrics import mean_squared_error

... # 이전 절에서 구한 선형회귀 모델의 코드를 삽입함

print('Mean squared error:', mean_squared_error(y_test, y_pred))
Mean squared error: 2450.864111510396

14.14 아름다운 붓꽃의 종류를 분류할 준비를 해보자

In [1]:
from sklearn.datasets import load_iris 
iris = load_iris() 
print(iris.data)
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.9 1.5]
 [5.5 2.3 4.  1.3]
 [6.5 2.8 4.6 1.5]
 [5.7 2.8 4.5 1.3]
 [6.3 3.3 4.7 1.6]
 [4.9 2.4 3.3 1. ]
 [6.6 2.9 4.6 1.3]
 [5.2 2.7 3.9 1.4]
 [5.  2.  3.5 1. ]
 [5.9 3.  4.2 1.5]
 [6.  2.2 4.  1. ]
 [6.1 2.9 4.7 1.4]
 [5.6 2.9 3.6 1.3]
 [6.7 3.1 4.4 1.4]
 [5.6 3.  4.5 1.5]
 [5.8 2.7 4.1 1. ]
 [6.2 2.2 4.5 1.5]
 [5.6 2.5 3.9 1.1]
 [5.9 3.2 4.8 1.8]
 [6.1 2.8 4.  1.3]
 [6.3 2.5 4.9 1.5]
 [6.1 2.8 4.7 1.2]
 [6.4 2.9 4.3 1.3]
 [6.6 3.  4.4 1.4]
 [6.8 2.8 4.8 1.4]
 [6.7 3.  5.  1.7]
 [6.  2.9 4.5 1.5]
 [5.7 2.6 3.5 1. ]
 [5.5 2.4 3.8 1.1]
 [5.5 2.4 3.7 1. ]
 [5.8 2.7 3.9 1.2]
 [6.  2.7 5.1 1.6]
 [5.4 3.  4.5 1.5]
 [6.  3.4 4.5 1.6]
 [6.7 3.1 4.7 1.5]
 [6.3 2.3 4.4 1.3]
 [5.6 3.  4.1 1.3]
 [5.5 2.5 4.  1.3]
 [5.5 2.6 4.4 1.2]
 [6.1 3.  4.6 1.4]
 [5.8 2.6 4.  1.2]
 [5.  2.3 3.3 1. ]
 [5.6 2.7 4.2 1.3]
 [5.7 3.  4.2 1.2]
 [5.7 2.9 4.2 1.3]
 [6.2 2.9 4.3 1.3]
 [5.1 2.5 3.  1.1]
 [5.7 2.8 4.1 1.3]
 [6.3 3.3 6.  2.5]
 [5.8 2.7 5.1 1.9]
 [7.1 3.  5.9 2.1]
 [6.3 2.9 5.6 1.8]
 [6.5 3.  5.8 2.2]
 [7.6 3.  6.6 2.1]
 [4.9 2.5 4.5 1.7]
 [7.3 2.9 6.3 1.8]
 [6.7 2.5 5.8 1.8]
 [7.2 3.6 6.1 2.5]
 [6.5 3.2 5.1 2. ]
 [6.4 2.7 5.3 1.9]
 [6.8 3.  5.5 2.1]
 [5.7 2.5 5.  2. ]
 [5.8 2.8 5.1 2.4]
 [6.4 3.2 5.3 2.3]
 [6.5 3.  5.5 1.8]
 [7.7 3.8 6.7 2.2]
 [7.7 2.6 6.9 2.3]
 [6.  2.2 5.  1.5]
 [6.9 3.2 5.7 2.3]
 [5.6 2.8 4.9 2. ]
 [7.7 2.8 6.7 2. ]
 [6.3 2.7 4.9 1.8]
 [6.7 3.3 5.7 2.1]
 [7.2 3.2 6.  1.8]
 [6.2 2.8 4.8 1.8]
 [6.1 3.  4.9 1.8]
 [6.4 2.8 5.6 2.1]
 [7.2 3.  5.8 1.6]
 [7.4 2.8 6.1 1.9]
 [7.9 3.8 6.4 2. ]
 [6.4 2.8 5.6 2.2]
 [6.3 2.8 5.1 1.5]
 [6.1 2.6 5.6 1.4]
 [7.7 3.  6.1 2.3]
 [6.3 3.4 5.6 2.4]
 [6.4 3.1 5.5 1.8]
 [6.  3.  4.8 1.8]
 [6.9 3.1 5.4 2.1]
 [6.7 3.1 5.6 2.4]
 [6.9 3.1 5.1 2.3]
 [5.8 2.7 5.1 1.9]
 [6.8 3.2 5.9 2.3]
 [6.7 3.3 5.7 2.5]
 [6.7 3.  5.2 2.3]
 [6.3 2.5 5.  1.9]
 [6.5 3.  5.2 2. ]
 [6.2 3.4 5.4 2.3]
 [5.9 3.  5.1 1.8]]
In [2]:
iris.data.shape
Out[2]:
(150, 4)

14.15 k-NN 알고리즘을 적용할 데이터를 살펴보자

In [3]:
print(iris.feature_names) # 4개의 특징 이름을 출력한다.
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
In [4]:
# 정수는 꽃의 종류를 나타낸다.: 0 = setosa, 1=versicolor, 2=virginica 
print(iris.target)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]

14.16 k-NN 알고리즘을 적용해보자

In [5]:
# (80:20)으로 분할한다. 
from sklearn.datasets import load_iris 
from sklearn.model_selection import train_test_split 

iris = load_iris() 
X_train,X_test,y_train,y_test = train_test_split(iris.data, iris.target,test_size=0.2)
In [6]:
from sklearn.neighbors import KNeighborsClassifier 
from sklearn import metrics 

num_neigh = 1
knn = KNeighborsClassifier(n_neighbors = num_neigh) 
knn.fit(X_train, y_train) 
y_pred = knn.predict(X_test) 
scores = metrics.accuracy_score(y_test, y_pred) 
print('n_neighbors가 {0:d}일때 정확도: {1:.3f}'.format(num_neigh, scores))
n_neighbors가 1일때 정확도: 0.900

14.17 새로운 꽃에 대해서 모델을 적용해서 분류해보자

In [7]:
from sklearn.datasets import load_iris 
from sklearn.model_selection import train_test_split 
from sklearn.neighbors import KNeighborsClassifier 
 
iris = load_iris() 
knn = KNeighborsClassifier(n_neighbors=6) 
knn.fit(iris.data, iris.target) 
Out[7]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=6, p=2,
                     weights='uniform')
In [8]:
classes = {0:'setosa', 1:'versicolor', 2:'virginica'} 
 
# 아직 보지 못한 새로운 데이터를 제시해보자. 
X = [[3,4,5,2], 
[5,4,2,2]] 
y = knn.predict(X) 
 
print(classes[y[0]]) 
print(classes[y[1]]) 
versicolor
setosa

14.19 보스턴 집값 데이터 읽어오기와 결측 확인하기

In [9]:
import numpy as np 
import matplotlib.pyplot as plt 
import pandas as pd 
import seaborn as sns    # 시각화를 위하여 Seaborn 라이브러리를 이용함
 
from sklearn.datasets import load_boston 
boston = load_boston() 
 
df = pd.DataFrame(boston.data, columns=boston.feature_names) 
print(df.head())
      CRIM    ZN  INDUS  CHAS    NOX     RM   AGE     DIS  RAD    TAX  \
0  0.00632  18.0   2.31   0.0  0.538  6.575  65.2  4.0900  1.0  296.0   
1  0.02731   0.0   7.07   0.0  0.469  6.421  78.9  4.9671  2.0  242.0   
2  0.02729   0.0   7.07   0.0  0.469  7.185  61.1  4.9671  2.0  242.0   
3  0.03237   0.0   2.18   0.0  0.458  6.998  45.8  6.0622  3.0  222.0   
4  0.06905   0.0   2.18   0.0  0.458  7.147  54.2  6.0622  3.0  222.0   

   PTRATIO       B  LSTAT  
0     15.3  396.90   4.98  
1     17.8  396.90   9.14  
2     17.8  392.83   4.03  
3     18.7  394.63   2.94  
4     18.7  396.90   5.33  
In [10]:
df['MEDV'] = boston.target 
In [11]:
print( df.isnull().sum() )
CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

14.20 각 특징들 사이의 상관관계를 살펴보자

In [12]:
sns.set(rc={'figure.figsize':(12,10)}) 
correlation_matrix = df.corr().round(2) 
sns.heatmap(data=correlation_matrix, annot=True) 
plt.show()

14.22 어떤 특징들이 서로 상관관계가 있을까

In [13]:
sns.pairplot(df[["MEDV", "RM", "AGE", "CHAS", "B"]])
plt.show()

14.21 간단한 회귀모델을 만들자

In [14]:
X = df[['LSTAT', 'RM']] 
y = df['MEDV']
In [15]:
from sklearn.model_selection import train_test_split 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
In [16]:
from sklearn.linear_model import LinearRegression 
from sklearn.metrics import mean_squared_error 

lin_model = LinearRegression() 
lin_model.fit(X_train, y_train)
Out[16]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [17]:
y_test_predict = lin_model.predict(X_test) 
rmse = np.sqrt(mean_squared_error(y_test, y_test_predict))
print('RMSE =', rmse)
RMSE = 5.350860024889858