教你Python使用隨機森林模型預測機票價格

由程式設計師小六發表于網頁遊戲
2022-09-19

簡介head（）可見訓練集中的欄位有航空公司（Airline）、日期（Date_of_Journey）、始發站（Source）、終點站（Destination）、路線（Route）、起飛時間（Dep_Time）、抵達時間（Arrival_Tim

rmse多少模型算效果好

印度的機票價格基於供需關係浮動，很少受到監管機構的限制。因此它通常被認為是不可預測的，而動態定價機制更增添了人們的困惑。

我們的目的是建立一個機器學習模型，根據歷史資料預測未來航班的價格，這些航班價格可以給客戶或航空公司服務提供商作為參考價格。

1.準備

開始之前，你要確保Python和pip已經成功安裝在電腦上，如果沒有，可以訪問這篇文章：

超詳細Python安裝指南

進行安裝。

(可選1)

如果你用Python的目的是資料分析，可以直接安裝Anaconda：

Python資料分析與挖掘好幫手—Anaconda

，它內建了Python和pip。

(可選2)

此外，推薦大家用VSCode編輯器，它有許多的優點：

Python 程式設計的最好搭檔—VSCode 詳細指南

。

請選擇以下任一種方式輸入命令安裝依賴

：

1。 Windows 環境開啟 Cmd （開始-執行-CMD）。

2。 MacOS 環境開啟 Terminal （command+空格輸入Terminal）。

3。如果你用的是 VSCode編輯器或 Pycharm，可以直接使用介面下方的Terminal。

pip

install

pandas

pip

install

numpy

pip

install

matplotlib

pip

install

seaborn

pip

install

scikit-learn

2.匯入相關資料集

本文的資料集是 Data_Train。xlsx，首先看看訓練集的格式：

import

pandas

import

numpy

import

matplotlib。pyplot

plt

import

seaborn

sns

sns。set_style（

‘whitegrid’

）

flights = pd。read_excel（

‘。/Data_Train。xlsx’

）

flights。head（）

可見訓練集中的欄位有航空公司（

Airline

）、日期（

Date_of_Journey

）、始發站（

Source

）、終點站（

Destination

）、路線（

Route

）、起飛時間（

Dep_Time

）、抵達時間（

Arrival_Time

）、歷經時長（

Duration

）、總計停留站點個數（

Total_Stops

）、額外資訊（

Additional_Info

），最後是機票價格（

Price

）。

與其相對的測試集，除了缺少價格欄位之外，與訓練集的其他所有欄位均一致。

下載完整資料來源和程式碼

請訪問：

https：//pythondict。com/download/predict-ticket/

或在Python實用寶典後臺回覆：

預測機票

。

3.探索性資料分析

3.1 清理缺失資料

看看所有欄位的基本資訊：

flights。info（）

其他的非零值數量均為10683，只有路線和停靠站點數是10682，說明這兩個欄位缺少了一個值。

謹慎起見，我們刪掉缺少資料的行：

# clearing the missing data

flights。dropna（inplace=

True

）

flights。info（）

現在非零值達到一致數量，資料清理完畢。

3.2 航班公司分佈特徵

接下來看看航空公司的分佈特徵：

sns。countplot（

‘Airline’

， data=flights）

plt。xticks（rotation=

）

plt。show（）

前三名的航空公司分別是 IndiGo， Air India， JetAirways。

其中可能存在廉價航空公司。

3.3 再來看看始發地的分佈

sns。countplot（

‘Source’

，data=flights）

plt。xticks（rotation=

）

plt。show（）

某些地區可能是冷門地區，存在冷門機票的可能性比較大。

3.4 停靠站點的數量分佈

sns。countplot（

‘Total_Stops’

，data=flights）

plt。xticks（rotation=

）

plt。show（）

看來大部分航班在飛行途中只停靠一次或無停靠。

會不會某些停靠多的航班比較便宜？

3.5 有多少資料含有額外資訊

plot=plt。figure（）

sns。countplot（

‘Additional_Info’

，data=flights）

plt。xticks（rotation=

）

大部分航班資訊中都沒有包含額外資訊，除了部分航班資訊有：不包含飛機餐、不包含免費託運。

這個資訊挺重要的，是否不包含這兩項服務的飛機機票比較便宜？

3.6 時間維度分析

首先轉換時間格式：

flights［

‘Date_of_Journey’

］ = pd。to_datetime（flights［

‘Date_of_Journey’

］）

flights［

‘Dep_Time’

］ = pd。to_datetime（flights［

‘Dep_Time’

］，format=

‘%H：%M：%S’

）。dt。time

接下來，研究一下出發時間和價格的關係：

flights［

‘weekday’

］ = flights［［

‘Date_of_Journey’

］］。apply（

lambda

x：x。dt。day_name（））

sns。barplot（

‘weekday’

，

‘Price’

，data=flights）

plt。show（）

大體上價格沒有差別，說明這個特徵是無效的。

那麼月份和機票價格的關係呢？

flights［

“month”

］ = flights［

‘Date_of_Journey’

］。map（

lambda

x： x。month_name（））

sns。barplot（

‘month’

，

‘Price’

，data=flights）

plt。show（）

沒想到4月的機票價格均價只是其他月份的一半，看來4月份是印度的出行淡季吧。

起飛時間和價格的關係

：

flights［

‘Dep_Time’

］ = flights［

‘Dep_Time’

］。apply（

lambda

x：x。hour）

flights［

‘Dep_Time’

］ = pd。to_numeric（flights［

‘Dep_Time’

］）

sns。barplot（

‘Dep_Time’

，

‘Price’

，data=flights）

plot。show（）

可以看到，紅眼航班（半夜及早上）的機票比較便宜，這是符合我們的認知的。

3.7 清除無效特徵

把那些和價格沒有關聯關係的欄位直接去除掉：

flights。drop（［

‘Route’

，

‘Arrival_Time’

，

‘Date_of_Journey’

］，axis=

，inplace=

True

）

flights。head（）

4.模型訓練

接下來，我們可以準備使用模型來預測機票價格了，不過，還需要對資料進行預處理和特徵縮放。

4.1 資料預處理

將字串變數使用數字替代：

from

sklearn。preprocessing

import

LabelEncoder

var_mod = ［

‘Airline’

，

‘Source’

，

‘Destination’

，

‘Additional_Info’

，

‘Total_Stops’

，

‘weekday’

，

‘month’

，

‘Dep_Time’

］

le = LabelEncoder（）

for

var_mod：

flights［i］ = le。fit_transform（flights［i］）

flights。head（）

對每列資料進行

特徵縮放

，提取自變數（x）和因變數（y）：

flights。corr（）

def

outlier

（df）

：

for

df。describe（）。columns：

Q1=df。describe（）。at［

‘25%’

，i］

Q3=df。describe（）。at［

‘75%’

，i］

IQR= Q3-Q1

LE=Q1

-1。5

*IQR

UE=Q3+

1。5

*IQR

df［i］=df［i］。mask（df［i］

df［i］=df［i］。mask（df［i］>UE，UE）

return

flights = outlier（flights）

x = flights。drop（

‘Price’

，axis=

）

y = flights［

‘Price’

］

劃分測試集和訓練集：

from

sklearn。model_selection

import

train_test_split

x_train， x_test， y_train， y_test = train_test_split（x， y， test_size=

0。2

， random_state=

101

）

4.2 模型訓練及測試

使用隨機森林進行模型訓練：

from

sklearn。ensemble

import

RandomForestRegressor

rfr=RandomForestRegressor（n_estimators=

100

）

rfr。fit（x_train，y_train）

在隨機森林中，我們有一種根據資料的相關性來確定特徵重要性的方法：

features=x。columns

importances = rfr。feature_importances_

indices = np。argsort（importances）

plt。figure（

）

plt。title（

‘Feature Importances’

）

plt。barh（range（len（indices））， importances［indices］， color=

‘b’

， align=

‘center’

）

plt。yticks（range（len（indices））， features［indices］）

plt。xlabel（

‘Relative Importance’

）

可以看到，Duration（飛行時長）是影響最大的因子。

對劃分的測試集進行預測，得到結果：

predictions=rfr。predict（x_test）

plt。scatter（y_test，predictions）

plt。show（）

這樣看不是很直觀，接下來我們要數字化地評價這個模型。

4.3 模型評價

sklearn 提供了非常方便的函式來評價模型，那就是 metrics ：

from

sklearn

import

metrics

print（

‘MAE：’

， metrics。mean_absolute_error（y_test， predictions））

print（

‘MSE：’

， metrics。mean_squared_error（y_test， predictions））

print（

‘RMSE：’

， np。sqrt（metrics。mean_squared_error（y_test， predictions）））

print（

‘r2_score：’

，（metrics。r2_score（y_test， predictions）））

結果：

MAE

：

1453。9350628905618

MSE

：

4506308。3645551

RMSE

：

2122。806718605135

r2_score

：

0。7532074710409375

這4個值中你可以只關注R2_score，r2越接近1說明模型效果越好，這個模型的分數是0。75，算是很不錯的模型了。

看看其殘差直方圖是否符合正態分佈：

sns。distplot（（y_test-predictions），bins=

）

plt。show（）

不錯，多數預測結果和真實值都在-1000到1000的範圍內，算是可以接受的結果。其殘差直方圖也基本符合正態分佈，說明模型是有效果的。

*宣告：本文於網路整理，版權歸原作者所有，如來源資訊有誤或侵犯權益，請聯絡我們刪除或授權事宜。

上一篇：8.2.2 施工質量控制體系

下一篇：最名貴的吊蘭品種

您現在的位置是：首頁 > 網頁遊戲首頁 網頁遊戲

教你Python使用隨機森林模型預測機票價格

相關文章