1. 캐글 - 뉴욕 택시여행 기간 예측


New York City Taxi Trip Duration


분석 공부를 위해 캐글의 대회들 중 좋은 성적을 받았던 커널들을 따라해보려고 합니다.

0. Competition Introduction


이 대회에서의 목적은 뉴욕에서의 택시 여행 기간을 예측하는 모델을 만드는 것으로서,

가장 성과측정치가 좋았던 사람을 뽑는 것보다는 통찰력 있고 사용 가능한 모델을 만드는 사람에게 보상을 지불하는 형태로 진행되었다.

성과측정치는 다음과 같다.

\epsilon =\sqrt { \frac { 1 }{ n } \sum _{ i=1 }^{ n }{ { (log({ p }_{ i }+1)\quad -\quad log({ a }_{ i }+1)) }^{ 2 } } }

Where:

ϵ is the RMSLE value (score)
n is the total number of observations in the (public/private) data set,
{p}_{i} is your prediction of trip duration, and
{a}_{i} is the actual trip duration for ii.
log(x) is the natural logarithm of x

이 분석은 캐글 대회 New York City Taxi Trip Duration의 데이터를 이용하여 진행하였으며
연습을 위해 Weiying Wang의 A Practical Guide to NY Taxi Data (0.379) 커널을 참고하여 진행한 분석이다.

# Library import

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['figure.figsize']=(10, 18)
%matplotlib inline
from datetime import datetime
from datetime import date
import xgboost as xgb
from sklearn.cluster import MiniBatchKMeans
import seaborn as sns
import warnings
sns.set()
warnings.filterwarnings('ignore')

1. Data Preview


train = pd.read_csv('Input/train.csv',
                    parse_dates=['pickup_datetime'])
test = pd.read_csv('Input/test.csv',
                   parse_dates=['pickup_datetime'])
train.head()
idvendor_idpickup_datetimedropoff_datetimepassenger_countpickup_longitudepickup_latitudedropoff_longitudedropoff_latitudestore_and_fwd_flagtrip_duration
0id287542122016-03-14 17:24:552016-03-14 17:32:301-73.98215540.767937-73.96463040.765602N455
1id237739412016-06-12 00:43:352016-06-12 00:54:381-73.98041540.738564-73.99948140.731152N663
2id385852922016-01-19 11:35:242016-01-19 12:10:481-73.97902740.763939-74.00533340.710087N2124
3id350467322016-04-06 19:32:312016-04-06 19:39:401-74.01004040.719971-74.01226840.706718N429
4id218102822016-03-26 13:30:552016-03-26 13:38:101-73.97305340.793209-73.97292340.782520N435
#dataDir = '../input/'
#train = pd.read_csv(dataDir + 'train.csv')
#test = pd.read_csv(dataDir + 'test.csv')
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 11 columns):
id                    1458644 non-null object
vendor_id             1458644 non-null int64
pickup_datetime       1458644 non-null datetime64[ns]
dropoff_datetime      1458644 non-null object
passenger_count       1458644 non-null int64
pickup_longitude      1458644 non-null float64
pickup_latitude       1458644 non-null float64
dropoff_longitude     1458644 non-null float64
dropoff_latitude      1458644 non-null float64
store_and_fwd_flag    1458644 non-null object
trip_duration         1458644 non-null int64
dtypes: datetime64[ns](1), float64(4), int64(3), object(3)
memory usage: 122.4+ MB

null값 없음. 11개 열과 1458644개 행

for df in (train, test):
    df['year'] = df['pickup_datetime'].dt.year
    df['month'] = df['pickup_datetime'].dt.month
    df['day'] = df['pickup_datetime'].dt.day
    df['hour'] = df['pickup_datetime'].dt.hour
    df['minute'] = df['pickup_datetime'].dt.minute
    df['store_and_fwd_flag'] = 1 * (df['store_and_fwd_flag'].values == 'Y')
test.head()
idvendor_idpickup_datetimepassenger_countpickup_longitudepickup_latitudedropoff_longitudedropoff_latitudestore_and_fwd_flagyearmonthdayhourminute
0id300467212016-06-30 23:59:581-73.98812940.732029-73.99017340.756680020166302359
1id350535512016-06-30 23:59:531-73.96420340.679993-73.95980840.655403020166302359
2id121714112016-06-30 23:59:471-73.99743740.737583-73.98616040.729523020166302359
3id215012622016-06-30 23:59:411-73.95607040.771900-73.98642740.730469020166302359
4id159824512016-06-30 23:59:331-73.97021540.761475-73.96151040.755890020166302359

RMSLE를 사용하여 점수를 매길 것이기 때문에, 위의 성과측정치를 사용하여 실제 여행 기간을 변경한다.

\epsilon =\sqrt { \frac { 1 }{ n } \sum _{ i=1 }^{ n }{ { (log({ p }_{ i }+1)\quad -\quad log({ a }_{ i }+1)) }^{ 2 } } }


train = train.assign(log_trip_duration = np.log(train.trip_duration+1))
train.head()
idvendor_idpickup_datetimedropoff_datetimepassenger_countpickup_longitudepickup_latitudedropoff_longitudedropoff_latitudestore_and_fwd_flagtrip_durationyearmonthdayhourminutelog_trip_duration
0id287542122016-03-14 17:24:552016-03-14 17:32:301-73.98215540.767937-73.96463040.7656020455201631417246.122493
1id237739412016-06-12 00:43:352016-06-12 00:54:381-73.98041540.738564-73.99948140.731152066320166120436.498282
2id385852922016-01-19 11:35:242016-01-19 12:10:481-73.97902740.763939-74.00533340.71008702124201611911357.661527
3id350467322016-04-06 19:32:312016-04-06 19:39:401-74.01004040.719971-74.01226840.706718042920164619326.063785
4id218102822016-03-26 13:30:552016-03-26 13:38:101-73.97305340.793209-73.97292340.7825200435201632613306.077642

2. Features


참고한 커널의 의견에 따르면 중요한 Features는 다음과 같다.

  1. the pickup time (rush hour should cause longer trip duration.)
  2. the trip distance
  3. the pickup location

2.1. Pickup Time and Weekend Features


자세한 내용은 코드를 통해 알아보자.

from datetime import datetime
holiday1 = pd.read_csv('Input/NYC_2016Holidays.csv', sep=';')
# holiday['Date'] = holiday['Date'].apply(lambda x: x + ' 2016')
# 이 커널의 경우 위와 같이 laambda 수식을 이용하여 코드로 만들었는데,
# 굳이 저런 식으로 만들 필요가 있을까 싶어서 변경하였다.
holiday['Date'] = holiday['Date'] + ' 2016'
# strptime 함수를 통해 January 01 2016 과 같은 형식으로
# 되어있는 문자열을 데이터 타임으로 변경한다.
# '%B %d %Y'를 통해 현재 데이터가 어떤 형태로
# 날짜를 표현하고 있는지를 알려준다.
holidays = [datetime.strptime(holiday.loc[i, 'Date'],
            '%B %d %Y').date() for i in range(len(holiday))]
time_train = pd.DataFrame(index = range(len(train)))
time_test = pd.DataFrame(index = range(len(test)))
from datetime import date
def restday(yr, month, day, holidays):
    is_rest = [None]*len(yr)
    is_weekend = [None]*len(yr)
    i=0
    for yy, mm, dd in zip(yr, month, day):
        is_weekend[i] = date(yy, mm, dd).isoweekday() in (6,7)
        is_rest[i] = is_weekend[i] or date(yy, mm, dd) in holidays
        i+=1
    return is_rest, is_weekend
rest_day, weekend = restday(train.year, train.month, train.day, holidays)
#time_train = time_train.assign(rest_day=rest_day)
#time_train = time_train.assign(weekend=weekend)
time_train['rest_day'] = rest_day
time_train['weekend'] = weekend
time_train['pickup_time'] = train.hour+train.minute/60
time_train.head()
rest_dayweekendpickup_time
0FalseFalse17.400000
1TrueTrue0.716667
2FalseFalse11.583333
3FalseFalse19.533333
4TrueTrue13.500000
rest_day, weekend = restday(test.year, test.month, test.day, holidays)
#time_train = time_train.assign(rest_day=rest_day)
#time_train = time_train.assign(weekend=weekend)
time_test['rest_day'] = rest_day
time_test['weekend'] = weekend
time_test['pickup_time'] = test.hour+test.minute/60
time_test.head()
rest_dayweekendpickup_time
0FalseFalse23.983333
1FalseFalse23.983333
2FalseFalse23.983333
3FalseFalse23.983333
4FalseFalse23.983333

2.2. Distance Features


2.2.1. OSRM Features


이 커널에 따르면 GPS로부터 얻은 실제 pickup과 dropoff의 위치 차이가 아니라 travel distance가 더 관련성 있는 데이터라고 한다. 이 둘의 차이가 어떻게 다른지는 아직까지 감이 잡히지 않아서 코드를 통해 이유를 알아보자. 여하튼 그 데이터를 구하기가 어렵지만 Oscarleo가 데이터셋을 올려줬다고 해서 그 데이터를 활용해보자.

fastrout1 = pd.read_csv('Input/fastest_routes_train_part_1.csv',
                usecols=['id', 'total_distance', 'total_travel_time',  
                         'number_of_steps','step_direction'])
fastrout2 = pd.read_csv('Input/fastest_routes_train_part_2.csv',
                usecols=['id', 'total_distance', 'total_travel_time',  
                         'number_of_steps','step_direction'])
fastrout = pd.concat((fastrout1, fastrout2))
fastrout.head()
idtotal_distancetotal_travel_timenumber_of_stepsstep_direction
0id28754212009.1164.95left|straight|right|straight|arrive
1id23773942513.2332.06none|right|left|right|left|arrive
2id35046731779.4235.84left|left|right|arrive
3id21810281614.9140.15right|left|right|left|arrive
4id08015841393.5189.45right|right|right|left|arrive
# map 함수는 데이터 각각에 특정한 함수를 적용하는 것인데,
# lambda를 통해 즉석에서 함수를 만들어서 적용한다.
right_turn = []
left_turn = []
right_turn += list(map(lambda x:x.count('right')-
                x.count('slight right'), fastrout.step_direction))
left_turn += list(map(lambda x:x.count('left')-
                x.count('slight left'),fastrout.step_direction))

osrm_data = fastrout[['id', 'total_distance', 'total_travel_time',
                      'number_of_steps']]
osrm_data['right_steps'] = right_turn
osrm_data['left_steps'] = left_turn
osrm_data.head()
idtotal_distancetotal_travel_timenumber_of_stepsright_stepsleft_steps
0id28754212009.1164.9511
1id23773942513.2332.0622
2id35046731779.4235.8412
3id21810281614.9140.1522
4id08015841393.5189.4531

OSRM 데이터의 열은 1458643개이며, 실제 데이터보다 1개 열이 적다. 그래서 이 데이터를 사용하기 위해서는 SQL의 join을 사용하여서 데이터를 접합시켜야 한다.

osrm_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1458643 entries, 0 to 758642
Data columns (total 6 columns):
id                   1458643 non-null object
total_distance       1458643 non-null float64
total_travel_time    1458643 non-null float64
number_of_steps      1458643 non-null int64
right_steps          1458643 non-null int64
left_steps           1458643 non-null int64
dtypes: float64(2), int64(3), object(1)
memory usage: 77.9+ MB
train = train.join(osrm_data.set_index('id'), on='id')
train.head()
idvendor_idpickup_datetimedropoff_datetimepassenger_countpickup_longitudepickup_latitudedropoff_longitudedropoff_latitudestore_and_fwd_flag...monthdayhourminutelog_trip_durationtotal_distancetotal_travel_timenumber_of_stepsright_stepsleft_steps
0id287542122016-03-14 17:24:552016-03-14 17:32:301-73.98215540.767937-73.96463040.7656020...31417246.1224932009.1164.95.01.01.0
1id237739412016-06-12 00:43:352016-06-12 00:54:381-73.98041540.738564-73.99948140.7311520...6120436.4982822513.2332.06.02.02.0
2id385852922016-01-19 11:35:242016-01-19 12:10:481-73.97902740.763939-74.00533340.7100870...11911357.66152711060.8767.616.05.04.0
3id350467322016-04-06 19:32:312016-04-06 19:39:401-74.01004040.719971-74.01226840.7067180...4619326.0637851779.4235.84.01.02.0
4id218102822016-03-26 13:30:552016-03-26 13:38:101-73.97305340.793209-73.97292340.7825200...32613306.0776421614.9140.15.02.02.0

5 rows × 22 columns

테스트 데이터에도 동일한 처리를 한다.

osrm_test = pd.read_csv('Input/fastest_routes_test.csv')
right_turn= list(map(lambda x:x.count('right')-
                x.count('slight right'),osrm_test.step_direction))
left_turn = list(map(lambda x:x.count('left')-
                x.count('slight left'),osrm_test.step_direction))

osrm_test = osrm_test[['id','total_distance','total_travel_time',
                       'number_of_steps']]
osrm_test['right_steps'] = right_turn
osrm_test['left_steps'] = left_turn
osrm_test.head(3)
idtotal_distancetotal_travel_timenumber_of_stepsright_stepsleft_steps
0id07717041497.1200.2723
1id32742091427.1141.5200
2id27564552312.3324.6944
test = test.join(osrm_test.set_index('id'), on='id')
test.head()
idvendor_idpickup_datetimepassenger_countpickup_longitudepickup_latitudedropoff_longitudedropoff_latitudestore_and_fwd_flagyearmonthdayhourminutetotal_distancetotal_travel_timenumber_of_stepsright_stepsleft_steps
0id300467212016-06-30 23:59:581-73.98812940.732029-73.99017340.7566800201663023593795.9424.6411
1id350535512016-06-30 23:59:531-73.96420340.679993-73.95980840.6554030201663023592904.5200.0411
2id121714112016-06-30 23:59:471-73.99743740.737583-73.98616040.7295230201663023591499.5193.2411
3id215012622016-06-30 23:59:411-73.95607040.771900-73.98642740.7304690201663023597023.9494.81133
4id159824512016-06-30 23:59:331-73.97021540.761475-73.96151040.7558900201663023591108.2103.2412

2.2.2. Other Distance Features


세 가지의 다른 거리 계산법을 사용한다.

  1. Haversine distance: the direct distance of two GPS location, taking into account that the earth is round.
  2. Manhattan distance: the usual L1 distance, here the haversine distance is used to calculate each coordinate of distance.
  3. Bearing: The direction of the trip. Using radian as unit. (I must admit that I am not fully understand the formula. I have starring at it for a long time but can’t come up anything. If anyone can help explain that will do me a big favor.)
    — 출처는 beluga
def haversine_array(lat1, lng1, lat2, lng2):
    lat1, lng1, lat2, lng2 = map(np.radians, (lat1, lng1, lat2, lng2))
    AVG_EARTH_RADIUS = 6371  # in km
    lat = lat2 - lat1
    lng = lng2 - lng1
    d = np.sin(lat * 0.5) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(lng * 0.5) ** 2
    h = 2 * AVG_EARTH_RADIUS * np.arcsin(np.sqrt(d))
    return h

def dummy_manhattan_distance(lat1, lng1, lat2, lng2):
    a = haversine_array(lat1, lng1, lat1, lng2)
    b = haversine_array(lat1, lng1, lat2, lng1)
    return a + b

def bearing_array(lat1, lng1, lat2, lng2):
    lng_delta_rad = np.radians(lng2 - lng1)
    lat1, lng1, lat2, lng2 = map(np.radians, (lat1, lng1, lat2, lng2))
    y = np.sin(lng_delta_rad) * np.cos(lat2)
    x = np.cos(lat1) * np.sin(lat2) - np.sin(lat1) * np.cos(lat2) * np.cos(lng_delta_rad)
    return np.degrees(np.arctan2(y, x))
List_dist = []
for df in (train, test):
    lat1, lng1, lat2, lng2 = (df['pickup_latitude'].values, df['pickup_longitude'].values,
                              df['dropoff_latitude'].values,df['dropoff_longitude'].values)
    dist = pd.DataFrame(index=range(len(df)))
    dist = dist.assign(haversine_dist = haversine_array(lat1, lng1, lat2, lng2))
    dist = dist.assign(manhattan_dist = dummy_manhattan_distance(lat1, lng1, lat2, lng2))
    dist = dist.assign(bearing = bearing_array(lat1, lng1, lat2, lng2))
    List_dist.append(dist)
Other_dist_train,Other_dist_test = List_dist
Other_dist_train.head()
haversine_distmanhattan_distbearing
01.4985211.73543399.970196
11.8055072.430506-117.153768
26.3850988.203575-159.680165
31.4854981.661331-172.737700
41.1885881.199457179.473585

2.3. Location Features: K-means Clustering


coord_pickup = np.vstack((train[['pickup_latitude', 'pickup_longitude']].values,                  
                          test[['pickup_latitude', 'pickup_longitude']].values))
coord_dropoff = np.vstack((train[['dropoff_latitude', 'dropoff_longitude']].values,                  
                           test[['dropoff_latitude', 'dropoff_longitude']].values))
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1458644 entries, 0 to 1458643
Data columns (total 23 columns):
id                    1458644 non-null object
vendor_id             1458644 non-null int64
pickup_datetime       1458644 non-null datetime64[ns]
dropoff_datetime      1458644 non-null object
passenger_count       1458644 non-null int64
pickup_longitude      1458644 non-null float64
pickup_latitude       1458644 non-null float64
dropoff_longitude     1458644 non-null float64
dropoff_latitude      1458644 non-null float64
store_and_fwd_flag    1458644 non-null int64
trip_duration         1458644 non-null int64
year                  1458644 non-null int64
month                 1458644 non-null int64
day                   1458644 non-null int64
hour                  1458644 non-null int64
minute                1458644 non-null int64
log_trip_duration     1458644 non-null float64
total_distance        1458643 non-null float64
total_travel_time     1458643 non-null float64
number_of_steps       1458643 non-null float64
right_steps           1458643 non-null float64
left_steps            1458643 non-null float64
pickup_dropoff_loc    1458644 non-null int32
dtypes: datetime64[ns](1), float64(10), int32(1), int64(9), object(2)
memory usage: 250.4+ MB
# null값 있는 1개 행 제거
train.dropna(inplace=True)
test.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9258 entries, 0 to 9257
Data columns (total 24 columns):
id                    9258 non-null object
vendor_id             9258 non-null int64
pickup_datetime       9258 non-null datetime64[ns]
passenger_count       9258 non-null int64
pickup_longitude      9258 non-null float64
pickup_latitude       9258 non-null float64
dropoff_longitude     9258 non-null float64
dropoff_latitude      9258 non-null float64
store_and_fwd_flag    9258 non-null int64
year                  9258 non-null int64
month                 9258 non-null int64
day                   9258 non-null int64
hour                  9258 non-null int64
minute                9258 non-null int64
total_distance        9258 non-null float64
total_travel_time     9258 non-null float64
number_of_steps       9258 non-null int64
right_steps           9258 non-null int64
left_steps            9258 non-null int64
pickup_dropoff_loc    9258 non-null int32
Temp.                 9258 non-null float64
Precip                9258 non-null float64
snow                  9258 non-null int64
Visibility            9258 non-null float64
dtypes: datetime64[ns](1), float64(9), int32(1), int64(12), object(1)
memory usage: 1.7+ MB
# null값 존재하는 1개 행 제거
test.dropna(inplace=True)
test.head()
idvendor_idpickup_datetimepassenger_countpickup_longitudepickup_latitudedropoff_longitudedropoff_latitudestore_and_fwd_flagyearmonthdayhourminutetotal_distancetotal_travel_timenumber_of_stepsright_stepsleft_steps
0id300467212016-06-30 23:59:581-73.98812940.732029-73.99017340.7566800201663023593795.9424.6411
1id350535512016-06-30 23:59:531-73.96420340.679993-73.95980840.6554030201663023592904.5200.0411
2id121714112016-06-30 23:59:471-73.99743740.737583-73.98616040.7295230201663023591499.5193.2411
3id215012622016-06-30 23:59:411-73.95607040.771900-73.98642740.7304690201663023597023.9494.81133
4id159824512016-06-30 23:59:331-73.97021540.761475-73.96151040.7558900201663023591108.2103.2412
coords = np.hstack((coord_pickup,coord_dropoff))
sample_ind = np.random.permutation(len(coords))[:500000]
kmeans = MiniBatchKMeans(n_clusters=10, batch_size=10000).fit(coords[sample_ind])
for df in (train, test):
    df.loc[:, 'pickup_dropoff_loc'] = kmeans.predict(df[['pickup_latitude', 'pickup_longitude',
                                                         'dropoff_latitude','dropoff_longitude']])
kmean10_train = train[['pickup_dropoff_loc']]
kmean10_test = test[['pickup_dropoff_loc']]
plt.figure(figsize=(16,16))
N = 500
for i in range(10):
    plt.subplot(4,3,i+1)
    tmp = train[train.pickup_dropoff_loc==i]
    drop = plt.scatter(tmp['dropoff_longitude'][:N], tmp['dropoff_latitude'][:N], s=10, lw=0, alpha=0.5,label='dropoff')
    pick = plt.scatter(tmp['pickup_longitude'][:N], tmp['pickup_latitude'][:N], s=10, lw=0, alpha=0.4,label='pickup')    
    plt.xlim([-74.05,-73.75]);plt.ylim([40.6,40.9])
    plt.legend(handles = [pick,drop])
    plt.title('clusters %d'%i)

output_42_0

2.4. Weather Features


weather = pd.read_csv('Input/KNYC_Metars.csv', parse_dates=['Time'])
weather.head()
TimeTemp.WindchillHeat IndexHumidityPressureDew PointVisibilityWind DirWind SpeedGust SpeedPrecipEventsConditions
02015-12-31 02:00:007.87.1NaN0.891017.06.18.0NNE5.60.00.8NoneOvercast
12015-12-31 03:00:007.25.9NaN0.901016.55.612.9Variable7.40.00.3NoneOvercast
22015-12-31 04:00:007.2NaNNaN0.901016.75.612.9Calm0.00.00.0NoneOvercast
32015-12-31 05:00:007.25.9NaN0.861015.95.014.5NW7.40.00.0NoneOvercast
42015-12-31 06:00:007.26.4NaN0.901016.25.611.3West5.60.00.0NoneOvercast
print('The Events has values {}.'.format(str(weather.Events.unique())))
The Events has values ['None' 'Rain' 'Snow' 'Fog\n\t,\nSnow' 'Fog' 'Fog\n\t,\nRain'].
weather['snow'] = 1*(weather.Events=='Snow') + 1*(weather.Events=='Fog\n\t,\nSnow')
weather['year'] = weather['Time'].dt.year
weather['month'] = weather['Time'].dt.month
weather['day'] = weather['Time'].dt.day
weather['hour'] = weather['Time'].dt.hour
weather = weather[weather['year'] == 2016][['month','day','hour','Temp.','Precip','snow','Visibility']]
weather.head()
monthdayhourTemp.PrecipsnowVisibility
221105.60.0016.1
231115.60.0016.1
241125.60.0016.1
251135.00.0016.1
261145.00.0016.1
train = pd.merge(train, weather, on = ['month', 'day', 'hour'],
                 how = 'left')
test = pd.merge(test, weather, on = ['month', 'day', 'hour'],
                 how = 'left')

3. Analysis of Features


tmp = train
tmp = pd.concat([tmp, time_train], axis=1)
fig = plt.figure(figsize=(18, 8))
sns.boxplot(x='hour', y='log_trip_duration', data=tmp);

output_51_0

sns.violinplot(x='month', y='log_trip_duration', hue='rest_day',
               data=tmp, split=True, inner='quart');

output_52_0

tmp.head()
idvendor_idpickup_datetimedropoff_datetimepassenger_countpickup_longitudepickup_latitudedropoff_longitudedropoff_latitudestore_and_fwd_flag...right_stepsleft_stepspickup_dropoff_locTemp.PrecipsnowVisibilityrest_dayweekendpickup_time
0id28754212.02016-03-14 17:24:552016-03-14 17:32:301.0-73.98215540.767937-73.96463040.7656020.0...1.01.04.04.40.30.08.0FalseFalse17.400000
1id23773941.02016-06-12 00:43:352016-06-12 00:54:381.0-73.98041540.738564-73.99948140.7311520.0...2.02.02.028.90.00.016.1TrueTrue0.716667
2id38585292.02016-01-19 11:35:242016-01-19 12:10:481.0-73.97902740.763939-74.00533340.7100870.0...5.04.05.0-6.70.00.016.1FalseFalse11.583333
3id35046732.02016-04-06 19:32:312016-04-06 19:39:401.0-74.01004040.719971-74.01226840.7067180.0...1.02.05.07.20.00.016.1FalseFalse19.533333
4id21810282.02016-03-26 13:30:552016-03-26 13:38:101.0-73.97305340.793209-73.97292340.7825200.0...2.02.04.09.40.00.016.1TrueTrue13.500000

5 rows × 30 columns

sns.violinplot(x="pickup_dropoff_loc", y="log_trip_duration",
               hue="rest_day",
               data=tmp,
               split=True,inner="quart");

output_54_0

4. XGB Model : the Prediction of trip duration


testdf = test[['vendor_id','passenger_count','pickup_longitude', 'pickup_latitude',
       'dropoff_longitude', 'dropoff_latitude','store_and_fwd_flag']]
len(train)
1458643

이후 아직 진행 중 입니다.




© 2018. by YH

Powered by YH KIM