竞赛圈 > 【最新baseline】汽车目的地智能预测大赛

来自竞赛: 汽车目的地智能预测大赛

【最新baseline】汽车目的地智能预测大赛

小螺号

java

关注者 9
关注了 1

小螺号

官方 java

baseline作者：XBuder

作者公众号：麻婆豆腐AI

思路：

统计车辆在星期1-7天最常去哪些地点，对这些地点做标记，之后去过为1，否则为0，以此做标签，把概率最高的地点推荐给选手

使用模型：

逻辑回归

优化方法：

1.继续细化统计车辆在1-7天的哪个小时去过哪些地点（如统计距0点的分钟数或者小时数），即按小时做推荐。另外还可以针对是否为节假日做标记构建特征。

2.挖掘同时间，同地域的车辆是否有相似性，对节假日的推荐单独处理。

3.尝试使用svm，xgb,lgb等模型，并进行参数调整

4.模型融合

导入所需要的包，忽略警告

import pandas as pd

import numpy as np

import warnings

warnings.filterwarnings('ignore')

评分算法

from math import radians, atan, tan, sin, acos, cos

def getDistance(latA, lonA, latB, lonB):

    ra = 6378140  # 赤道半径: 米

    rb = 6356755  # 极线半径: 米

    flatten = (ra - rb) / ra  # Partial rate of the earth

    # change angle to radians

    radLatA = radians(latA)#弧度

    radLonA = radians(lonA)

    radLatB = radians(latB)

    radLonB = radians(lonB)



    try:

        pA = atan(rb / ra * tan(radLatA))

        pB = atan(rb / ra * tan(radLatB))

        x = acos(sin(pA) * sin(pB) + cos(pA) * cos(pB) * cos(radLonA - radLonB))

        c1 = (sin(x) - x) * (sin(pA) + sin(pB)) ** 2 / cos(x / 2) ** 2

        c2 = (sin(x) + x) * (sin(pA) - sin(pB)) ** 2 / sin(x / 2) ** 2

        dr = flatten / 8 * (c1 - c2)

        distance = ra * (x + dr)

        return distance  # meter

    except:
        return 0.0000001
def f(d):

    return 1 / (1 + np.exp(-(d-1000)/250))

计算误差值

def getDistanceFromDF(data):

    tmp = data[['end_lat','end_lon','predict_end_lat','predict_end_lon']].astype(float)
    #从数据中取出'end_lat','end_lon','predict_end_lat','predict_end_lon'4列，转为浮点型

    error = []#设置一个空列表error

    for i in tmp.values:

        t = getDistance(i[0],i[1],i[2],i[3])#逐条计算误差

        error.append(t)#将误差加入列表error

    print (np.sum(f(np.array(error))) / tmp.shape[0])#打印

转化数据集中的日期为pandas中的datatime类型

这里生成了2个新的特征，出发的星期和出发的小时，实际我们没有使用小时这个特征，同学们可以尝试用一下

def dateConvert(data,isTrain):

    print ('convert string to datetime')

    data['start_time'] = pd.to_datetime(data['start_time'])#转化开始时间

    if isTrain:#如果是训练集

        data['end_time'] = pd.to_datetime(data['end_time'])#转化结束时间

    data['weekday'] = data['start_time'].dt.weekday + 1#生成新的一列，为星期几，weekday对应0到6，所以这里加1
    data['hour']= data['start_time'].dt.hour
    return data

合并经纬度

注意：更常用的处理方法为将经纬度转化为geohash编码，这里只是做了精度和合并处理

def latitude_longitude_to_go(data,isTrain):

    tmp = data[['start_lat','start_lon']]#取出出发地经纬度

    start_geohash = []#定义一个空列表

    for t in tmp.values:#逐行遍历出发地经纬度

        start_geohash.append(str(round(t[0],5)) + '_' + str(round(t[1],5)))#保留小数点后5位，将经纬度合并

    data['startGo'] = start_geohash#生成新的一列，值为合并后的经纬度

    if isTrain:#如果是训练集，对目的地经纬度做如上处理

        tmp = data[['end_lat','end_lon']]

        end_geohash = []

        for t in tmp.values:

            end_geohash.append(str(round(t[0],5))+ '_' + str(round(t[1],5)))

        data['endGo'] = end_geohash#生成新的一列，值为合并后目的地经纬度

    return data

统计用户去过最多的7个地方

def getMostTimesCandidate(candidate):
    
    #取9月前的数据
    mostTimeCandidate = candidate[candidate['start_time']<='2018-08-30 23:59:59']
    #取'out_id','endGo','end_lat','end_lon','weekday'列
    mostTimeCandidate = mostTimeCandidate[['out_id','endGo','end_lat','end_lon','weekday']]
    #按车辆id、目的地、星期分组（agg）统计目的地出现的次数,放入生成的mostCandidateCount列
    mostTimeCandidate_7 = mostTimeCandidate.groupby(['out_id','endGo','weekday'],as_index=False)['endGo'].agg({'mostCandidateCount':'count'})
    #按出现次数和车辆id降序排列
    mostTimeCandidate_7.sort_values(['mostCandidateCount','out_id'],inplace=True,ascending=False)
    #按车辆id和星期分组，取去的最多的7条数据
    mostTimeCandidate_7 = mostTimeCandidate_7.groupby(['out_id','weekday']).head(7)

    return mostTimeCandidate_7

将合并后的经纬度拆分

def geoHashToLatLoc(data):
    #取出目的地
    tmp = data[['endGo']]
    #设置空列表，预测的目的地纬度
    predict_end_lat = []
    #设置空列表，预测的目的地经度
    predict_end_lon = []
    #逐行遍历
    for i in tmp.values:
        #取出每行的经纬度，放入列表
        lats, lons = str(i[0]).split('_')

        predict_end_lat.append(lats)

        predict_end_lon.append(lons)
    #生成新的列
    data['predict_end_lat'] = predict_end_lat

    data['predict_end_lon'] = predict_end_lon

    return data

计算两地点的距离

def calcGeoHasBetween(go1,go2):

    latA, lonA = str(go1).split('_')

    latB, lonB = str(go2).split('_')

    distence = getDistance(float(latA), float(lonA), float(latB), float(lonB))

    return distence

对整个数据集计算距离

def calcGeoHasBetweenMain(data):

    distance = []

    tmp = data[['endGo','startGo']]

    for i in tmp.values:

        distance.append(calcGeoHasBetween(i[0],i[1]) / 1000 )

    data['distance'] = distance

    return data

读取数据&特征工程

读取训练集

用1-6月去提取最常去的地方用7月去训练

print ('begin')
train = pd.read_csv('train_new.csv',low_memory=False)#读取数据集，注意声明low_memory=False，否则会类型错误
print (train['start_time'].min(),train['start_time'].max())#打印训练集出发的最早时间和最晚时间
print (train[train['start_time']>'2018-06-30 23:59:59'].shape)#打印训练集出发时间在6月30日之后的数据的维度
print (train[train['start_time']<='2018-06-30 23:59:59'].shape)#打印训练集出发时间在6月30日及之前的数据的维度

begin

2018-01-01 00:01:58 2018-08-01 00:49:30

(205111, 8)

(1290703, 8)

test = pd.read_csv('test_new.csv',low_memory=False)#读取测试集
print (test['start_time'].min(),test['start_time'].max())
print (test.shape)

2018-09-01 00:32:42 2018-10-23 23:59:48

(58097, 5)

trainIndex = train.shape[0]#设置训练集索引
testIndex = test.shape[0]#设置测试集索引
print (trainIndex,testIndex)

1495814 58097

时间转化

train = dateConvert(train,True)
test = dateConvert(test,False)

convert string to datetime

合并经纬度

train = latitude_longitude_to_go(train,True)
test = latitude_longitude_to_go(test,False)

生成特征工程完成后的文件

train.to_csv('train1.csv',index=False)
test.to_csv('test1.csv',index=False)

userMostTimes3loc = getMostTimesCandidate(train)

val = train[train['start_time']>'2018-06-30 23:59:59']#取6月30日后数据做验证
#取出'r_key','out_id','end_lat','end_lon','weekday','startGo','endGo','start_lat','start_lon'列
val = val[['r_key','out_id','end_lat','end_lon','weekday','startGo','endGo','start_lat','start_lon']]
#将'endGo'重命名为'trueEndGo'
val.rename(columns={'endGo':'trueEndGo'},inplace=True)
#以'out_id','weekday'为关键字，对左表取交集，合并val和userMostTimes3loc表
val = pd.merge(val,userMostTimes3loc,on=['out_id','weekday'],how='left',copy=False)
#将val['endGo']中的目的地缺失值填充为起始地点
val['endGo'] = val['endGo'].fillna(val['startGo'])
#判断是否真的去了推荐的地点，如果去了为Ture，否则为False
val['flag1'] = val['trueEndGo'] == val['endGo']
#将True转化为1，False转化为0
val['flag1'] = val['flag1'].astype(int)
#计算距离
val = calcGeoHasBetweenMain(val)

#取出'r_key','out_id','weekday','startGo','start_lat','start_lon'列作为测试集
test = test[['r_key','out_id','weekday','startGo','start_lat','start_lon']]
#以'out_id','weekday'为关键字，对左表取交集，合并test和userMostTimes3loc表
test = pd.merge(test,userMostTimes3loc,on=['out_id','weekday'],how='left',copy=False)
#将测试集endGo列填充为出发地点
test['endGo'] = test['endGo'].fillna(test['startGo'])
#计算距离
test = calcGeoHasBetweenMain(test)

使用模型进行预测

这里使用了逻辑回归模型同学们可以使用xgb，lgb等模型，相信会取得更好的成绩

feature = ['start_lat','start_lon','weekday','distance','mostCandidateCount']
#这里可以导入其它算法如svm,xgb,lgb等
from sklearn.linear_model import LogisticRegression

print ('training')

model= LogisticRegression()#使用逻辑回归模型训练,修改这里为sklearn中的其它模型

model.fit(val[feature].fillna(-1).values,val['flag1'].values)#x为fearture，将空值填充为-1，y值为val['flag1']，即是否推荐成功

pre = model.predict_proba(val[feature].fillna(-1).values)[:,1]#在验证集预测结果，为概率值

training

val_result = val[['r_key','endGo','end_lat','end_lon',]]
val_result['predict'] = pre

val_result = val_result.sort_values(['predict'],ascending=False)#按预测概率降序排列
val_result = val_result.drop_duplicates(['r_key'])#按r_key字段去重

val = geoHashToLatLoc(val)#将预测的结果切片为维度和经度
getDistanceFromDF(val)#计算在验证集上预测的得分

0.7703523722443778

subPre = model.predict_proba(test[feature].fillna(-1).values)[:,1]#在测试集上预测

test_result = test[['r_key','endGo']]#取出预测结果
test_result['predict'] = subPre#生成概率列

#按预测概率降序排列
test_result = test_result.sort_values(['predict'],ascending=False)
#按r_key字段去重
test_result = test_result.drop_duplicates(['r_key'])
#将预测的结果切片为维度和经度
test_result = geoHashToLatLoc(test_result)

#生成提交结果
submit = test_result[['r_key','predict_end_lat','predict_end_lon']]
submit.columns = ['r_key','end_lat','end_lon']
submit.to_csv('result.csv',index=False)

5条评论

意见反馈

关注微信公众号

数据科学征程，总有DC陪伴
DC竞赛 DC学院 DC直聘神码童学

商务合作 : 13520118900（张先生）

DC竞赛服务规则 DC竞赛隐私权政策 DC竞赛作弊管理规则

DC竞赛版本更新
关注DC官方微博

加入DC官方QQ群

关注微信公众号

诚征英才联系我们

扫一扫分享给周围朋友

登录

第三方登录

注册

第三方登录

验证邮箱

恭喜您

手机账号绑定

联系DC

【最新baseline】汽车目的地智能预测大赛

小螺号

小螺号

请选择举报原因：