发布时间:2023-11-16 13:00
广告欺诈是数字营销需要面临的重要挑战之一,点击会欺诈浪费广告主大量金钱,同时对点击数据会产生误导作用。本次比赛提供了约50万次点击数据。特别注意:我们对数据进行了模拟生成,对某些特征含义进行了隐藏,并进行了脱敏处理。
请预测用户的点击行为是否为正常点击,还是作弊行为。点击欺诈预测适用于各种信息流广告投放,banner广告投放,以及百度网盟平台,帮助商家鉴别点击欺诈,锁定精准真实用户。
大赛提供50万的训练数据以及15万的测试数据。目标是预测该笔数据是否存在反欺诈行为。
字段 | 类型 | 说明 |
---|---|---|
sid | string | 样本id/请求会话sid |
package | string | 媒体信息,包名(已加密) |
version | string | 媒体信息,app版本 |
android_id | string | 媒体信息,对外广告位ID(已加密) |
media_id | string | 媒体信息,对外媒体ID(已加密) |
apptype | int | 媒体信息,app所属分类 |
timestamp | bigint | 请求到达服务时间,单位ms |
location | int | 用户地理位置编码(精确到城市) |
fea_hash | int | 用户特征编码(具体物理含义略去) |
fea1_hash | int | 用户特征编码(具体物理含义略去) |
cus_type | int | 用户特征编码(具体物理含义略去) |
ntt | int | 网络类型 0-未知, 1-有线网, 2-WIFI, 3-蜂窝网络未知, 4-2G, 5-3G, 6–4G |
carrier | string | 设备使用的运营商 0-未知, 46000-移动, 46001-联通, 46003-电信 |
os | string | 操作系统,默认为android |
osv | string | 操作系统版本 |
lan | string | 设备采用的语言,默认为中文 |
dev_height | int | 设备高 |
dev_width | int | 设备宽 |
dev_ppi | int | 屏幕分辨率 |
label | int | 是否存在反欺诈 |
通过数据label可以得知,该命题是一个二分类任务。可使用机器学习算法或者MLP进行求解。
解题方案可分为两部分:
下面将列出大致的建模方案,具体可查看源码:gitee仓库
机器学习无非就是特征工程+祖传参数的问题。通常经过下为了快速出第一版本的Baseline,我们常常会使用LGB(lightgbm)起步。这个算法的最大的特点就是保证准确率的同时还很快。
空值处理
经调研发现,在lan和osv上面出现空值。
# 字符串类型 需要转换为数值(labelencoder)
object_cols = train.select_dtypes(include=\'object\').columns
# 缺失值个数
temp = train.isnull().sum()
# 有缺失值的字段: lan, osv
temp[temp>0]
# 获取分析字段
features = train.columns.tolist()
features.remove(\'label\')
print(features)
连续值与分类值
接着分析连续值与分类值。最终发现对osv需要进行转换处理,对fea_hash与fea1_hash初步先求字符长度处理
for feature in features:
print(feature, train[feature].nunique())
osv处理方法
# 处理osv
def trans_osv(osv):
global result
osv = str(osv).replace(\' \',\'\').replace(\'.\',\'\').replace(\'Android_\',\'\').replace(\'十核20G_HD\',\'\').replace(\'Android\',\'\').replace(\'W\',\'\')
if osv == \'nan\' or osv == \'GIONEE_YNGA\':
result = 810
elif osv.count(\'-\') >0:
result = int(osv.split(\'-\')[0])
elif osv == \'f073b_changxiang_v01_b1b8_20180915\':
result = 810
elif osv == \'%E6%B1%9F%E7%81%B5OS+50\':
result = 500
else:
result = int(osv)
if result < 10:
result = result * 100
elif result < 100:
result = result * 10
return int(result)
最后测试与训练集的转换
# 特征筛选
features = train[col]
# 构造fea_hash_len特征
features[\'fea_hash_len\'] = features[\'fea_hash\'].map(lambda x: len(str(x)))
features[\'fea1_hash_len\'] = features[\'fea1_hash\'].map(lambda x: len(str(x)))
# Thinking:为什么将很大的,很长的fea_hash化为0?
# 如果fea_hash很长,都归为0,否则为自己的本身
features[\'fea_hash\'] = features[\'fea_hash\'].map(lambda x: 0 if len(str(x))>16 else int(x))
features[\'fea1_hash\'] = features[\'fea1_hash\'].map(lambda x: 0 if len(str(x))>16 else int(x))
features[\'osv\'] = features[\'osv\'].apply(trans_osv)
test_features = test[col]
# 构造fea_hash_len特征
test_features[\'fea_hash_len\'] = test_features[\'fea_hash\'].map(lambda x: len(str(x)))
test_features[\'fea1_hash_len\'] = test_features[\'fea1_hash\'].map(lambda x: len(str(x)))
# Thinking:为什么将很大的,很长的fea_hash化为0?
# 如果fea_hash很长,都归为0,否则为自己的本身
test_features[\'fea_hash\'] = test_features[\'fea_hash\'].map(lambda x: 0 if len(str(x))>16 else int(x))
test_features[\'fea1_hash\'] = test_features[\'fea1_hash\'].map(lambda x: 0 if len(str(x))>16 else int(x))
test_features[\'osv\'] = test_features[\'osv\'].apply(trans_osv)
使用默认参数的lgb进行建模,最终成绩:88.094
#train[\'os\'].value_counts()
# 使用LGBM训练
import lightgbm as lgb
model = lgb.LGBMClassifier()
# 模型训练
model.fit(features.drop([\'timestamp\', \'version\'], axis=1), train[\'label\'])
result = model.predict(test_features.drop([\'timestamp\', \'version\'], axis=1))
#features[\'version\'].value_counts()
res = pd.DataFrame(test[\'sid\'])
res[\'label\'] = result
res.to_csv(\'./baseline.csv\', index=False)
res
下面列出做过的方案列表,具体版本对比见文末模型结果。具体查看源码:gitee仓库
本次深度学习方法着重使用百度的飞桨作为基础框架完成
针对数据处理模块,大致与机器学习的类似。但由于使用到深度学习,所以在处理完成以后需要对数据进行归一化处理。
import pandas as pd
import warnings
warnings.filterwarnings(\'ignore\')
# 数据加载
train = pd.read_csv(\'train.csv\')
test = pd.read_csv(\'test.csv\')
test = test.iloc[:, 1:]
train = train.iloc[:, 1:]
train
# ##### Object类型: lan, os, osv, version, fea_hash
# ##### 有缺失值的字段: lan, osv
# In[2]:
# [\'os\', \'osv\', \'lan\', \'sid’]
features = train.columns.tolist()
features.remove(\'label\')
print(features)
# In[3]:
for feature in features:
print(feature, train[feature].nunique())
# In[4]:
# 对osv进行数据清洗
def osv_trans(x):
x = str(x).replace(\'Android_\', \'\').replace(\'Android \', \'\').replace(\'W\', \'\')
if str(x).find(\'.\') > 0:
temp_index1 = x.find(\'.\')
if x.find(\' \') > 0:
temp_index2 = x.find(\' \')
else:
temp_index2 = len(x)
if x.find(\'-\') > 0:
temp_index2 = x.find(\'-\')
result = x[0:temp_index1] + \'.\' + x[temp_index1 + 1:temp_index2].replace(\'.\', \'\')
try:
return float(result)
except:
print(x + \'#########\')
return 0
try:
return float(x)
except:
print(x + \'#########\')
return 0
# train[\'osv\'] => LabelEncoder ?
# 采用众数,进行缺失值的填充
train[\'osv\'].fillna(\'8.1.0\', inplace=True)
# 数据清洗
train[\'osv\'] = train[\'osv\'].apply(osv_trans)
# 采用众数,进行缺失值的填充
test[\'osv\'].fillna(\'8.1.0\', inplace=True)
# 数据清洗
test[\'osv\'] = test[\'osv\'].apply(osv_trans)
# In[5]:
# train[\'os\'].value_counts()
train[\'lan\'].value_counts()
# lan_map = {\'zh-CN\': 1, }
train[\'lan\'].value_counts().index
lan_map = {\'zh-CN\': 1, \'zh_CN\': 2, \'Zh-CN\': 3, \'zh-cn\': 4, \'zh_CN_#Hans\': 5, \'zh\': 6, \'ZH\': 7, \'cn\': 8, \'CN\': 9,
\'zh-HK\': 10, \'tw\': 11, \'TW\': 12, \'zh-TW\': 13, \'zh-MO\': 14, \'en\': 15, \'en-GB\': 16, \'en-US\': 17, \'ko\': 18,
\'ja\': 19, \'it\': 20, \'mi\': 21}
train[\'lan\'] = train[\'lan\'].map(lan_map)
test[\'lan\'] = test[\'lan\'].map(lan_map)
test[\'lan\'].value_counts()
# In[6]:
# 对于有缺失的lan 设置为22
train[\'lan\'].fillna(22, inplace=True)
test[\'lan\'].fillna(22, inplace=True)
# In[7]:
remove_list = [\'os\', \'sid\']
col = features
for i in remove_list:
col.remove(i)
col
# In[8]:
# train[\'timestamp\'].value_counts()
# train[\'timestamp\'] = pd.to_datetime(train[\'timestamp\'])
# train[\'timestamp\']
from datetime import datetime
# lambda 是一句话函数,匿名函数
train[\'timestamp\'] = train[\'timestamp\'].apply(lambda x: datetime.fromtimestamp(x / 1000))
# 1559892728241.7212
# 1559871800477.1477
# 1625493942.538375
# import time
# time.time()
test[\'timestamp\'] = test[\'timestamp\'].apply(lambda x: datetime.fromtimestamp(x / 1000))
test[\'timestamp\']
# In[9]:
def version_trans(x):
if x == \'V3\':
return 3
if x == \'v1\':
return 1
if x == \'P_Final_6\':
return 6
if x == \'V6\':
return 6
if x == \'GA3\':
return 3
if x == \'GA2\':
return 2
if x == \'V2\':
return 2
if x == \'50\':
return 5
return int(x)
train[\'version\'] = train[\'version\'].apply(version_trans)
test[\'version\'] = test[\'version\'].apply(version_trans)
train[\'version\'] = train[\'version\'].astype(\'int\')
test[\'version\'] = test[\'version\'].astype(\'int\')
# In[10]:
# 特征筛选
features = train[col]
# 构造fea_hash_len特征
features[\'fea_hash_len\'] = features[\'fea_hash\'].map(lambda x: len(str(x)))
features[\'fea1_hash_len\'] = features[\'fea1_hash\'].map(lambda x: len(str(x)))
# Thinking:为什么将很大的,很长的fea_hash化为0?
# 如果fea_hash很长,都归为0,否则为自己的本身
features[\'fea_hash\'] = features[\'fea_hash\'].map(lambda x: 0 if len(str(x)) > 16 else int(x))
features[\'fea1_hash\'] = features[\'fea1_hash\'].map(lambda x: 0 if len(str(x)) > 16 else int(x))
features
test_features = test[col]
# 构造fea_hash_len特征
test_features[\'fea_hash_len\'] = test_features[\'fea_hash\'].map(lambda x: len(str(x)))
test_features[\'fea1_hash_len\'] = test_features[\'fea1_hash\'].map(lambda x: len(str(x)))
# Thinking:为什么将很大的,很长的fea_hash化为0?
# 如果fea_hash很长,都归为0,否则为自己的本身
test_features[\'fea_hash\'] = test_features[\'fea_hash\'].map(lambda x: 0 if len(str(x)) > 16 else int(x))
test_features[\'fea1_hash\'] = test_features[\'fea1_hash\'].map(lambda x: 0 if len(str(x)) > 16 else int(x))
test_features
# 对训练集的timestamp提取时间多尺度
# 创建时间戳索引
temp = pd.DatetimeIndex(features[\'timestamp\'])
features[\'year\'] = temp.year
features[\'month\'] = temp.month
features[\'day\'] = temp.day
features[\'week_day\'] = temp.weekday # 星期几
features[\'hour\'] = temp.hour
features[\'minute\'] = temp.minute
# 求时间的diff
start_time = features[\'timestamp\'].min()
features[\'time_diff\'] = features[\'timestamp\'] - start_time
features[\'time_diff\'] = features[\'time_diff\'].dt.days + features[\'time_diff\'].dt.seconds / 3600 / 24
features[[\'timestamp\', \'year\', \'month\', \'day\', \'week_day\', \'hour\', \'minute\', \'time_diff\']]
# 创建时间戳索引
temp = pd.DatetimeIndex(test_features[\'timestamp\'])
test_features[\'year\'] = temp.year
test_features[\'month\'] = temp.month
test_features[\'day\'] = temp.day
test_features[\'week_day\'] = temp.weekday # 星期几
test_features[\'hour\'] = temp.hour
test_features[\'minute\'] = temp.minute
# 求时间的diff
# start_time = features[\'timestamp\'].min()
test_features[\'time_diff\'] = test_features[\'timestamp\'] - start_time
test_features[\'time_diff\'] = test_features[\'time_diff\'].dt.days + test_features[\'time_diff\'].dt.seconds / 3600 / 24
# test_features[[\'timestamp\', \'year\', \'month\', \'day\', \'week_day\', \'hour\', \'minute\', \'time_diff\']]
test_features[\'time_diff\']
# In[12]:
# test[\'version\'].value_counts()
# features[\'version\'].value_counts()
features[\'dev_height\'].value_counts()
features[\'dev_width\'].value_counts()
# 构造面积特征
features[\'dev_area\'] = features[\'dev_height\'] * features[\'dev_width\']
test_features[\'dev_area\'] = test_features[\'dev_height\'] * test_features[\'dev_width\']
# In[13]:
\"\"\"
Thinking:是否可以利用 dev_ppi 和 dev_area构造新特征
features[\'dev_ppi\'].value_counts()
features[\'dev_area\'].astype(\'float\') / features[\'dev_ppi\'].astype(\'float\')
\"\"\"
# features[\'ntt\'].value_counts()
features[\'carrier\'].value_counts()
features[\'package\'].value_counts()
# version - osv APP版本与操作系统版本差
features[\'osv\'].value_counts()
features[\'version_osv\'] = features[\'osv\'] - features[\'version\']
test_features[\'version_osv\'] = test_features[\'osv\'] - test_features[\'version\']
# In[14]:
features = features.drop([\'timestamp\'], axis=1)
test_features = test_features.drop([\'timestamp\'], axis=1)
# In[16]:
# 特征归一化
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
features1 = scaler.fit_transform(features)
test_features1 = scaler.transform(test_features)
import paddle
from paddle import nn
from paddle.io import Dataset, DataLoader
import numpy as np
paddle.device.set_device(\'gpu:0\')
# 自定义dataset
class MineDataset(Dataset):
def __init__(self, X, y):
super(MineDataset, self).__init__()
self.num_samples = len(X)
self.X = X
self.y = y
def __getitem__(self, idx):
return self.X.iloc[idx].values.astype(\'float32\'), np.array(self.y.iloc[idx]).astype(\'int64\')
def __len__(self):
return self.num_samples
from sklearn.model_selection import train_test_split
train_x, val_x, train_y, val_y = train_test_split(features1, train[\'label\'], test_size=0.2, random_state=42)
train_x = pd.DataFrame(train_x, columns=features.columns)
val_x = pd.DataFrame(val_x, columns=features.columns)
train_y = pd.DataFrame(train_y, columns=[\'label\'])
val_y = pd.DataFrame(val_y, columns=[\'label\'])
train_dataloader = DataLoader(MineDataset(train_x, train_y),
batch_size=1024,
shuffle=True,
drop_last=True,
num_workers=2)
val_dataloader = DataLoader(MineDataset(val_x, val_y),
batch_size=1024,
shuffle=True,
drop_last=True,
num_workers=2)
test_dataloader = DataLoader(MineDataset(test_features1, pd.Series([0 for i in range(len(test_features1))])),
batch_size=1024,
shuffle=True,
drop_last=True,
num_workers=2)
第一版本网络仅使用简单的全连接层网络。网络结构从250到2的塔石结构,每个线性层之间经过relu和dropout层。
class ClassifyModel(nn.Layer):
def __init__(self, features_len):
super(ClassifyModel, self).__init__()
self.fc1 = nn.layer.Linear(in_features=features_len, out_features=250)
self.ac1 = nn.layer.ReLU()
self.drop1 = nn.layer.Dropout(p=0.02)
self.fc2 = nn.layer.Linear(in_features=250, out_features=100)
self.ac2 = nn.layer.ReLU()
self.drop2 = nn.layer.Dropout(p=0.02)
self.fc3 = nn.layer.Linear(in_features=100, out_features=50)
self.ac3 = nn.layer.ReLU()
self.drop3 = nn.layer.Dropout(p=0.02)
self.fc4 = nn.layer.Linear(in_features=50, out_features=25)
self.ac4 = nn.layer.ReLU()
self.drop4 = nn.layer.Dropout(p=0.02)
self.fc5 = nn.layer.Linear(in_features=25, out_features=2)
self.out = nn.layer.Sigmoid()
def forward(self, input):
x = self.fc1(input)
x = self.ac1(x)
x = self.drop1(x)
x = self.fc2(x)
x = self.ac2(x)
x = self.drop2(x)
x = self.fc3(x)
x = self.ac3(x)
x = self.drop3(x)
x = self.fc4(x)
x = self.ac4(x)
x = self.drop4(x)
x = self.fc5(x)
output = self.out(x)
return output
# 初始化模型
model = ClassifyModel(int(len(features.columns)))
# 训练模式
model.train()
# 定义优化器
opt = paddle.optimizer.AdamW(learning_rate=0.001, parameters=model.parameters())
loss_fn = nn.CrossEntropyLoss()
EPOCHS = 10 # 设置外层循环次数
for epoch in range(EPOCHS):
for iter_id, mini_batch in enumerate(train_dataloader):
x_train = mini_batch[0]
y_train = mini_batch[1]
# 前向传播
y_pred = model(x_train)
# 计算损失
loss = nn.functional.loss.cross_entropy(y_pred, y_train)
# 打印loss
avg_loss = paddle.mean(loss)
if iter_id % 20 == 0:
acc = paddle.metric.accuracy(y_pred, y_train)
print(\"epoch: {}, iter: {}, loss is: {}, acc is: {}\".format(epoch, iter_id, avg_loss.numpy(), acc.numpy()))
# 反向传播
avg_loss.backward()
# 最小化loss,更新参数
opt.step()
# 清除梯度
opt.clear_grad()
同样,由于篇幅原因,下面两个方案可参考源码:gitee仓库
注意使用Embedding前,请先运行Embedding分析.ipynb生成对应字典文件
分类 | 模型 | 详情 | 分数 |
---|---|---|---|
ML | ML第一版本 | 1. 初步建模 2. 不参与建模的特征 [‘os’, ‘version’, ‘lan’, \'sid’] 3. 默认参数LGB |
88.094 |
ML第二版本 | 1. 基于第一版本 2. 引入version,简单转化使用timestamp 3. 测试默认参数LGB与XGB |
88.2133 | |
ML第三版本 | 1. 基于第二版本 2. 引入lan 3. 对osv和version做差 4. lgb祖传参数 |
88.9487 | |
ML第四版本 | 1. 基于第三版本 2. 5折lgb 3. 5折xgb 4. 融合 |
89.0293 89.0253 89.054 |
|
ML第五版本 | 1.基于第三版本 2.添加像素比、像素大小、像素分辨率比 3. 5折lgb 4. 5折xgb 5. 融合 |
89.1873 89.108 89.1713 |
|
Paddle | Paddle第一版本 | 1. 基于ML第三版本特征工程 2. 简单基于paddle搭建网络 |
未上传结果 |
Paddle第二版本 | 1. 基于第一版本 2. 添加embedding字典创建(在Embedding分析.ipynb) 3.基于embedding的混合基础模型 |
88.71 | |
Paddle第三版本 | 1. 基于第二版本 2. 添加DeepFM部分模型,然后合并 |
87.816 | |
TensorFlow | TF第一版本 | 1. 基于ML第三版本特征工程 2. 简单基于TensorFlow搭建网络 |
未上传结果 |
FM | FM第一版本 | 1. 基于FM模型的第一次简单建模 | 57.2147 |
最终排名得分
源码地址
https://gitee.com/turkeymz/coggle/tree/master/coggle_202112/mlp