复现 Elliptic Data Set 的分类算法（在 Kaggle 上）

2020-03-04

载入数据
数据处理
1. 时间维度上交易分布
2. 选取非法和合法交易
训练和测试模型（使用聚合数据）
1. 随机森林
2. 线性回归
训练和测试模型（不使用聚合数据）
1. 随机森林
2. 线性回归

载入数据

import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

import plotly.offline as py 
import plotly.graph_objs as go 
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import f1_score

df_classes = pd.read_csv('/kaggle/input/elliptic-data-set/elliptic_bitcoin_dataset/elliptic_txs_classes.csv')
df_features = pd.read_csv('/kaggle/input/elliptic-data-set/elliptic_bitcoin_dataset/elliptic_txs_features.csv', header=None)
df_edgelist = pd.read_csv('/kaggle/input/elliptic-data-set/elliptic_bitcoin_dataset/elliptic_txs_edgelist.csv')

df_classes.shape, df_features.shape, df_edgelist.shape,

1 2	> ((203769, 2), (203769, 167), (234355, 2)) >

df_features 中有 166 个特征，第一列是交易 id（是 int 类型，不是链上真实的 txid），第二列是 time setp，然后是 93 个自身特征和 72 个聚合特征。

交易数量是 20 万零 3 千。边数是 23 万零 4 千。

数据处理

df_features 没有表头，先加上表头

1	df_features.columns = ['id', 'time step'] + [f'trans_feat_{i}' for i in range(93)] + [f'agg_feat_{i}' for i in range(72)]

查看时间步的分布：

1	df_features['time step'].value_counts().sort_index()

> 1     7880
> 2     4544
> 3     6621
> 4     5693
> ...
> 46    3519
> 47    5121
> 48    2954
> 49    2454
> Name: time step, dtype: int64
>

是从 1 到 49。

再看类别分布

1	df_classes['class'].value_counts()

> unknown    157205
> 2           42019
> 1            4545
> Name: class, dtype: int64
>

把特征和类别合起来

1	df_features = pd.merge(df_features, df_classes, left_on='id', right_on='txId', how='left')

把类别都统一为数字

1	df_features['class'] = df_features['class'].apply(lambda x: '0' if x == "unknown" else x)

时间维度上交易分布

#plt.figure(figsize=(12, 8))
grouped = df_features.groupby(['time step', 'class'])['id'].count().reset_index().rename(columns={'id': 'count'})
sns.lineplot(x='time step', y='count', hue='class', data=grouped);
plt.legend(loc=(1.0, 0.8));
plt.title('Number of transactions in each time step by class');

条形图

count_by_class = df_features[["time step",'class']].groupby(['time step','class']).size().to_frame().reset_index()
illicit_count = count_by_class[count_by_class['class'] == '1']
licit_count = count_by_class[count_by_class['class'] == '2']
unknown_count = count_by_class[count_by_class['class'] == "0"]

x_list = list(range(1,50))
fig = go.Figure(data = [
    go.Bar(name="Unknown",x=x_list,y=unknown_count[0],marker = dict(color = 'rgba(120, 100, 180, 0.6)',
        line = dict(
            color = 'rgba(120, 100, 180, 1.0)',width=1))),
    go.Bar(name="Licit",x=x_list,y=licit_count[0],marker = dict(color = 'rgba(246, 78, 139, 0.6)',
        line = dict(
            color = 'rgba(246, 78, 139, 1.0)',width=1))),
    go.Bar(name="Illicit",x=x_list,y=illicit_count[0],marker = dict(color = 'rgba(58, 190, 120, 0.6)',
        line = dict(
            color = 'rgba(58, 190, 120, 1.0)',width=1)))
])
fig.update_layout(barmode='stack')
py.iplot(fig)

选取非法和合法交易

1 2	data = df_features[(df_features['class']=='1') \| (df_features['class']=='2')] data.shape

1 2	> (46564, 169) >

只有 4 万 6 千。

选取特征

tx_features = [f'trans_feat_{i}' for i in range(93)]
agg_features = [f'agg_feat_{i}' for i in range(72)]

X = data[tx_features+agg_features]
y = data['class']
y = y.apply(lambda x: 0 if x == '2' else 1 )

分割测试集和训练集

1	X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=15,shuffle=False)

训练和测试模型（使用聚合数据）

随机森林

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=50, max_depth=100,random_state=15).fit(X_train,y_train)
preds = clf.predict(X_test)

prec,rec,f1,num = precision_recall_fscore_support(y_test,preds, average=None)
print("Random Forest Classifier")
print("Precision:%.3f \nRecall:%.3f \nF1 Score:%.3f"%(prec[1],rec[1],f1[1]))
micro_f1 = f1_score(y_test,preds,average='micro')
print("Micro-Average F1 Score:",micro_f1)

> Random Forest Classifier
> Precision:0.981 
> Recall:0.651 
> F1 Score:0.782
> Micro-Average F1 Score: 0.9772369362920544
>

线性回归

from sklearn.linear_model import LogisticRegression
reg = LogisticRegression().fit(X_train,y_train)
preds = reg.predict(X_test)
prec,rec,f1,num = precision_recall_fscore_support(y_test,preds, average=None)
print("Logistic Regression")
print("Precision:%.3f \nRecall:%.3f \nF1 Score:%.3f"%(prec[1],rec[1],f1[1]))
micro_f1 = f1_score(y_test,preds,average='micro')
print("Micro-Average F1 Score:",micro_f1)

> Logistic Regression
> Precision:0.454 
> Recall:0.633 
> F1 Score:0.529
> Micro-Average F1 Score: 0.928990694345025
>

训练和测试模型（不使用聚合数据）

X = data[tx_features]#+agg_features]
y = data['class']
y = y.apply(lambda x: 0 if x == '2' else 1 )
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=15,shuffle=False)

1 2	> ((32594, 93), (13970, 93)) >

随机森林

> Random Forest Classifier
> Precision:0.909 
> Recall:0.648 
> F1 Score:0.757
> Micro-Average F1 Score: 0.9738010021474588
>

线性回归

> Logistic Regression
> Precision:0.433 
> Recall:0.653 
> F1 Score:0.521
> Micro-Average F1 Score: 0.9243378668575519
>

参考：

https://www.kaggle.com/dhruvrnaik/illicit-transaction-detection

https://www.kaggle.com/artgor/elliptic-data-eda/

我的博客

复现 Elliptic Data Set 的分类算法（在 Kaggle 上）

载入数据

数据处理

时间维度上交易分布

选取非法和合法交易

训练和测试模型（使用聚合数据）

随机森林

线性回归

训练和测试模型（不使用聚合数据）

随机森林

线性回归

About

Categories

Tags

Tag Cloud

Archives

Recents