我的博客

复现 Elliptic Data Set 的分类算法(在 Kaggle 上)

目录
  1. 载入数据
  2. 数据处理
    1. 时间维度上交易分布
    2. 选取非法和合法交易
  3. 训练和测试模型(使用聚合数据)
    1. 随机森林
    2. 线性回归
  4. 训练和测试模型(不使用聚合数据)
    1. 随机森林
    2. 线性回归

载入数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

import plotly.offline as py
import plotly.graph_objs as go
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import f1_score

df_classes = pd.read_csv('/kaggle/input/elliptic-data-set/elliptic_bitcoin_dataset/elliptic_txs_classes.csv')
df_features = pd.read_csv('/kaggle/input/elliptic-data-set/elliptic_bitcoin_dataset/elliptic_txs_features.csv', header=None)
df_edgelist = pd.read_csv('/kaggle/input/elliptic-data-set/elliptic_bitcoin_dataset/elliptic_txs_edgelist.csv')

df_classes.shape, df_features.shape, df_edgelist.shape,
1
2
> ((203769, 2), (203769, 167), (234355, 2))
>

df_features 中有 166 个特征,第一列是交易 id(是 int 类型,不是链上真实的 txid),第二列是 time setp,然后是 93 个自身特征和 72 个聚合特征。

交易数量是 20 万零 3 千。边数是 23 万零 4 千。

数据处理

df_features 没有表头,先加上表头

1
df_features.columns = ['id', 'time step'] + [f'trans_feat_{i}' for i in range(93)] + [f'agg_feat_{i}' for i in range(72)]

查看时间步的分布:

1
df_features['time step'].value_counts().sort_index()
1
2
3
4
5
6
7
8
9
10
11
> 1     7880
> 2 4544
> 3 6621
> 4 5693
> ...
> 46 3519
> 47 5121
> 48 2954
> 49 2454
> Name: time step, dtype: int64
>

是从 1 到 49。

再看类别分布

1
df_classes['class'].value_counts()
1
2
3
4
5
> unknown    157205
> 2 42019
> 1 4545
> Name: class, dtype: int64
>

把特征和类别合起来

1
df_features = pd.merge(df_features, df_classes, left_on='id', right_on='txId', how='left')

把类别都统一为数字

1
df_features['class'] = df_features['class'].apply(lambda x: '0' if x == "unknown" else x)

时间维度上交易分布

1
2
3
4
5
#plt.figure(figsize=(12, 8))
grouped = df_features.groupby(['time step', 'class'])['id'].count().reset_index().rename(columns={'id': 'count'})
sns.lineplot(x='time step', y='count', hue='class', data=grouped);
plt.legend(loc=(1.0, 0.8));
plt.title('Number of transactions in each time step by class');

image.png

条形图

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
count_by_class = df_features[["time step",'class']].groupby(['time step','class']).size().to_frame().reset_index()
illicit_count = count_by_class[count_by_class['class'] == '1']
licit_count = count_by_class[count_by_class['class'] == '2']
unknown_count = count_by_class[count_by_class['class'] == "0"]

x_list = list(range(1,50))
fig = go.Figure(data = [
go.Bar(name="Unknown",x=x_list,y=unknown_count[0],marker = dict(color = 'rgba(120, 100, 180, 0.6)',
line = dict(
color = 'rgba(120, 100, 180, 1.0)',width=1))),
go.Bar(name="Licit",x=x_list,y=licit_count[0],marker = dict(color = 'rgba(246, 78, 139, 0.6)',
line = dict(
color = 'rgba(246, 78, 139, 1.0)',width=1))),
go.Bar(name="Illicit",x=x_list,y=illicit_count[0],marker = dict(color = 'rgba(58, 190, 120, 0.6)',
line = dict(
color = 'rgba(58, 190, 120, 1.0)',width=1)))
])
fig.update_layout(barmode='stack')
py.iplot(fig)

image.png

选取非法和合法交易

1
2
data = df_features[(df_features['class']=='1') | (df_features['class']=='2')]
data.shape
1
2
> (46564, 169)
>

只有 4 万 6 千。

选取特征

1
2
3
4
5
6
tx_features = [f'trans_feat_{i}' for i in range(93)]
agg_features = [f'agg_feat_{i}' for i in range(72)]

X = data[tx_features+agg_features]
y = data['class']
y = y.apply(lambda x: 0 if x == '2' else 1 )

分割测试集和训练集

1
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=15,shuffle=False)

训练和测试模型(使用聚合数据)

随机森林

1
2
3
4
5
6
7
8
9
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=50, max_depth=100,random_state=15).fit(X_train,y_train)
preds = clf.predict(X_test)

prec,rec,f1,num = precision_recall_fscore_support(y_test,preds, average=None)
print("Random Forest Classifier")
print("Precision:%.3f \nRecall:%.3f \nF1 Score:%.3f"%(prec[1],rec[1],f1[1]))
micro_f1 = f1_score(y_test,preds,average='micro')
print("Micro-Average F1 Score:",micro_f1)
1
2
3
4
5
6
> Random Forest Classifier
> Precision:0.981
> Recall:0.651
> F1 Score:0.782
> Micro-Average F1 Score: 0.9772369362920544
>

线性回归

1
2
3
4
5
6
7
8
from sklearn.linear_model import LogisticRegression
reg = LogisticRegression().fit(X_train,y_train)
preds = reg.predict(X_test)
prec,rec,f1,num = precision_recall_fscore_support(y_test,preds, average=None)
print("Logistic Regression")
print("Precision:%.3f \nRecall:%.3f \nF1 Score:%.3f"%(prec[1],rec[1],f1[1]))
micro_f1 = f1_score(y_test,preds,average='micro')
print("Micro-Average F1 Score:",micro_f1)
1
2
3
4
5
6
> Logistic Regression
> Precision:0.454
> Recall:0.633
> F1 Score:0.529
> Micro-Average F1 Score: 0.928990694345025
>

训练和测试模型(不使用聚合数据)

1
2
3
4
X = data[tx_features]#+agg_features]
y = data['class']
y = y.apply(lambda x: 0 if x == '2' else 1 )
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=15,shuffle=False)
1
2
> ((32594, 93), (13970, 93))
>

随机森林

1
2
3
4
5
6
> Random Forest Classifier
> Precision:0.909
> Recall:0.648
> F1 Score:0.757
> Micro-Average F1 Score: 0.9738010021474588
>

线性回归

1
2
3
4
5
6
> Logistic Regression
> Precision:0.433
> Recall:0.653
> F1 Score:0.521
> Micro-Average F1 Score: 0.9243378668575519
>

参考:

https://www.kaggle.com/dhruvrnaik/illicit-transaction-detection

https://www.kaggle.com/artgor/elliptic-data-eda/

评论无需登录,可以匿名,欢迎评论!