我的博客

复现 elliptic 数据集去匿名

目录
  1. 载入数据
  2. 根据 edgelist 统计出度入度
  3. 查看特征与入度的关系
  4. 分析 local feature 3
  5. 推测 lf_0、lf_1、 lf_3、 lf_4、 lf_5、 lf_13 的含义
  6. 寻找真实交易

系列文章:

复现 Elliptic Data Set 的分类算法(在 Kaggle 上)

读论文:Anti-Money Laundering in Bitcoin: Experimenting with Graph Convolutional Networks for Financial Forensics

Elliptic Data Set 椭圆比特币交易数据集

参考:

BenZik 提出的去匿名化方案:

  1. 去匿名化后的结果:https://www.kaggle.com/alexbenzik/deanonymized-995-pct-of-elliptic-transactions
  2. 在 elliptic 数据集后的讨论:https://www.kaggle.com/ellipticco/elliptic-data-set/discussion/117862
  3. 介绍实现过程的博客(俄语):https://habr.com/ru/post/479178/

我通过借助翻译,根据 BenZ 博客上的介绍,在原数据集上做了实验,基本复现了这个过程。

载入数据

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

import plotly.offline as py
import plotly.graph_objs as go
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import f1_score

df_classes = pd.read_csv('/kaggle/input/elliptic-data-set/elliptic_bitcoin_dataset/elliptic_txs_classes.csv')
df_features = pd.read_csv('/kaggle/input/elliptic-data-set/elliptic_bitcoin_dataset/elliptic_txs_features.csv', header=None)
df_edgelist = pd.read_csv('/kaggle/input/elliptic-data-set/elliptic_bitcoin_dataset/elliptic_txs_edgelist.csv')

df_classes.shape, df_features.shape, df_edgelist.shape,

((203769, 2), (203769, 167), (234355, 2))

1
2
3
df_features.columns = ['id', 'time step'] + [f'trans_feat_{i}' for i in range(93)] + [f'agg_feat_{i}' for i in range(72)]
df_features = pd.merge(df_features, df_classes, left_on='id', right_on='txId', how='left')
df_features['class'] = df_features['class'].apply(lambda x: '0' if x == "unknown" else x)

根据 edgelist 统计出度入度

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# 统计入度出度
import collections
in_d = collections.Counter()
out_d = collections.Counter()
for i in range(df_edgelist.shape[0]):
s, d = df_edgelist.iloc[i]
in_d[d] += 1
out_d[s] += 1
print(len(in_d), len(out_d))
# 把入度、出度加入特征
i_d = []
o_d = []
for i in range(df_features.shape[0]):
i_d.append(in_d[df_features.iloc[i]['id']])
o_d.append(out_d[df_features.iloc[i]['id']])
df_features['i_d'] = i_d
df_features['o_d'] = o_d
len(i_d), len(o_d)

148447 166345
(203769, 203769)

顺便检验一下是否有跨过时间步的边

1
2
3
4
5
6
7
8
9
10
# 检查是否有跨时间步的边
# 构建 id 到 时间步的映射
id2time_step = {}
for i in range(df_features.shape[0]):
id2time_step[df_features.iloc[i]['id']] = df_features.iloc[i]['time step']
for i in range(df_edgelist.shape[0]):
s, d = df_edgelist.iloc[i]
if id2time_step[s] != id2time_step[d]:
print(i, s, d)
# 结果是:没有

查看特征与入度的关系

1
2
plt.figure(figsize=(15,10))
plt.bar([f'lf_{i}' for i in range(20)],df_features[[f'trans_feat_{i}' for i in range(20)]].corrwith(df_features['i_d']))

入度和部分特征的关系

lf 指的是 local feature,实际上就是 trans_feat_。

这里 lf_3 是俄文博客里说的 V6,因为他的 V 从 1 开始编号,而 lf 从 0 开始,而且 V1 是 id,V2 是时间步。

博客里提到 cor(in-degree,V6) = 0.589,

我这里的结果是:0.508025。有一些差异,还没有弄清楚原因。

1
2
3
4
plt.figure(figsize=(10,10))
plt.axes()
#plt.scatter(df_features['trans_feat_3'] - df_features['trans_feat_3'].min(), df_features['i_d'])
plt.loglog(df_features['trans_feat_3'] - df_features['trans_feat_3'].min(), df_features['i_d'] , '.')

lf3 和入度的双对数坐标图

分析 local feature 3

1
2
3
4
5
6
7
8
9
10
tf3cnt = collections.Counter(df_features['trans_feat_3'])
l = list(tf3cnt.items())
l.sort()
diff_trans_feat_3 = []
for i in range(1, len(l)):
diff_trans_feat_3.append(l[i][0] - l[i-1][0])
dtf3cnt = collections.Counter(['%.5f' % x for x in diff_trans_feat_3])
dtf3list = list(dtf3cnt.items())
dtf3list.sort()
dtf3list

[(‘0.07504’, 202),
(‘0.15008’, 36),
(‘0.22511’, 20),
(‘0.30015’, 9),
(‘0.37519’, 10),
(‘0.45023’, 4),
(‘0.52526’, 3),
(‘0.60030’, 3),
(‘0.67534’, 3),
(‘0.75038’, 2),
(‘0.82541’, 2),
(‘1.05053’, 1),
(‘1.12556’, 1),
(‘1.65083’, 1),
(‘1.72586’, 2)]

这个就是博客中的那个表格

lf3 差值(间隔) 出现次数 差值/0.07504
0.07504 202 1
0.15008 36 2
0.22511 20 3
0.30015 9 4
0.37519 10 5
0.45023 4 6
0.52526 3 7
0.6003 3 8
0.67534 3 9
0.75038 2 10
0.82541 2 11
1.05053 1 14
1.12556 1 15
1.65083 1 22
1.72586 2 23

推测 lf_0、lf_1、 lf_3、 lf_4、 lf_5、 lf_13 的含义

input_count_lf3 = (lf3- min(lf3)) / min(diff_lf3)+ 1。

对lf_4,lf_5,lf_13进行类似分析后:

  • input_count_lf3 = 13.3266685112665 * lf3 + 2.62544842444139,
  • input_unique_count_lf5 = 11.9243179897452 * lf5 + 2.34747189219164,
  • outputs_count_lf4 = 50.3777694891647 * lf4 + 4.21030186142152,
  • outputs_unique_count_lf13 = 49.3957564403755 * lf13 + 4.121809499973
  • fee_lf1 = 81341.4537626213 + 386323.710952989 * lf1,
  • total_out_value_lf0 = 2742460603.92287 + 15853961614.9796 * lf0
  • 近似时间= 1450468509.80488 + 1155672.19512195 * elliptic_time

寻找真实交易

根据以上信息可以找到 92.9% 的交易的真实 txid。然后通过图信息,可以再找到 6.6%。

一共找到了 99.5 % 的交易也就是:Elliptic Data 一共 203769 个交易中的 202805 个。有 965 个是未确定的。

dataframe 参考:

https://www.jianshu.com/p/8024ceef4fe2

评论无需登录,可以匿名,欢迎评论!