用 SQL 获取 bigquery 比特币数据 - 使用 Google Big Query API 处理比特币数据

2020-02-16

创建查询对象
一次查询
数据格式
1. transactions 表
  1. 查寻第 50 万块中所有交易

之前写过一篇，主要是总（ban）结（运）了 kaggle 上入门 kernal 的代码：
Kaggle 上的比特币链上数据集 - 使用 Google Big Query API 处理比特币数据

但是实际上这个数据集相关的 kernal 不算很多，很多想要做的东西缺乏参考，需要自己探索。本次文章记录了我探索 bigquery 数据的过程。

创建查询对象

1 2	from google.cloud import bigquery client = bigquery.Client()

一次查询

一段废话：很久没有写过 SQL 了，第一次写 SQL 是大一自学 PHP 后开发了一个用于统计我们专业期末成绩的网站。结果最终版的代码在 Azure 欠费后还都丢掉了。后来上数据库原理不喜欢那个老师，也没好好学，再后来用 django 都是 orm 了，更写 SQL。后来去中智实习，做 ETL 这是第一次实践中用 SQL 才明白 SQL 有这么强大的能力。但是现在也都忘了。

写个最简单，查询 0 号区块的所有信息。

query = """
SELECT * FROM bigquery-public-data.crypto_bitcoin.blocks where number=0
"""
r = client.query(query)

看一下 QueryJob 对象 r 有哪些方法和属性：

1	type(r), dir(r)

(google.cloud.bigquery.job.QueryJob,
[
‘job_id’,
‘job_type’,
‘labels’,
‘location’,
‘path’,
‘query’,
‘query_parameters’,
‘referenced_tables’,
‘state’,
‘table_definitions’,
‘timeline’,
‘to_dataframe’,
‘total_bytes_billed’,
‘total_bytes_processed’,
‘user_email’,
])

有很多，这里就选了一些可能用的着的。
看一下它们的值（这代码是在 kaggle notebook 里输入的，所以不用 print 也能返回结果。）

1	r.state, r.job_id, r.labels, r.location, r.path, r.query, r.referenced_tables, r.state, r.timeline, r.total_bytes_billed, r.total_bytes_processed, r.user_email

(‘DONE’,
‘ee67ed4d-80d4-4abf-bab4-522d33b52438’,
{},
‘US’,
‘/projects/kaggle-161607/jobs/ee67ed4d-80d4-4abf-bab4-522d33b52438’,
‘\nSELECT * FROM `bigquery-public-data.crypto_bitcoin.blocks` where number=0\n’,
[TableReference(DatasetReference(‘bigquery-public-data’, ‘crypto_bitcoin’), ‘blocks’)],
‘DONE’,
[<google.cloud.bigquery.job.TimelineEntry at 0x7faad8e9e860>],
196083712,
196019375,
'data-proxy-public@kaggle-bq.iam.gserviceaccount.com‘)

这些都是一些状态信息之类的，最重要的是数据：
r.to_dataframe 是函数，返回一个 pandas.core.frame.DataFrame 对象。

1 2	df = r.to_dataframe() df.head()

至此已经得到比特币原始数据了。

数据格式

这里大致分析一下我们可以在 bigquery 拿到哪些比特币的数据。

一共有 4 个表： blocks、inputs、outputs、transactions

数量集应该是这样：至今（2020 年 2 月 16 日）比特币有 61 万多区块，交易数量估计是 6 亿左右，而输入输出可能有 10 亿。

但是他这个数据并不是最新的。所以没有这么多数据。

这里面最关键的是 transactions 表。

transactions 表

一共 17 列

hash 交易哈希 String
size 交易大小（bytes) Numeric
virtual_size 好像隔离见证交易这个字段会与 size 不同 Numeric
version 协议版本 Numeric
lock_time 矿工可以把交易纳入 Merkel 树的最早时间 Numeric
block_hash 交易属于的块的哈希 String
block_number 交易属于的区块的编号 Numeric
block_timestamp 区块时间戳 Date
block_timestamp_month 区块时间的月 Date
input_count 输入数量 Numeric
output_count 输出数量 Numeric
input_value 输入值总和 Numeric
output_value 输出值总和 Numeric
is_coinbase 是否是铸币交易 Boolean
fee 交易手续费 Numeric
inputs 交易的输入数组数组
outputs 交易的输出数组数组

查寻第 50 万块中所有交易

query = """
SELECT * FROM `bigquery-public-data.crypto_bitcoin.transactions` where block_number=500000
"""
transactions_r = client.query(query)
tx_df = transactions_r.to_dataframe()
transactions_r.total_bytes_processed / (1024 ** 3)

这里显示： 1076.8477526148781 难道查询一个区块的交易就处理的 1 T 多的数据吗？有点奇怪。

可以使用

1
2
3

for i, row in tx_df.iterrows():
    print(i, row['hash'], row['block_timestamp'])
    break

遍历 dataframe 对象

我的博客

用 SQL 获取 bigquery 比特币数据 - 使用 Google Big Query API 处理比特币数据

创建查询对象

一次查询

数据格式

transactions 表

查寻第 50 万块中所有交易

About

Categories

Tags

Tag Cloud

Archives

Recents