I want to end address reuse in the Bitcoin blockchain.

Today, I’m releasing source code for a tool that collects data about address reuse in the blockchain and visualizes it. I’m also releasing visualizations of the data for the first 200,000 blocks in the blockchain (up to September 22, 2012). This is my request for comment — please (politely) critique my approach. Feedback about the quality of the code, feature requests, and desires to contribute are also welcome; this is my first Big Data project, and my first largish Python project. Keep in mind that this is an initial push of the code and that more cleanup is planned.

https://github.com/kristovatlas/address-reuse-tracker

Data Sources

Three sources of data were used to compile these visualizations:

  • The Bitcoin blockchain, downloaded with bitcoind. This is how we find instances of address reuse.
  • Blockchain.info’s “relayed by” field in its block explorer API. This is how we can identify which address-reusing transactions were broadcast to the Bitcoin network using the Blockchain.info push_tx API.
  • WalletExplorer.com for clustering analysis of addresses. This is how we can guess which addresses are associated with particular Bitcoins services.

Address Reuse

Address reuse is considered in two forms:

  1. Sending funds to an address that already has a prior transaction history. (Depicted as large, red area in graphs below.)
  2. Using an input address for change. This is a subset of the above form. (Depicted as smaller, sky blue area in graphs below.)

This project gathers statistics about three kinds of address reuse implication:

  1. A Bitcoin service was used to send funds, resulting in address reuse.
  2. A Bitcoin service used to receive funds, resulting in address reuse.
  3. A Bitcoin client was used to send funds, resulting in address reuse.

My Goals for the Project

The goal for this project is for researchers to be able to synch up with the blockchain quickly and remain in synch, collect statistics about address reuse in the blockchain, and meaningfully visualize the results. We can use this to help identify problematic services within the Bitcoin ecosystem, work toward drastically decreasing address reuse, and track our progress over time.

Limitations

There are at least two important limitations to this data.

  • Clustering analysis is somewhat naive and is confounded by, for example, importing private keys that belong to wallets that have been used in CoinJoins. An obvious example of this is the cluster labeled “MtGoxAndOthers.” Many addresses continue to join this cluster on a daily basis long past the dissolution of the company; this is likely because MtGox for a time allowed users to import private keys.
  • Multiple services use the Blockchain.info push_tx API including Blockchain.info, but this code is not (yet) able to distinguish between them. Therefore, transactions tagged as “Blockchain.info” may have been created using one of Blockchain.info’s wallet clients, or other services.

Blocks 0 to 100k

Blocks 0 to 200k

Blocks 100k to 200k

An Open Source Python Framework for Blockchain Analysis

If there is sufficient interest in the research community, I’m strongly considering factoring out the context-specific code in this code base to create an open source framework for gathering statistics about the blockchain. It was a non-trivial amount of work to tackle this problem, and I’d like to save others the work. Some goals that this code provides already that I think any open source analytics framework should provide:

  • The ability to analyze blockchain data systematically, e.g. processing one block at a time.
  • Support for multiple sources of data, including a local copy of the blockchain, remote blockchain explorer APIs, and other remote APIs.
  • The ability to segregate different types of analysis into multiple threads or processes to be run simultaneously.
  • Fast database interaction to query or update the database of blockchain analysis. The goal should be to complete processing the blockchain until caught up on a single, powerful machine in a matter of days.
  • The ability to visualize the data collected.

I’m assuming that there aren’t superior open-source frameworks available or soon-to-be available. If you are working on one such framework, please drop me a line.

Resources for Blockchain Analysis Programmers

Several of my correspondents on Twitter expressed interest in a mailing list dedicated to discussing the programming and engineering challenges of analyzing the blockchain. Request to join here (requires a Google account): https://groups.google.com/forum/#!forum/blockchain-analysis-dev/