What is the content of a blockchain?
A blockchain is an immutable, append-only distributed database. To work with a blockchain, it is not just about understanding these unique features, but one also needs to understand what data is saved on the blockchain. What does blockchain data look like, i.e. what is the content and how is it formatted?
As the word describes, a blockchain is a chain of blocks, and it works similarly for the existing major chains like Bitcoin, Ethereum, and other Proof-of-Work networks. To simplify, in this article the focus is on the Ethereum blockchain only. Technical details that are not actual data but mainly support the functioning of the network (e.g. uncles etc.) are intentionally left out.
Each block consists of the header and the transactions mined with the block. The chain is formed by hashing various information of the previous block and including it in the new block header.
Besides this hash of the previous header, in Ethereum there are hashes of the state root and the roots of the transactions and the receipt – see here for details.
Of course, there is also a block number (the increasing so-called “block height”), a timestamp, and information about the gas limit and gas usage. In addition, there are some pieces related to the mining process (like the miner’s address, difficulty, nonce, etc.).
Check out this post for some diagrams and more details.
Now let’s have a look at the transactions that are bundled into each block from the pool of pending transactions. Besides the (account) nonce, the gas price, and the gas limit, transactions are comprised of a receiving “to” address, a sending “from” address, an ETH value in WEI, and a data field. The transaction can then be sent to the network and tracked by a 256-bit transaction ID which is the hash of the transaction.
For more details and an example, check out this informative article by CodeTract.
How is smart contract interaction data stored in Ethereum?
As opposed to Bitcoin, the Ethereum blockchain offers more than the simple transfer of value via end-to-end transactions. Running Turing-complete code is the key differentiator for Ethereum, and therefore the most interesting data lays in the smart contract interactions of the ”world computer”.
To be able to run code, Ethereum provides a virtual machine called the Ethereum Virtual Machine (EVM). It abstracts the underlying computer so smart contracts can run on every computer, where there is an Ethereum node running. A smart contract is just a fancy word for a program or code written in a programming language and compiled for the EVM.
So let’s have a closer look at these interactions now: Smart contracts generate logs by firing events whenever a function is called by an external account (transaction) or another smart contract (internal transaction). Events can therefore be described generally as asynchronous triggers with data. Asynchronous, because the log is only written once the originating transaction has been mined into a block. More details on events and logs can be found in the Solidity documentation.
The most important use case for events is to provide smart contract return values for a user interface. Logs can also be used as a cheaper form of storage – as described in more detail here.
To understand how the function is called, one must look at what makes up the (optional) data field in a transaction. It could be arbitrary data, but most often it is a function call to a smart contract.
The transaction is targeting the smart contract by using its address in the “to” field.
To know which function the transaction is calling within the smart contract, the functions of the contract must be known beforehand to create a hash table. The first 32 bits in the transaction data field correspond to the first 32 bits of the hash of the function. This is then followed by 256 bits for each argument of the function. In essence, this means that data fields in function calls are encoded. To use them and interpret them, they need to be decoded.
For details on how to interpret the topic and data fields in a transaction receipt log see here.
Reading from the Ethereum blockchain is hard
So now that you know what is stored on the Ethereum blockchain, I think it has become clear how difficult it is to extract actionable insights out of it. Several things make it particularly hard:
hexadecimal hashes instead of human-readable text
sequential nature of the data
slow JSON-RPC interface
The most obvious difficulty is the fact that pretty much everything consists of hexadecimal hashes instead of clear-text labels. For account addresses, this can be seen as a feature allowing pseudonymity. But with regards to smart contract interactions, the data needs to be converted into a human-readable format.
Another aspect is the serialized nature of the data. For only very few use cases it is possible to read the answer from the blockchain within a single query. Most often, you’ll have to traverse the chain with multiple requests for simple tasks such as displaying a transaction history of an account. Now, think about calculating an average gas price over the history of the blockchain, or monitoring the current state with regards to a specific token…
And lastly, consider the interface for querying the data – it is super slow. Before you can even start, you’ll need to set up an archive node (e.g. Geth or Parity) to have all historical data available. The size of that is about 1,7 TB as of November 2018 – and you need to store it on SSD hard drives to get it running reliably. Even then syncing the Ethereum Mainnet will last several days to even weeks. Once you have the node running in snyc, you can only query it by using the JSON-RPC API. This is of course really slow, particularly when considering the multitude of calls you need to make due to the serial nature of the data.
How to speed up reading blockchain data
To monitor smart contract development either on a test net or Ethereum Mainnet one needs to set up a node and create some sort of index in other words a database.
Doing so there are essentially two options: a) index the whole blockchain or b) limit the amount of data that is drained from the node into the index. Which approach to choose is a question of balance between resource consumption and flexibility constraints. In any way, the necessity for a filter mechanism for the relevant data arises.
Depending on the amount and type of data that you want to access, another thoughtful consideration should be made about how to query your blockchain index. The query language and the database system are, of course, mostly interdependent, so you’ll need to consider both in tandem.
Popular choices tend to be SQL (e.g. for PostgreSQL database) or Elasticsearch Query DSL (for Elasticsearch) as many developers are familiar with their query syntax.
These architectural considerations become more complex if you want to share such a database index across your teams/departments or even between several business entities. This would at least require authentication, likely also authorization, and possibly even accounting.
Let’s assume you have made up your mind about these considerations, have acquired the infrastructure, and set up the systems. Now you’ll do classical extract, transform and load (ETL) processes.
Extract relevant data from the node
First, you have to extract the data relevant to you from the node, for example, everything concerning certain smart contracts, or the complete blockchain history starting from a specific point in time. Of particular interest is the question of how fast you want to retrieve incoming blocks. While it is beneficial to be up to date quickly, you might have to deal with chain reorganizations from time to time. This occurs in a situation where a client node discovers a new difficultywise-longest well-formed blockchain which excludes one or more blocks that the client previously thought were part of the longest blockchain. These excluded blocks become orphans and therefore the data contained in them need to be purged from or at least flagged in the index.
Transform data into a human-readable format
In the transformation step, you probably want to make the data human-readable. Examples might be labeling Ethereum accounts of known origin (e.g. exchange wallets, smart contract names, etc.), or fetching smart contract Application Binary Interfaces (ABI) to spell out the names of the functions. You’ll use some form of mapping to match the raw data to the clear text labels. This could also mean getting historical price information for ERC-20 tokens for example and combining the timestamp with the appropriate block height. You’d need that to quantify value transfer transactions in fiat denominations, which would be an example of enriching the blockchain data from external sources.
Load into database index for faster querying
All of this data then needs to be loaded into a database and be indexed for best query performance. Depending on the chosen technology, this process might take a while and involve different steps in itself.
But now it is done!
You have transferred the serialized blockchain content from an OLTP (On-Line Transaction Processing) to an OLAP (On-Line Analytical Processing) environment. Therefore, you are now able to read from a database index much more quickly to start digging into the blockchain data.
Hopefully, you have gained an understanding of the content stored on the Ethereum blockchain and how to dissect the logs of smart contract events to see their interactions. As you have learned from this article, reading from the blockchain is hard and the process to access the data faster is quite involving.
Tools and services to help you access blockchain data
In order not to have to do the whole process by yourself, there are fortunately different tools and services available. For example, we blogged about working with Ethereum data on Google Big Query here: http://test.anyblockanalytics.com/news/analyse-transactions-with-ethereum-google-bigquery-data-set/
Our Anyblock Index provides a complete set of all data of 22+ Ethereum- and Bitcoin-based blockchains as Software-as-a-Service. As a smart contract developer, you can get your free API key here and start querying our Elasticsearch API or PostgreSQL right away. Or you might want to check out our documentation on how to get started – and feel free to contact us for any questions, support, and feedback. We’d love to talk to you!
In any case: Happy BUIDLing and stay awesome!