Welcome! I’m assuming you’re a data analyst who is new to web3, starting to build your web3 analytics team, or have just taken an interest in web3 data in general. Either way, you should already be loosely familiar with how APIs, databases, transformations, and models work in web2.
For this inaugural guide, I’m going to try to keep things very concise and highlight my thoughts across these three pillars:
This is probably a good time to say that this is all just my view of the space, and I’ve showcased only a few facets of the tools/communities mentioned below.
Let’s get into it!
Let’s start by summarizing how data is built, queried, and accessed in web2 (i.e. accessing Twitter’s API). We have four steps for a simplified data pipeline:
The only step when the data is sometimes open-sourced is after transformations are done. Communities like Kaggle (1000’s of data science/feature engineering competitions) and Hugging Face (26,000 top-notch NLP models) use some subset of exposed data to help corporates build better models. There are some domain-specific cases like open street maps that open up data in the earlier three steps - but these still have limits around write permissions.
I do want to clarify that I’m only talking about the data here, I’m not saying web2 doesn’t have any open-source at all. Like most other engineering roles, web2 data has tons of open-source tools for building their pipelines (dbt, anything apache, TensorFlow). We still use all these tools in web3. In summary, their tooling is open but their data is closed.
Web3 open-sources the data as well - this means that it's no longer just data scientists working in the open but analytics engineers and data engineers as well! Instead of a mostly black-box data cycle, everyone gets involved in a more continuous workflow.
The shape of work has gone from web2 data dams to web3 data rivers, deltas, and oceans. It’s also important to note that this new cycle affects all products/protocols in the ecosystem at once.
Let’s look at an example of how web3 analysts work together. There are dozens of DEX’s out there which use different exchange mechanisms and fees for allowing you to swap token A for token B. If these were typical exchanges like Nasdaq, each exchange would report their own data in a 10k or some API, and then some other service like capIQ would do the work of putting all exchange data together and charge $$$ for you to access their API. Maybe once in a while, they’ll run an innovation competition so that they can have an extra data/chart feature to charge for in the future.
With web3 exchanges, we have this data flow instead:
dex.trades
is a table on Dune (put together by many community analytics engineers over time) where all DEX swap data is aggregated - so you can very easily search for something like a single token’s swap volume across all exchanges.Discussion, collaboration, and learning happen in a much tighter feedback loop due to the shared ecosystem. I will admit this gets very overwhelming at times, and the analysts I know basically all rotate data burnout. However, as long as one of us keeps pushing the data forward (i.e. someone creates that insert DEX query) then else everyone benefits.
It doesn’t always have to be complicated abstracted views either, sometimes its just utility functions like making it easy to search up an ENS reverse resolver or improvements in tooling like auto-generating most of the graphQL mapping with a single CLI command! All of it is reusable by everyone, and can be adapted for API consumption in some product frontend or your own personal trading models.
While the possibilities unlocked here are amazing, I do acknowledge that the wheel isn’t running that smoothly yet. The ecosystem is still really immature on the data analyst/science side compared to data engineering. I think there are a few reasons for this:
On top of learning to work together, the web3 data community is also still learning how to work across this new data stack. You don’t get to control the infrastructure or slowly build up from excel to data lake or data warehouse anymore - as soon as your product is live then your data is live everywhere. Your team is basically thrown into the deep end of data infrastructure.
Here’s what most of you came here for:
*These tools are not comprehensive of the entire space - they’re just the ones that I’ve found myself or others consistently using and referencing in the Ethereum ecosystem (some of them cover other chains as well).
*The “decentralized” tag means there’s either an infrastructure network or guideline framework to stop changes from happening unilaterally. I like to think of it as decoupled infra versus cloud infra, but that will need to be its own article.
Let’s walk through when you would need to use each layer/category:
It wouldn’t be web3 without strong, standout communities to go alongside these tools! I’ve put some of the top communities next to each layer:
(yes, we need to do better on diversity. Go follow/reach out to Danning and Kofi too, they’re amazing!)
Every one of these communities has done immense work to better the web3 ecosystem. It almost goes without saying that the products with a community around them grow at 100x the speed. This is still a heavily underrated competitive edge, one I think people don’t get unless they’ve built something within one of these communities.
It should also go without saying that you want to look within these communities for the people to hire onto your teams. Let’s go further and break down the important web3 data skills and experiences, so you actually know what you’re searching for. And if you’re looking to be hired, view this as the skills and experiences to go after!
If you’re new and want to dive in, start with the free recordings of my 30-day data course which is purely focused on the first pillar. I’ll hopefully have educational content on everything here and be able to run cohorts with it one day!
At a minimum, an analyst should be an Etherscan detective and know how to read Dune dashboards. This takes maybe 1 month to ramp up to leisurely, and 2 weeks if you’re really booking it and binge studying.
There’s a little more context you should have in your mind as well, specifically on time allocations and skill transferability.
Remember, it’s less about knowing how to use the tools - every analyst should more-or-less be able to write SQL or create dashboards. It’s all about knowing how to contribute and work with the communities. If the person you’re interviewing isn’t a part of any of web3 data communities (and don’t seem to express any interest to start doing so), you may want to ask yourself if that’s a red flag.
And a final tip for hiring: pitching the analyst using your data will work much better than pitching them with a role. Graeme had originally approached me with a really fun project and set of data, and after working through it I was fairly easily convinced to join the Mirror team.
Disclaimer: I would not recommend taking my word as definitive criteria for everyone, but at least it’s the frame of mind I’ll be using to hire data friends at Mirror (sometime soon 🙂).
I’d also have loved to talk about team structure, but to my knowledge, most product data teams have been 0 and 2 analysts full-time. I’m sure I’ll have better-informed thoughts on this in a year or two.
It’s honestly quite amazing how far data in web3 has come over the last year, and I’m very excited to see where the ecosystem grows to by 2023. I’ll try and make this a yearly thing - if you want to support me and this kind of work, feel free to collect an edition of this entry (at the top) or one of the landscape NFTs below:
If you have ideas/questions on web3 data topics you’d like to learn more about, just dm me on Twitter. I’m always looking for educational partners as well!
Special thanks to Elena as always for reviewing and making great suggestions!