Data in Depth
Clone this wiki locally
Data Source
The data used in the DeFi Survival Analysis Toolkit is from The Graph.
In the current state of the app, the utilized data is from the DeFi lending protocol Aave. Aave pushes its own data to The Graph, and each network it is deployed on has its own sub-graph. These sub-graphs are structured identically, with respect to the transaction-level data.
The data currently available for use within the toolkit comes from the Aave-maintained sub-graphs, specifically the following 7 markets:
Ethereum (V2), Polygon (V3), Avalanche (V3), Optimism (V3), Harmony (V3), Fantom (V3), and Arbitrum (V3).
The structure of the app is designed to be extensible to additional survival datasets as they are added (see Data Storage). To this end, we are currently working with Amberdata to expand the data feed to more DeFi protocols (Uniswap in the works).
Transaction Data Structure
Coming soon...
Survival Data Creation
The process for the creation of the survival datasets is coming soon (will include code for data creation).
Survival Data Structure
Basic Survival Data Structure
The columns necessary for a survival model to be created are:
ID - Identifier value
User - User hash
TimeDiff - Time (in seconds) from either the start of observation period or the time of the index event to either the time in which the outcome event occurred (status = 1) or the end of the observation period (status = 0)
Status - Binary value with value 0 if the event is censored during the observation period
Example image:
Survival Datasets with Covariates
These datasets are generally more informative than basic models because it allows users to see differences in certain behaviors based on covariates. This allows for the discovering of changes over time (quarters), overall trend of the market, user clusters, and more. These are put into the dataframe as factor columns after the original, making a dataframe similar to as seen here:
Overall, here is a data dictionary for the currently-implemented covariates:
Note - all of these columns are stored as factors.
Reserve Type - Type (stable or non-stable) of cryptocurrency used in index event.
USD_Amount_Quartile - Quartile of value of cryptocurrency used in index event (value assigned as USD amount equivalent at time of transaction) from all transactions of that index event in the given time period.
Market_Trend - Hard-coded general values for state of general cryptocurrency market which correspond to : (Bull[growing market]: FILL IN DATES, Bear[shrinking market]: FILL IN DATES, Steady[stable market]: FILL IN DATES)
Quarter - Year and quarter in which the index event occurred. The format is [Year] Q[Quarter Number] (i.e. 2022 Q4). Used lubridate package for creation of quarter from date-time values.
User_Cluster - See other GitHub for information on clustering of users. Generally, number which shows user behavior cluster which the user ID is associated to.
Borrowed_Reserve_Type (Borrow to Account Liquidated Only) - Type (stable or non-stable) of borrow type (see AAVE loan types) used for borrow.
Liquidation_Type (Borrow to Account Liquidated Only) - Format is [Principal Type (Stable, Non-Stable, or (Stable, Non-Stable - means combined currency types)]:[Collateral Type (Stable or Non-Stable)] of account liquidation.
Collateral_Amount_USD_Quartile (Borrow to Account Liquidated Only) - Quartile of value of cryptocurrency used as collateral (value assigned as USD amount equivalent at time of transaction) from all account liquidations in the given time period.
Principal_Amount_USD_Quartile (Borrow to Account Liquidated Only) - Quartile of value of cryptocurrency used as principal (value assigned as USD amount equivalent at time of transaction) from all account liquidations in the given time period.
Extensible Data Storage
The toolkit is designed to use the file system of the data storage to create UI options based on additional survival datasets added. The general structure of the file path is: DeFi_Toolkit/Data/protocol/version/market/compute_quarterly_choice/index_event/outcome_event. The UI elements are created in a reactive sequential manner by reading the files at each subsequent step. Thus, as more datasets are added, the toolkit will intrinsically be able to handle it.
Once it extracts the data, the categories to split by are also gotten by looking at all columns in the data object which are not the ones necessary for creating basic survival curves without splitting by category (see survival data structure above). Thus, we can add categories and they will automatically be implemented into the toolkit as a functionality.
Quarterly-Computed Data
As to the ends of figuring out how user behavior has changed throughout time, the toolkit allows for the computation of the survival data in a quarterly manner. In this, each quarter (01-Jan through 31-Mar, 01-Apr through 30-Jun, 01-Jul through 30-Sep, and 01-Oct through 31-Dec) is treated as a separate observation period. This means that it is possible for an outcome event to occur with no associated index event having taken place in the observation period, leading to "left-censored" events. For the calculation of these quarterly survival data sets, we treat left-censored events as truncated in a similar manner to right-censored events. If an outcome event occurs, say, 30 days after the start of the observation period and it was not preceded by an associated index event, we record in the survival data that an observation occurred with 30 days of elapsed time, and that the event was censored.
This method of computation can be selected via using the "Compute Quarterly?:" drop-down. We recommend the use of this style of computation when trying to view changes over time, specifically with "Quarter" as the category to split. We promote the experimentation of this functionality, especially with plots with curves with different lengths (eg: Market_Trend, since time in different trends is different).