Data in Depth
Clone this wiki locally
Data Source
The data used in the DeFi Survival Analysis Toolkit is from The Graph.
In the current state of the app, the utilized data is from the DeFi lending protocol Aave. Aave pushes its own data to The Graph, and each network it is deployed on has its own sub-graph. These sub-graphs are structured identically, with respect to the transaction-level data.
The data currently available for use within the toolkit comes from the Aave-maintained sub-graphs, specifically the following 7 markets:
Ethereum (V2), Polygon (V3), Avalanche (V3), Optimism (V3), Harmony (V3), Fantom (V3), and Arbitrum (V3).
The structure of the app is designed to be extensible to additional survival datasets as they are added (see Data Storage). To this end, we are currently working with Amberdata to expand the data feed to more DeFi protocols (Uniswap in the works).
Transaction Data Structure
The data from TheGraph is a transactional style dataset, with each transaction having a unique hashed identifier. There are multiple transaction types, each with different data descriptors associated with them. They are as outlined below
Deposits - User deposits currency in to a lending pool to accrue interest
- type(char): type of transaction ("deposit" for deposits)
- user(int): id of user who initiated the transaction
- onBehalfOf(int): id of user who will redeem the tokens in most cases, user = onBehalfOf
- timestamp(UNIX): Unix Timestamp of transaction
- pool(int): id of lending pool from which currency is redeemed
- reserve(char): symbol of currency used in transaction -reservePriceETH(int): price of currency at time of transaction, in Ether
- reservePriceUSD(int): price of currency at time of transaction, in USD
- amount(int): number of currency tokens redeemed
- amountUSD(int): amount being borrowed, in USD reservePriceUSD X amount
- amountETH(int): amount being borrowed, in Ether reservePriceETH X amount
Redeems - User removes deposit from lending pool
- type(char): type of transaction ("redeem" for redeems)
- user(int): id of user who initiated the transaction
- onBehalfOf(char): id of user who will redeem the tokens in most cases, user = onBehalfOf
- timestamp(UNIX): Unix Timestamp of transaction
- pool(int): id of lending pool from which currency is redeemed
- reserve(char): symbol of currency used in transaction
- reservePriceETH(int): price of currency at time of transaction, in Ether
- reservePriceUSD(int): price of currency at time of transaction, in USD
- amount(int): number of currency tokens redeemed
- amountUSD(int): amount being borrowed, in USD reservePriceUSD X amount
- amountETH(int): amount being borrowed, in Ether reservePriceETH X amount
Borrows - User borrows a currency from a lending pool, which can be repaid at any time (if the user has the required collateral)*
- type(char): type of transaction ("borrow" for borrows)
- user(int): id of user who initiated the transaction
- onBehalfOf(int): id of user who will incur the debt in most cases, user = onBehalfOf
- timestamp(UNIX): Unix Timestamp of transaction
- pool(char): id of lending pool from which currency is borrowed
- reserve(char): symbol of currency used in transaction
- reservePriceETH(int): price of currency at time of transaction, in Ether
- reservePriceUSD(int): price of currency at time of transaction, in USD
- amount(int): number of currency tokens borrowed
- amountUSD(int): amount being borrowed, in USD reservePriceUSD X amount
- amountETH(int): amount being borrowed, in Ether reservePriceETH X amount
- borrowRate(int): interest rate of loan (APR)
- borrowRateMode(char): whether the loan has a variable or stable interest rate
Repays - User repays money borrowed to a lending pool*
- type(char): type of transaction ("borrow" for borrows)
- user(int): id of user who initiated the transaction
- onBehalfOf(int): id of user whose borrow will be repaid in most cases, user = onBehalfOf
- timestamp(UNIX): Unix Timestamp of transaction
- pool(int): id of lending pool for which currency is being repaid
- reserve(char): symbol of currency used in transaction
- reservePriceETH(int): price of currency at time of transaction, in Ether
- reservePriceUSD(int): price of currency at time of transaction, in USD
- amount(int): number of currency tokens repaid
- amountUSD(int): amount being repaid, in USD reservePriceUSD X amount
- amountETH(int): amount being borrowed, in Ether reservePriceETH X amount
Liquidations - User's loan is forcibly repaid when they no longer have the required collateral to maintain the loan*
- type(char): type of transaction ("liquidation" for liquidations)
- user(int): id of user who received liquidation
- liquidator(int): id of user who liquidated
- id(char): address of liquidation transaction
- timestamp(UNIX): Unix Timestamp of transaction
- pool(int): id of lending pool from which loan is being liquidated
- collateralReserve(char): symbol of currency used as collateral in loan (what is being repaid)
- principalReserve(char): symbol of currency which has been borrowed
- reservePriceETHCollateral(int): price of collateral currency at time of transaction, in Ether
- reservePriceUSDCollateral(int): price of collateral currency at time of transaction, in USD
- reservePriceETHPrincipal(int): price of principal currency at time of transaction, in Ether
- reservePriceUSDPrincipal(int): price of principal currency at time of transaction, in USD
- collateralAmount(int): number of currency tokens repaid
- principalAmount(int): number of currency tokens borrowed for loan
- amountUSDPrincipal(int): amount of principal being borrowed, in USD reservePriceUSDPrincipal X principalAmount
- amountUSDCollateral(int): amount being of collateral being repaid, in USD reservePriceUSDCollateral X collateralAmount
- amountETHPrincipal(int): amount of principal being borrowed, in Ether reservePriceETHPrincipal X principalAmount
- amountETHCollateral(int): amount being of collateral being repaid, in Ether reservePriceETHCollateral X collateralAmount
Collaterals - User has deposited funds, enables/disables the funds to be used as collateral for borrowing*
- type(char): type of transaction ("collateral" for collaterals)
- user(int): id of user who initiated the transaction
- timestamp(UNIX): Unix Timestamp of transaction
- pool(int): id of lending pool from which currency is redeemed
- reserve(char): symbol of currency used in transaction
- FromState(boolean): indicates initial state of currency to be used as collateral (TRUE means enabled as collateral)
- ToState(boolean): indicates final state of currency to be used as collateral (FALSE means disabled as collateral)
Swaps - User changes deposit from one type of interest rate to another*
- type(char): type of transaction ("swap" for swaps)
- user(int): id of user who initiated the transaction
- timestamp(UNIX): Unix Timestamp of transaction
- pool(int): id of lending pool from which borrowed currency is being swapped
- reserve(char): symbol of currency used in transaction
- borrowRateModeFrom(int): interest rate mode swapping from
- borrowRateModeTo(int): interest rate mode swapping to
- stableBorrowRate(int): the stable interest rate for the deposit
- variableBorrowRate(int): the variable floating rate for the deposit
Survival Data Creation
The process for the creation of the survival datasets is shown in... (coming soon).
Survival Data Structure
Basic Survival Data Structure
The columns necessary for a survival model to be created are:
ID - Identifier value
User - User hash
TimeDiff - Time (in seconds) from either the start of observation period or the time of the index event to either the time in which the outcome event occurred (status = 1) or the end of the observation period (status = 0)
Status - Binary value with value 0 if the event is censored during the observation period
Example image:
Survival Datasets with Covariates
These datasets are generally more informative than basic models because it allows users to see differences in certain behaviors based on covariates. This allows for the discovering of changes over time (quarters), overall trend of the market, user clusters, and more. These are put into the dataframe as factor columns after the original, making a dataframe similar to as seen here:
Overall, here is a data dictionary for the currently-implemented covariates:
Note - all of these columns are stored as factors.
Reserve Type - Type (stable or non-stable) of cryptocurrency used in index event.
USD_Amount_Quartile - Quartile of value of cryptocurrency used in index event (value assigned as USD amount equivalent at time of transaction) from all transactions of that index event in the given time period.
Market_Trend - Hard-coded general values for state of general cryptocurrency market which correspond to (month/day/year):
Bull(growing market): 11/30/2020 - 05/11/2021, 07/19/2021 - 11/09/2021
Bear(shrinking market): 05/11/2021 - 07/19/2021, 11/09/2021 - 06/18/2022
Steady(stable market): 06/18/2022 - 10/10/2022
Quarter - Year and quarter in which the index event occurred. The format is [Year] Q[Quarter Number] (i.e. 2022 Q4). Used lubridate package for creation of quarter from date-time values.
User_Cluster - See other GitHub for information on clustering of users. Generally, number which shows user behavior cluster which the user ID is associated to.
Borrowed_Reserve_Type (Borrow to Account Liquidated Only) - Type (stable or non-stable) of borrow type (see AAVE loan types) used for borrow.
Liquidation_Type (Borrow to Account Liquidated Only) - Format is [Principal Type (Stable, Non-Stable, or (Stable, Non-Stable - means combined currency types)]:[Collateral Type (Stable or Non-Stable)] of account liquidation.
Collateral_Amount_USD_Quartile (Borrow to Account Liquidated Only) - Quartile of value of cryptocurrency used as collateral (value assigned as USD amount equivalent at time of transaction) from all account liquidations in the given time period.
Principal_Amount_USD_Quartile (Borrow to Account Liquidated Only) - Quartile of value of cryptocurrency used as principal (value assigned as USD amount equivalent at time of transaction) from all account liquidations in the given time period.
Extensible Data Storage
The toolkit is designed to use the file system of the data storage to create UI options based on additional survival datasets added. The general structure of the file path is: DeFi_Toolkit/Data/protocol/version/market/compute_quarterly_choice/index_event/outcome_event. The UI elements are created in a reactive sequential manner by reading the files at each subsequent step. Thus, as more datasets are added, the toolkit will intrinsically be able to handle it.
Once it extracts the data, the categories to split by are also gotten by looking at all columns in the data object which are not the ones necessary for creating basic survival curves without splitting by category (see survival data structure above). Thus, we can add categories and they will automatically be implemented into the toolkit as a functionality.
Quarterly-Computed Data
As to the ends of figuring out how user behavior has changed throughout time, the toolkit allows for the computation of the survival data in a quarterly manner. In this, each quarter (01-Jan through 31-Mar, 01-Apr through 30-Jun, 01-Jul through 30-Sep, and 01-Oct through 31-Dec) is treated as a separate observation period. This means that it is possible for an outcome event to occur with no associated index event having taken place in the observation period, leading to "left-censored" events. For the calculation of these quarterly survival data sets, we treat left-censored events as truncated in a similar manner to right-censored events. If an outcome event occurs, say, 30 days after the start of the observation period and it was not preceded by an associated index event, we record in the survival data that an observation occurred with 30 days of elapsed time, and that the event was censored.
This method of computation can be selected via using the "Compute Quarterly?:" drop-down. We recommend the use of this style of computation when trying to view changes over time, specifically with "Quarter" as the category to split. We promote the experimentation of this functionality, especially with plots with curves with different lengths (eg: Market_Trend, since time in different trends is different).