- add incremental model for syncing near lake s3 bucket

- add s3 udfs
This commit is contained in:
Julius Remigio 2023-01-19 18:43:55 -08:00
parent 107f0e519f
commit e70f097e52
5 changed files with 184 additions and 30 deletions

View File

@ -5,36 +5,38 @@ Curated SQL Views and Metrics for the Near Blockchain.
What's Near? Learn more [here](https://near.org/)
## Setup
### Prerequisites
1. Complete the steps in the [Data Curator Onboarding Guide](https://docs.metricsdao.xyz/data-curation/data-curator-onboarding).
* Note that the Data Curator Onboarding Guide assumes that you will ask to be added as a contributor to a MetricsDAO project. Ex: https://github.com/MetricsDAO/near_dbt.
* However, if you have not yet been added as a contributor, or you'd like to take an even lower-risk approach, you can always follow the [Fork and Pull Workflow](https://reflectoring.io/github-fork-and-pull/) by forking a copy of the project to which you'd like to contribute to a local copy of the project in your github account. Just make sure to:
- Note that the Data Curator Onboarding Guide assumes that you will ask to be added as a contributor to a MetricsDAO project. Ex: https://github.com/MetricsDAO/near_dbt.
- However, if you have not yet been added as a contributor, or you'd like to take an even lower-risk approach, you can always follow the [Fork and Pull Workflow](https://reflectoring.io/github-fork-and-pull/) by forking a copy of the project to which you'd like to contribute to a local copy of the project in your github account. Just make sure to:
- Fork the MetricsDAO repository.
- Git clone from your forked repository. Ex: `git clone https://github.com/YourAccount/near_dbt`.
- Create a branch for the changes you'd like to make. Ex: `git branch readme-update`.
- Switch to the branch. Ex: `git checkout readme-update`.
- Make your changes on the branch and follow the rest of the steps in the [Fork and Pull Workflow](https://reflectoring.io/github-fork-and-pull/) to notify the MetricsDAO repository owners to review your changes.
2. Download [Docker for Desktop](https://www.docker.com/products/docker-desktop).
* (Optional) You can run the Docker tutorial.
- Switch to the branch. Ex: `git checkout readme-update`.
- Make your changes on the branch and follow the rest of the steps in the [Fork and Pull Workflow](https://reflectoring.io/github-fork-and-pull/) to notify the MetricsDAO repository owners to review your changes.
2. Download [Docker for Desktop](https://www.docker.com/products/docker-desktop).
- (Optional) You can run the Docker tutorial.
3. Install [VSCode](https://code.visualstudio.com/).
### Prerequisites: Additional Windows Subsystem for Linux (WSL) Setup
4. For Windows users, you'll need to install WSL and connect VSCode to WSL by
* Right clicking VSCode and running VSCode as admin.
* Installing [WSL](https://docs.microsoft.com/en-us/windows/wsl/install) by typing `wsl --install` in VScode's terminal.
* Following the rest of the [VSCode WSL instruction](https://code.visualstudio.com/docs/remote/wsl) to create a new WSL user.
* Installing the Remote Development extension (ms-vscode-remote.vscode-remote-extensionpack) in VSCode.
* Finally, restarting VSCode in a directory in which you'd like to work. For example,
- `cd ~/metricsDAO/data_curation/near_dbt`
- `code .`
4. For Windows users, you'll need to install WSL and connect VSCode to WSL by
+ Right clicking VSCode and running VSCode as admin.
- Installing [WSL](https://docs.microsoft.com/en-us/windows/wsl/install) by typing `wsl --install` in VScode's terminal.
+ Following the rest of the [VSCode WSL instruction](https://code.visualstudio.com/docs/remote/wsl) to create a new WSL user.
+ Installing the Remote Development extension (ms-vscode-remote.vscode-remote-extensionpack) in VSCode.
- Finally, restarting VSCode in a directory in which you'd like to work. For example,
- `cd ~/metricsDAO/data_curation/near_dbt`
- `code .`
### Create the Environment Variables
1. Create a `.env` file with the following contents (note `.env` will not be committed to source) in the near_dbt directory (ex: near_dbt/.env):
```
```
SF_ACCOUNT=zsniary-metricsdao
SF_USERNAME=<your_metrics_dao_snowflake_username>
SF_PASSWORD=<your_metrics_dao_snowflake_password>
@ -43,9 +45,9 @@ What's Near? Learn more [here](https://near.org/)
SF_WAREHOUSE=DEFAULT
SF_ROLE=PUBLIC
SF_SCHEMA=SILVER
```
```
**Replace** the SF_USERNAME and SF_PASSWORD with the temporary Snowflake user name and password you received in the Snowflake step of the [Data Curator Onboarding Guide](https://docs.metricsdao.xyz/data-curation/data-curator-onboarding).
**Replace** the SF_USERNAME and SF_PASSWORD with the temporary Snowflake user name and password you received in the Snowflake step of the [Data Curator Onboarding Guide](https://docs.metricsdao.xyz/data-curation/data-curator-onboarding).
2. New to DBT? It's pretty dope. Read up on it [here](https://www.getdbt.com/docs/)
@ -53,6 +55,31 @@ What's Near? Learn more [here](https://near.org/)
Run the following commands from inside the Near directory (**you must have completed the Setup steps above^^**)
## Variables
To control which external table environment a model references, as well as, whether a Stream is invoked at runtime using control variables:
* STREAMLINE_INVOKE_STREAMS
When True, invokes streamline on model run as normal
When False, NO-OP
* STREAMLINE_USE_DEV_FOR_EXTERNAL_TABLES
When True, uses DEV schema Streamline. Ethereum_DEV
When False, uses PROD schema Streamline. Ethereum
Default values are False
* Usage:
`dbt run --var '{"STREAMLINE_USE_DEV_FOR_EXTERNAL_TABLES": True, "STREAMLINE_INVOKE_STREAMS": True}' -m ...`
To control the creation of UDF or SP macros with dbt run:
* UPDATE_UDFS_AND_SPS
When True, executes all macros included in the on-run-start hooks within dbt_project.yml on model run as normal
When False, none of the on-run-start macros are executed on model run
Default values are False
* Usage:
`dbt run --var '{"UPDATE_UDFS_AND_SPS": True}' -m ...`
### DBT Environment
1. In VSCode's terminal, type `cd near_dbt`.
@ -61,10 +88,10 @@ Run the following commands from inside the Near directory (**you must have compl
### DBT Project Docs
1. In VSCode, open another terminal.
1. In VSCode, open another terminal.
2. In this new terminal, run `make dbt-docs`. This will compile your dbt documentation and launch a web-server at http://localhost:8080
Documentation is automatically generated and hosted using Netlify.
Documentation is automatically generated and hosted using Netlify.
[![Netlify Status](https://api.netlify.com/api/v1/badges/12fc0079-7428-4771-9923-38ee6599db0f/deploy-status)](https://app.netlify.com/sites/mdao-near/deploys)
## Project Overview
@ -94,12 +121,12 @@ Blocks and transactions are fed into the above two Near tables utilizing the Cha
| Column | Type | Description |
| --------------- | ------------ | ---------------------------------------------------------------- |
| record_id | VARCHAR | A unique id for the record generated by Chainwalkers |
| offset_id | NUMBER(38,0) | Synonmous with block_id for Near |
| block_id | NUMBER(38,0) | The height of the chain this block corresponds with |
| offset_id | NUMBER(38, 0) | Synonmous with block_id for Near |
| block_id | NUMBER(38, 0) | The height of the chain this block corresponds with |
| block_timestamp | TIMESTAMP | The time the block was minted |
| network | VARCHAR | The blockchain network (i.e. mainnet, testnet, etc.) |
| chain_id | VARCHAR | Synonmous with blockchain name for Near |
| tx_count | NUMBER(38,0) | The number of transactions in the block |
| tx_count | NUMBER(38, 0) | The number of transactions in the block |
| header | json variant | A json queryable column containing the blocks header information |
| ingested_at | TIMESTAMP | The time this data was ingested into the table by Snowflake |
@ -109,13 +136,13 @@ Blocks and transactions are fed into the above two Near tables utilizing the Cha
| --------------- | ------------ | ---------------------------------------------------------------------- |
| record_id | VARCHAR | A unique id for the record generated by Chainwalkers |
| tx_id | VARCHAR | A unique on chain identifier for the transaction |
| tx_block_index | NUMBER(38,0) | The index of the transaction within the block. Starts at 0. |
| offset_id | NUMBER(38,0) | Synonmous with block_id for Near |
| block_id | NUMBER(38,0) | The height of the chain this block corresponds with |
| tx_block_index | NUMBER(38, 0) | The index of the transaction within the block. Starts at 0. |
| offset_id | NUMBER(38, 0) | Synonmous with block_id for Near |
| block_id | NUMBER(38, 0) | The height of the chain this block corresponds with |
| block_timestamp | TIMESTAMP | The time the block was minted |
| network | VARCHAR | The blockchain network (i.e. mainnet, testnet, etc.) |
| chain_id | VARCHAR | Synonmous with blockchain name for Near |
| tx_count | NUMBER(38,0) | The number of transactions in the block |
| tx_count | NUMBER(38, 0) | The number of transactions in the block |
| header | json variant | A json queryable column containing the blocks header information |
| tx | array | An array of json queryable objects containing each tx and decoded logs |
| ingested_at | TIMESTAMP | The time this data was ingested into the table by Snowflake |
@ -124,7 +151,7 @@ Blocks and transactions are fed into the above two Near tables utilizing the Cha
Data in this DBT project is written to the `NEAR` database in MetricsDAO.
This database has 2 schemas, one for `DEV` and one for `PROD`. As a contributer you have full permission to write to the `DEV` schema. However the `PROD` schema can only be written to by Metric DAO's DBT Cloud account. The DBT Cloud account controls running / scheduling models against the `PROD` schema.
This database has 2 schemas, one for `DEV` and one for `PROD` . As a contributer you have full permission to write to the `DEV` schema. However the `PROD` schema can only be written to by Metric DAO's DBT Cloud account. The DBT Cloud account controls running / scheduling models against the `PROD` schema.
## Branching / PRs
@ -138,6 +165,6 @@ When creating a PR please include the following details in the PR description:
## More DBT Resources:
- Learn more about dbt [in the docs](https://docs.getdbt.com/docs/introduction)
- Check out [Discourse](https://discourse.getdbt.com/) for commonly asked questions and answers
- Check out [the blog](https://blog.getdbt.com/) for the latest news on dbt's development and best practices
* Learn more about dbt [in the docs](https://docs.getdbt.com/docs/introduction)
* Check out [Discourse](https://discourse.getdbt.com/) for commonly asked questions and answers
* Check out [the blog](https://blog.getdbt.com/) for the latest news on dbt's development and best practices

View File

@ -25,6 +25,7 @@ clean-targets: # directories to be removed by `dbt clean`
- "dbt_packages"
on-run-start:
- "{{ create_udfs() }}"
- "{{create_sps()}}"
- "{{create_get_nearblocks_fts()}}"
@ -51,3 +52,6 @@ tests:
vars:
"dbt_date:time_zone": GMT
STREAMLINE_INVOKE_STREAMS: False
STREAMLINE_USE_DEV_FOR_EXTERNAL_TABLES: False
UPDATE_UDFS_AND_SPS: False

17
macros/create_udfs.sql Normal file
View File

@ -0,0 +1,17 @@
{% macro create_udfs() %}
{% if var("UPDATE_UDFS_AND_SPS") %}
{% if target.database != "NEAR_COMMUNITY_DEV" %}
{% set sql %}
CREATE schema if NOT EXISTS silver;
CREATE schema if NOT EXISTS streamline;
{{ create_udf_introspect() }}
{{ create_udf_s3_list_directories() }}
{{ create_udf_s3_list_objects() }}
{{ create_udf_s3_copy_objects() }}
{{ create_udf_s3_copy_objects_overwrite() }}
{% endset %}
{% do run_query(sql) %}
{% endif %}
{% endif %}
{% endmacro %}

View File

@ -0,0 +1,62 @@
{% macro create_udf_introspect() %}
CREATE
OR REPLACE EXTERNAL FUNCTION streamline.udf_introspect(
echo STRING
) returns text api_integration = aws_stg_us_east_1_api max_batch_rows = 10 AS {% if target.name == "prod" %}
'https://jfqhk99kj1.execute-api.us-east-1.amazonaws.com/stg/s3/introspect'
{% else %}
'https://jfqhk99kj1.execute-api.us-east-1.amazonaws.com/stg/s3/introspect'
{%- endif %};
{% endmacro %}
{% macro create_udf_s3_list_directories() %}
CREATE
OR REPLACE EXTERNAL FUNCTION streamline.udf_s3_list_directories(
path STRING
) returns ARRAY api_integration = aws_stg_us_east_1_api max_batch_rows = 10 AS {% if target.name == "prod" %}
'https://jfqhk99kj1.execute-api.us-east-1.amazonaws.com/stg/s3/list_directories'
{% else %}
'https://jfqhk99kj1.execute-api.us-east-1.amazonaws.com/stg/s3/list_directories'
{%- endif %};
{% endmacro %}
{% macro create_udf_s3_list_objects() %}
CREATE
OR REPLACE EXTERNAL FUNCTION streamline.udf_s3_list_objects(
path STRING
) returns ARRAY api_integration = aws_stg_us_east_1_api max_batch_rows = 10 AS {% if target.name == "prod" %}
'https://jfqhk99kj1.execute-api.us-east-1.amazonaws.com/stg/s3/list_objects'
{% else %}
'https://jfqhk99kj1.execute-api.us-east-1.amazonaws.com/stg/s3/list_objects'
{%- endif %};
{% endmacro %}
{% macro create_udf_s3_copy_objects() %}
CREATE
OR REPLACE EXTERNAL FUNCTION streamline.udf_s3_copy_objects(
paths ARRAY,
source STRING,
target STRING
) returns ARRAY api_integration = aws_stg_us_east_1_api headers = (
'overwrite' = '1'
) max_batch_rows = 1 AS {% if target.name == "prod" %}
'https://jfqhk99kj1.execute-api.us-east-1.amazonaws.com/stg/s3/copy_objects'
{% else %}
'https://jfqhk99kj1.execute-api.us-east-1.amazonaws.com/stg/s3/copy_objects'
{%- endif %};
{% endmacro %}
{% macro create_udf_s3_copy_objects_overwrite() %}
CREATE
OR REPLACE EXTERNAL FUNCTION streamline.udf_s3_objects_overwrite(
paths ARRAY,
source STRING,
target STRING
) returns ARRAY api_integration = aws_stg_us_east_1_api headers = (
'overwrite' = '1'
) max_batch_rows = 1 AS {% if target.name == "prod" %}
'https://jfqhk99kj1.execute-api.us-east-1.amazonaws.com/stg/s3/copy_objects'
{% else %}
'https://jfqhk99kj1.execute-api.us-east-1.amazonaws.com/stg/s3/copy_objects'
{%- endif %};
{% endmacro %}

View File

@ -0,0 +1,44 @@
{{ config(
materialized = 'view',
tags = ['streamline'],
post_hook = "select * from {{this.schema}}.{{this.identifier}}",
comment = "incrementally sync Streamline.near_dev.blocks and Streamline.near_dev.shards"
) }}
WITH latest_block AS (
SELECT
MAX(SPLIT_PART(file_name, '/', 1)) AS block_id
FROM
TABLE(
information_schema.external_table_file_registration_history(
table_name => 'streamline.near_dev.blocks',
start_time => DATEADD('hour', {{ var("S3_LOOKBACK_HOURS", -2) }}, SYSDATE()))
)
WHERE
operation_status = 'REGISTERED_NEW'
),
blocks AS (
-- Look back 1000 blocks from most max block_id and count forward 4000
SELECT
ROW_NUMBER() over (
ORDER BY
SEQ4()
) AS series,
b.block_id :: INTEGER - 1000 + series AS new_block_id,
RIGHT(REPEAT('0', 12) || new_block_id :: STRING, 12) AS prefix
FROM
TABLE(GENERATOR(rowcount => 4000)),
latest_block b
)
SELECT
b.prefix,
streamline.udf_s3_copy_objects(
streamline.udf_s3_list_objects(
's3://near-lake-data-mainnet/' || b.prefix || '/*'
),
's3://near-lake-data-mainnet/',
's3://stg-us-east-1-serverless-near-lake-mainnet-fsc/'
)
FROM
blocks b