- add incremental model for syncing near lake s3 bucket

- add s3 udfs
2026-02-06 13:56:44 +00:00 · 2023-01-19 18:43:55 -08:00 · 2023-01-19 18:43:55 -08:00 · e70f097e52
commit e70f097e52
parent 107f0e519f
5 changed files with 184 additions and 30 deletions
--- a/README.md
+++ b/README.md
@ -5,36 +5,38 @@ Curated SQL Views and Metrics for the Near Blockchain.
 What's Near? Learn more [here](https://near.org/)

 ## Setup
+
 ### Prerequisites

 1. Complete the steps in the [Data Curator Onboarding Guide](https://docs.metricsdao.xyz/data-curation/data-curator-onboarding).
-    * Note that the Data Curator Onboarding Guide assumes that you will ask to be added as a contributor to a MetricsDAO project. Ex: https://github.com/MetricsDAO/near_dbt. 
-    * However, if you have not yet been added as a contributor, or you'd like to take an even lower-risk approach, you can always follow the [Fork and Pull Workflow](https://reflectoring.io/github-fork-and-pull/) by forking a copy of the project to which you'd like to contribute to a local copy of the project in your github account. Just make sure to: 
+    - Note that the Data Curator Onboarding Guide assumes that you will ask to be added as a contributor to a MetricsDAO project. Ex: https://github.com/MetricsDAO/near_dbt.
+    - However, if you have not yet been added as a contributor, or you'd like to take an even lower-risk approach, you can always follow the [Fork and Pull Workflow](https://reflectoring.io/github-fork-and-pull/) by forking a copy of the project to which you'd like to contribute to a local copy of the project in your github account. Just make sure to:
        - Fork the MetricsDAO repository.
        - Git clone from your forked repository. Ex: `git clone https://github.com/YourAccount/near_dbt`.
        - Create a branch for the changes you'd like to make. Ex: `git branch readme-update`.
-        - Switch to the branch. Ex: `git checkout readme-update`. 
-        - Make your changes on the branch and follow the rest of the steps in the [Fork and Pull Workflow](https://reflectoring.io/github-fork-and-pull/) to notify the MetricsDAO repository owners to review your changes. 
-2. Download [Docker for Desktop](https://www.docker.com/products/docker-desktop). 
-    * (Optional) You can run the Docker tutorial. 
+        - Switch to the branch. Ex: `git checkout readme-update`.
+        - Make your changes on the branch and follow the rest of the steps in the [Fork and Pull Workflow](https://reflectoring.io/github-fork-and-pull/) to notify the MetricsDAO repository owners to review your changes.
+2. Download [Docker for Desktop](https://www.docker.com/products/docker-desktop).
+    - (Optional) You can run the Docker tutorial.
 3. Install [VSCode](https://code.visualstudio.com/).

 ### Prerequisites: Additional Windows Subsystem for Linux (WSL) Setup

-4. For Windows users, you'll need to install WSL and connect VSCode to WSL by   
-	* Right clicking VSCode and running VSCode as admin.
-    * Installing [WSL](https://docs.microsoft.com/en-us/windows/wsl/install) by typing `wsl --install` in VScode's terminal. 
-	* Following the rest of the [VSCode WSL instruction](https://code.visualstudio.com/docs/remote/wsl) to create a new WSL user. 
-	* Installing the Remote Development extension (ms-vscode-remote.vscode-remote-extensionpack) in VSCode. 
-    * Finally, restarting VSCode in a directory in which you'd like to work. For example, 
-        - `cd ~/metricsDAO/data_curation/near_dbt` 
-        - `code .`
+4. For Windows users, you'll need to install WSL and connect VSCode to WSL by
+     + Right clicking VSCode and running VSCode as admin.
+       - Installing [WSL](https://docs.microsoft.com/en-us/windows/wsl/install) by typing `wsl --install` in VScode's terminal.
+     + Following the rest of the [VSCode WSL instruction](https://code.visualstudio.com/docs/remote/wsl) to create a new WSL user.
+     + Installing the Remote Development extension (ms-vscode-remote.vscode-remote-extensionpack) in VSCode.
+       - Finally, restarting VSCode in a directory in which you'd like to work. For example,
+           - `cd ~/metricsDAO/data_curation/near_dbt`
+
+           - `code .`

 ### Create the Environment Variables

 1. Create a `.env` file with the following contents (note `.env` will not be committed to source) in the near_dbt directory (ex: near_dbt/.env):

-    ```
+```
    SF_ACCOUNT=zsniary-metricsdao
    SF_USERNAME=<your_metrics_dao_snowflake_username>
    SF_PASSWORD=<your_metrics_dao_snowflake_password>
@ -43,9 +45,9 @@ What's Near? Learn more [here](https://near.org/)
    SF_WAREHOUSE=DEFAULT
    SF_ROLE=PUBLIC
    SF_SCHEMA=SILVER
-    ```
+```

-    **Replace** the SF_USERNAME and SF_PASSWORD with the temporary Snowflake user name and password you received in the Snowflake step of the [Data Curator Onboarding Guide](https://docs.metricsdao.xyz/data-curation/data-curator-onboarding). 
+**Replace** the SF_USERNAME and SF_PASSWORD with the temporary Snowflake user name and password you received in the Snowflake step of the [Data Curator Onboarding Guide](https://docs.metricsdao.xyz/data-curation/data-curator-onboarding).

 2. New to DBT? It's pretty dope. Read up on it [here](https://www.getdbt.com/docs/)

@ -53,6 +55,31 @@ What's Near? Learn more [here](https://near.org/)

 Run the following commands from inside the Near directory (**you must have completed the Setup steps above^^**)

+## Variables
+
+To control which external table environment a model references, as well as, whether a Stream is invoked at runtime using control variables:
+* STREAMLINE_INVOKE_STREAMS
+When True, invokes streamline on model run as normal
+When False, NO-OP
+* STREAMLINE_USE_DEV_FOR_EXTERNAL_TABLES
+When True, uses DEV schema Streamline. Ethereum_DEV
+When False, uses PROD schema Streamline. Ethereum
+
+Default values are False
+
+* Usage:
+ `dbt run --var '{"STREAMLINE_USE_DEV_FOR_EXTERNAL_TABLES": True, "STREAMLINE_INVOKE_STREAMS": True}'  -m ...`
+
+To control the creation of UDF or SP macros with dbt run:
+* UPDATE_UDFS_AND_SPS
+When True, executes all macros included in the on-run-start hooks within dbt_project.yml on model run as normal
+When False, none of the on-run-start macros are executed on model run
+
+Default values are False
+
+* Usage:
+ `dbt run --var '{"UPDATE_UDFS_AND_SPS": True}'  -m ...`
+
 ### DBT Environment

 1. In VSCode's terminal, type `cd near_dbt`.
@ -61,10 +88,10 @@ Run the following commands from inside the Near directory (**you must have compl

 ### DBT Project Docs

-1. In VSCode, open another terminal. 
+1. In VSCode, open another terminal.
 2. In this new terminal, run `make dbt-docs`. This will compile your dbt documentation and launch a web-server at http://localhost:8080

-Documentation is automatically generated and hosted using Netlify.  
+Documentation is automatically generated and hosted using Netlify.
 [![Netlify Status](https://api.netlify.com/api/v1/badges/12fc0079-7428-4771-9923-38ee6599db0f/deploy-status)](https://app.netlify.com/sites/mdao-near/deploys)

 ## Project Overview
@ -94,12 +121,12 @@ Blocks and transactions are fed into the above two Near tables utilizing the Cha
 | Column          | Type         | Description                                                      |
 | --------------- | ------------ | ---------------------------------------------------------------- |
 | record_id       | VARCHAR      | A unique id for the record generated by Chainwalkers             |
-| offset_id       | NUMBER(38,0) | Synonmous with block_id for Near                                 |
-| block_id        | NUMBER(38,0) | The height of the chain this block corresponds with              |
+| offset_id       | NUMBER(38, 0) | Synonmous with block_id for Near                                 |
+| block_id        | NUMBER(38, 0) | The height of the chain this block corresponds with              |
 | block_timestamp | TIMESTAMP    | The time the block was minted                                    |
 | network         | VARCHAR      | The blockchain network (i.e. mainnet, testnet, etc.)             |
 | chain_id        | VARCHAR      | Synonmous with blockchain name for Near                          |
-| tx_count        | NUMBER(38,0) | The number of transactions in the block                          |
+| tx_count        | NUMBER(38, 0) | The number of transactions in the block                          |
 | header          | json variant | A json queryable column containing the blocks header information |
 | ingested_at     | TIMESTAMP    | The time this data was ingested into the table by Snowflake      |

@ -109,13 +136,13 @@ Blocks and transactions are fed into the above two Near tables utilizing the Cha
 | --------------- | ------------ | ---------------------------------------------------------------------- |
 | record_id       | VARCHAR      | A unique id for the record generated by Chainwalkers                   |
 | tx_id           | VARCHAR      | A unique on chain identifier for the transaction                       |
-| tx_block_index  | NUMBER(38,0) | The index of the transaction within the block. Starts at 0.            |
-| offset_id       | NUMBER(38,0) | Synonmous with block_id for Near                                       |
-| block_id        | NUMBER(38,0) | The height of the chain this block corresponds with                    |
+| tx_block_index  | NUMBER(38, 0) | The index of the transaction within the block. Starts at 0.            |
+| offset_id       | NUMBER(38, 0) | Synonmous with block_id for Near                                       |
+| block_id        | NUMBER(38, 0) | The height of the chain this block corresponds with                    |
 | block_timestamp | TIMESTAMP    | The time the block was minted                                          |
 | network         | VARCHAR      | The blockchain network (i.e. mainnet, testnet, etc.)                   |
 | chain_id        | VARCHAR      | Synonmous with blockchain name for Near                                |
-| tx_count        | NUMBER(38,0) | The number of transactions in the block                                |
+| tx_count        | NUMBER(38, 0) | The number of transactions in the block                                |
 | header          | json variant | A json queryable column containing the blocks header information       |
 | tx              | array        | An array of json queryable objects containing each tx and decoded logs |
 | ingested_at     | TIMESTAMP    | The time this data was ingested into the table by Snowflake            |
@ -124,7 +151,7 @@ Blocks and transactions are fed into the above two Near tables utilizing the Cha

 Data in this DBT project is written to the `NEAR` database in MetricsDAO.

-This database has 2 schemas, one for `DEV` and one for `PROD`. As a contributer you have full permission to write to the `DEV` schema. However the `PROD` schema can only be written to by Metric DAO's DBT Cloud account. The DBT Cloud account controls running / scheduling models against the `PROD` schema.
+This database has 2 schemas, one for `DEV` and one for `PROD` . As a contributer you have full permission to write to the `DEV` schema. However the `PROD` schema can only be written to by Metric DAO's DBT Cloud account. The DBT Cloud account controls running / scheduling models against the `PROD` schema.

 ## Branching / PRs

@ -138,6 +165,6 @@ When creating a PR please include the following details in the PR description:

 ## More DBT Resources:

- Learn more about dbt [in the docs](https://docs.getdbt.com/docs/introduction)
- Check out [Discourse](https://discourse.getdbt.com/) for commonly asked questions and answers
- Check out [the blog](https://blog.getdbt.com/) for the latest news on dbt's development and best practices
+* Learn more about dbt [in the docs](https://docs.getdbt.com/docs/introduction)
+* Check out [Discourse](https://discourse.getdbt.com/) for commonly asked questions and answers
+* Check out [the blog](https://blog.getdbt.com/) for the latest news on dbt's development and best practices
--- a/dbt_project.yml
+++ b/dbt_project.yml
@ -25,6 +25,7 @@ clean-targets: # directories to be removed by `dbt clean`
  - "dbt_packages"

 on-run-start:
+  - "{{ create_udfs() }}"
  - "{{create_sps()}}"
  - "{{create_get_nearblocks_fts()}}"

@ -51,3 +52,6 @@ tests:

 vars:
  "dbt_date:time_zone": GMT
+  STREAMLINE_INVOKE_STREAMS: False
+  STREAMLINE_USE_DEV_FOR_EXTERNAL_TABLES: False
+  UPDATE_UDFS_AND_SPS: False
--- a/macros/create_udfs.sql
+++ b/macros/create_udfs.sql
@ -0,0 +1,17 @@
+{% macro create_udfs() %}
+    {% if var("UPDATE_UDFS_AND_SPS") %}
+        {% if target.database != "NEAR_COMMUNITY_DEV" %}
+            {% set sql %}
+            CREATE schema if NOT EXISTS silver;
+CREATE schema if NOT EXISTS streamline;
+{{ create_udf_introspect() }}
+            {{ create_udf_s3_list_directories() }}
+            {{ create_udf_s3_list_objects() }}
+            {{ create_udf_s3_copy_objects() }}
+            {{ create_udf_s3_copy_objects_overwrite() }}
+
+            {% endset %}
+            {% do run_query(sql) %}
+        {% endif %}
+    {% endif %}
+{% endmacro %}
--- a/macros/streamline_udfs.sql
+++ b/macros/streamline_udfs.sql
@ -0,0 +1,62 @@
+{% macro create_udf_introspect() %}
+    CREATE
+    OR REPLACE EXTERNAL FUNCTION streamline.udf_introspect(
+        echo STRING
+    ) returns text api_integration = aws_stg_us_east_1_api max_batch_rows = 10 AS {% if target.name == "prod" %}
+        'https://jfqhk99kj1.execute-api.us-east-1.amazonaws.com/stg/s3/introspect'
+    {% else %}
+        'https://jfqhk99kj1.execute-api.us-east-1.amazonaws.com/stg/s3/introspect'
+    {%- endif %};
+{% endmacro %}
+
+{% macro create_udf_s3_list_directories() %}
+    CREATE
+    OR REPLACE EXTERNAL FUNCTION streamline.udf_s3_list_directories(
+        path STRING
+    ) returns ARRAY api_integration = aws_stg_us_east_1_api max_batch_rows = 10 AS {% if target.name == "prod" %}
+        'https://jfqhk99kj1.execute-api.us-east-1.amazonaws.com/stg/s3/list_directories'
+    {% else %}
+        'https://jfqhk99kj1.execute-api.us-east-1.amazonaws.com/stg/s3/list_directories'
+    {%- endif %};
+{% endmacro %}
+
+{% macro create_udf_s3_list_objects() %}
+    CREATE
+    OR REPLACE EXTERNAL FUNCTION streamline.udf_s3_list_objects(
+        path STRING
+    ) returns ARRAY api_integration = aws_stg_us_east_1_api max_batch_rows = 10 AS {% if target.name == "prod" %}
+        'https://jfqhk99kj1.execute-api.us-east-1.amazonaws.com/stg/s3/list_objects'
+    {% else %}
+        'https://jfqhk99kj1.execute-api.us-east-1.amazonaws.com/stg/s3/list_objects'
+    {%- endif %};
+{% endmacro %}
+
+{% macro create_udf_s3_copy_objects() %}
+    CREATE
+    OR REPLACE EXTERNAL FUNCTION streamline.udf_s3_copy_objects(
+        paths ARRAY,
+        source STRING,
+        target STRING
+    ) returns ARRAY api_integration = aws_stg_us_east_1_api headers = (
+        'overwrite' = '1'
+    ) max_batch_rows = 1 AS {% if target.name == "prod" %}
+        'https://jfqhk99kj1.execute-api.us-east-1.amazonaws.com/stg/s3/copy_objects'
+    {% else %}
+        'https://jfqhk99kj1.execute-api.us-east-1.amazonaws.com/stg/s3/copy_objects'
+    {%- endif %};
+{% endmacro %}
+
+{% macro create_udf_s3_copy_objects_overwrite() %}
+    CREATE
+    OR REPLACE EXTERNAL FUNCTION streamline.udf_s3_objects_overwrite(
+        paths ARRAY,
+        source STRING,
+        target STRING
+    ) returns ARRAY api_integration = aws_stg_us_east_1_api headers = (
+        'overwrite' = '1'
+    ) max_batch_rows = 1 AS {% if target.name == "prod" %}
+        'https://jfqhk99kj1.execute-api.us-east-1.amazonaws.com/stg/s3/copy_objects'
+    {% else %}
+        'https://jfqhk99kj1.execute-api.us-east-1.amazonaws.com/stg/s3/copy_objects'
+    {%- endif %};
+{% endmacro %}
--- a/models/silver/streamline/streamline__s3_sync.sql
+++ b/models/silver/streamline/streamline__s3_sync.sql
@ -0,0 +1,44 @@
+{{ config(
+    materialized = 'view',
+    tags = ['streamline'],
+    post_hook = "select * from {{this.schema}}.{{this.identifier}}",
+    comment = "incrementally sync Streamline.near_dev.blocks and Streamline.near_dev.shards"
+) }}
+
+WITH latest_block AS (
+
+    SELECT
+        MAX(SPLIT_PART(file_name, '/', 1)) AS block_id
+    FROM
+        TABLE(
+            information_schema.external_table_file_registration_history(
+                table_name => 'streamline.near_dev.blocks',
+                start_time => DATEADD('hour', {{ var("S3_LOOKBACK_HOURS", -2) }}, SYSDATE()))
+            )
+            WHERE
+                operation_status = 'REGISTERED_NEW'
+        ),
+        blocks AS (
+            -- Look back 1000 blocks from most max block_id and count forward 4000
+            SELECT
+                ROW_NUMBER() over (
+                    ORDER BY
+                        SEQ4()
+                ) AS series,
+                b.block_id :: INTEGER - 1000 + series AS new_block_id,
+                RIGHT(REPEAT('0', 12) || new_block_id :: STRING, 12) AS prefix
+            FROM
+                TABLE(GENERATOR(rowcount => 4000)),
+                latest_block b
+        )
+    SELECT
+        b.prefix,
+        streamline.udf_s3_copy_objects(
+            streamline.udf_s3_list_objects(
+                's3://near-lake-data-mainnet/' || b.prefix || '/*'
+            ),
+            's3://near-lake-data-mainnet/',
+            's3://stg-us-east-1-serverless-near-lake-mainnet-fsc/'
+        )
+    FROM
+        blocks b