Overview

OrgStack is a framework for establishing contracts between components in your data pipelines. These docs contain all of the information necessary to get up and running on the OrgStack platform. If you have additional questions after reading these docs, please feel free to contact us.

The two primary interfaces to the OrgStack platform include:

Use Cases

An Example Scenario

Suppose we have a cron job that produces json data with the following structure:

{
    "first_name": "Willy",
    "last_name": "Wonka",
    "age": 40,
    "favorite_candy": "Sugar Plums"
}

We expect that this data will always contain the keys first_name, last_name, etc. Several downstream components in our data pipeline consume this data, so it's imperative that the data's structure remains consistent. On the following day, our cron job produces the following output:

{
    "first_name": "Wyatt",
    "last_name": "Earp",
    "age": 35,
    "favorite_candy": "Lollipops"
}

Notice that the structure is unchanged; the keys first_name, last_name, etc. are still present in this output. The values of these key are dynamic, though.

OrgStack's Solution

OrgStack generates a schema from this output, and it continuously verifies that the data's structure remains unchanged every time your code is built via CI/CD. This same concept can be applied to CSV or XML data sources, as well. Think of our cron job as a producer, or a data source, and think of all the downstream components that rely on this data as consumers. OrgStack establishes a contract between producing and consuming components, which eliminates an entire class of bugs caused by malformed data, eliminates the need to manually write schema validations, and ensures that your production data pipelines stay online.

Common Use Cases

Definitions

Getting Started

Follow the steps below to take full advantage of the OrgStack platform:

  1. Sign up for OrgStack
  2. Create a Data Source in the web platform
  3. Install the CLI
  4. Configure the CLI
  5. Verify your Data Source
  6. Consume your Data Source in other repositories

Sign Up For OrgStack

In order to start using the web platform or CLI, you'll need an OrgStack account. Please contact us to learn more about account creation.

Creating a Data Source

Create a data source in the OrgStack web platform by providing a name, version, description, data type, and example data file. OrgStack currently supports json, csv, and xml data sources. The example data file should be a sample of output that you expect your data source to produce. Using our example scenario above, we would upload the following example data file for a json data source:

{
    "first_name": "Willy",
    "last_name": "Wonka",
    "age": 40,
    "favorite_candy": "Sugar Plums"
}

Installing the CLI

The OrgStack command-line interface is currently available via the pip package manager for Python. Make sure pip is installed , then run the following command to install the OrgStack CLI:

pip install orgstackcli

This package can be installed globally, or inside a virtual environment.


NOTE

The OrgStack CLI is a Python 3 application, and therefore requires Python 3 to be installed on your system.


Run the following command at any time to learn more about CLI usage:

orgstack --help

Configuring the CLI

The OrgStack CLI requires the user to authenticate before executing further commands. Run the following command to authenticate:

orgstack configure

At the prompt, enter the same login credentials that you use to access the OrgStack web platform.

This profile is stored in $HOME/.orgstack/credentials.json.

You can overwrite the existing profile by running orgstack configure a second time.


NOTE

You only need to run orgstack configure once per machine, unless you're intentionally overwriting an existing profile.


Verifying Data Sources

Change your working directory to the repository where your data source is defined.

OrgStack expects a file named orgstack.json to be in the root of your repository. Below is an example structure for a Python project.

example-project/
    src/
        driver.py
    orgstack.json    # <-- Our configuration file
    generate.py
    requirements.txt
    setup.py
    README.md

In the following example orgstack.json configuration file, we define version 1.0.0 of a data source called Example Source X.

{
    "producing": [
        {
            "name": "Example Source X",
            "version": "1.0.0",
            "build_sequence": ["python", "generate.py"]
        }
    ]
}

producing definitions inside orgstack.json only require 3 fields:

The build_sequence provides a way of reproducing the data source's output, so that we can compare it's schema to the well-established, valid schema. This is a command that you would typically run in your terminal, so it must be an executable script. The output of the build_sequence command must be on stdout (a normal print() call is). We run a python script below to generate our output, but a build_sequence could be any executable that produces output, from using curl on a remote resource, to reading in a local file.

Sticking with our cron example above, assume the command python generate.py outputs the following data:

{
    "first_name": "Jerry",
    "last_name": "Seinfeld",
    "age": 34,
    "favorite_candy": "Junior Mints"
}

Once the orgstack.json configuration file is complete, run the following command to verify all of the data sources defined in the repository:

orgstack verify

This command will produce build reports for all of the data sources defined in the working repository. These reports are available in the web platform.

Make sure to configure the OrgStack CLI and run the orgstack verify command as part of your CI/CD process. We recommend that you verify data sources immediately after you run your unit tests.


NOTE

Set the ORGSTACK_ENV environent variable to sandbox while using the orgstack verify command for local development. Leave the variable unset when you deploy OrgStack to your CI/CD server.


Consuming Data Sources

Once you've defined at least one producing data source, you're ready to start consuming that data in downstream repositories. Below is an example of an orgstack.json file in a consuming repository:

{
    "consuming": [
        {"name": "Example Source X", "version": "1.0.0"}
    ]
}

consuming definitions inside orgstack.json only require 2 fields:

By defining this dependency, you're ensuring that the data produced by Example Source X will always remain available and consistent for this consumer.

OrgStack keeps track of all downstream repositories, so that maintainers of upstream data sources can better understand usage. Usage metrics are available for each data source in the web platform .

Additional Resources

Couldn't find an answer to your question? Feel free to contact us.