Merge "Experimental component for a JSON data model"

2019-01-09 18:03:18 +00:00 · 2019-01-09 18:03:18 +00:00 · df4a73b1eb
parent 0a71e6038d 35565e8737
commit df4a73b1eb
2 changed files with 515 additions and 5 deletions
--- a/doc/source/index.rst
+++ b/doc/source/index.rst
@ -15,12 +15,11 @@ Train approved specs:

 Stein approved specs:

-..
-   .. toctree::
-      :glob:
-      :maxdepth: 1
+.. toctree::
+   :glob:
+   :maxdepth: 1

-      specs/stein/*
+   specs/stein/*

 Rocky approved specs:

--- a/specs/stein/json-data-model.rst
+++ b/specs/stein/json-data-model.rst
@ -0,0 +1,511 @@
+..
+ This work is licensed under a Creative Commons Attribution 3.0 Unported
+ License.
+
+ http://creativecommons.org/licenses/by/3.0/legalcode
+
+============================================
+Experimental component for a JSON data model
+============================================
+
+https://blueprints.launchpad.net/congress/+spec/json-data-model
+
+The translation from JSON source data to tabular Congress data places the
+limited resources of Congress developers in the way of the Congress users
+having the flexibility to quickly try/use any data source and data field as
+the need arises in their policy use cases. This spec is a proposal to
+experiment with removing the translation step, thus retaining the source JSON
+data model, as a promising way to overcome the limitation.
+
+Problem description
+===================
+
+In cloud computing, most of the source data is JSON. In the Congress model,
+the developers carefully understand and curate the source data to develop the
+data source drivers that make the data available to end users in table form.
+The translation into table model has the advantage of presenting a simpler
+and more consistent data model to the user writing policy on the data.
+
+However, with so many sources of data out there and new data being added to
+existing sources through frequent microversions (e.g., Nova), it is impossible
+for Congress developers to anticipate and keep up with all the data sources or
+even all the data fields in a given data source the end users desire.
+It often happens that as the user writes policy, the user finds out that
+additional data fields or sources are needed for the policy. In that case,
+the user needs to contact upstream developers, wait for them to add the data
+fields or sources, then wait for the new release, then upgrade to the latest
+release. All this before the end user can even test out the desired policy.
+
+In order to better support those operators who have rapidly emerging and
+evolving use cases, we need to experiment with an additional model: remove
+translation and developer curation from the loop, allowing users to integrate
+new data sources without code change.
+
+
+Proposed change
+===============
+
+Create an experimental component of Congress (tentatively congress-json)
+to allow writing policy over json data stored in PostgreSQL json data model.
+
+Congress-CloudState would ingest the source data mostly as-is, in JSON format,
+into a PostgreSQL database using the JSONB column type in PostgreSQL. The
+`PostgreSQL JSON syntax
+<https://hackernoon.com/how-to-query-jsonb-beginner-sheet-cheat-4da3aa5082a3>`_
+(see the policy_ section for examples) would allow for rules (SQL views) to
+operate directly over the JSON data.
+
+Because no translation is needed, new data sources can be configured without
+code change. For example, the following sample YAML config file configures
+Congress-CloudState to poll the nova endpoint. It tells Congress-CloudState to
+poll the path servers/detail, extract the results using the jsonpath
+expression ``$.servers[:]``, and finally populate the nova.servers table with
+the results.
+
+::
+
+  name: nova
+  poll: 60
+  authentication:
+    type: keystone_password
+    user: congress_access
+    password: secret
+  api_endpoint: https://stack.org/compute/v2.26
+  tables:
+    servers:
+      api_path: servers/detail
+      api_verb: get
+      jsonpath: $.servers[:]
+
+Architecture
+------------
+The architecture is fairly simple. Congress-CloudState has two major
+components.
+
+- The data ingestor polls the cloud services via API and populates the
+  database with the data. The ingestor also accepts webhook notifications.
+- The executor reads from the database to determine what actions to execute,
+  then calls the appropriate APIs in the cloud services to execute the
+  actions.
+
+The users mostly interact directly with the database to add/remove policy
+rules and query policy results.
+
+The following diagram shows the high level components and flow of data.
+
+::
+
+  +----------------------------------------------------------+
+  |                  Cloud Services                          |
+  +--------------------------------------------------^-------+
+         |                                           |
+         |                                           |
+  +------v--------+    +---------------+    +----------------+
+  | Congress-JSON |    |               |    |  Congress-JSON |
+  | Data Ingestor +---->  PostgreSQL   +---->  Executor      |
+  |               |    |               |    |                |
+  +---------------+    +----------^----+    +----------------+
+                            |     |
+                            |     |
+                       +----v----------+
+                       |               |
+                       |    Users      |
+                       |               |
+                       +---------------+
+
+Benefits
+--------
+
+- Very easy to add new data source because no translation is required. This
+  `blog post
+  <https://pndablog.com/2017/04/26/schema-on-write-vs-schema-on-read/>`_
+  has more discussion of the pros and cons of the approach (schema-on-read).
+- No code change and release needed to consume a new field in a new API
+  microversion. The operator simply changes the desired microversion in
+  Congress-CloudState configuration.
+  This `patch <https://review.openstack.org/#/c/611516/>`_ is a recent
+  example of how much time and work it takes to consume a new field previously
+  not included by Congress, not to mention the delay for users waiting for a
+  new release to consume a new field.
+- Much is available in the way of editors, debuggers, tutorials, and online
+  help for PostgreSQL language, making it easy for new users to write
+  policies.
+- PostgreSQL has `JSON indexing support
+  <https://blog.2ndquadrant.com/nosql-postgresql-9-4-jsonb/>`_
+  which allows for a wide class of performant queries.
+
+Alternatives
+------------
+
+- Directly make changes to Congress agnostic policy engine to support JSON data
+
+  - Here is a rough estimate of the work required to implement similar
+    functionality in Congress itself.
+
+    - Design a new version of Datalog language to support JSON. This step is
+      hard. As of now, there is no concensus in the Datalog community on how
+      best to do this.
+    - Major changes to congress/datalog/* to process the new language.
+    - Major changes to the policy engine to support the new language.
+    - Changes to congress/dse2/* to accommodate JSON data.
+
+  - Instead of attempting all the work outlined above, we propose to create
+    a very simple new component which leverages PostgreSQL allow policy over
+    JSON data.
+    If the approach proves successful, a decision could be made in the future
+    to allow congress-json to be used as a tool for extracting from JSON data
+    the table data needed by other parts of Congress (e.g., agnostic, z3).
+
+- Homegrow a policy engine supporting JSON data rather than use one
+  off-the-shelf
+
+  - It would take a tremendous amount of unnecessary development and
+    maintenance work.
+
+- Choose a datastore other than PostgreSQL
+
+  - The following are the major requirements for the datastore
+
+    - JSON data support
+    - Widely-adopted declarative language
+    - Performant (off-key) join
+    - Support for rules (views)
+    - Transactional write
+    - Strong open source community with appropriate licensing
+    - Highly-available deployment option
+    - Fine-grained permissions
+
+  - Most of the well-known NoSQL solutions do not provide performant off-key
+    joins
+  - MySQL/MariaDB has support for JSON data but limited indexing support
+    (typically requiring the indexed value be declared as a virtual column)
+  - Among all the options evaluated, PostgreSQL best satisfies the
+    requirements, with the following notable features:
+
+    - Flexible indexing for JSONB data type supporting performant off-key
+      joins
+    - Very strong open source community with permissive license
+    - A wealth of (first party and third party) tools and online knowledge for
+      learning, query writing, query debugging, and performance-tuning
+    - A flexible and expressive permissions system
+  - We propose to start with PostgreSQL, while leaving the option open to add
+    support for other datastores down the road through a plugin architecture.
+
+
+Policy
+------
+
+Here we give several examples to demonstrate how policy and rules would be
+created. Each interaction is shown in both classic Congress interaction as
+well as Congress-CloudState interaction for reader’s understanding.
+
+Sample data
+~~~~~~~~~~~
+The nova.servers table is the collection of all the JSON documents
+representing servers, each document in a row with a single column d containing
+the document. Each server is represented by a JSON document as supplied by
+Nova’s list servers (detailed) API. Here is a simplified version of a sample
+JSON document representing a Nova server (the UUIDs have been replaced with
+more readable strings for ease of illustration):
+
+::
+
+ {
+   "id":"server-134",
+   "name":"server 134",
+   "status":"ACTIVE",
+   "tags":[
+      "production",
+      "critical"
+   ],
+   "hostId":"host-05",
+   "host_status":"ACTIVE",
+   "metadata":{
+      "HA_Enabled":false
+   },
+   "tenant_id":"tenant-52",
+   "user_id":"user-830",
+   "flavor": {
+     "disk": 1,
+     "ephemeral": 0,
+     "extra_specs": {
+       "hw:cpu_policy": "dedicated",
+       "hw:mem_page_size": "2048"
+     },
+     "original_name": "m1.tiny.specs",
+     "ram": 512,
+     "swap": 0,
+     "vcpus": 1
+   }
+ }
+
+
+Example 1
+~~~~~~~~~
+
+- Create policy (schema).
+
+  - Congress syntax:
+    ::
+
+     congress policy create vm_error
+  - Congress-CloudState equivalent:
+    ::
+
+     CREATE SCHEMA vm_host_down; 
+
+- Create policy rule (view) identifying those servers whose host is down.
+
+  - Congress syntax:
+    ::
+
+     congress policy rule create vm_host_down '
+       error(server_id) :-
+         nova:servers(id=server_id, host_id=host_id),
+         nova:hypervisors(id=host_id, state="DOWN")'
+
+  - Congress-CloudState equivalent:
+    ::
+
+     CREATE VIEW vm_host_down.error AS
+     SELECT d->>'id' AS server_id
+     FROM   nova.servers
+     WHERE  d->>'host_status' = 'DOWN';
+
+    - Note on syntax: the :code:`->>` operator accesses the content of a JSON
+      object field and returns the result as text.
+
+- Query the results in the policy table.
+
+  - Congress syntax:
+    ::
+
+     congress policy row list vm_host_down error
+
+  - Congress-CloudState equivalent:
+    ::
+
+     SELECT * FROM vm_host_down.error;
+
+- Create policy rule (view) identifying 'critical'-tagged servers whose host
+  is down.
+
+  - Congress syntax:
+    ::
+
+     congress policy rule create vm_host_down '
+       critical(server_id) :-
+         nova:servers(id=server_id, host_id=host_id),
+         nova:hypervisors(id=host_id, state="DOWN"),
+         nova:tags(server_id=server_id, tag="critical")'
+
+  - Congress-CloudState equivalent:
+    ::
+
+     CREATE VIEW vm_host_down.critical AS
+     SELECT d->>'id' AS server_id
+     FROM   nova.servers
+     WHERE  d->>'host_status' = 'DOWN'
+     AND    d->'tags' ? 'critical';
+
+    - Note on syntax:
+
+      - The :code:`->` operator accesses the content of a JSON object field
+        and returns the result as JSON. In this example, :code:`d->'tags'`
+        returns the array of tags associated with each server.
+      - The :code:`?` operator checks that a string is a top-level key/element
+        in a JSON structure. In this example, the
+        :code:`d->'tags' ? 'critical'` condition checks that the string
+        'critical' is in the array of tags retrieved by :code:`d->'tags'`.
+
+Example 2
+~~~~~~~~~
+
+- Create policy (schema).
+
+  - Congress syntax:
+    ::
+
+     congress policy create production_stable
+
+  - Congress-CloudState equivalent:
+    ::
+
+     CREATE SCHEMA production_stable;
+
+- Create policy rule (view) identifying production servers using an unstable
+  image.
+
+  - Congress syntax:
+    ::
+
+     congress policy rule create production_stable '
+       error(id) :-
+         nova:servers(id=id, image=image_id),
+         nova:tags(id=id, tag="production"),
+         glance:tags(image_id=image_id, tag="unstable")'
+
+  - Congress-CloudState equivalent:
+    ::
+
+     CREATE VIEW production_stable.error  AS
+     SELECT server.d->>'id'               AS server_id,
+            image.d->>'id'                AS image_id
+     FROM   nova.servers server
+     JOIN   glance.images image
+     ON     server.d->'image'->'id' = image.d->'id'
+     WHERE  (server.d->'tags' ? 'production')
+     AND    (image.d->'tags' ? 'unstable');
+
+    - Note on syntax: the join-condition
+      :code:`server.d->'image'->'id' = image.d->'id'` matches the glance image
+      with the server whose image ID matches the glance image ID.
+
+
+Example 3
+~~~~~~~~~
+
+This example illustrates that accessing deeply nested data can be relatively
+straightforward.
+
+The following view (rule) identifies all the servers using a flavor with cpu
+policy 'dedicated'. Despite the information being buried several layers deep,
+it is relatively straightforward to access simply by following the structure.
+
+::
+
+  CREATE VIEW dedicated_servers AS
+  SELECT *
+  FROM   nova.servers
+  WHERE  d -> 'flavor' -> 'extra_specs' ->> 'hw:cpu_policy' = 'dedicated';
+
+
+Policy actions
+--------------
+
+Policy actions would work similarly to agnostic engine.
+The policy writer would define a policy rule (view) specifying in each row
+what action to call on which service using what parameter values.
+
+As in agnostic engine, the new component would query the 'execute' table and
+then call the appropriate client/API methods according to the rows in the
+'execute' table of each policy.
+
+
+Data sources
+------------
+
+A new type of data source is introduced which is configured through YAML files
+and stores data in the original JSON format without translation.
+
+
+Data model impact
+-----------------
+
+No impact on the data model.
+
+
+REST API impact
+---------------
+
+Minimal REST API impact. The webhook model for accepting webhooks is naturally
+extended to the JSON data sources when configured.
+
+
+Security impact
+---------------
+
+No security impact when the experimental component is not explicitly.
+When enabled, there is an additional data access which would be protected by
+PostgreSQL permissions set up by Congress.
+
+
+Notifications impact
+--------------------
+
+No impact.
+
+
+Other end user impact
+---------------------
+
+End users will have the option of using the well-known
+PostgreSQL syntax to write policy directly on the source data format,
+having direct access to all the source data fields.
+
+
+Performance impact
+------------------
+
+Preliminary performance testing on 100,000 records show adequate performance
+on both data ingestion and query evaluation. If a query performance issues
+does arise, there is a wealth of tools and knowledge for tuning query
+performance in PostgreSQL.
+
+
+Other deployer impact
+---------------------
+
+Additional config/commandline options to control the deployment of the new
+experimental component.
+
+
+Developer impact
+----------------
+
+congress-json is expected to be a very simple component. As a result,
+congress-json is expected to require very minimal maintenance from
+developers.
+
+
+Implementation
+==============
+
+Assignee(s)
+-----------
+
+Primary assignee: ekcs
+
+Others welcome to pick up tasks.
+
+
+Work items
+----------
+
+- Database initializer
+- JSON-postgres data source driver
+- Integration testing
+
+
+Dependencies
+============
+
+The only major new dependency is PostgreSQL, which is already used in
+OpenStack deployments. Version 9.4 or higher required (Ubuntu xenial comes with
+9.5). It affects only those enabling the experimental new component.
+
+
+Testing
+=======
+
+Simple tempest scenarios for the new component. They may piggyback on the
+existing congress-tempest-postgres jobs.
+
+
+Documentation impact
+====================
+
+Standard documentation expected for the new component.
+
+
+References
+==========
+
+* `Schema-on-read pros and cons
+  <https://pndablog.com/2017/04/26/schema-on-write-vs-schema-on-read/>`_
+* `PostgreSQL JSON operators and functions
+  <https://www.postgresql.org/docs/9.4/functions-json.html>`_
+* `PostgreSQL JSON syntax cheat sheet
+  <https://hackernoon.com/how-to-query-jsonb-beginner-sheet-cheat-4da3aa5082a3>`_
+* `JSON indexing support
+  <https://blog.2ndquadrant.com/nosql-postgresql-9-4-jsonb/>`_