puppetlabs.puppetdb.catalog.utils

Catalog generation and manipulation

A suite of functions that aid in constructing random catalogs, or randomly modifying an existing catalog (wire format or parsed).

Catalog generation and manipulation

A suite of functions that aid in constructing random catalogs, or
randomly modifying an existing catalog (wire format or parsed).

raw docstring

puppetlabs.puppetdb.catalogs

Puppet catalog parsing

Functions that handle conversion of catalogs from wire format to internal PuppetDB format.

The wire format is described in detail in the spec.

There are a number of transformations we apply to wire format catalogs during conversion to our internal format; while wire format catalogs contain complete records of all resources and edges, and most things are properly encoded as lists or maps, there are still a number of places where structure is absent or lacking:

Resource specifiers are represented as opaque strings, like Class[Foobar], as opposed to something like {"type" "Class" "title" "Foobar"}
Tags are represented as lists (and may contain duplicates) instead of sets
Resources are represented as a list instead of a map, making operations that need to correlate against specific resources unneccesarily difficult
Keys to all maps are strings (to conform with JSON), instead of more convenient Clojure keywords

Terminology

Unless otherwise indicated, all terminology for catalog components matches terms listed in the spec.

Transformed constructs

Resource Specifier (resource-spec)

A map of the form {:type "Class" :title "Foobar"}. This is a unique identifier for a resource within a catalog.

Resource

A map that represents a single resource in a catalog:

{:type       "..."
 :title      "..."
 :...        "..."
 :tags       #{"tag1", "tag2", ...}
 :parameters {:name1 "value1"
              :name2 "value2"
              ...}}

Certain attributes are treated special:

:type and :title are used to produce a resource-spec for this resource

Edge

A representation of an "edge" (dependency or containment) in the catalog. All edges have the following form:

{:source       <resource spec>
 :target       <resource spec>
 :relationship <relationship id>}

A relationship identifier can be one of:

:contains
:required-by
:notifies
:before
:subscription-of

Catalog

A wire-format-neutral representation of a Puppet catalog. It is a map with the following structure:

{:certname    "..."
 :version     "..."
 :resources   {<resource-spec> <resource>
               <resource-spec> <resource>
               ...}
 :edges       #(<dependency-spec>,
                <dependency-spec>,
                ...)}

Puppet catalog parsing

Functions that handle conversion of catalogs from wire format to
internal PuppetDB format.

The wire format is described in detail in [the spec](../spec/catalog-wire-format.md).

There are a number of transformations we apply to wire format
catalogs during conversion to our internal format; while wire
format catalogs contain complete records of all resources and
edges, and most things are properly encoded as lists or maps, there
are still a number of places where structure is absent or lacking:

1. Resource specifiers are represented as opaque strings, like
   `Class[Foobar]`, as opposed to something like
   `{"type" "Class" "title" "Foobar"}`

2. Tags are represented as lists (and may contain duplicates)
   instead of sets

3. Resources are represented as a list instead of a map, making
   operations that need to correlate against specific resources
   unneccesarily difficult

4. Keys to all maps are strings (to conform with JSON), instead of
   more convenient Clojure keywords

### Terminology

Unless otherwise indicated, all terminology for catalog components
matches terms listed in [the spec](../spec/catalog-wire-format.md).

### Transformed constructs

### Resource Specifier (resource-spec)

A map of the form `{:type "Class" :title "Foobar"}`. This is a
unique identifier for a resource within a catalog.

### Resource

A map that represents a single resource in a catalog:

    {:type       "..."
     :title      "..."
     :...        "..."
     :tags       #{"tag1", "tag2", ...}
     :parameters {:name1 "value1"
                  :name2 "value2"
                  ...}}

Certain attributes are treated special:

* `:type` and `:title` are used to produce a `resource-spec` for
  this resource

### Edge

A representation of an "edge" (dependency or containment) in the
catalog. All edges have the following form:

    {:source       <resource spec>
     :target       <resource spec>
     :relationship <relationship id>}

A relationship identifier can be one of:

* `:contains`
* `:required-by`
* `:notifies`
* `:before`
* `:subscription-of`

### Catalog

A wire-format-neutral representation of a Puppet catalog. It is a
map with the following structure:

    {:certname    "..."
     :version     "..."
     :resources   {<resource-spec> <resource>
                   <resource-spec> <resource>
                   ...}
     :edges       #(<dependency-spec>,
                    <dependency-spec>,
                    ...)}

raw docstring

puppetlabs.puppetdb.cheshire

Cheshire related functions

This front-ends the common set of core cheshire functions:

generate-string
generate-stream
parse
parse-strict
parse-string
parse-stream

This namespace when 'required' will also setup some common JSON encoders globally, so you can avoid doing this for each call.

Cheshire related functions

This front-ends the common set of core cheshire functions:

* generate-string
* generate-stream
* parse
* parse-strict
* parse-string
* parse-stream

This namespace when 'required' will also setup some common JSON encoders
globally, so you can avoid doing this for each call.

raw docstring

puppetlabs.puppetdb.cli.benchmark

Benchmark suite

This command-line utility will simulate catalog submission for a population. It requires that a separate, running instance of PuppetDB for it to submit catalogs to.

We attempt to approximate a number of hosts submitting catalogs at the specified runinterval with the specified rate-of-churn in catalog content.

Running parallel Benchmarks

If are running up against the upper limit at which Benchmark can submit simulated requests, you can run multiple instances of benchmark and make use of the --offset flag to shift the cert numbers.

Example (probably run on completely separate hosts):

benchmark --offset 0 --numhosts 100000
benchmark --offset 100000 --numhosts 100000
benchmark --offset 200000 --numhosts 100000
...

Preserving host-map data

By default, each time Benchmark is run, it initializes the host-map catalog, factset and report data randomly from the given set of base --catalogs --factsets and --reports files. When re-running benchmark, this causes excessive load on puppetdb due to the completely changed catalogs/factsets that must be processed.

To avoid this, set --simulation-dir to preserve all of the host map data between runs as nippy/frozen files. Benchmark will then load and initialize a preserved host matching a particular host-# from these files at startup. Missing hosts (if --numhosts exceeds preserved, for example) will be initialized randomly as by default.

Mutating Catalogs and Factsets

The benchmark tool automatically refreshs timestamps and transaction ids when submitting catalogs, factsets and reports, but the content does not change.

To simulate system drift, code changes and fact changes, use '--rand-catalog=PERCENT_CHANCE:CHANGE_COUNT' and '--rand-facts=PERCENT_CHANCE:PERCENT_CHANGE'.

The former indicates the chance any given catalog will perform CHANGE_COUNT resource mutations (additions, modifications or deletions). The later is the chance any given factset will mutate PERCENT_CHANGE of its fact values. These may be set multiple times, provided that PERCENT_CHANCE does not sum to more than 100%.

By default edges are not included in catalogs. If --include-edges is true, then add-resource and del-resource will involve edges as well.

adding a resource adds a single 'contains' edge with the source being one of the catalog's original (non-added) resources.
deleting a resource removes one of the added resources (if there are any) and it's related leaf edge.

By ensuring we only ever delete leaves from the graph, we maintain the graph integrity, which is important since PuppetDB validates the edges on injestion.

This provides only limited exercise of edge mutation, which seemed like a reasonable trade-off given that edge submission is deprecated. Running with --include-edges also impacts the nature of catalog mutation, since original resources will never be removed from the catalog.

See add-resource, mod-resource and del-resource for details of resource and edge changes.

TODO: Fact addition/removal TODO: Mutating reports

Viewing Metrics

There are benchmark metrics which can be viewed via JMX.

WARNING: DO NOT DO THIS WITH A PRODUCTION OR INTERNET-ACCESSIBLE INSTANCE! This gives remote access to the JVM internals, including potentially secrets. If you absolutely must (you don't), read about using certs with JMX to do it securely. You are better off using the metrics API or Grafana metrics exporter.

Add the following properties to your Benchmark Java process on startup:

-Dcom.sun.management.jmxremote=true
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.port=5555
-Djava.rmi.server.hostname=127.0.0.1
-Dcom.sun.management.jmxremote.rmi.port=5556

Then with a tool like VisualVM, you can add a JMX Connection, and (with the MBeans plugin) view puppetlabs.puppetdb.benchmark metrics.

Benchmark suite

This command-line utility will simulate catalog submission for a
population. It requires that a separate, running instance of
PuppetDB for it to submit catalogs to.

We attempt to approximate a number of hosts submitting catalogs at
the specified runinterval with the specified rate-of-churn in
catalog content.

### Running parallel Benchmarks

If are running up against the upper limit at which Benchmark can
submit simulated requests, you can run multiple instances of benchmark and
make use of the --offset flag to shift the cert numbers.

Example (probably run on completely separate hosts):

```
benchmark --offset 0 --numhosts 100000
benchmark --offset 100000 --numhosts 100000
benchmark --offset 200000 --numhosts 100000
...
```

### Preserving host-map data

By default, each time Benchmark is run, it initializes the host-map catalog,
factset and report data randomly from the given set of base --catalogs
--factsets and --reports files. When re-running benchmark, this causes
excessive load on puppetdb due to the completely changed catalogs/factsets
that must be processed.

To avoid this, set --simulation-dir to preserve all of the host map data
between runs as nippy/frozen files. Benchmark will then load and initialize a
preserved host matching a particular host-# from these files at startup.
Missing hosts (if --numhosts exceeds preserved, for example) will be
initialized randomly as by default.

### Mutating Catalogs and Factsets

The benchmark tool automatically refreshs timestamps and transaction ids
when submitting catalogs, factsets and reports, but the content does not
change.

To simulate system drift, code changes and fact changes, use
'--rand-catalog=PERCENT_CHANCE:CHANGE_COUNT' and
'--rand-facts=PERCENT_CHANCE:PERCENT_CHANGE'.

The former indicates the chance any given catalog will perform CHANGE_COUNT
resource mutations (additions, modifications or deletions). The later is the
chance any given factset will mutate PERCENT_CHANGE of its fact values. These
may be set multiple times, provided that PERCENT_CHANCE does not sum to more
than 100%.

By default edges are not included in catalogs. If --include-edges is true,
then add-resource and del-resource will involve edges as well.

* adding a resource adds a single 'contains' edge with the source
  being one of the catalog's original (non-added) resources.
* deleting a resource removes one of the added resources (if there are any)
  and it's related leaf edge.

By ensuring we only ever delete leaves from the graph, we maintain the graph
integrity, which is important since PuppetDB validates the edges on injestion.

This provides only limited exercise of edge mutation, which seemed like a
reasonable trade-off given that edge submission is deprecated. Running with
--include-edges also impacts the nature of catalog mutation, since original
resources will never be removed from the catalog.

See add-resource, mod-resource and del-resource for details of resource and
edge changes.

TODO: Fact addition/removal
TODO: Mutating reports

### Viewing Metrics

There are benchmark metrics which can be viewed via JMX.

WARNING: DO NOT DO THIS WITH A PRODUCTION OR INTERNET-ACCESSIBLE INSTANCE! 
This gives remote access to the JVM internals, including potentially secrets.
If you absolutely must (you don't), read about using certs with JMX to do it
securely. You are better off using the metrics API or Grafana metrics
exporter.

Add the following properties to your Benchmark Java process on startup:

```
-Dcom.sun.management.jmxremote=true
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.port=5555
-Djava.rmi.server.hostname=127.0.0.1
-Dcom.sun.management.jmxremote.rmi.port=5556
````

Then with a tool like VisualVM, you can add a JMX Connection, and (with the
MBeans plugin) view puppetlabs.puppetdb.benchmark metrics.

raw docstring

puppetlabs.puppetdb.cli.benchmark.query

puppetlabs.puppetdb.cli.fact-storage-benchmark

puppetlabs.puppetdb.cli.generate

Data Generation utility

This command-line tool can generate a base sampling of catalog, fact and report files suitable for consumption by the PuppetDB benchmark utility.

Note that it is only necessary to generate a small set of initial sample data since benchmark will permute per node differences. So even if you want to benchmark 1000 nodes, you don't need to generate initial catalog/fact/report json for 1000 nodes.

If you want a representative sample with big differences between catalogs, you will need to run the tool multiple times. For example, if you want a set of 5 large catalogs and 10 small ones, you will need to run the tool twice with the desired parameters to create the two different sets.

Flag Notes

Catalogs

Resource Counts

The num-resources flag is total and includes num-classes. So if you set --num-resources to 100 and --num-classes to 30, you will get a catalog with a hundred resources, thirty of which are classes.

Edges

A containment edge is always generated between the main stage and each class. And non-class resources get a containment edge to a random class. So there will always be a base set of containment edges equal to the resource count. The --additional-edge-percent governs how many non-containment edges are added on top of that to simulate some further catalog structure. There is no guarantee of relationship depth (as far as, for example Stage(main) -> Class(foo) -> Class(bar) -> Resource(biff)), but it does ensure some edges between classes, as well as between class and non-class resources.

Large Resource Parameter Blobs

The --blob-count and --blob-size parameters control inclusion of large text blobs in catalog resources. By default one ~100kb blob is added per catalog.

Set --blob-count to 0 to exclude blobs altogether.

Facts

Baseline Facts

Each fact set begins with a set of baseline facts from: baseline-agent-node.json.

These provide some consistency for a common set of baseline fact paths present on any puppet node. The generator then mutates half of the values to provide variety.

Fact Counts

The --num-facts parameter controls the number of facts to generate per host.

There are 376 leaf facts in the baseline file. Setting num-facts less than this will remove baseline facts to approach the requested number of facts. (Empty maps and arrays are not removed from the factset, so it will never pair down to zero.) Setting num-facts to a larger number will add facts of random depth based on --max-fact-depth until the requested count is reached.

Total Facts Size

The --total-fact-size parameter controls the total weight of the fact values map in kB. Weight is added after count is reached. So if the weight of the adjusted baseline facts already exceeds the total-fact-size, nothing more is done. No attempt is made to pair facts back down the requested size, as this would likely require removing facts.

Max Fact Depth

The --max-fact-depth parameter is the maximum nested depth a fact added to the baseline facts may reach. For example a max depth of 5, would mean that an added fact would at most be a nest of four maps:

{foo: {bar: {baz: {biff: boz}}}}

Since depth is picked randomly for each additional fact, this does not guarantee facts of a given depth. Nor does it directly affect the average depth of facts in the generated factset, although the larger the max-fact-depth and num-facts, the more likely that the average depth will drift higher.

Package Inventory

The --num-packages parameter sets the number of packages to generate for the factset's package_inventory array. Set to 0 to exclude.

Reports

Reports per Catalog

The --num-reports flag governs the number of reports to generate per generated catalog. Since one catalog is generated per host, this means you will end up with num-hosts * num-reports reports.

Variation in Reports

A report details change, or lack there of, during enforcement of the puppet catalog on the host. Since the benchmark tool currently chooses randomly from the given report files, a simple mechanism for determining the likelihood of receiving a report of a particular size (with lots of changes, few changes or no changes) is to produce multiple reports of each type per host to generate a weighted average. (If there are 10 reports, 2 are large and 8 are small, then it's 80% likely any given report submission submitted by benchmark will be of the small variety...)

The knobs to control this with the generate tool are:

--num-reports, to determine the base number of reports to generate per catalog
--high-change-reports-percent, percentage of that base to generate as reports with a high number of change events, as determined by:
--high-change-resource-percent, percentage of resources in a high change report that will experience events (changes)
--low-change-reports-percent, percentage of the base reports to generate as reports with a low number of change events as determined by:
--low-change-resource-percent, percentage of resources in a low change report that will experience events (changes)

The left over percentage of reports will be no change reports (generally the most common) indicating the report run was steady-state with no changes.

By default, with a num-reports of 20, a high change percent of 5% and a low change percent of 20%, you will get 1 high change, 4 low change and 15 unchanged reports per host.

Unchanged Resources

In Puppet 8, by default, the agent no longer includes unchanged resources in the report, reducing its size.

The generate tool also does this by default, but you can set --no-exclude-unchanged-resources to instead include unchanged resources in every report (for default Puppet 7 behavior, for example).

Logs

In addition to a few boilerplate log lines, random logs are generated for each change event in the report. However other factors, such as pluginsync, puppet runs with debug lines and additional logging in modules can increase log output (quite dramatically in the case of debug output from the agent).

To simulate this, you can set --num-additional-logs to include in a report. And you can set --percent-add-report-logs to indicate what percentage of reports have this additional number of logs included.

Random Distribution

The default generation produces relatively uniform structures.

for catalogs it generates equal resource and edge counts and similar byte counts.
for factsets it generates equal fact counts and similar byte counts.

Example:

jpartlow@jpartlow-dev-2204:~/work/src/puppetdb$ lein run generate --verbose --output-dir generate-test ... :catalogs: 5

| :certname | :resource-count | :resource-weight | :min-resource | :mean-resource | :max-resource | :edge-count | :edge-weight | :catalog-weight | |---------------+-----------------+------------------+---------------+----------------+---------------+-------------+--------------+-----------------| | host-sarasu-0 | 101 | 137117 | 90 | 1357 | 110246 | 150 | 16831 | 154248 | | host-lukoxo-1 | 101 | 132639 | 98 | 1313 | 104921 | 150 | 16565 | 149504 | | host-dykivy-2 | 101 | 120898 | 109 | 1197 | 94013 | 150 | 16909 | 138107 | | host-talyla-3 | 101 | 110328 | 128 | 1092 | 82999 | 150 | 16833 | 127461 | | host-foropy-4 | 101 | 136271 | 106 | 1349 | 109811 | 150 | 16980 | 153551 |

:facts: 5

| :certname | :fact-count | :avg-depth | :max-depth | :fact-weight | :total-weight | |---------------+-------------+------------+------------+--------------+---------------| | host-sarasu-0 | 400 | 2.77 | 7 | 10000 | 10118 | | host-lukoxo-1 | 400 | 2.8 | 7 | 10000 | 10118 | | host-dykivy-2 | 400 | 2.7625 | 7 | 10000 | 10118 | | host-talyla-3 | 400 | 2.7825 | 7 | 10000 | 10118 | | host-foropy-4 | 400 | 2.7925 | 7 | 10000 | 10118 | ...

This mode is best used when generating several different sample sets with distinct weights and counts to provide (when combined) an overall sample set for benchmark that includes some fixed number of fairly well described catalog, fact and report examples.

By setting --random-distribution to true, you can instead generate a more random sample set, where the exact parameter values used per host will be picked from a normal curve based on the set value as mean.

for catalogs, this will effect the class, resource, edge and total blob counts

Blobs will be distributed randomly through the set, so if you set --blob-count to 2 over --hosts 10, on averge there will be two per catalog, but some may have none, others four, etc...

for facts, this will effect the fact and package counts, the total weight and the max fact depth.

This has no effect on generated reports at the moment.

Example:

jpartlow@jpartlow-dev-2204:~/work/src/puppetdb$ lein run generate --verbose --random-distribution :catalogs: 5

| :certname | :resource-count | :resource-weight | :min-resource | :mean-resource | :max-resource | :edge-count | :edge-weight | :catalog-weight | |---------------+-----------------+------------------+---------------+----------------+---------------+-------------+--------------+-----------------| | host-cevani-0 | 122 | 33831 | 93 | 277 | 441 | 193 | 22044 | 56175 | | host-firilo-1 | 91 | 115091 | 119 | 1264 | 91478 | 130 | 14466 | 129857 | | host-gujudi-2 | 129 | 36080 | 133 | 279 | 465 | 180 | 20230 | 56610 | | host-xegyxy-3 | 106 | 120603 | 136 | 1137 | 92278 | 153 | 17482 | 138385 | | host-jaqomi-4 | 107 | 211735 | 87 | 1978 | 98354 | 159 | 17792 | 229827 |

:facts: 5

| :certname | :fact-count | :avg-depth | :max-depth | :fact-weight | :total-weight | |---------------+-------------+------------+------------+--------------+---------------| | host-cevani-0 | 533 | 3.4690433 | 9 | 25339 | 25457 | | host-firilo-1 | 355 | 2.7464788 | 7 | 13951 | 14069 | | host-gujudi-2 | 380 | 2.75 | 8 | 16111 | 16229 | | host-xegyxy-3 | 360 | 2.7305555 | 7 | 5962 | 6080 | | host-jaqomi-4 | 269 | 2.7695167 | 7 | 16984 | 17102 | ...

# Data Generation utility

This command-line tool can generate a base sampling of catalog, fact and
report files suitable for consumption by the PuppetDB benchmark utility.

Note that it is only necessary to generate a small set of initial sample
data since benchmark will permute per node differences. So even if you want
to benchmark 1000 nodes, you don't need to generate initial
catalog/fact/report json for 1000 nodes.

If you want a representative sample with big differences between catalogs,
you will need to run the tool multiple times. For example, if you want a set
of 5 large catalogs and 10 small ones, you will need to run the tool twice
with the desired parameters to create the two different sets.

## Flag Notes

### Catalogs

#### Resource Counts

The num-resources flag is total and includes num-classes. So if you set
--num-resources to 100 and --num-classes to 30, you will get a catalog with a
hundred resources, thirty of which are classes.

#### Edges

A containment edge is always generated between the main stage and each
class. And non-class resources get a containment edge to a random class. So
there will always be a base set of containment edges equal to the resource
count. The --additional-edge-percent governs how many non-containment edges
are added on top of that to simulate some further catalog structure. There is
no guarantee of relationship depth (as far as, for example Stage(main) ->
Class(foo) -> Class(bar) -> Resource(biff)), but it does ensure some edges
between classes, as well as between class and non-class resources.

#### Large Resource Parameter Blobs

The --blob-count and --blob-size parameters control inclusion of large
text blobs in catalog resources. By default one ~100kb blob is
added per catalog.

Set --blob-count to 0 to exclude blobs altogether.

### Facts

#### Baseline Facts

Each fact set begins with a set of baseline facts from:
[baseline-agent-node.json](./resources/puppetlabs/puppetdb/generate/samples/facts/baseline-agent-node.json).

These provide some consistency for a common set of baseline fact paths
present on any puppet node. The generator then mutates half of the values to
provide variety.

#### Fact Counts

The --num-facts parameter controls the number of facts to generate per host.

There are 376 leaf facts in the baseline file. Setting num-facts less than
this will remove baseline facts to approach the requested number of facts.
(Empty maps and arrays are not removed from the factset, so it will never
pair down to zero.) Setting num-facts to a larger number will add facts of
random depth based on --max-fact-depth until the requested count is reached.

#### Total Facts Size

The --total-fact-size parameter controls the total weight of the fact values
map in kB. Weight is added after count is reached. So if the weight of the
adjusted baseline facts already exceeds the total-fact-size, nothing more is
done. No attempt is made to pair facts back down the requested size, as this
would likely require removing facts.

#### Max Fact Depth

The --max-fact-depth parameter is the maximum nested depth a fact added to
the baseline facts may reach. For example a max depth of 5, would mean that
an added fact would at most be a nest of four maps:

  {foo: {bar: {baz: {biff: boz}}}}

Since depth is picked randomly for each additional fact, this does not
guarantee facts of a given depth. Nor does it directly affect the average
depth of facts in the generated factset, although the larger the
max-fact-depth and num-facts, the more likely that the average depth will
drift higher.

#### Package Inventory

The --num-packages parameter sets the number of packages to generate for the
factset's package_inventory array. Set to 0 to exclude.

### Reports

#### Reports per Catalog

The --num-reports flag governs the number of reports to generate per
generated catalog.  Since one catalog is generated per host, this means you
will end up with num-hosts * num-reports reports.

#### Variation in Reports

A report details change, or lack there of, during enforcement of the puppet
catalog on the host. Since the benchmark tool currently chooses randomly from the
given report files, a simple mechanism for determining the likelihood of
receiving a report of a particular size (with lots of changes, few changes or
no changes) is to produce multiple reports of each type per host to generate
a weighted average. (If there are 10 reports, 2 are large and 8 are small,
then it's 80% likely any given report submission submitted by benchmark will
be of the small variety...)

The knobs to control this with the generate tool are:

* --num-reports, to determine the base number of reports to generate per catalog
* --high-change-reports-percent, percentage of that base to generate as
  reports with a high number of change events, as determined by:
* --high-change-resource-percent, percentage of resources in a high change
  report that will experience events (changes)
* --low-change-reports-percent, percentage of the base reports to generate
  as reports with a low number of change events as determined by:
* --low-change-resource-percent, percentage of resources in a low change
  report that will experience events (changes)

The left over percentage of reports will be no change reports (generally the
most common) indicating the report run was steady-state with no changes.

By default, with a num-reports of 20, a high change percent of 5% and a low
change percent of 20%, you will get 1 high change, 4 low change and 15
unchanged reports per host.

#### Unchanged Resources

In Puppet 8, by default, the agent no longer includes unchanged resources in
the report, reducing its size.

The generate tool also does this by default, but you can set
--no-exclude-unchanged-resources to instead include unchanged resources in
every report (for default Puppet 7 behavior, for example).

#### Logs

In addition to a few boilerplate log lines, random logs are generated for
each change event in the report. However other factors, such as pluginsync,
puppet runs with debug lines and additional logging in modules can increase
log output (quite dramatically in the case of debug output from the agent).

To simulate this, you can set --num-additional-logs to include in a report.
And you can set --percent-add-report-logs to indicate what percentage of
reports have this additional number of logs included.

### Random Distribution

The default generation produces relatively uniform structures.

* for catalogs it generates equal resource and edge counts and similar byte
  counts.
* for factsets it generates equal fact counts and similar byte counts.

Example:

   jpartlow@jpartlow-dev-2204:~/work/src/puppetdb$ lein run generate --verbose --output-dir generate-test
   ...
   :catalogs: 5

   |     :certname | :resource-count | :resource-weight | :min-resource | :mean-resource | :max-resource | :edge-count | :edge-weight | :catalog-weight |
   |---------------+-----------------+------------------+---------------+----------------+---------------+-------------+--------------+-----------------|
   | host-sarasu-0 |             101 |           137117 |            90 |           1357 |        110246 |         150 |        16831 |          154248 |
   | host-lukoxo-1 |             101 |           132639 |            98 |           1313 |        104921 |         150 |        16565 |          149504 |
   | host-dykivy-2 |             101 |           120898 |           109 |           1197 |         94013 |         150 |        16909 |          138107 |
   | host-talyla-3 |             101 |           110328 |           128 |           1092 |         82999 |         150 |        16833 |          127461 |
   | host-foropy-4 |             101 |           136271 |           106 |           1349 |        109811 |         150 |        16980 |          153551 |

   :facts: 5

   |     :certname | :fact-count | :avg-depth | :max-depth | :fact-weight | :total-weight |
   |---------------+-------------+------------+------------+--------------+---------------|
   | host-sarasu-0 |         400 |       2.77 |          7 |        10000 |         10118 |
   | host-lukoxo-1 |         400 |        2.8 |          7 |        10000 |         10118 |
   | host-dykivy-2 |         400 |     2.7625 |          7 |        10000 |         10118 |
   | host-talyla-3 |         400 |     2.7825 |          7 |        10000 |         10118 |
   | host-foropy-4 |         400 |     2.7925 |          7 |        10000 |         10118 |
   ...

This mode is best used when generating several different sample sets with
distinct weights and counts to provide (when combined) an overall sample set
for benchmark that includes some fixed number of fairly well described
catalog, fact and report examples.

By setting --random-distribution to true, you can instead generate a more random
sample set, where the exact parameter values used per host will be picked
from a normal curve based on the set value as mean.

* for catalogs, this will effect the class, resource, edge and total blob counts

Blobs will be distributed randomly through the set, so if you
set --blob-count to 2 over --hosts 10, on averge there will be two per
catalog, but some may have none, others four, etc...

* for facts, this will effect the fact and package counts, the total weight and the max fact depth.

This has no effect on generated reports at the moment.

Example:

   jpartlow@jpartlow-dev-2204:~/work/src/puppetdb$ lein run generate --verbose --random-distribution
   :catalogs: 5

   |     :certname | :resource-count | :resource-weight | :min-resource | :mean-resource | :max-resource | :edge-count | :edge-weight | :catalog-weight |
   |---------------+-----------------+------------------+---------------+----------------+---------------+-------------+--------------+-----------------|
   | host-cevani-0 |             122 |            33831 |            93 |            277 |           441 |         193 |        22044 |           56175 |
   | host-firilo-1 |              91 |           115091 |           119 |           1264 |         91478 |         130 |        14466 |          129857 |
   | host-gujudi-2 |             129 |            36080 |           133 |            279 |           465 |         180 |        20230 |           56610 |
   | host-xegyxy-3 |             106 |           120603 |           136 |           1137 |         92278 |         153 |        17482 |          138385 |
   | host-jaqomi-4 |             107 |           211735 |            87 |           1978 |         98354 |         159 |        17792 |          229827 |

   :facts: 5

   |     :certname | :fact-count | :avg-depth | :max-depth | :fact-weight | :total-weight |
   |---------------+-------------+------------+------------+--------------+---------------|
   | host-cevani-0 |         533 |  3.4690433 |          9 |        25339 |         25457 |
   | host-firilo-1 |         355 |  2.7464788 |          7 |        13951 |         14069 |
   | host-gujudi-2 |         380 |       2.75 |          8 |        16111 |         16229 |
   | host-xegyxy-3 |         360 |  2.7305555 |          7 |         5962 |          6080 |
   | host-jaqomi-4 |         269 |  2.7695167 |          7 |        16984 |         17102 |
   ...

raw docstring

puppetlabs.puppetdb.cli.pdb-dataset

Pg_restore and timeshift entries utility This command-line tool restores an empty database from a backup file (pg_dump generated file), then updates all the timestamps inside the database. It does this by calculating the period between the newest timestamp inside the file and the provided date. Then, every timestamp is shifted with that period. It accepts two parameters:

[Mandatory] -d / --dumpfile Path to the dumpfile that will be used to restore the database.
[Optional]-t / --shift-to-time Timestamp to which all timestamps from the dumpfile will be shifted after the restore. If it's not provided, the system's current timestamp will be used. !!! All timestamps are converted to a Zero timezone format. e.g timestamps like: 2015-03-26T10:58:51+10:00 will become 2015-03-26T11:58:51Z !!! !!! If the time difference between the latest entry in the dumpfile and the time provided to timeshift-to is less than 24 hours this tool will fail !!!

Pg_restore and timeshift entries utility
This command-line tool restores an empty database from a backup file (pg_dump generated file), then updates all the
timestamps inside the database.
It does this by calculating the period between the newest timestamp inside the file and the provided date.
Then, every timestamp is shifted with that period.
It accepts two parameters:
 - [Mandatory] -d / --dumpfile
   Path to the dumpfile that will be used to restore the database.
 - [Optional]-t / --shift-to-time
   Timestamp to which all timestamps from the dumpfile will be shifted after the restore.
   If it's not provided, the system's current timestamp will be used.
!!! All timestamps are converted to a Zero timezone format. e.g timestamps like: 2015-03-26T10:58:51+10:00
will become 2015-03-26T11:58:51Z !!!
!!! If the time difference between the latest entry in the dumpfile and the time provided to timeshift-to is less
than 24 hours this tool will fail !!!

raw docstring

puppetlabs.puppetdb.cli.services

Main entrypoint

PuppetDB consists of several, cooperating components:

Command processing

PuppetDB uses a CQRS pattern for making changes to its domain objects (facts, catalogs, etc). Instead of simply submitting data to PuppetDB and having it figure out the intent, the intent needs to explicitly be codified as part of the operation. This is known as a "command" (e.g. "replace the current facts for node X").

Commands are processed asynchronously, however we try to do our best to ensure that once a command has been accepted, it will eventually be executed. Ordering is also preserved. To do this, all incoming commands are placed in a message queue which the command processing subsystem reads from in FIFO order.

Refer to puppetlabs.puppetdb.command for more details.
Message queue

We use stockpile to durably store commands. The "in memory" representation of that queue is a core.async channel.
REST interface

All interaction with PuppetDB is conducted via its REST API. We embed an instance of Jetty to handle web server duties. Commands that come in via REST are relayed to the message queue. Read-only requests are serviced synchronously.
Database sweeper

As catalogs are modified, unused records may accumulate and stale data may linger in the database. We periodically sweep the database, compacting it and performing regular cleanup so we can maintain acceptable performance.

Main entrypoint

PuppetDB consists of several, cooperating components:

* Command processing

  PuppetDB uses a CQRS pattern for making changes to its domain
  objects (facts, catalogs, etc). Instead of simply submitting data
  to PuppetDB and having it figure out the intent, the intent
  needs to explicitly be codified as part of the operation. This is
  known as a "command" (e.g. "replace the current facts for node
  X").

  Commands are processed asynchronously, however we try to do our
  best to ensure that once a command has been accepted, it will
  eventually be executed. Ordering is also preserved. To do this,
  all incoming commands are placed in a message queue which the
  command processing subsystem reads from in FIFO order.

  Refer to `puppetlabs.puppetdb.command` for more details.

* Message queue

  We use stockpile to durably store commands. The "in memory"
  representation of that queue is a core.async channel.

* REST interface

  All interaction with PuppetDB is conducted via its REST API. We
  embed an instance of Jetty to handle web server duties. Commands
  that come in via REST are relayed to the message queue. Read-only
  requests are serviced synchronously.

* Database sweeper

  As catalogs are modified, unused records may accumulate and stale
  data may linger in the database. We periodically sweep the
  database, compacting it and performing regular cleanup so we can
  maintain acceptable performance.

raw docstring

puppetlabs.puppetdb.cli.time-shift-export

Timestamp shift utility

This simple command-line tool updates all the timestamps inside a PuppetDB export. It does this by calculating the period between the newest timestamp inside the export and the provided date. Then, every timestamp is shifted with that period. It accepts three parameters:

[Mandatory] -i / --input Path to the .tgz pdb export, which will be shifted.
[Optional] -o / --output Path to the where the shifted export will be saved. If no path is given, the shifted export is sent as a stream to standard output. You may use it like this: lein time-shift-export -i export.tgz -o > shifted.tgz
[Optional]-t / --shift-to-time Timestamp to which all the export timestamp will be shifted. If it's not provided, the system's current timestamp will be used.

!!! All timestamps are converted to a Zero timezone format. e.g timestamps like: 2015-03-26T10:58:51+10:00 will become 2015-03-26T11:58:51Z !!!

Timestamp shift utility

This simple command-line tool updates all the timestamps inside a PuppetDB export.
It does this by calculating the period between the newest timestamp inside the export and the provided date.
Then, every timestamp is shifted with that period.
It accepts three parameters:
 - [Mandatory] -i / --input
   Path to the .tgz pdb export, which will be shifted.
 - [Optional] -o / --output
   Path to the where the shifted export will be saved.
   If no path is given, the shifted export is sent as a stream to standard output. You may use it like this:
   lein time-shift-export -i export.tgz -o > shifted.tgz
 - [Optional]-t / --shift-to-time
   Timestamp to which all the export timestamp will be shifted.
   If it's not provided, the system's current timestamp will be used.

 !!! All timestamps are converted to a Zero timezone format. e.g timestamps like: 2015-03-26T10:58:51+10:00
 will become 2015-03-26T11:58:51Z !!!

raw docstring

puppetlabs.puppetdb.cli.tk-util

This namespace is separate from cli.util because we don't want to require any more than we have to there.

This namespace is separate from cli.util because we don't want to
require any more than we have to there.

raw docstring

run-tk-cli-cmd

puppetlabs.puppetdb.cli.util

As this namespace is required by both the tk and non-tk subcommands, it must remain very lightweight, so that subcommands like "version" aren't slowed down by loading the entire logging subsystem or trapperkeeper, etc.

As this namespace is required by both the tk and non-tk subcommands,
it must remain very lightweight, so that subcommands like
"version" aren't slowed down by loading the entire logging
subsystem or trapperkeeper, etc.

raw docstring

puppetlabs.puppetdb.cli.version

Version utility

This simple command-line tool prints a list of info about the version of PuppetDB. It is useful for testing and other situations where you'd like to know some of the version details without having a running instance of PuppetDB.

The output is currently formatted like the contents of a java properties file; each line contains a single property name, followed by an equals sign, followed by the property value.

Version utility

This simple command-line tool prints a list of info about
the version of PuppetDB.  It is useful for testing and other situations
where you'd like to know some of the version details without having
a running instance of PuppetDB.

The output is currently formatted like the contents of a java properties file;
each line contains a single property name, followed by an equals sign, followed
by the property value.

raw docstring

puppetlabs.puppetdb.client

puppetlabs.puppetdb.command

PuppetDB command handling

Commands are the mechanism by which changes are made to PuppetDB's model of a population. Commands are represented by command objects, which have the following JSON wire format:

{"command": "...",
 "version": 123,
 "payload": <json object>}

payload must be a valid JSON string of any sort. It's up to an individual handler function how to interpret that object.

More details can be found in the spec.

Commands should include a received field containing a timestamp of when the message was first seen by the system. If this is omitted, it will be added when the message is first parsed, but may then be somewhat inaccurate.

Commands should include an id field containing a unique integer identifier for the command. If this is omitted, it will be added when the message is first parsed.

Failed messages will have an attempts annotation containing an array of maps of the form:

{:timestamp <timestamp>
 :error     "some error message"
 :trace     <stack trace from :exception>}

Each entry corresponds to a single failed attempt at handling the message, containing the error message, stack trace, and timestamp for each failure. PuppetDB may discard messages which have been attempted and failed too many times, or which have experienced fatal errors (including unparseable messages).

Failed messages will be stored in files in the "dead letter office", located under the MQ data directory, in /discarded/<command>. These files contain the annotated message, along with each exception that occured while trying to handle the message.

We currently support the following wire formats for commands:

Java Strings
UTF-8 encoded byte-array

In either case, the command itself, once string-ified, must be a JSON-formatted string with the aforementioned structure.

PuppetDB command handling

Commands are the mechanism by which changes are made to PuppetDB's
model of a population. Commands are represented by `command
objects`, which have the following JSON wire format:

    {"command": "...",
     "version": 123,
     "payload": <json object>}

`payload` must be a valid JSON string of any sort. It's up to an
individual handler function how to interpret that object.

More details can be found in [the spec](../spec/commands.md).

Commands should include a `received` field containing a timestamp
of when the message was first seen by the system. If this is
omitted, it will be added when the message is first parsed, but may
then be somewhat inaccurate.

Commands should include an `id` field containing a unique integer
identifier for the command. If this is omitted, it will be added
when the message is first parsed.

Failed messages will have an `attempts` annotation containing an
array of maps of the form:

    {:timestamp <timestamp>
     :error     "some error message"
     :trace     <stack trace from :exception>}

Each entry corresponds to a single failed attempt at handling the
message, containing the error message, stack trace, and timestamp
for each failure. PuppetDB may discard messages which have been
attempted and failed too many times, or which have experienced
fatal errors (including unparseable messages).

Failed messages will be stored in files in the "dead letter
office", located under the MQ data directory, in
`/discarded/<command>`. These files contain the annotated message,
along with each exception that occured while trying to handle the
message.

We currently support the following wire formats for commands:

1. Java Strings

2. UTF-8 encoded byte-array

In either case, the command itself, once string-ified, must be a
JSON-formatted string with the aforementioned structure.

raw docstring

puppetlabs.puppetdb.command.constants

puppetlabs.puppetdb.command.dlo

puppetlabs.puppetdb.config

Centralized place for reading a user-defined config INI file, validating, defaulting and converting into a format that can startup a PuppetDB instance.

The schemas in this file define what is expected to be present in the INI file and the format expected by the rest of the application.

Centralized place for reading a user-defined config INI file, validating,
defaulting and converting into a format that can startup a PuppetDB instance.

The schemas in this file define what is expected to be present in the INI file
and the format expected by the rest of the application.

raw docstring

puppetlabs.puppetdb.constants

puppetlabs.puppetdb.core

PuppetDBs normal entry point. Dispatches to command line subcommands.

PuppetDBs normal entry point.  Dispatches to command line subcommands.

raw docstring

puppetlabs.puppetdb.dashboard

puppetlabs.puppetdb.export

puppetlabs.puppetdb.facts

puppetlabs.puppetdb.factsets

puppetlabs.puppetdb.honeysql

Some generic HoneySQL extensions, candidates for re-usability and potential upstream submission.

Some generic HoneySQL extensions, candidates for re-usability and
potential upstream submission.

raw docstring

puppetlabs.puppetdb.http

puppetlabs.puppetdb.http.command

puppetlabs.puppetdb.http.handlers

puppetlabs.puppetdb.http.query

Query parameter manipulation

Functions that aid in the parsing, serialization, and manipulation of PuppetDB queries embedded in HTTP parameters.

Query parameter manipulation

Functions that aid in the parsing, serialization, and manipulation
of PuppetDB queries embedded in HTTP parameters.

raw docstring

puppetlabs.puppetdb.http.server

REST server

Consolidates our disparate REST endpoints into a single Ring application.

REST server

Consolidates our disparate REST endpoints into a single Ring
application.

raw docstring

puppetlabs.puppetdb.import

Import utility

This is a command-line tool for importing data into PuppetDB. It expects as input a tarball generated by the PuppetDB export command-line tool.

Import utility

This is a command-line tool for importing data into PuppetDB. It expects
as input a tarball generated by the PuppetDB `export` command-line tool.

raw docstring

puppetlabs.puppetdb.jdbc

Database utilities

Database utilities

raw docstring

puppetlabs.puppetdb.jdbc.internal

JDBC helper functions

External code should not call any of these functions directly, as they are subject to change without notice.

JDBC helper functions

*External code should not call any of these functions directly, as they are*
*subject to change without notice.*

raw docstring

puppetlabs.puppetdb.jdbc.PDBBytea

Carrier for bytea parameters, to support clojure.java.jdbc and next.jdbc protocol extensions. Essentially just a typed wrapper around a byte[].

Carrier for bytea parameters, to support clojure.java.jdbc and next.jdbc
protocol extensions. Essentially just a typed wrapper around a byte[].

raw docstring

puppetlabs.puppetdb.jdbc.VecPDBBytea

Carrier for bytea[] parameters, to support clojure.java.jdbc and next.jdbc protocol extensions. Essentially just a typed wrapper around a byte[][].

Carrier for bytea[] parameters, to support clojure.java.jdbc and next.jdbc
protocol extensions. Essentially just a typed wrapper around a byte[][].

raw docstring

-init

puppetlabs.puppetdb.lint

ignore-value

puppetlabs.puppetdb.meta

puppetlabs.puppetdb.meta.version

Versioning Utility Library

This namespace contains some utility functions relating to checking version numbers of various fun things.

Versioning Utility Library

This namespace contains some utility functions relating to checking version
numbers of various fun things.

raw docstring

version

puppetlabs.puppetdb.metrics.core

puppetlabs.puppetdb.middleware

Ring middleware

Ring middleware

raw docstring

puppetlabs.puppetdb.mq

puppetlabs.puppetdb.nio

puppetlabs.puppetdb.nodes

Puppet nodes parsing

Functions that handle conversion of nodes from wire format to internal PuppetDB format, including validation.

Puppet nodes parsing

Functions that handle conversion of nodes from wire format to
internal PuppetDB format, including validation.

raw docstring

puppetlabs.puppetdb.package-util

puppetlabs.puppetdb.pdb-routing

puppetlabs.puppetdb.pql

puppetlabs.puppetdb.pql.transform

puppetlabs.puppetdb.query

SQL query compiler

The query compile operates in a multi-step process. Compilation begins with one of the foo-query->sql functions. The job of these functions is basically to call compile-term on the first term of the query to get back the "compiled" form of the query, and then to turn that into a complete SQL query.

The compiled form of a query consists of a map with two keys: where and params. The where key contains SQL for querying that particular predicate, written in such a way as to be suitable for placement after a WHERE clause in the database. params contains, naturally, the parameters associated with that SQL expression. For instance, a resource query for ["=" ["node" "name"] "foo.example.com"] will compile to:

{:where "catalogs.certname = ?"
 :params ["foo.example.com"]}

The where key is then inserted into a template query to return the final result as a string of SQL code.

The compiled query components can be combined by operators such as AND or OR, which return the same sort of structure. Operators which accept other terms as their arguments are responsible for compiling their arguments themselves. To facilitate this, those functions accept as their first argument a map from operator to compile function. This allows us to have a different set of operators for resources and facts, or queries, while still sharing the implementation of the operators themselves.

Other operators include the subquery operators, in, extract, and select-resources or select-facts. The select-foo operators implement subqueries, and are simply implemented by calling their corresponding foo-query->sql function, which means they return a complete SQL query rather than the compiled query map. The extract function knows how to handle that, and is the only place those queries are allowed as arguments. extract is used to select a particular column from the subquery. The sibling operator to extract is in, which checks that the value of a certain column from the table being queried is in the result set returned by extract. Composed, these three operators provide a complete subquery facility. For example, consider this fact query:

["and"
 ["=" ["fact" "name"] "ipaddress"]
 ["in" "certname"
  ["extract" "certname"
   ["select-resources" ["and"
                        ["=" "type" "Class"]
                        ["=" "title" "apache"]]]]]]

This will perform a query (via select-resources) for resources matching Class[apache]. It will then pick out the certname from each of those, and match against the certname of fact rows, returning those facts which have a corresponding entry in the results of select-resources and which are named ipaddress. Effectively, the semantics of this query are "find the ipaddress of every node with Class[apache]".

The resulting SQL from the foo-query->sql functions selects all the columns. Thus consumers of those functions may need to wrap that query with another SELECT to pull out only the desired columns. Similarly for applying ordering constraints.

SQL query compiler

The query compile operates in a multi-step process. Compilation begins with
one of the `foo-query->sql` functions. The job of these functions is
basically to call `compile-term` on the first term of the query to get back
the "compiled" form of the query, and then to turn that into a complete SQL
query.

The compiled form of a query consists of a map with two keys: `where`
and `params`. The `where` key contains SQL for querying that
particular predicate, written in such a way as to be suitable for placement
after a `WHERE` clause in the database. `params` contains, naturally, the
parameters associated with that SQL expression. For instance, a resource
query for `["=" ["node" "name"] "foo.example.com"]` will compile to:

    {:where "catalogs.certname = ?"
     :params ["foo.example.com"]}

The `where` key is then inserted into a template query to return
the final result as a string of SQL code.

The compiled query components can be combined by operators such as
`AND` or `OR`, which return the same sort of structure. Operators
which accept other terms as their arguments are responsible for
compiling their arguments themselves. To facilitate this, those
functions accept as their first argument a map from operator to
compile function. This allows us to have a different set of
operators for resources and facts, or queries, while still sharing
the implementation of the operators themselves.

Other operators include the subquery operators, `in`, `extract`, and
`select-resources` or `select-facts`. The `select-foo` operators implement
subqueries, and are simply implemented by calling their corresponding
`foo-query->sql` function, which means they return a complete SQL query
rather than the compiled query map. The `extract` function knows how to
handle that, and is the only place those queries are allowed as arguments.
`extract` is used to select a particular column from the subquery. The
sibling operator to `extract` is `in`, which checks that the value of
a certain column from the table being queried is in the result set returned
by `extract`. Composed, these three operators provide a complete subquery
facility. For example, consider this fact query:

    ["and"
     ["=" ["fact" "name"] "ipaddress"]
     ["in" "certname"
      ["extract" "certname"
       ["select-resources" ["and"
                            ["=" "type" "Class"]
                            ["=" "title" "apache"]]]]]]

This will perform a query (via `select-resources`) for resources matching
`Class[apache]`. It will then pick out the `certname` from each of those,
and match against the `certname` of fact rows, returning those facts which
have a corresponding entry in the results of `select-resources` and which
are named `ipaddress`. Effectively, the semantics of this query are "find
the ipaddress of every node with Class[apache]".

The resulting SQL from the `foo-query->sql` functions selects all the
columns. Thus consumers of those functions may need to wrap that query with
another `SELECT` to pull out only the desired columns. Similarly for
applying ordering constraints.

raw docstring

puppetlabs.puppetdb.query-eng

puppetlabs.puppetdb.query-eng.default-reports

puppetlabs.puppetdb.query-eng.engine

puppetlabs.puppetdb.query-eng.parse

AST parsing

AST parsing

raw docstring

puppetlabs.puppetdb.query.aggregate-event-counts

puppetlabs.puppetdb.query.catalog-inputs

puppetlabs.puppetdb.query.common

bad-query-ex

puppetlabs.puppetdb.query.edges

Fact query generation

Fact query generation

raw docstring

puppetlabs.puppetdb.query.event-counts

puppetlabs.puppetdb.query.events

SQL/query-related functions for events

SQL/query-related functions for events

raw docstring

puppetlabs.puppetdb.query.fact-contents

puppetlabs.puppetdb.query.facts

Fact query generation

Fact query generation

raw docstring

puppetlabs.puppetdb.query.monitor

This provides a monitor for in-progress queries. The monitor keeps track of each registered query's deadline, client socket (channel), and possible postgresql connection, and whenever the deadline is reaached or the client disconnects (the query is abandoned), the monitor will attempt to kill the query -- currently by invoking a pg_terminate() on the query's registered postgres pid.

The main focus is client disconnections since there didn't appear to be any easy way to detect/handle them otherwise, and because without the pg_terminate, the server might continue executing an expensive query for a long time after the client is gone (say via browser page refresh).

It's worth noting that including this monitor, we have three different query timeout mechanisms. The other two are the time-limited-seq and the jdbc/update-local-timeouts operations in query-eng. We have all three because the time-limited seq only works when rows are moving (i.e. not when blocked waiting on pg or blocked pushing json to the client), the pg timeouts have an unusual granularity, i.e. they're a per-pg-wire-batch timeout, not a timeout for an entire statement like a top-level select, and the pg_terminate()s used here in the monitor are more expensive than either of those (killing an entire pg worker process).

The core of the monitor is a traditional (here NIO based) socket select loop, which should be able to handle even a large number of queries reasonably efficiently, and without requiring some number of threads proportional to the in-progress query count.

The current implementation is intended to respect the Selector concurrency requirements, aided in part by limiting most work to the single monitor thread, though forget does compete with the monitor loop (coordinating via the :terminated promise.

No operations should block forever; they should all eventually (in some cases customizably) time out, and the current implementation is intended, overall, to try to let pdb keep running, even if the monitor (thread) dies somehow. The precipitating errors should still be reported to the log.

Every monitored query will have a SelectionKey associated with it, The key is cancelled during forget, but won't be removed from the selector's cancelled set until the next call to select. During that time, another query on the same socket/connection could try to re-register the cancelled key. This will throw an exception, which we suppress and retry until the select loop finally removes the cancelled key, and we can re-register the socket.

Every monitored query may also have a postgres pid associated with it, and whenever it does, that pid should be terminated (in coordination with the :terminated promise) once the query has been abandoned or has timed out.

The terminated promise coordinates between the monitor and attempts to remove (forget) a query. The arrangement is intended to make sure that the attempt to forget doesn't return until any competing termination attempt has finished, or at least had a chance to finish (otherwise the termination could kill a pg worker that's no longer associated with the original query, i.e. it's handling a new query that jetty has picked up on that channel).

The client socket monitoring depends on access to the jetty query response which (at least at the moment) provides indirect access to the java socket channel which can be read to determine whether the client is still connected.

The current implementation is completely incompatible with http "pipelining", but it looks like that is no longer a realistic concern: https://daniel.haxx.se/blog/2019/04/06/curl-says-bye-bye-to-pipelining/

If that turns out to be an incorrect assumption, then we'll have to reevaluate the implementation and/or feasibility of the monitoring. That's because so far, the only way we've found to detect a client disconnection is to attempt to read a byte. At the moment, that's acceptable because the client shouldn't be sending any data during the response (which of course wouldn't be true with pipelining, where it could be sending additional requests).

This provides a monitor for in-progress queries.  The monitor keeps
track of each registered query's deadline, client socket (channel),
and possible postgresql connection, and whenever the deadline is
reaached or the client disconnects (the query is abandoned), the
monitor will attempt to kill the query -- currently by invoking a
pg_terminate() on the query's registered postgres pid.

The main focus is client disconnections since there didn't appear to
be any easy way to detect/handle them otherwise, and because without
the pg_terminate, the server might continue executing an expensive
query for a long time after the client is gone (say via browser page
refresh).

It's worth noting that including this monitor, we have three
different query timeout mechanisms.  The other two are the
time-limited-seq and the jdbc/update-local-timeouts operations in
query-eng.  We have all three because the time-limited seq only
works when rows are moving (i.e. not when blocked waiting on pg or
blocked pushing json to the client), the pg timeouts have an unusual
granularity, i.e. they're a per-pg-wire-batch timeout, not a timeout
for an entire statement like a top-level select, and the
pg_terminate()s used here in the monitor are more expensive than
either of those (killing an entire pg worker process).

The core of the monitor is a traditional (here NIO based) socket
select loop, which should be able to handle even a large number of
queries reasonably efficiently, and without requiring some number of
threads proportional to the in-progress query count.

The current implementation is intended to respect the Selector
concurrency requirements, aided in part by limiting most work to the
single monitor thread, though `forget` does compete with the monitor
loop (coordinating via the `:terminated` promise.

No operations should block forever; they should all eventually (in
some cases customizably) time out, and the current implementation is
intended, overall, to try to let pdb keep running, even if the
monitor (thread) dies somehow.  The precipitating errors should
still be reported to the log.

Every monitored query will have a SelectionKey associated with it,
The key is cancelled during forget, but won't be removed from the
selector's cancelled set until the next call to select.  During that
time, another query on the same socket/connection could try to
re-register the cancelled key.  This will throw an exception, which
we suppress and retry until the select loop finally removes the
cancelled key, and we can re-register the socket.

Every monitored query may also have a postgres pid associated with
it, and whenever it does, that pid should be terminated (in
coordination with the :terminated promise) once the query has been
abandoned or has timed out.

The terminated promise coordinates between the monitor and attempts
to remove (forget) a query.  The arrangement is intended to make
sure that the attempt to forget doesn't return until any competing
termination attempt has finished, or at least had a chance to
finish (otherwise the termination could kill a pg worker that's no
longer associated with the original query, i.e. it's handling a new
query that jetty has picked up on that channel).

The client socket monitoring depends on access to the jetty query
response which (at least at the moment) provides indirect access to
the java socket channel which can be read to determine whether the
client is still connected.

The current implementation is completely incompatible with http
"pipelining", but it looks like that is no longer a realistic
concern:
https://daniel.haxx.se/blog/2019/04/06/curl-says-bye-bye-to-pipelining/

If that turns out to be an incorrect assumption, then we'll have to
reevaluate the implementation and/or feasibility of the monitoring.
That's because so far, the only way we've found to detect a client
disconnection is to attempt to read a byte.  At the moment, that's
acceptable because the client shouldn't be sending any data during
the response (which of course wouldn't be true with pipelining,
where it could be sending additional requests).

raw docstring

puppetlabs.puppetdb.query.paging

Paging query parameter manipulation

Functions that aid in the validation and processing of the query parameters related to paging PuppetDB queries

Paging query parameter manipulation

Functions that aid in the validation and processing of the
query parameters related to paging PuppetDB queries

raw docstring

puppetlabs.puppetdb.query.population

Population-wide queries

Contains queries and metrics that apply across an entire population.

Population-wide queries

Contains queries and metrics that apply across an entire population.

raw docstring

puppetlabs.puppetdb.query.resources

Resource querying

This implements resource querying, using the query compiler in puppetlabs.puppetdb.query, basically by munging the results into the right format and picking out the desired columns.

Resource querying

This implements resource querying, using the query compiler in
`puppetlabs.puppetdb.query`, basically by munging the results into the
right format and picking out the desired columns.

raw docstring

puppetlabs.puppetdb.query.summary-stats

puppetlabs.puppetdb.queue

puppetlabs.puppetdb.random

puppetlabs.puppetdb.reports

Puppet report/event parsing

Functions that handle conversion of reports from wire format to internal PuppetDB format, including validation.

Puppet report/event parsing

Functions that handle conversion of reports from wire format to
internal PuppetDB format, including validation.

raw docstring

puppetlabs.puppetdb.scf.hash

puppetlabs.puppetdb.scf.migrate

Schema migrations

The initialize-schema function can be used to prepare the database, applying all the pending migrations to the database, in ascending order of schema version. Pending is defined as having a schema version greater than the current version in the database.

A migration is specified by defining a function of arity 0 and adding it to the migrations map, along with its schema version. To apply the migration, the migration function will be invoked, and the schema version and current time will be recorded in the schema_migrations table.

A migration function can return a map with ::vacuum-analyze to indicate what tables need to be analyzed post-migration.

NOTE: in order to support bug-fix schema changes to older branches without breaking the ability to upgrade, it is possible to define a sequence of migrations with non-sequential integers. e.g., if the 1.0.x branch contains migrations 1-5, and the 2.0.x branch contains schema migrations 1-10, and then a bugfix schema change (such as creating or adding an index) is identified, this migration can be defined as #11 in both branches. Code in the 1.0.x branch should happily apply #11 even though it does not have a definition for 6-10. Then when a 1.0.x user upgrades to 2.0.x, migrations 6-10 will be applied, and 11 will be skipped because it's already been run. Because of this, it is crucial to be extremely careful about numbering new migrations if they are going into multiple branches. It's also crucial to be absolutely certain that the schema change in question is compatible with both branches and that the migrations missing from the earlier branch can reasonably and safely be applied after the bugfix migration, because that is what will happen for upgrading users.

In short, here are some guidelines re: applying schema changes to multiple branches:

If at all possible, avoid it.
Seriously, are you sure you need to do this? :)
OK, if you really must do it, make sure that the schema change in question is as independent as humanly possible. For example, things like creating or dropping an index on a table should be fairly self-contained. You should think long and hard about any change more complex than that.
Determine what the latest version of the schema is in each of the two branches.
Examine every migration that exists in the newer branch but not the older branch, and make sure that your new schema change will not conflict with any of those migrations. Your change must be able to execute successfully regardless of whether it is applied BEFORE all of those migrations or AFTER them.
If you're certain you've met the conditions described above, choose the next available integer from the newer branch and add your migration to both branches using this integer. This will result in a gap between the integers in the migrations array in the old branch, but that is not a problem.

TODO: consider using multimethods for migration funcs

Schema migrations

 The `initialize-schema` function can be used to prepare the
database, applying all the pending migrations to the database, in
ascending order of schema version. Pending is defined as having a
schema version greater than the current version in the database.

 A migration is specified by defining a function of arity 0 and adding it to
 the `migrations` map, along with its schema version. To apply the migration,
 the migration function will be invoked, and the schema version and current
 time will be recorded in the schema_migrations table.

 A migration function can return a map with ::vacuum-analyze to indicate what tables
 need to be analyzed post-migration.

 NOTE: in order to support bug-fix schema changes to older branches without
 breaking the ability to upgrade, it is possible to define a sequence of
 migrations with non-sequential integers.  e.g., if the 1.0.x branch
 contains migrations 1-5, and the 2.0.x branch contains schema migrations
 1-10, and then a bugfix schema change (such as creating or adding an index)
 is identified, this migration can be defined as #11 in both branches.  Code
 in the 1.0.x branch should happily apply #11 even though it does not have
 a definition for 6-10.  Then when a 1.0.x user upgrades to 2.0.x, migrations
 6-10 will be applied, and 11 will be skipped because it's already been run.
 Because of this, it is crucial to be extremely careful about numbering new
 migrations if they are going into multiple branches.  It's also crucial to
 be absolutely certain that the schema change in question is compatible
 with both branches and that the migrations missing from the earlier branch
 can reasonably and safely be applied *after* the bugfix migration, because
 that is what will happen for upgrading users.

 In short, here are some guidelines re: applying schema changes to multiple
 branches:

 1. If at all possible, avoid it.
 2. Seriously, are you sure you need to do this? :)
 3. OK, if you really must do it, make sure that the schema change in question
    is as independent as humanly possible.  For example, things like creating
    or dropping an index on a table should be fairly self-contained.  You should
    think long and hard about any change more complex than that.
 4. Determine what the latest version of the schema is in each of the two branches.
 5. Examine every migration that exists in the newer branch but not the older
    branch, and make sure that your new schema change will not conflict with
    *any* of those migrations.  Your change must be able to execute successfully
    regardless of whether it is applied BEFORE all of those migrations or AFTER
    them.
 6. If you're certain you've met the conditions described above, choose the next
    available integer from the *newer* branch and add your migration to both
    branches using this integer.  This will result in a gap between the integers
    in the migrations array in the old branch, but that is not a problem.

 _TODO: consider using multimethods for migration funcs_

raw docstring

puppetlabs.puppetdb.scf.partitioning

Handles all work related to database table partitioning

Handles all work related to database table partitioning

raw docstring

puppetlabs.puppetdb.scf.storage

Catalog persistence

Catalogs are persisted in a relational database. Roughly speaking, the schema looks like this:

resource_parameters are associated 0 to N catalog_resources (they are deduped across catalogs). It's possible for a resource_param to exist in the database, yet not be associated with a catalog. This is done as a performance optimization.
edges are associated with a single catalog
catalogs are associated with a single certname
facts are associated with a single certname

The standard set of operations on information in the database will likely result in dangling resources and catalogs; to clean these up, it's important to run garbage-collect!.

Catalog persistence

Catalogs are persisted in a relational database. Roughly speaking,
the schema looks like this:

* resource_parameters are associated 0 to N catalog_resources (they are
deduped across catalogs). It's possible for a resource_param to exist in the
database, yet not be associated with a catalog. This is done as a
performance optimization.

* edges are associated with a single catalog

* catalogs are associated with a single certname

* facts are associated with a single certname

 The standard set of operations on information in the database will
 likely result in dangling resources and catalogs; to clean these
 up, it's important to run `garbage-collect!`.

raw docstring

Time-related Utility Functions

This namespace contains some utility functions for working with objects
related to time; it is mostly based off of the `Period` class from
Java's JodaTime library.

raw docstring

`Ctrl`+`k`	Jump to recent docs
`←`	Move to previous article
`→`	Move to next article
`Ctrl`+`/`	Jump to the search field

puppetlabs.puppetdb

puppetlabs.puppetdb.admin

puppetlabs.puppetdb.anonymizer

puppetlabs.puppetdb.archive

puppetlabs.puppetdb.catalog.utils

puppetlabs.puppetdb.catalogs

puppetlabs.puppetdb.cheshire

puppetlabs.puppetdb.cli.benchmark

puppetlabs.puppetdb.cli.benchmark.query

puppetlabs.puppetdb.cli.fact-storage-benchmark

puppetlabs.puppetdb.cli.generate

puppetlabs.puppetdb.cli.pdb-dataset

puppetlabs.puppetdb.cli.services

puppetlabs.puppetdb.cli.time-shift-export

puppetlabs.puppetdb.cli.tk-util

puppetlabs.puppetdb.cli.util

puppetlabs.puppetdb.cli.version

puppetlabs.puppetdb.client

puppetlabs.puppetdb.command

puppetlabs.puppetdb.command.constants

puppetlabs.puppetdb.command.dlo

puppetlabs.puppetdb.config

puppetlabs.puppetdb.constants

puppetlabs.puppetdb.core

puppetlabs.puppetdb.dashboard

puppetlabs.puppetdb.export

puppetlabs.puppetdb.facts

puppetlabs.puppetdb.factsets

puppetlabs.puppetdb.honeysql

puppetlabs.puppetdb.http

puppetlabs.puppetdb.http.command

puppetlabs.puppetdb.http.handlers

puppetlabs.puppetdb.http.query

puppetlabs.puppetdb.http.server

puppetlabs.puppetdb.import

puppetlabs.puppetdb.jdbc

puppetlabs.puppetdb.jdbc.internal

puppetlabs.puppetdb.jdbc.PDBBytea

puppetlabs.puppetdb.jdbc.VecPDBBytea

puppetlabs.puppetdb.lint

puppetlabs.puppetdb.meta

puppetlabs.puppetdb.meta.version

puppetlabs.puppetdb.metrics.core

puppetlabs.puppetdb.middleware

puppetlabs.puppetdb.mq

puppetlabs.puppetdb.nio

puppetlabs.puppetdb.nodes

puppetlabs.puppetdb.package-util

puppetlabs.puppetdb.pdb-routing

puppetlabs.puppetdb.pql

puppetlabs.puppetdb.pql.transform

puppetlabs.puppetdb.query

puppetlabs.puppetdb.query-eng

puppetlabs.puppetdb.query-eng.default-reports

puppetlabs.puppetdb.query-eng.engine

puppetlabs.puppetdb.query-eng.parse

puppetlabs.puppetdb.query.aggregate-event-counts

puppetlabs.puppetdb.query.catalog-inputs

puppetlabs.puppetdb.query.common

puppetlabs.puppetdb.query.edges

puppetlabs.puppetdb.query.event-counts

puppetlabs.puppetdb.query.events

puppetlabs.puppetdb.query.fact-contents

puppetlabs.puppetdb.query.facts

puppetlabs.puppetdb.query.monitor

puppetlabs.puppetdb.query.paging

puppetlabs.puppetdb.query.population

puppetlabs.puppetdb.query.resources

puppetlabs.puppetdb.query.summary-stats

puppetlabs.puppetdb.queue

puppetlabs.puppetdb.random

puppetlabs.puppetdb.reports

puppetlabs.puppetdb.scf.hash

puppetlabs.puppetdb.scf.migrate

puppetlabs.puppetdb.scf.partitioning

puppetlabs.puppetdb.scf.storage

puppetlabs.puppetdb.scf.storage-utils

puppetlabs.puppetdb.schema

puppetlabs.puppetdb.status

puppetlabs.puppetdb.threadpool