No vars found in this namespace.
Compression / Archive Utility library
This namespace contains functions for reading and writing compressed archive files. Currently only supports gzipped tar archives.
Compression / Archive Utility library This namespace contains functions for reading and writing compressed archive files. Currently only supports gzipped tar archives.
Catalog generation and manipulation
A suite of functions that aid in constructing random catalogs, or randomly modifying an existing catalog (wire format or parsed).
Catalog generation and manipulation A suite of functions that aid in constructing random catalogs, or randomly modifying an existing catalog (wire format or parsed).
Puppet catalog parsing
Functions that handle conversion of catalogs from wire format to internal PuppetDB format.
The wire format is described in detail in the spec.
There are a number of transformations we apply to wire format catalogs during conversion to our internal format; while wire format catalogs contain complete records of all resources and edges, and most things are properly encoded as lists or maps, there are still a number of places where structure is absent or lacking:
Resource specifiers are represented as opaque strings, like
Class[Foobar], as opposed to something like
{"type" "Class" "title" "Foobar"}
Tags are represented as lists (and may contain duplicates) instead of sets
Resources are represented as a list instead of a map, making operations that need to correlate against specific resources unneccesarily difficult
Keys to all maps are strings (to conform with JSON), instead of more convenient Clojure keywords
Unless otherwise indicated, all terminology for catalog components matches terms listed in the spec.
A map of the form {:type "Class" :title "Foobar"}. This is a
unique identifier for a resource within a catalog.
A map that represents a single resource in a catalog:
{:type "..."
:title "..."
:... "..."
:tags #{"tag1", "tag2", ...}
:parameters {:name1 "value1"
:name2 "value2"
...}}
Certain attributes are treated special:
:type and :title are used to produce a resource-spec for
this resourceA representation of an "edge" (dependency or containment) in the catalog. All edges have the following form:
{:source <resource spec>
:target <resource spec>
:relationship <relationship id>}
A relationship identifier can be one of:
:contains:required-by:notifies:before:subscription-ofA wire-format-neutral representation of a Puppet catalog. It is a map with the following structure:
{:certname "..."
:version "..."
:resources {<resource-spec> <resource>
<resource-spec> <resource>
...}
:edges #(<dependency-spec>,
<dependency-spec>,
...)}
Puppet catalog parsing
Functions that handle conversion of catalogs from wire format to
internal PuppetDB format.
The wire format is described in detail in [the spec](../spec/catalog-wire-format.md).
There are a number of transformations we apply to wire format
catalogs during conversion to our internal format; while wire
format catalogs contain complete records of all resources and
edges, and most things are properly encoded as lists or maps, there
are still a number of places where structure is absent or lacking:
1. Resource specifiers are represented as opaque strings, like
`Class[Foobar]`, as opposed to something like
`{"type" "Class" "title" "Foobar"}`
2. Tags are represented as lists (and may contain duplicates)
instead of sets
3. Resources are represented as a list instead of a map, making
operations that need to correlate against specific resources
unneccesarily difficult
4. Keys to all maps are strings (to conform with JSON), instead of
more convenient Clojure keywords
### Terminology
Unless otherwise indicated, all terminology for catalog components
matches terms listed in [the spec](../spec/catalog-wire-format.md).
### Transformed constructs
### Resource Specifier (resource-spec)
A map of the form `{:type "Class" :title "Foobar"}`. This is a
unique identifier for a resource within a catalog.
### Resource
A map that represents a single resource in a catalog:
{:type "..."
:title "..."
:... "..."
:tags #{"tag1", "tag2", ...}
:parameters {:name1 "value1"
:name2 "value2"
...}}
Certain attributes are treated special:
* `:type` and `:title` are used to produce a `resource-spec` for
this resource
### Edge
A representation of an "edge" (dependency or containment) in the
catalog. All edges have the following form:
{:source <resource spec>
:target <resource spec>
:relationship <relationship id>}
A relationship identifier can be one of:
* `:contains`
* `:required-by`
* `:notifies`
* `:before`
* `:subscription-of`
### Catalog
A wire-format-neutral representation of a Puppet catalog. It is a
map with the following structure:
{:certname "..."
:version "..."
:resources {<resource-spec> <resource>
<resource-spec> <resource>
...}
:edges #(<dependency-spec>,
<dependency-spec>,
...)}Cheshire related functions
This front-ends the common set of core cheshire functions:
This namespace when 'required' will also setup some common JSON encoders globally, so you can avoid doing this for each call.
Cheshire related functions This front-ends the common set of core cheshire functions: * generate-string * generate-stream * parse * parse-strict * parse-string * parse-stream This namespace when 'required' will also setup some common JSON encoders globally, so you can avoid doing this for each call.
Benchmark suite
This command-line utility will simulate catalog submission for a population. It requires that a separate, running instance of PuppetDB for it to submit catalogs to.
We attempt to approximate a number of hosts submitting catalogs at the specified runinterval with the specified rate-of-churn in catalog content.
If are running up against the upper limit at which Benchmark can submit simulated requests, you can run multiple instances of benchmark and make use of the --offset flag to shift the cert numbers.
Example (probably run on completely separate hosts):
benchmark --offset 0 --numhosts 100000
benchmark --offset 100000 --numhosts 100000
benchmark --offset 200000 --numhosts 100000
...
By default, each time Benchmark is run, it initializes the host-map catalog, factset and report data randomly from the given set of base --catalogs --factsets and --reports files. When re-running benchmark, this causes excessive load on puppetdb due to the completely changed catalogs/factsets that must be processed.
To avoid this, set --simulation-dir to preserve all of the host map data between runs as nippy/frozen files. Benchmark will then load and initialize a preserved host matching a particular host-# from these files at startup. Missing hosts (if --numhosts exceeds preserved, for example) will be initialized randomly as by default.
The benchmark tool automatically refreshs timestamps and transaction ids when submitting catalogs, factsets and reports, but the content does not change.
To simulate system drift, code changes and fact changes, use '--rand-catalog=PERCENT_CHANCE:CHANGE_COUNT' and '--rand-facts=PERCENT_CHANCE:PERCENT_CHANGE'.
The former indicates the chance any given catalog will perform CHANGE_COUNT resource mutations (additions, modifications or deletions). The later is the chance any given factset will mutate PERCENT_CHANGE of its fact values. These may be set multiple times, provided that PERCENT_CHANCE does not sum to more than 100%.
By default edges are not included in catalogs. If --include-edges is true, then add-resource and del-resource will involve edges as well.
By ensuring we only ever delete leaves from the graph, we maintain the graph integrity, which is important since PuppetDB validates the edges on injestion.
This provides only limited exercise of edge mutation, which seemed like a reasonable trade-off given that edge submission is deprecated. Running with --include-edges also impacts the nature of catalog mutation, since original resources will never be removed from the catalog.
See add-resource, mod-resource and del-resource for details of resource and edge changes.
TODO: Fact addition/removal TODO: Mutating reports
There are benchmark metrics which can be viewed via JMX.
WARNING: DO NOT DO THIS WITH A PRODUCTION OR INTERNET-ACCESSIBLE INSTANCE! This gives remote access to the JVM internals, including potentially secrets. If you absolutely must (you don't), read about using certs with JMX to do it securely. You are better off using the metrics API or Grafana metrics exporter.
Add the following properties to your Benchmark Java process on startup:
-Dcom.sun.management.jmxremote=true
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.port=5555
-Djava.rmi.server.hostname=127.0.0.1
-Dcom.sun.management.jmxremote.rmi.port=5556
Then with a tool like VisualVM, you can add a JMX Connection, and (with the MBeans plugin) view puppetlabs.puppetdb.benchmark metrics.
Benchmark suite This command-line utility will simulate catalog submission for a population. It requires that a separate, running instance of PuppetDB for it to submit catalogs to. We attempt to approximate a number of hosts submitting catalogs at the specified runinterval with the specified rate-of-churn in catalog content. ### Running parallel Benchmarks If are running up against the upper limit at which Benchmark can submit simulated requests, you can run multiple instances of benchmark and make use of the --offset flag to shift the cert numbers. Example (probably run on completely separate hosts): ``` benchmark --offset 0 --numhosts 100000 benchmark --offset 100000 --numhosts 100000 benchmark --offset 200000 --numhosts 100000 ... ``` ### Preserving host-map data By default, each time Benchmark is run, it initializes the host-map catalog, factset and report data randomly from the given set of base --catalogs --factsets and --reports files. When re-running benchmark, this causes excessive load on puppetdb due to the completely changed catalogs/factsets that must be processed. To avoid this, set --simulation-dir to preserve all of the host map data between runs as nippy/frozen files. Benchmark will then load and initialize a preserved host matching a particular host-# from these files at startup. Missing hosts (if --numhosts exceeds preserved, for example) will be initialized randomly as by default. ### Mutating Catalogs and Factsets The benchmark tool automatically refreshs timestamps and transaction ids when submitting catalogs, factsets and reports, but the content does not change. To simulate system drift, code changes and fact changes, use '--rand-catalog=PERCENT_CHANCE:CHANGE_COUNT' and '--rand-facts=PERCENT_CHANCE:PERCENT_CHANGE'. The former indicates the chance any given catalog will perform CHANGE_COUNT resource mutations (additions, modifications or deletions). The later is the chance any given factset will mutate PERCENT_CHANGE of its fact values. These may be set multiple times, provided that PERCENT_CHANCE does not sum to more than 100%. By default edges are not included in catalogs. If --include-edges is true, then add-resource and del-resource will involve edges as well. * adding a resource adds a single 'contains' edge with the source being one of the catalog's original (non-added) resources. * deleting a resource removes one of the added resources (if there are any) and it's related leaf edge. By ensuring we only ever delete leaves from the graph, we maintain the graph integrity, which is important since PuppetDB validates the edges on injestion. This provides only limited exercise of edge mutation, which seemed like a reasonable trade-off given that edge submission is deprecated. Running with --include-edges also impacts the nature of catalog mutation, since original resources will never be removed from the catalog. See add-resource, mod-resource and del-resource for details of resource and edge changes. TODO: Fact addition/removal TODO: Mutating reports ### Viewing Metrics There are benchmark metrics which can be viewed via JMX. WARNING: DO NOT DO THIS WITH A PRODUCTION OR INTERNET-ACCESSIBLE INSTANCE! This gives remote access to the JVM internals, including potentially secrets. If you absolutely must (you don't), read about using certs with JMX to do it securely. You are better off using the metrics API or Grafana metrics exporter. Add the following properties to your Benchmark Java process on startup: ``` -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.port=5555 -Djava.rmi.server.hostname=127.0.0.1 -Dcom.sun.management.jmxremote.rmi.port=5556 ```` Then with a tool like VisualVM, you can add a JMX Connection, and (with the MBeans plugin) view puppetlabs.puppetdb.benchmark metrics.
This command-line tool can generate a base sampling of catalog, fact and report files suitable for consumption by the PuppetDB benchmark utility.
Note that it is only necessary to generate a small set of initial sample data since benchmark will permute per node differences. So even if you want to benchmark 1000 nodes, you don't need to generate initial catalog/fact/report json for 1000 nodes.
If you want a representative sample with big differences between catalogs, you will need to run the tool multiple times. For example, if you want a set of 5 large catalogs and 10 small ones, you will need to run the tool twice with the desired parameters to create the two different sets.
The num-resources flag is total and includes num-classes. So if you set --num-resources to 100 and --num-classes to 30, you will get a catalog with a hundred resources, thirty of which are classes.
A containment edge is always generated between the main stage and each class. And non-class resources get a containment edge to a random class. So there will always be a base set of containment edges equal to the resource count. The --additional-edge-percent governs how many non-containment edges are added on top of that to simulate some further catalog structure. There is no guarantee of relationship depth (as far as, for example Stage(main) -> Class(foo) -> Class(bar) -> Resource(biff)), but it does ensure some edges between classes, as well as between class and non-class resources.
The --blob-count and --blob-size parameters control inclusion of large text blobs in catalog resources. By default one ~100kb blob is added per catalog.
Set --blob-count to 0 to exclude blobs altogether.
Each fact set begins with a set of baseline facts from: baseline-agent-node.json.
These provide some consistency for a common set of baseline fact paths present on any puppet node. The generator then mutates half of the values to provide variety.
The --num-facts parameter controls the number of facts to generate per host.
There are 376 leaf facts in the baseline file. Setting num-facts less than this will remove baseline facts to approach the requested number of facts. (Empty maps and arrays are not removed from the factset, so it will never pair down to zero.) Setting num-facts to a larger number will add facts of random depth based on --max-fact-depth until the requested count is reached.
The --total-fact-size parameter controls the total weight of the fact values map in kB. Weight is added after count is reached. So if the weight of the adjusted baseline facts already exceeds the total-fact-size, nothing more is done. No attempt is made to pair facts back down the requested size, as this would likely require removing facts.
The --max-fact-depth parameter is the maximum nested depth a fact added to the baseline facts may reach. For example a max depth of 5, would mean that an added fact would at most be a nest of four maps:
{foo: {bar: {baz: {biff: boz}}}}
Since depth is picked randomly for each additional fact, this does not guarantee facts of a given depth. Nor does it directly affect the average depth of facts in the generated factset, although the larger the max-fact-depth and num-facts, the more likely that the average depth will drift higher.
The --num-packages parameter sets the number of packages to generate for the factset's package_inventory array. Set to 0 to exclude.
The --num-reports flag governs the number of reports to generate per generated catalog. Since one catalog is generated per host, this means you will end up with num-hosts * num-reports reports.
A report details change, or lack there of, during enforcement of the puppet catalog on the host. Since the benchmark tool currently chooses randomly from the given report files, a simple mechanism for determining the likelihood of receiving a report of a particular size (with lots of changes, few changes or no changes) is to produce multiple reports of each type per host to generate a weighted average. (If there are 10 reports, 2 are large and 8 are small, then it's 80% likely any given report submission submitted by benchmark will be of the small variety...)
The knobs to control this with the generate tool are:
The left over percentage of reports will be no change reports (generally the most common) indicating the report run was steady-state with no changes.
By default, with a num-reports of 20, a high change percent of 5% and a low change percent of 20%, you will get 1 high change, 4 low change and 15 unchanged reports per host.
In Puppet 8, by default, the agent no longer includes unchanged resources in the report, reducing its size.
The generate tool also does this by default, but you can set --no-exclude-unchanged-resources to instead include unchanged resources in every report (for default Puppet 7 behavior, for example).
In addition to a few boilerplate log lines, random logs are generated for each change event in the report. However other factors, such as pluginsync, puppet runs with debug lines and additional logging in modules can increase log output (quite dramatically in the case of debug output from the agent).
To simulate this, you can set --num-additional-logs to include in a report. And you can set --percent-add-report-logs to indicate what percentage of reports have this additional number of logs included.
The default generation produces relatively uniform structures.
Example:
jpartlow@jpartlow-dev-2204:~/work/src/puppetdb$ lein run generate --verbose --output-dir generate-test ... :catalogs: 5
| :certname | :resource-count | :resource-weight | :min-resource | :mean-resource | :max-resource | :edge-count | :edge-weight | :catalog-weight | |---------------+-----------------+------------------+---------------+----------------+---------------+-------------+--------------+-----------------| | host-sarasu-0 | 101 | 137117 | 90 | 1357 | 110246 | 150 | 16831 | 154248 | | host-lukoxo-1 | 101 | 132639 | 98 | 1313 | 104921 | 150 | 16565 | 149504 | | host-dykivy-2 | 101 | 120898 | 109 | 1197 | 94013 | 150 | 16909 | 138107 | | host-talyla-3 | 101 | 110328 | 128 | 1092 | 82999 | 150 | 16833 | 127461 | | host-foropy-4 | 101 | 136271 | 106 | 1349 | 109811 | 150 | 16980 | 153551 |
:facts: 5
| :certname | :fact-count | :avg-depth | :max-depth | :fact-weight | :total-weight | |---------------+-------------+------------+------------+--------------+---------------| | host-sarasu-0 | 400 | 2.77 | 7 | 10000 | 10118 | | host-lukoxo-1 | 400 | 2.8 | 7 | 10000 | 10118 | | host-dykivy-2 | 400 | 2.7625 | 7 | 10000 | 10118 | | host-talyla-3 | 400 | 2.7825 | 7 | 10000 | 10118 | | host-foropy-4 | 400 | 2.7925 | 7 | 10000 | 10118 | ...
This mode is best used when generating several different sample sets with distinct weights and counts to provide (when combined) an overall sample set for benchmark that includes some fixed number of fairly well described catalog, fact and report examples.
By setting --random-distribution to true, you can instead generate a more random sample set, where the exact parameter values used per host will be picked from a normal curve based on the set value as mean.
Blobs will be distributed randomly through the set, so if you set --blob-count to 2 over --hosts 10, on averge there will be two per catalog, but some may have none, others four, etc...
This has no effect on generated reports at the moment.
Example:
jpartlow@jpartlow-dev-2204:~/work/src/puppetdb$ lein run generate --verbose --random-distribution :catalogs: 5
| :certname | :resource-count | :resource-weight | :min-resource | :mean-resource | :max-resource | :edge-count | :edge-weight | :catalog-weight | |---------------+-----------------+------------------+---------------+----------------+---------------+-------------+--------------+-----------------| | host-cevani-0 | 122 | 33831 | 93 | 277 | 441 | 193 | 22044 | 56175 | | host-firilo-1 | 91 | 115091 | 119 | 1264 | 91478 | 130 | 14466 | 129857 | | host-gujudi-2 | 129 | 36080 | 133 | 279 | 465 | 180 | 20230 | 56610 | | host-xegyxy-3 | 106 | 120603 | 136 | 1137 | 92278 | 153 | 17482 | 138385 | | host-jaqomi-4 | 107 | 211735 | 87 | 1978 | 98354 | 159 | 17792 | 229827 |
:facts: 5
| :certname | :fact-count | :avg-depth | :max-depth | :fact-weight | :total-weight | |---------------+-------------+------------+------------+--------------+---------------| | host-cevani-0 | 533 | 3.4690433 | 9 | 25339 | 25457 | | host-firilo-1 | 355 | 2.7464788 | 7 | 13951 | 14069 | | host-gujudi-2 | 380 | 2.75 | 8 | 16111 | 16229 | | host-xegyxy-3 | 360 | 2.7305555 | 7 | 5962 | 6080 | | host-jaqomi-4 | 269 | 2.7695167 | 7 | 16984 | 17102 | ...
# Data Generation utility
This command-line tool can generate a base sampling of catalog, fact and
report files suitable for consumption by the PuppetDB benchmark utility.
Note that it is only necessary to generate a small set of initial sample
data since benchmark will permute per node differences. So even if you want
to benchmark 1000 nodes, you don't need to generate initial
catalog/fact/report json for 1000 nodes.
If you want a representative sample with big differences between catalogs,
you will need to run the tool multiple times. For example, if you want a set
of 5 large catalogs and 10 small ones, you will need to run the tool twice
with the desired parameters to create the two different sets.
## Flag Notes
### Catalogs
#### Resource Counts
The num-resources flag is total and includes num-classes. So if you set
--num-resources to 100 and --num-classes to 30, you will get a catalog with a
hundred resources, thirty of which are classes.
#### Edges
A containment edge is always generated between the main stage and each
class. And non-class resources get a containment edge to a random class. So
there will always be a base set of containment edges equal to the resource
count. The --additional-edge-percent governs how many non-containment edges
are added on top of that to simulate some further catalog structure. There is
no guarantee of relationship depth (as far as, for example Stage(main) ->
Class(foo) -> Class(bar) -> Resource(biff)), but it does ensure some edges
between classes, as well as between class and non-class resources.
#### Large Resource Parameter Blobs
The --blob-count and --blob-size parameters control inclusion of large
text blobs in catalog resources. By default one ~100kb blob is
added per catalog.
Set --blob-count to 0 to exclude blobs altogether.
### Facts
#### Baseline Facts
Each fact set begins with a set of baseline facts from:
[baseline-agent-node.json](./resources/puppetlabs/puppetdb/generate/samples/facts/baseline-agent-node.json).
These provide some consistency for a common set of baseline fact paths
present on any puppet node. The generator then mutates half of the values to
provide variety.
#### Fact Counts
The --num-facts parameter controls the number of facts to generate per host.
There are 376 leaf facts in the baseline file. Setting num-facts less than
this will remove baseline facts to approach the requested number of facts.
(Empty maps and arrays are not removed from the factset, so it will never
pair down to zero.) Setting num-facts to a larger number will add facts of
random depth based on --max-fact-depth until the requested count is reached.
#### Total Facts Size
The --total-fact-size parameter controls the total weight of the fact values
map in kB. Weight is added after count is reached. So if the weight of the
adjusted baseline facts already exceeds the total-fact-size, nothing more is
done. No attempt is made to pair facts back down the requested size, as this
would likely require removing facts.
#### Max Fact Depth
The --max-fact-depth parameter is the maximum nested depth a fact added to
the baseline facts may reach. For example a max depth of 5, would mean that
an added fact would at most be a nest of four maps:
{foo: {bar: {baz: {biff: boz}}}}
Since depth is picked randomly for each additional fact, this does not
guarantee facts of a given depth. Nor does it directly affect the average
depth of facts in the generated factset, although the larger the
max-fact-depth and num-facts, the more likely that the average depth will
drift higher.
#### Package Inventory
The --num-packages parameter sets the number of packages to generate for the
factset's package_inventory array. Set to 0 to exclude.
### Reports
#### Reports per Catalog
The --num-reports flag governs the number of reports to generate per
generated catalog. Since one catalog is generated per host, this means you
will end up with num-hosts * num-reports reports.
#### Variation in Reports
A report details change, or lack there of, during enforcement of the puppet
catalog on the host. Since the benchmark tool currently chooses randomly from the
given report files, a simple mechanism for determining the likelihood of
receiving a report of a particular size (with lots of changes, few changes or
no changes) is to produce multiple reports of each type per host to generate
a weighted average. (If there are 10 reports, 2 are large and 8 are small,
then it's 80% likely any given report submission submitted by benchmark will
be of the small variety...)
The knobs to control this with the generate tool are:
* --num-reports, to determine the base number of reports to generate per catalog
* --high-change-reports-percent, percentage of that base to generate as
reports with a high number of change events, as determined by:
* --high-change-resource-percent, percentage of resources in a high change
report that will experience events (changes)
* --low-change-reports-percent, percentage of the base reports to generate
as reports with a low number of change events as determined by:
* --low-change-resource-percent, percentage of resources in a low change
report that will experience events (changes)
The left over percentage of reports will be no change reports (generally the
most common) indicating the report run was steady-state with no changes.
By default, with a num-reports of 20, a high change percent of 5% and a low
change percent of 20%, you will get 1 high change, 4 low change and 15
unchanged reports per host.
#### Unchanged Resources
In Puppet 8, by default, the agent no longer includes unchanged resources in
the report, reducing its size.
The generate tool also does this by default, but you can set
--no-exclude-unchanged-resources to instead include unchanged resources in
every report (for default Puppet 7 behavior, for example).
#### Logs
In addition to a few boilerplate log lines, random logs are generated for
each change event in the report. However other factors, such as pluginsync,
puppet runs with debug lines and additional logging in modules can increase
log output (quite dramatically in the case of debug output from the agent).
To simulate this, you can set --num-additional-logs to include in a report.
And you can set --percent-add-report-logs to indicate what percentage of
reports have this additional number of logs included.
### Random Distribution
The default generation produces relatively uniform structures.
* for catalogs it generates equal resource and edge counts and similar byte
counts.
* for factsets it generates equal fact counts and similar byte counts.
Example:
jpartlow@jpartlow-dev-2204:~/work/src/puppetdb$ lein run generate --verbose --output-dir generate-test
...
:catalogs: 5
| :certname | :resource-count | :resource-weight | :min-resource | :mean-resource | :max-resource | :edge-count | :edge-weight | :catalog-weight |
|---------------+-----------------+------------------+---------------+----------------+---------------+-------------+--------------+-----------------|
| host-sarasu-0 | 101 | 137117 | 90 | 1357 | 110246 | 150 | 16831 | 154248 |
| host-lukoxo-1 | 101 | 132639 | 98 | 1313 | 104921 | 150 | 16565 | 149504 |
| host-dykivy-2 | 101 | 120898 | 109 | 1197 | 94013 | 150 | 16909 | 138107 |
| host-talyla-3 | 101 | 110328 | 128 | 1092 | 82999 | 150 | 16833 | 127461 |
| host-foropy-4 | 101 | 136271 | 106 | 1349 | 109811 | 150 | 16980 | 153551 |
:facts: 5
| :certname | :fact-count | :avg-depth | :max-depth | :fact-weight | :total-weight |
|---------------+-------------+------------+------------+--------------+---------------|
| host-sarasu-0 | 400 | 2.77 | 7 | 10000 | 10118 |
| host-lukoxo-1 | 400 | 2.8 | 7 | 10000 | 10118 |
| host-dykivy-2 | 400 | 2.7625 | 7 | 10000 | 10118 |
| host-talyla-3 | 400 | 2.7825 | 7 | 10000 | 10118 |
| host-foropy-4 | 400 | 2.7925 | 7 | 10000 | 10118 |
...
This mode is best used when generating several different sample sets with
distinct weights and counts to provide (when combined) an overall sample set
for benchmark that includes some fixed number of fairly well described
catalog, fact and report examples.
By setting --random-distribution to true, you can instead generate a more random
sample set, where the exact parameter values used per host will be picked
from a normal curve based on the set value as mean.
* for catalogs, this will effect the class, resource, edge and total blob counts
Blobs will be distributed randomly through the set, so if you
set --blob-count to 2 over --hosts 10, on averge there will be two per
catalog, but some may have none, others four, etc...
* for facts, this will effect the fact and package counts, the total weight and the max fact depth.
This has no effect on generated reports at the moment.
Example:
jpartlow@jpartlow-dev-2204:~/work/src/puppetdb$ lein run generate --verbose --random-distribution
:catalogs: 5
| :certname | :resource-count | :resource-weight | :min-resource | :mean-resource | :max-resource | :edge-count | :edge-weight | :catalog-weight |
|---------------+-----------------+------------------+---------------+----------------+---------------+-------------+--------------+-----------------|
| host-cevani-0 | 122 | 33831 | 93 | 277 | 441 | 193 | 22044 | 56175 |
| host-firilo-1 | 91 | 115091 | 119 | 1264 | 91478 | 130 | 14466 | 129857 |
| host-gujudi-2 | 129 | 36080 | 133 | 279 | 465 | 180 | 20230 | 56610 |
| host-xegyxy-3 | 106 | 120603 | 136 | 1137 | 92278 | 153 | 17482 | 138385 |
| host-jaqomi-4 | 107 | 211735 | 87 | 1978 | 98354 | 159 | 17792 | 229827 |
:facts: 5
| :certname | :fact-count | :avg-depth | :max-depth | :fact-weight | :total-weight |
|---------------+-------------+------------+------------+--------------+---------------|
| host-cevani-0 | 533 | 3.4690433 | 9 | 25339 | 25457 |
| host-firilo-1 | 355 | 2.7464788 | 7 | 13951 | 14069 |
| host-gujudi-2 | 380 | 2.75 | 8 | 16111 | 16229 |
| host-xegyxy-3 | 360 | 2.7305555 | 7 | 5962 | 6080 |
| host-jaqomi-4 | 269 | 2.7695167 | 7 | 16984 | 17102 |
...Pg_restore and timeshift entries utility This command-line tool restores an empty database from a backup file (pg_dump generated file), then updates all the timestamps inside the database. It does this by calculating the period between the newest timestamp inside the file and the provided date. Then, every timestamp is shifted with that period. It accepts two parameters:
Pg_restore and timeshift entries utility This command-line tool restores an empty database from a backup file (pg_dump generated file), then updates all the timestamps inside the database. It does this by calculating the period between the newest timestamp inside the file and the provided date. Then, every timestamp is shifted with that period. It accepts two parameters: - [Mandatory] -d / --dumpfile Path to the dumpfile that will be used to restore the database. - [Optional]-t / --shift-to-time Timestamp to which all timestamps from the dumpfile will be shifted after the restore. If it's not provided, the system's current timestamp will be used. !!! All timestamps are converted to a Zero timezone format. e.g timestamps like: 2015-03-26T10:58:51+10:00 will become 2015-03-26T11:58:51Z !!! !!! If the time difference between the latest entry in the dumpfile and the time provided to timeshift-to is less than 24 hours this tool will fail !!!
Main entrypoint
PuppetDB consists of several, cooperating components:
Command processing
PuppetDB uses a CQRS pattern for making changes to its domain objects (facts, catalogs, etc). Instead of simply submitting data to PuppetDB and having it figure out the intent, the intent needs to explicitly be codified as part of the operation. This is known as a "command" (e.g. "replace the current facts for node X").
Commands are processed asynchronously, however we try to do our best to ensure that once a command has been accepted, it will eventually be executed. Ordering is also preserved. To do this, all incoming commands are placed in a message queue which the command processing subsystem reads from in FIFO order.
Refer to puppetlabs.puppetdb.command for more details.
Message queue
We use stockpile to durably store commands. The "in memory" representation of that queue is a core.async channel.
REST interface
All interaction with PuppetDB is conducted via its REST API. We embed an instance of Jetty to handle web server duties. Commands that come in via REST are relayed to the message queue. Read-only requests are serviced synchronously.
Database sweeper
As catalogs are modified, unused records may accumulate and stale data may linger in the database. We periodically sweep the database, compacting it and performing regular cleanup so we can maintain acceptable performance.
Main entrypoint PuppetDB consists of several, cooperating components: * Command processing PuppetDB uses a CQRS pattern for making changes to its domain objects (facts, catalogs, etc). Instead of simply submitting data to PuppetDB and having it figure out the intent, the intent needs to explicitly be codified as part of the operation. This is known as a "command" (e.g. "replace the current facts for node X"). Commands are processed asynchronously, however we try to do our best to ensure that once a command has been accepted, it will eventually be executed. Ordering is also preserved. To do this, all incoming commands are placed in a message queue which the command processing subsystem reads from in FIFO order. Refer to `puppetlabs.puppetdb.command` for more details. * Message queue We use stockpile to durably store commands. The "in memory" representation of that queue is a core.async channel. * REST interface All interaction with PuppetDB is conducted via its REST API. We embed an instance of Jetty to handle web server duties. Commands that come in via REST are relayed to the message queue. Read-only requests are serviced synchronously. * Database sweeper As catalogs are modified, unused records may accumulate and stale data may linger in the database. We periodically sweep the database, compacting it and performing regular cleanup so we can maintain acceptable performance.
Timestamp shift utility
This simple command-line tool updates all the timestamps inside a PuppetDB export. It does this by calculating the period between the newest timestamp inside the export and the provided date. Then, every timestamp is shifted with that period. It accepts three parameters:
!!! All timestamps are converted to a Zero timezone format. e.g timestamps like: 2015-03-26T10:58:51+10:00 will become 2015-03-26T11:58:51Z !!!
Timestamp shift utility This simple command-line tool updates all the timestamps inside a PuppetDB export. It does this by calculating the period between the newest timestamp inside the export and the provided date. Then, every timestamp is shifted with that period. It accepts three parameters: - [Mandatory] -i / --input Path to the .tgz pdb export, which will be shifted. - [Optional] -o / --output Path to the where the shifted export will be saved. If no path is given, the shifted export is sent as a stream to standard output. You may use it like this: lein time-shift-export -i export.tgz -o > shifted.tgz - [Optional]-t / --shift-to-time Timestamp to which all the export timestamp will be shifted. If it's not provided, the system's current timestamp will be used. !!! All timestamps are converted to a Zero timezone format. e.g timestamps like: 2015-03-26T10:58:51+10:00 will become 2015-03-26T11:58:51Z !!!
This namespace is separate from cli.util because we don't want to require any more than we have to there.
This namespace is separate from cli.util because we don't want to require any more than we have to there.
As this namespace is required by both the tk and non-tk subcommands, it must remain very lightweight, so that subcommands like "version" aren't slowed down by loading the entire logging subsystem or trapperkeeper, etc.
As this namespace is required by both the tk and non-tk subcommands, it must remain very lightweight, so that subcommands like "version" aren't slowed down by loading the entire logging subsystem or trapperkeeper, etc.
Version utility
This simple command-line tool prints a list of info about the version of PuppetDB. It is useful for testing and other situations where you'd like to know some of the version details without having a running instance of PuppetDB.
The output is currently formatted like the contents of a java properties file; each line contains a single property name, followed by an equals sign, followed by the property value.
Version utility This simple command-line tool prints a list of info about the version of PuppetDB. It is useful for testing and other situations where you'd like to know some of the version details without having a running instance of PuppetDB. The output is currently formatted like the contents of a java properties file; each line contains a single property name, followed by an equals sign, followed by the property value.
PuppetDB command handling
Commands are the mechanism by which changes are made to PuppetDB's
model of a population. Commands are represented by command objects, which have the following JSON wire format:
{"command": "...",
"version": 123,
"payload": <json object>}
payload must be a valid JSON string of any sort. It's up to an
individual handler function how to interpret that object.
More details can be found in the spec.
Commands should include a received field containing a timestamp
of when the message was first seen by the system. If this is
omitted, it will be added when the message is first parsed, but may
then be somewhat inaccurate.
Commands should include an id field containing a unique integer
identifier for the command. If this is omitted, it will be added
when the message is first parsed.
Failed messages will have an attempts annotation containing an
array of maps of the form:
{:timestamp <timestamp>
:error "some error message"
:trace <stack trace from :exception>}
Each entry corresponds to a single failed attempt at handling the message, containing the error message, stack trace, and timestamp for each failure. PuppetDB may discard messages which have been attempted and failed too many times, or which have experienced fatal errors (including unparseable messages).
Failed messages will be stored in files in the "dead letter
office", located under the MQ data directory, in
/discarded/<command>. These files contain the annotated message,
along with each exception that occured while trying to handle the
message.
We currently support the following wire formats for commands:
Java Strings
UTF-8 encoded byte-array
In either case, the command itself, once string-ified, must be a JSON-formatted string with the aforementioned structure.
PuppetDB command handling
Commands are the mechanism by which changes are made to PuppetDB's
model of a population. Commands are represented by `command
objects`, which have the following JSON wire format:
{"command": "...",
"version": 123,
"payload": <json object>}
`payload` must be a valid JSON string of any sort. It's up to an
individual handler function how to interpret that object.
More details can be found in [the spec](../spec/commands.md).
Commands should include a `received` field containing a timestamp
of when the message was first seen by the system. If this is
omitted, it will be added when the message is first parsed, but may
then be somewhat inaccurate.
Commands should include an `id` field containing a unique integer
identifier for the command. If this is omitted, it will be added
when the message is first parsed.
Failed messages will have an `attempts` annotation containing an
array of maps of the form:
{:timestamp <timestamp>
:error "some error message"
:trace <stack trace from :exception>}
Each entry corresponds to a single failed attempt at handling the
message, containing the error message, stack trace, and timestamp
for each failure. PuppetDB may discard messages which have been
attempted and failed too many times, or which have experienced
fatal errors (including unparseable messages).
Failed messages will be stored in files in the "dead letter
office", located under the MQ data directory, in
`/discarded/<command>`. These files contain the annotated message,
along with each exception that occured while trying to handle the
message.
We currently support the following wire formats for commands:
1. Java Strings
2. UTF-8 encoded byte-array
In either case, the command itself, once string-ified, must be a
JSON-formatted string with the aforementioned structure.Centralized place for reading a user-defined config INI file, validating, defaulting and converting into a format that can startup a PuppetDB instance.
The schemas in this file define what is expected to be present in the INI file and the format expected by the rest of the application.
Centralized place for reading a user-defined config INI file, validating, defaulting and converting into a format that can startup a PuppetDB instance. The schemas in this file define what is expected to be present in the INI file and the format expected by the rest of the application.
PuppetDBs normal entry point. Dispatches to command line subcommands.
PuppetDBs normal entry point. Dispatches to command line subcommands.
Some generic HoneySQL extensions, candidates for re-usability and potential upstream submission.
Some generic HoneySQL extensions, candidates for re-usability and potential upstream submission.
Query parameter manipulation
Functions that aid in the parsing, serialization, and manipulation of PuppetDB queries embedded in HTTP parameters.
Query parameter manipulation Functions that aid in the parsing, serialization, and manipulation of PuppetDB queries embedded in HTTP parameters.
REST server
Consolidates our disparate REST endpoints into a single Ring application.
REST server Consolidates our disparate REST endpoints into a single Ring application.
Import utility
This is a command-line tool for importing data into PuppetDB. It expects
as input a tarball generated by the PuppetDB export command-line tool.
Import utility This is a command-line tool for importing data into PuppetDB. It expects as input a tarball generated by the PuppetDB `export` command-line tool.
Database utilities
Database utilities
JDBC helper functions
External code should not call any of these functions directly, as they are subject to change without notice.
JDBC helper functions *External code should not call any of these functions directly, as they are* *subject to change without notice.*
Carrier for bytea parameters, to support clojure.java.jdbc and next.jdbc protocol extensions. Essentially just a typed wrapper around a byte[].
Carrier for bytea parameters, to support clojure.java.jdbc and next.jdbc protocol extensions. Essentially just a typed wrapper around a byte[].
Carrier for bytea[] parameters, to support clojure.java.jdbc and next.jdbc protocol extensions. Essentially just a typed wrapper around a byte[][].
Carrier for bytea[] parameters, to support clojure.java.jdbc and next.jdbc protocol extensions. Essentially just a typed wrapper around a byte[][].
Versioning Utility Library
This namespace contains some utility functions relating to checking version numbers of various fun things.
Versioning Utility Library This namespace contains some utility functions relating to checking version numbers of various fun things.
Ring middleware
Ring middleware
Puppet nodes parsing
Functions that handle conversion of nodes from wire format to internal PuppetDB format, including validation.
Puppet nodes parsing Functions that handle conversion of nodes from wire format to internal PuppetDB format, including validation.
SQL query compiler
The query compile operates in a multi-step process. Compilation begins with
one of the foo-query->sql functions. The job of these functions is
basically to call compile-term on the first term of the query to get back
the "compiled" form of the query, and then to turn that into a complete SQL
query.
The compiled form of a query consists of a map with two keys: where
and params. The where key contains SQL for querying that
particular predicate, written in such a way as to be suitable for placement
after a WHERE clause in the database. params contains, naturally, the
parameters associated with that SQL expression. For instance, a resource
query for ["=" ["node" "name"] "foo.example.com"] will compile to:
{:where "catalogs.certname = ?"
:params ["foo.example.com"]}
The where key is then inserted into a template query to return
the final result as a string of SQL code.
The compiled query components can be combined by operators such as
AND or OR, which return the same sort of structure. Operators
which accept other terms as their arguments are responsible for
compiling their arguments themselves. To facilitate this, those
functions accept as their first argument a map from operator to
compile function. This allows us to have a different set of
operators for resources and facts, or queries, while still sharing
the implementation of the operators themselves.
Other operators include the subquery operators, in, extract, and
select-resources or select-facts. The select-foo operators implement
subqueries, and are simply implemented by calling their corresponding
foo-query->sql function, which means they return a complete SQL query
rather than the compiled query map. The extract function knows how to
handle that, and is the only place those queries are allowed as arguments.
extract is used to select a particular column from the subquery. The
sibling operator to extract is in, which checks that the value of
a certain column from the table being queried is in the result set returned
by extract. Composed, these three operators provide a complete subquery
facility. For example, consider this fact query:
["and"
["=" ["fact" "name"] "ipaddress"]
["in" "certname"
["extract" "certname"
["select-resources" ["and"
["=" "type" "Class"]
["=" "title" "apache"]]]]]]
This will perform a query (via select-resources) for resources matching
Class[apache]. It will then pick out the certname from each of those,
and match against the certname of fact rows, returning those facts which
have a corresponding entry in the results of select-resources and which
are named ipaddress. Effectively, the semantics of this query are "find
the ipaddress of every node with Class[apache]".
The resulting SQL from the foo-query->sql functions selects all the
columns. Thus consumers of those functions may need to wrap that query with
another SELECT to pull out only the desired columns. Similarly for
applying ordering constraints.
SQL query compiler
The query compile operates in a multi-step process. Compilation begins with
one of the `foo-query->sql` functions. The job of these functions is
basically to call `compile-term` on the first term of the query to get back
the "compiled" form of the query, and then to turn that into a complete SQL
query.
The compiled form of a query consists of a map with two keys: `where`
and `params`. The `where` key contains SQL for querying that
particular predicate, written in such a way as to be suitable for placement
after a `WHERE` clause in the database. `params` contains, naturally, the
parameters associated with that SQL expression. For instance, a resource
query for `["=" ["node" "name"] "foo.example.com"]` will compile to:
{:where "catalogs.certname = ?"
:params ["foo.example.com"]}
The `where` key is then inserted into a template query to return
the final result as a string of SQL code.
The compiled query components can be combined by operators such as
`AND` or `OR`, which return the same sort of structure. Operators
which accept other terms as their arguments are responsible for
compiling their arguments themselves. To facilitate this, those
functions accept as their first argument a map from operator to
compile function. This allows us to have a different set of
operators for resources and facts, or queries, while still sharing
the implementation of the operators themselves.
Other operators include the subquery operators, `in`, `extract`, and
`select-resources` or `select-facts`. The `select-foo` operators implement
subqueries, and are simply implemented by calling their corresponding
`foo-query->sql` function, which means they return a complete SQL query
rather than the compiled query map. The `extract` function knows how to
handle that, and is the only place those queries are allowed as arguments.
`extract` is used to select a particular column from the subquery. The
sibling operator to `extract` is `in`, which checks that the value of
a certain column from the table being queried is in the result set returned
by `extract`. Composed, these three operators provide a complete subquery
facility. For example, consider this fact query:
["and"
["=" ["fact" "name"] "ipaddress"]
["in" "certname"
["extract" "certname"
["select-resources" ["and"
["=" "type" "Class"]
["=" "title" "apache"]]]]]]
This will perform a query (via `select-resources`) for resources matching
`Class[apache]`. It will then pick out the `certname` from each of those,
and match against the `certname` of fact rows, returning those facts which
have a corresponding entry in the results of `select-resources` and which
are named `ipaddress`. Effectively, the semantics of this query are "find
the ipaddress of every node with Class[apache]".
The resulting SQL from the `foo-query->sql` functions selects all the
columns. Thus consumers of those functions may need to wrap that query with
another `SELECT` to pull out only the desired columns. Similarly for
applying ordering constraints.AST parsing
AST parsing
Fact query generation
Fact query generation
SQL/query-related functions for events
SQL/query-related functions for events
Fact query generation
Fact query generation
This provides a monitor for in-progress queries. The monitor keeps track of each registered query's deadline, client socket (channel), and possible postgresql connection, and whenever the deadline is reaached or the client disconnects (the query is abandoned), the monitor will attempt to kill the query -- currently by invoking a pg_terminate() on the query's registered postgres pid.
The main focus is client disconnections since there didn't appear to be any easy way to detect/handle them otherwise, and because without the pg_terminate, the server might continue executing an expensive query for a long time after the client is gone (say via browser page refresh).
It's worth noting that including this monitor, we have three different query timeout mechanisms. The other two are the time-limited-seq and the jdbc/update-local-timeouts operations in query-eng. We have all three because the time-limited seq only works when rows are moving (i.e. not when blocked waiting on pg or blocked pushing json to the client), the pg timeouts have an unusual granularity, i.e. they're a per-pg-wire-batch timeout, not a timeout for an entire statement like a top-level select, and the pg_terminate()s used here in the monitor are more expensive than either of those (killing an entire pg worker process).
The core of the monitor is a traditional (here NIO based) socket select loop, which should be able to handle even a large number of queries reasonably efficiently, and without requiring some number of threads proportional to the in-progress query count.
The current implementation is intended to respect the Selector
concurrency requirements, aided in part by limiting most work to the
single monitor thread, though forget does compete with the monitor
loop (coordinating via the :terminated promise.
No operations should block forever; they should all eventually (in some cases customizably) time out, and the current implementation is intended, overall, to try to let pdb keep running, even if the monitor (thread) dies somehow. The precipitating errors should still be reported to the log.
Every monitored query will have a SelectionKey associated with it, The key is cancelled during forget, but won't be removed from the selector's cancelled set until the next call to select. During that time, another query on the same socket/connection could try to re-register the cancelled key. This will throw an exception, which we suppress and retry until the select loop finally removes the cancelled key, and we can re-register the socket.
Every monitored query may also have a postgres pid associated with it, and whenever it does, that pid should be terminated (in coordination with the :terminated promise) once the query has been abandoned or has timed out.
The terminated promise coordinates between the monitor and attempts to remove (forget) a query. The arrangement is intended to make sure that the attempt to forget doesn't return until any competing termination attempt has finished, or at least had a chance to finish (otherwise the termination could kill a pg worker that's no longer associated with the original query, i.e. it's handling a new query that jetty has picked up on that channel).
The client socket monitoring depends on access to the jetty query response which (at least at the moment) provides indirect access to the java socket channel which can be read to determine whether the client is still connected.
The current implementation is completely incompatible with http "pipelining", but it looks like that is no longer a realistic concern: https://daniel.haxx.se/blog/2019/04/06/curl-says-bye-bye-to-pipelining/
If that turns out to be an incorrect assumption, then we'll have to reevaluate the implementation and/or feasibility of the monitoring. That's because so far, the only way we've found to detect a client disconnection is to attempt to read a byte. At the moment, that's acceptable because the client shouldn't be sending any data during the response (which of course wouldn't be true with pipelining, where it could be sending additional requests).
This provides a monitor for in-progress queries. The monitor keeps track of each registered query's deadline, client socket (channel), and possible postgresql connection, and whenever the deadline is reaached or the client disconnects (the query is abandoned), the monitor will attempt to kill the query -- currently by invoking a pg_terminate() on the query's registered postgres pid. The main focus is client disconnections since there didn't appear to be any easy way to detect/handle them otherwise, and because without the pg_terminate, the server might continue executing an expensive query for a long time after the client is gone (say via browser page refresh). It's worth noting that including this monitor, we have three different query timeout mechanisms. The other two are the time-limited-seq and the jdbc/update-local-timeouts operations in query-eng. We have all three because the time-limited seq only works when rows are moving (i.e. not when blocked waiting on pg or blocked pushing json to the client), the pg timeouts have an unusual granularity, i.e. they're a per-pg-wire-batch timeout, not a timeout for an entire statement like a top-level select, and the pg_terminate()s used here in the monitor are more expensive than either of those (killing an entire pg worker process). The core of the monitor is a traditional (here NIO based) socket select loop, which should be able to handle even a large number of queries reasonably efficiently, and without requiring some number of threads proportional to the in-progress query count. The current implementation is intended to respect the Selector concurrency requirements, aided in part by limiting most work to the single monitor thread, though `forget` does compete with the monitor loop (coordinating via the `:terminated` promise. No operations should block forever; they should all eventually (in some cases customizably) time out, and the current implementation is intended, overall, to try to let pdb keep running, even if the monitor (thread) dies somehow. The precipitating errors should still be reported to the log. Every monitored query will have a SelectionKey associated with it, The key is cancelled during forget, but won't be removed from the selector's cancelled set until the next call to select. During that time, another query on the same socket/connection could try to re-register the cancelled key. This will throw an exception, which we suppress and retry until the select loop finally removes the cancelled key, and we can re-register the socket. Every monitored query may also have a postgres pid associated with it, and whenever it does, that pid should be terminated (in coordination with the :terminated promise) once the query has been abandoned or has timed out. The terminated promise coordinates between the monitor and attempts to remove (forget) a query. The arrangement is intended to make sure that the attempt to forget doesn't return until any competing termination attempt has finished, or at least had a chance to finish (otherwise the termination could kill a pg worker that's no longer associated with the original query, i.e. it's handling a new query that jetty has picked up on that channel). The client socket monitoring depends on access to the jetty query response which (at least at the moment) provides indirect access to the java socket channel which can be read to determine whether the client is still connected. The current implementation is completely incompatible with http "pipelining", but it looks like that is no longer a realistic concern: https://daniel.haxx.se/blog/2019/04/06/curl-says-bye-bye-to-pipelining/ If that turns out to be an incorrect assumption, then we'll have to reevaluate the implementation and/or feasibility of the monitoring. That's because so far, the only way we've found to detect a client disconnection is to attempt to read a byte. At the moment, that's acceptable because the client shouldn't be sending any data during the response (which of course wouldn't be true with pipelining, where it could be sending additional requests).
Paging query parameter manipulation
Functions that aid in the validation and processing of the query parameters related to paging PuppetDB queries
Paging query parameter manipulation Functions that aid in the validation and processing of the query parameters related to paging PuppetDB queries
Population-wide queries
Contains queries and metrics that apply across an entire population.
Population-wide queries Contains queries and metrics that apply across an entire population.
Resource querying
This implements resource querying, using the query compiler in
puppetlabs.puppetdb.query, basically by munging the results into the
right format and picking out the desired columns.
Resource querying This implements resource querying, using the query compiler in `puppetlabs.puppetdb.query`, basically by munging the results into the right format and picking out the desired columns.
Puppet report/event parsing
Functions that handle conversion of reports from wire format to internal PuppetDB format, including validation.
Puppet report/event parsing Functions that handle conversion of reports from wire format to internal PuppetDB format, including validation.
Schema migrations
The initialize-schema function can be used to prepare the
database, applying all the pending migrations to the database, in
ascending order of schema version. Pending is defined as having a
schema version greater than the current version in the database.
A migration is specified by defining a function of arity 0 and adding it to
the migrations map, along with its schema version. To apply the migration,
the migration function will be invoked, and the schema version and current
time will be recorded in the schema_migrations table.
A migration function can return a map with ::vacuum-analyze to indicate what tables need to be analyzed post-migration.
NOTE: in order to support bug-fix schema changes to older branches without breaking the ability to upgrade, it is possible to define a sequence of migrations with non-sequential integers. e.g., if the 1.0.x branch contains migrations 1-5, and the 2.0.x branch contains schema migrations 1-10, and then a bugfix schema change (such as creating or adding an index) is identified, this migration can be defined as #11 in both branches. Code in the 1.0.x branch should happily apply #11 even though it does not have a definition for 6-10. Then when a 1.0.x user upgrades to 2.0.x, migrations 6-10 will be applied, and 11 will be skipped because it's already been run. Because of this, it is crucial to be extremely careful about numbering new migrations if they are going into multiple branches. It's also crucial to be absolutely certain that the schema change in question is compatible with both branches and that the migrations missing from the earlier branch can reasonably and safely be applied after the bugfix migration, because that is what will happen for upgrading users.
In short, here are some guidelines re: applying schema changes to multiple branches:
TODO: consider using multimethods for migration funcs
Schema migrations
The `initialize-schema` function can be used to prepare the
database, applying all the pending migrations to the database, in
ascending order of schema version. Pending is defined as having a
schema version greater than the current version in the database.
A migration is specified by defining a function of arity 0 and adding it to
the `migrations` map, along with its schema version. To apply the migration,
the migration function will be invoked, and the schema version and current
time will be recorded in the schema_migrations table.
A migration function can return a map with ::vacuum-analyze to indicate what tables
need to be analyzed post-migration.
NOTE: in order to support bug-fix schema changes to older branches without
breaking the ability to upgrade, it is possible to define a sequence of
migrations with non-sequential integers. e.g., if the 1.0.x branch
contains migrations 1-5, and the 2.0.x branch contains schema migrations
1-10, and then a bugfix schema change (such as creating or adding an index)
is identified, this migration can be defined as #11 in both branches. Code
in the 1.0.x branch should happily apply #11 even though it does not have
a definition for 6-10. Then when a 1.0.x user upgrades to 2.0.x, migrations
6-10 will be applied, and 11 will be skipped because it's already been run.
Because of this, it is crucial to be extremely careful about numbering new
migrations if they are going into multiple branches. It's also crucial to
be absolutely certain that the schema change in question is compatible
with both branches and that the migrations missing from the earlier branch
can reasonably and safely be applied *after* the bugfix migration, because
that is what will happen for upgrading users.
In short, here are some guidelines re: applying schema changes to multiple
branches:
1. If at all possible, avoid it.
2. Seriously, are you sure you need to do this? :)
3. OK, if you really must do it, make sure that the schema change in question
is as independent as humanly possible. For example, things like creating
or dropping an index on a table should be fairly self-contained. You should
think long and hard about any change more complex than that.
4. Determine what the latest version of the schema is in each of the two branches.
5. Examine every migration that exists in the newer branch but not the older
branch, and make sure that your new schema change will not conflict with
*any* of those migrations. Your change must be able to execute successfully
regardless of whether it is applied BEFORE all of those migrations or AFTER
them.
6. If you're certain you've met the conditions described above, choose the next
available integer from the *newer* branch and add your migration to both
branches using this integer. This will result in a gap between the integers
in the migrations array in the old branch, but that is not a problem.
_TODO: consider using multimethods for migration funcs_Handles all work related to database table partitioning
Handles all work related to database table partitioning
Catalog persistence
Catalogs are persisted in a relational database. Roughly speaking, the schema looks like this:
resource_parameters are associated 0 to N catalog_resources (they are deduped across catalogs). It's possible for a resource_param to exist in the database, yet not be associated with a catalog. This is done as a performance optimization.
edges are associated with a single catalog
catalogs are associated with a single certname
facts are associated with a single certname
The standard set of operations on information in the database will
likely result in dangling resources and catalogs; to clean these
up, it's important to run garbage-collect!.
Catalog persistence Catalogs are persisted in a relational database. Roughly speaking, the schema looks like this: * resource_parameters are associated 0 to N catalog_resources (they are deduped across catalogs). It's possible for a resource_param to exist in the database, yet not be associated with a catalog. This is done as a performance optimization. * edges are associated with a single catalog * catalogs are associated with a single certname * facts are associated with a single certname The standard set of operations on information in the database will likely result in dangling resources and catalogs; to clean these up, it's important to run `garbage-collect!`.
Time-related Utility Functions
This namespace contains some utility functions for working with objects
related to time; it is mostly based off of the Period class from
Java's JodaTime library.
Time-related Utility Functions This namespace contains some utility functions for working with objects related to time; it is mostly based off of the `Period` class from Java's JodaTime library.
cljdoc builds & hosts documentation for Clojure/Script libraries
| Ctrl+k | Jump to recent docs |
| ← | Move to previous article |
| → | Move to next article |
| Ctrl+/ | Jump to the search field |