JEP |
210 |
Title |
External log storage for Pipeline |
Sponsor |
|
Status |
Accepted 👌 |
Type |
Standards |
Created |
2018-07-24 |
JIRA |
|
BDFL-Delegate |
|
Discussions-To |
|
Requires |
JEP-206: UTF-8 for Pipeline logs |
This project overhauls the storage of logs for Jenkins Pipeline builds. (The log can be seen from the Console link, but also from Blue Ocean, Pipeline Steps, and REST and CLI.)
There are four stages of changes:
-
Minimize Remoting channel usage for durable task (
sh
,bat
,powershell
) output. -
Replace aggregation of per-step log files with a single log stream annotated by step ID.
-
Make the traditional Console view use semantic markup of steps and CSS.
-
Permit the entire log to be sent to a pluggable external service: in the reference implementation, Amazon CloudWatch Logs.
#1 is implemented as a set of independently releasable branches. (It is tracked by JENKINS-52165.) #2, #3, and #4 constitute the complete change, and depend on #1 as well as JEP-206 for use of UTF-8 in Pipeline log files.
The scope does not include other project types (freestyle), or other types of logging (system, branch indexing). JEP-207 covers at least freestyle build logs, and could be integrated with these changes via JEP-212.
Together, the effect is that the TaskListener
associated with the overall build,
as well as with each individual step,
is potentially a “remotable” object which can autonomously stream log messages
directly from an agent computer to a cloud log service.
If and when a Jenkins-based UI needs to observe build or step logs,
the Jenkins master will retrieve messages from that service,
and either filter them by step,
or collate them by timestamp into a whole-build log annotated with step metadata.
When running durable tasks such as sh
on an agent,
which involves “tailing” an output.txt
file in a hidden directory,
the agent process begins checking for growth in the file at short recurring intervals.
Any new content is streamed to the master.
By using a last-location.txt
file to record the last-observed file length,
the state is persistent, so that log streaming can resume after a restart of the agent process
(including a restart of the Jenkins master, which forces agents to restart as well).
The principal new API here, in durable-task
, is Controller.watch
.
This takes a newly defined interface Handler
capable of receiving text content and an exit event.
DurableTaskStep
calls the new APIs when available
(which it will be for all standard steps),
otherwise falling back to the old polling mode.
The step will still poll at infrequent intervals
in case the agent became unresponsive and the step needs to abort.
The Handler
implementation serializes the TaskListener
coming from the StepContext
,
allowing an external log sink to bypass all master communications.
There is a default FileLogStorage
implementation based on the traditional log
file inside the build directory.
All log messages associated with the build are sent to the BuildListener
which writes to this file.
Steps which call getContext().get(TaskListener.class)
for logging
trigger addition of a LogStorageAction
(a stateless marker) to the FlowNode
;
this supplies a TaskListener
which tracks the FlowNode.id
(typically an integer)
used at certain byte offsets in the log, in a separate log-index
file.
For example:
1597 7
1985
2285 8
2686
2983 10
3034
4245 14
4287 10
4296 14
4328 10
4337 14
4360 10
4369 14
4392 10
4401 14
4424
5318 10
5329 14
7537
Conversely, calls to LogAction.getLogText()
produce an implementation
which concatenates segments of the log
for text corresponding to that step.
Console annotations (which are ignored by Blue Ocean)
continue to be written to the log file
using serialized, signed, and Base64-encoded ConsoleNote
objects.
Log lines produced on the agent side are temporarily buffered
before being streamed to the master in relatively large chunks,
to avoid putting too much load on the Remoting channel,
so remote tasks need to flush
before exiting.
Every sequence of consecutive lines in the “classic UI” HTML output (AnnotatedLargeText.writeHtmlTo
)
produced by the same step (technically, FlowNode.id
, which is a little more specific as it tracks body blocks)
is marked with an HTML <span>
identifying that ID.
(The absence of such a span indicates log lines not produced by any node.)
For example:
<span class="pipeline-node-12">+ whoami
jglick
</span>
Implementations of LogStorage.overallLog
such as in FileLogStorage
must call startStep
and endStep
.
NewNodeConsoleNote
is used to print the introduction of a new node into the log, typically displayed like
[Pipeline] sh
with variants for block-scoped steps.
The generated HTML includes another <span>
which encodes not just the node ID,
but also the start node ID (in the case of a block end node),
the enclosing (start) node ID,
any the value of any LabelAction
.
NewNodeConsoleNote
defines a combination of CSS and JavaScript
which, together with both kinds of spans,
implements a richer GUI in the build log view.
In particular, (show) and (hide) links are displayed
which allow output from individual steps or entire blocks
to be selectively displayed.
By default the traditional log
file is used for build logs.
Plugins may override this storage in both the read and write aspects.
LogStorageFactory
is the entry point for such an override;
currently a plugin may pick builds to provide storage for,
but this decision is not persisted
(pending work in JEP-207).
On the write side, the plugin is able to supply a custom TaskListener
.
This interface was already defined by Jenkins core to be remotable.
An implementation which streams to external storage
therefore needs only to ensure that all fields are truly serializable
and that the code to connect to a storage service can be run on a remote node.
There is a variant which records a FlowNode.id
association.
The reference implementation creates JSON-format records to be sent to CloudWatch Logs.
ConsoleNote
instances need to be created on the master side so that they may be signed.
Signed and serialized note constants may be transmitted to agents to handle this in simple cases.
Many traditional uses of this API, however, can be replaced by a new TaskListenerDecorator
interface
that improves on the old ConsoleLogFilter
extension point.
ConsoleAnnotatorFactory
has also been improved for Pipeline use and can replace some uses of ConsoleNote
.
The timestamper
and ansicolor
plugins have been reworked accordingly.
On the read side, the plugin can supply an AnnotatedLargeText
for either the build as a whole or one node.
(While this Jenkins core interface supports HTML rendering for the “classic” UI,
it is also responsible for generating plain-text content as consumed by Blue Ocean.)
The reference implementation makes API calls to CloudWatch Logs to serve content
based on JSON filter patterns to select messages by build and optionally node.
A subtle issue is the use of LargeText.isCompleted
by UI callers,
which determines whether a given log is considered finalized,
in which case no further “AJAX” calls need be made to fetch subsequent content.
Yet CloudWatch Logs does not guarantee that a given record
has been committed to permanent storage when the log event is sent,
and outbound records and queued and coalesced in a buffer,
so without any special effort a build log would sometimes stop refreshing before the end.
This is solved with a utility class TimestampTracker
(which could if necessary be pushed into a lower layer)
which records the last (master-side) log message sent for a given scope
and declines to mark the text block as completed
unless the last timestamp observed in CloudWatch Logs
matches the last-delivered timestamp.
Another feature of the reference implementation is to store ConsoleNote
s separately in JSON.
This is accomplished by the ConsoleNotes
utility
(again, potentially extractable to a shared API layer)
which keeps opaque notes (serialized, signed, and Base64-encoded)
in a separate JSON field,
so that external log viewers can access the plain text easily.
The plugin also supplies a sidebar link in builds
which jumps to a suitably constructed CloudWatch Logs search URL
displayed in the AWS Console.
The overall goal is to minimize the load placed on the Jenkins master process in the common case that the build log is written but not read (or read only via an external log browser).
Changes to durable task polling, log aggregation, and especially log pluggability contribute directly to this goal. Semantic console markup is a small extension to log aggregation.
Historically, when running durable tasks (sh
and kin),
running output was handled by having the master send a callable to the agent at intervals,
initially short (¼s) but growing exponentially up to some maximum (15s) if the process seems to be idle,
and resetting to short again if and when fresh output is detected.
Not only is there up to a 15s delay in displaying new output,
this is wasteful of master and network resources when the process is idle for a long time;
and sending a UserRequest
and corresponding response involves a fair amount of Java serialization.
By contrast, non-durable processes (such as those created by Launcher
in a freestyle build)
use RemoteOutputStream
to send content from the agent to master as soon as it is available, minimizing network traffic;
Remoting is also able to optimize this kind of traffic by sending low-overhead Chunk
packets of tailored sizes.
Another minor benefit is that DurableTaskStep
no longer needs to call StepContext.saveState
every time new output is observed,
which was potentially expensive since it involves a fresh serialization to program.dat
.
Therefore durable tasks should switch from the policy of pulling log output to having the agent push log output. The content cannot be detected immediately, since we are effectively tailing a log file, but it can be detected quickly after the log file is updated with minimal overhead.
The original implementation of Pipeline (then “Workflow”) used a separate log file for each step
as a rough-and-ready solution to the problem of allowing clients of the flow graph
to determine which log lines came from which steps.
To provide support for the various methods in Run
which expect to read a single log
,
a method WorkflowRun.copyLogs
periodically checked for new output in the step log files
and synchronized it to the master log file.
The last-read location for each active step was saved in build.xml
to provide durability.
This system had numerous flaws. Most obviously, it requires almost double the disk space.
The copying had an inherent delay,
ameliorated by eager copying at the time of step completion,
which can lead to flaky tests if care is not taken to wait for content.
Content between parallel
steps was also not interleaved in real time.
A heavily loaded system could wind up consuming considerable CPU and IOPS
running copy tasks for numerous concurrent builds.
Not only did many small step log files need to be read frequently,
but in the safest durability modes every update forced a new build.xml
write,
which besides I/O requires Java serialization of possibly large unrelated objects.
The required synchronization also introduced bottlenecks and occasional deadlocks.
All the background tasks also sometimes consumed all available threads in an executor pool,
leading to starvation of more critical operations.
Using a single log
file and streaming all data directly there
is considerably simpler, even accounting for the need to handle step ID prefixes.
It may be less efficient at read time,
but the primary consideration is minimizing overhead at write time.
The original Pipeline log display hard-coded markup for new node notes and failed to expose any of the node association information to potential UIs. Usability issues in the “classic” log build log UI continue to be brought up by users as annoyances, despite the availability of an alternative UI in Blue Ocean.
Other behaviors, like hiding new node displays
or hiding all but the first parallel
branch initially,
could be added later or even perhaps contributed by plugins.
An earlier draft implementation exposed a simpler SPI to plugins:
they could only supply a TaskListener
for the overall build,
and an InputStream
for the overall build content.
A special marker was inserted into the text to indicate the FlowNode.id
associated with each line.
This however had performance issues, and was ill suited to the CloudWatch Logs implementation.
Most obviously, the InputStream
interface forces the implementation to serve a complete build log
even when the text for only a single step (~ FlowNode.id
) has been requested.
The CloudWatch Logs implementation can do better by using a server-side filter.
More subtly, the InputStream
interface lacked any room for indicating
that the build content was incomplete.
For the default filesystem-based storage,
originally it was attempted to use a special ConsoleNote
to mark which step produced a given line.
This worked, but resulted in unacceptably bloated raw logs:
the serialized form, after GZIPping and Base64-encoding, was over 200 characters per line;
and rendered typical raw logs more or less unreadable.
Therefore a text marker was introduced,
as it adds minimal space overhead and does not interfere with legibility.
This in turn was later replaced even in the default file-based implementation (see above).
Some aspects of the implementation would be easier given certain API changes in Jenkins core (or Stapler).
For example, ConsoleAnnotators
could be replaced by a proper API;
some LargeText
/ AnnotatedLargeText
methods could be better designed for subclassing;
and some overrides in WorkflowRun
would make sense pulled up into Run
.
For now, these considerations were outweighed by the convenience of running on stock versions of Jenkins LTS.
JEP-207 tracks some, though not all, of these proposals.
When the synchronous Launcher
interface is used to start non-durable remote processes,
as happens for example from typical SCM
implementations delegating to a command-line tool,
historically the remotability of any supplied TaskListener
is ignored
and all log lines are sent over the Remoting channel to be processed on the master side:
JENKINS-52729.
This was fixed as a simple patch to Launcher
,
which would also benefit JEP-207 by removing any need to use DecoratedLauncher
s for freestyle build steps.
A related change could perhaps also fix encoding issues with such synchronous steps for JEP-206.
(While TaskListener
was long ago designed to be remotable,
and StreamTaskListener
in fact handled that by using RemoteOutputStream
,
until now it was not noticeable that Launcher
fails to remote the listener
since the effect is the same if the instance is in fact a StreamTaskListener
.
Some existing plugins such as Logstash or AWS CloudWatch Logs Publisher support redirecting or mirroring log messages to cloud services. To the extent that these are even compatible with Pipeline, they nonetheless suffer from fundamental limitations compared to the approach in this JEP: job configurers may have to opt-in to the publishing; log messages may still be kept on disk in the Jenkins master; the existing Jenkins UI gestures to display logs do not pick up data from the cloud; Remoting channels are still clogged with log-related traffic.
ConsoleLogFilter
implementations must be safely remotable in order to work correctly on the agent side.
Since any implementations available for use in Pipeline jobs
must already have been Serializable
(to be saved in program.dat
),
this is not as significant a restriction as it might at first appear.
It does mean that besides being careful about state (non-transient
instance fields),
implementations may not assume they are running inside the master JVM.
Completed FlowNode
s using the old LogActionImpl
will continue to serve log text from the per-step file.
This applies both to completed historical builds,
and to steps completed prior to the resume of a build which spanned the upgrade.
For the special case of a step running across the upgrade,
LogActionImpl
will stream new content to the overall build log,
as well as to the per-step log.
(For that purpose, the upgrade is detected as an update to the workflow-job
plugin.)
Historical builds using WorkflowRunConsoleNote
should continue to render logs,
but without the new semantic markup features.
Plugins remoting a TaskListener
need to take care to flush
the stream after completing a task,
as output may have been buffered.
Plugins calling Run.getLogFile
will produce a warning.
If the actual log is remote, a temporary file will be produced with a copy of the log text.
Any ConsoleLogFilter
with security-sensitive fields
(notably the password masking by the withCredentials
step)
must take into account that it will now be sent to the agent side,
where that data is vulnerable to retrieval or even manipulation by rogue builds.
In the case of withCredentials
this is not an issue,
since the agent already received these same secrets as environment variables.
The reference implementation uses IAM to create temporary access tokens for use by agents. See [this detailed explanation](https://github.com/jenkinsci/pipeline-cloudwatch-logs-plugin/blob/8e5024f02698db2aeed90374a92b90960fa32f9b/README.md#security).
Properly testing the reference implementation would require a CI server to have an AWS account as discussed below.
Automated functional tests verify the basic aspects of the change,
such as the fact that with a suitably remotable TaskListener
,
a sh
step will in fact deliver messages to the log sink constructed on the agent side.
Functional tests for open-source, cluster-based implementations such as ElasticSearch could be run using docker-fixtures
.
Tests for SaaS-based implementations such as CloudWatch require either mocks,
and/or live tests run on restricted CI machines.
The nature of performance testing remains to be defined.
The principal constraint is that the production of logs during a build should be efficient;
Jenkins-based display of logs during a running build or of a completed build
may involve some overhead to retrieve and collate messages,
but this is assumed to be a relatively infrequent event.
A disabled test in workflow-job
demonstrates a typical heavy log load scenario for the file-based implementation.
The reference implementation is a
pipeline-cloudwatch-logs
plugin.
-
Demo given to Cloud Native SIG (YouTube, ~19m, discussion before and after)