Tuning the Logstash multiline filter

I’m currently working on a project were we are using Apache Mesos and Marathon as our platform to run and provision our applications. We use the ELK stack to receive, store and analyze our logs. On each of the nodes in the cluster we have Filebeat (the successor of Logstash) installed which sends the logs to Logstash.

At first everything looked fine; all our tasks log showed up in Kibana. However as we added more and more applications to the cluster we started seeing a problem were log from one task would show up on the same event as another task in Kibana.

At first we suspected problems with how we had configured the Filebeat prospector path which contained multiple wildcards:

/tmp/mesos/slaves/*/frameworks/*/executors/*/runs/latest/stdout

The source field contained multiple paths, indicating that this message came from two different tasks. A log message should never have more than one path, so how could this be?

It turned out that the problem was being caused by the multiline filter we were using to collapse multiline message into one Logstash event.

Looking through the documentation of the multiline filter we found the stream_identity setting which determines which stream an event belongs to. By default the stream identity is configured to be %{host}.%{path}.%{type}, but for some reason this was not enough in our case. After appending %{source} to the stream identity the problem disappeared. This is what we added to our multiline filter:

 stream_identity => "%{host}.%{path}.%{type}.%{source}"

With that fixed we now noticed a different problem. Some messages were sometimes being tagged with  _grokparsefailure, indicating that the log parsing failed. Looking closer this happened when the log message contained long output like the stack traces from Java exceptions. The start of a message would be parsed correctly with timestamp and everything, but the message would be cut off. The following message would not contain a timestamp and would continue where the previous stopped.

This seemed strange as the job of the multiline filter is exactly this. It should collapse message into one message as long as it does not match the specified pattern, in our case wait until it matches a timestamp indicating a new message.

More reading of the documentation revealed that there is a periodic_flush setting which by default is true. That is, the multiline filter will periodically flush based on the max_age setting which by default is 5 seconds. Our theory is that due to processing delay and load on the ELK server we hit this timeout causing our messages to split.

By increasing the max_age to 60 seconds the problem seems to have disappeared.

max_age => 60

Being a time based setting this isn’t boom proof as the results will strongly vary based on the load of the server, but for now it seems to do the trick. Another option would be to disable periodic flush, but this means potentially needing to wait very long for messages from tasks that don’t log that often.

The problems mentioned in this post are some of the things being adressed by the new multiline-codec which aims to be the successor of the now deprecated multiline filter. Unfortunately changing from the filter to to the codec is not trivial in all cases, but I’ll leave that for another post.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s