Log*Velocity*Monitoring*...Monitoring*and*AlerDng* 24 AlerngThresholds*!...
Transcript of Log*Velocity*Monitoring*...Monitoring*and*AlerDng* 24 AlerngThresholds*!...
Copyright © 2013 Splunk Inc.
Sean Delaney Client Architect, Splunk #splunkconf
Log Velocity Monitoring
Legal NoDces During the course of this presentaDon, we may make forward-‐looking statements regarding future events or the expected performance of the company. We cauDon you that such statements reflect our current expectaDons and esDmates based on factors currently known to us and that actual events or results could differ materially. For important factors that may cause actual results to differ from those contained in our forward-‐looking statements, please review our filings with the SEC. The forward-‐looking statements made in this presentaDon are being made as of the Dme and date of its live presentaDon. If reviewed aSer its live presentaDon, this presentaDon may not contain current or accurate informaDon. We do not assume any obligaDon to update any forward-‐looking statements we may make. In addiDon, any informaDon about our roadmap outlines our general product direcDon and is subject to change at any Dme without noDce. It is for informaDonal purposes only and shall not, be incorporated into any contract or other commitment. Splunk undertakes no obligaDon either to develop the features or funcDonality described or to include any such feature or funcDonality in a future release.
Splunk, Splunk>, Splunk Storm, Listen to Your Data, SPL and The Engine for Machine Data are trademarks and registered trademarks of Splunk Inc. in the United States and other countries. All other brand names, product names, or trademarks belong to their respecCve owners.
©2013 Splunk Inc. All rights reserved.
2
About Me
! Splunk Client Architect – Splunker for 2+ years – Using Splunk for 6+ years – Large Splunk Deployments
! Previously – Splunk Professional Services – 10+ years ProducDon Services for a large Internet Security Company
3
Agenda
! Log Velocity ! Monitoring and AlerDng ! Drill Down Demo
4
Log Velocity
TradiDonal Velocity aka Speed
6
Velocity (m/s)
Dis
tanc
e (m
)
Time (s)0 10 20 30 40 50 60
3
6
9
12
15
Log Velocity
7
! Logging Data Rate – Events per Second (eps) – Data Volume per Second (kbps)
Increases or Deceases in Log Velocity
8
! Environmental changes – New service, servers or new data sources added to Splunk – ApplicaDon change (New code deployment, configuraDon change) – Networking Change (Firewall, RouDng) – Service migraDon
! Traffic changes – More users accessing service(s) – Change in ApplicaDon logging level (Debug mode) – Core component is down or intermiaent has issues (Database) – Logs not being generated or forwarded (Changed log file directory, syslog
server down)
Higher Level Approach to Service Monitoring
9
! Look at the forest not just the trees – License Usage – Event rate per Index, Sourcetype, Source – Network Throughput – Monitor event counts for errors and alerts
Is There an Issue?
10
! OperaDons team is alerted that Splunk is slow ! Service owners noDce slow down, then their website is unavailable ! Customer Service call volume jumps ! OperaDons team is now flooded with monitoring alerts, phone, email and chat messages
Is There an Issue?
11
• Splunk admins noDce a major spike in indexing volume
Is There an Issue?
12
! Further invesDgaDon detects a corresponding spike in webserver access and error logs
Is There an Issue?
13
• Service was DOSed (from an internal source)
• Early detecDon would have miDgated the issue, reduced customer impact
• Alerts on either indexing volume or webserver event counts would have noDfied OperaDons to the change of acDvity
Log Velocity Use Cases
14
! Security Use Cases – DOS/DDOS – Service or Port Knocking
! Webserver Access and Error Logs ! MarkeDng Campaigns
Log Velocity Use Cases
15
! ApplicaDon Error Logs ! ProducDon Code Updates/Rollouts ! Infrastructure Changes ! Network RouDng or Spanning Tree changes ! DNS/SMTP Changes
Monitoring Log Velocity
Where to Measure Log Velocity
17
! Splunk’s metrics.log: – Event counts (ev), events per second (eps) – Data indexed (kb), index throughput (kbps)
! Metrics data is logged by group: – per_index_thruput!– per_sourcetype_thruput!– per_source_thruput!– per_host_thruputhistory!
Where to Measure Log Velocity
18
! Example searches:
– index=_internal source="*/metrics.log" "group=per_index_thruput" | timechart span=10m sum(ev) by series!
– index=_internal source="*/metrics.log" "group=per_index_thruput" | timechart span=10m avg(kbps) by series!
Where to Measure Log Velocity
19
! Other sources: – Splunk license logs – Custom event count searches
> index=myapp error | timechart span=10m count!
Logging Workloads
20
! Log data workloads are normally cyclic ! Service peaks oSen correspond business or trading hours
Logging Workloads
21
! Weekday trends normally follow the same cycle ! Logging may drop off on weekend/holidays (business services) ! Log volume could be greater in the evening or weekends (online gaming)
! Logging can go crazy – Black Friday/Cyber Monday (online shopping) ! Take into account global/regional Dmezones
Monitoring and AlerDng
22
AlerDng Thresholds
! When defining alerDng thresholds, you need to consider either semng an upper boundary or your data workload
! Compare to the same Dme period yesterday, last week, last month
Monitoring and AlerDng
23
AlerDng Thresholds
! Absolute Thresholds: – index=_internal source="*/metrics.log" group="per_index_thruput"
series="main" | Dmechart span=10m sum(ev) as ev_count | stats max(ev_count) as max_ev | search max_ev>`lv_threshold`
• Macro used to hold `lv_threshold` value:
[lv_threshold]!definition = 600!iseval = 0!
Monitoring and AlerDng
24
AlerDng Thresholds
! Compare to same Dme previous day, day of week, etc
earliest=-10m latest=@m index=_internal source="*/metrics.log" group="per_index_thruput" series="main" | stats sum(ev) as ev_count_1 | append [search earliest=-1450m latest=-1440m index=_internal source="*/metrics.log" group="per_index_thruput" series="main" | stats sum(ev) as ev_count_2 ] | stats first(ev_count_1) as ev_count_today, first(ev_count_2) as ev_count_yesterday | eval delta=abs(ev_count_today - ev_count_yesterday) | eval threshold=ev_count_yesterday*0.1 | search delta>threshold!
Monitoring and AlerDng
25
Summary Indexing
! Summary Indexing your Log Velocity has benefits – Faster Loads for Monitoring Dashboards – Provide faster stats for comparaDve alerDng
Error Log Velocity
26
• Monitor and baseline error counts for an applicaDon
• Table the top 50 error types/codes • When a new code release is deployed
monitor for an increase of errors • Table the top 50 error types/codes and
compare with the results from the previous release
• Deploy patch/houix/update, and repeat unDl stable state has been re-‐established
sourcetype="apache_error" | rex "^(?:[^\]]*\]){3}\s*(?<phperr>[^\:]+)\:\s*(?<msg>.*)" | stats count by phperr,msg | sort - count | head 50 | fields count,msg!
Drill Down Demo
Summary
28
Monitoring Log Velocity provides addiDonal insight into your environment
• Detect and alert on environmental changes and abnormal traffic volumes
• Provides feedback on code deployments
• First level alerDng for issues • Useful for NOC/SOC monitoring • StarDng point for drill down
invesDgaDons
QuesDons?
Next Steps
30
Download the .conf2013 Mobile App If not iPhone, iPad or Android, use the Web App
Take the survey & WIN A PASS FOR .CONF2014… Or one of these bags!
1
2
THANK YOU