SlideShare a Scribd company logo
Oracle Database 23c and AHF Insights to do
better AIOps
Aug 2023
Sandesh Rao
VP AIOps , Autonomous Database
@sandeshr
https://www.linkedin.com/in/raosandesh/
https://www.slideshare.net/SandeshRao4
Observe Engage
Automate
AIOps
Realtime
data
Historical
data
Notification
Collaboration
Compliance
Incident
detection
Diagnostic
collection
Issue
clarification
Cause & solution
identification
Runbooks
Machine
Learning
Health
Availability
Performance
Capacity
Logs
Scripts
Health
Checks Anomaly
Detection
Email
Pager
Jira
Bug
SR
Slack
ACR Sanitization
Copyright © 2023, Oracle and/or its affiliates
2
AHF AIOps Platform
Detect Collect Create Notify
Clarify Rediscover Analyze
Mitigate
& Fix
Telemetry
Mini & SRDC
Bug & Jira
Page & Email
IC Service
Bug
De-duplication
Issue
Clustering
Expert Systems
Timeline
Dev
Containers
Source
Evaluators
Workaround & Patches
Copyright © 2023, Oracle and/or its affiliates
3
AIOps and Applied Machine Learning
Copyright © 2023, Oracle and/or its affiliates
4
How does Machine Learning play into AIOps?
Time-Series Metrics
Log/Trace Events
Precursor Metric(s)
Precursor Event(s)
Root-Cause Metric
Root-Cause Event
Prevent
Recover
Root-Cause Action
Root-Cause Action
Problem
Predictive
Reactive
AHF Compliance Manager
• Compliance management
• Around 4000+ best practices
• Covers Exadata and security
• Constant Cadence of features
AHF Root Cause Analyzer
• Log scanners for obvious issues
• ML models to root cause
• Eliminate non-defect issues
• Recommend Patches
AHF AutoUpgrade
• Stack Deployment
• RPM’s , automated packaged
installers
• Standard home locations
What is AHF
AHF Data Collectors
• First Failure Capture
• Telemetry capture, streaming
• Diagnostic log collection
• OS and Database metrics
• Collection standardization
• Rudimentary aggregation and
analysis
AHF ABS
• Bug rediscovery
• Autoclose known issues
• ML based models
• Cloud scale deployment
AHF Service Console
• Front-end for analysis, cause
and solution identification
• Unified Timeline
• Anomaly Detection
• Graphing for Time Series Data
• AHF Insights and Fleet Insights
Copyright © 2023, Oracle and/or its affiliates
5
6
Autonomous Health
Cloud Platform
Autonomous Health
Cloud Platform
Machines
Smart Collectors
SRs
Expert
Input
Feedback &
Improvement
Bugs
1
SRs
Logs
Model
Generation
Model
Knowledge
Extraction
Applied Machine Learning
Cloud Ops
Object
Store
Admin UI in Control Plane
Oracle Support
Bug DB
SE UI in Support
Tenant
(CNS)
Cleansing,
metadata
creation &
clustering
5 Model generation
with expert scrubbing
6
Deployed as
part of cloud
image,
running from
the start
1 Proactive regular health checking,
real-time fault detection, automatic
incident analysis, diagnostic
collection & masking of sensitive
data
2
Use real-time health dashboards for
anomaly detection, root cause analysis &
push of proactive, preventative &
corrective actions. Auto bug search & auto
bug & SR creation. 3
Auto SR analysis, diagnosis assistance via
automatic anomaly detection,
collaboration and one click bug creation
4
Message
Broker
Copyright © 2023, Oracle and/or its affiliates
EXAchk , ORAchk , DBSat , Autoupgrade , CVU , Collection Manager (Apex App)
The verification and compliance tools which support all the components across the stack
What is AHF
Copyright © 2023, Oracle and/or its affiliates
7
AHF
Compliance
Manager
AHF
Data
Collectors
AHF
Root Cause
Analyzer
AHF
Service
Console
TFA , CHM , Data Plane Telemetry , OSWatcher
The different OS and Data Collectors
CHA , DT , Parsers
Automation which responds to the customer issues or makes it easier to slice and dice data
AHF Insights and Fleet Insights
The frontend which is visible to Customers and Support
Oracle’s AI Ops Cloud Platform Implementation
What does our platform look like implemented?
Copyright © 2023, Oracle and/or its affiliates
8
Machine View
What are some of the Operations areas that use AML?
AIOps Using Applied Machine Learning
Copyright © 2023, Oracle and/or its affiliates
9
Proactive Prevention
OS Data
Real-Time Performance Prognostics Engine
Alert &
Preventive
Action
DB Data
Rapid Recovery
Entry
Clustering
Knowledge
Base Indexing
Model
Generation
Log
Cleansing
1 3 4 5 6
Expert Input
Knowledge Base Creation
Feedback
Training
Real-time
Log File
Processing
Timestamp
Correlation &
Ranking
8 9
7
Entry Feature
Creation
2
Logs
Traces
Alert &
Preventive
Action
Logs
Traces
In Lab
Pros:
• Destination for Important DB Events
• Single file to monitor by DBAs
• Many tools available to parse
• Supported by TFA for generating alarms
Cons:
• Includes both critical and non-critical
events
• Incudes messages not intended for
DBAs
• Inconsistently reports severity level
• Can report unintuitive cause and action
• New undocumented messages in every
release
Oracle Database Alert Log
Copyright © 2023, Oracle and/or its affiliates
10
Copyright © 2023, Oracle and/or its affiliates
11
The Curated Solution - New 21c Attention Log
Contains only important events requiring customer attention
Includes documented set of messages and attributes
All Messages include these attributes:
• Type
• Urgency
• Scope
• Target User
• Cause and Action
• Additional debug information
Oracle Database Attention Log Message Flow
Copyright © 2023, Oracle and/or its affiliates
12
DB
Component
Diagnostic
Framework
alert/log.xml log/attention.log
attention.amb
(Message Definitions)
Attention Log
Message
Attention Curated
Message
1. App-Dev
2. Sec-Admin
3. Net-Admin
4. Cluster-Admin
5. PDB-Admin
6. CDB-Admin
7. Server-Admin
8. Storage-Admin
9. DataOps-Admin
Attention Log Curation - Message Attributes
Copyright © 2023, Oracle and/or its affiliates
13
1. Error
2. Warning
3. Notification
1. Session
2. Process
3. PDB-Instance
4. CDB-Instance
5. CDB-Cluster
6. PDB-Persistent
7. CDB-Persistent
1. Immediate
2. Soon
3. Deferable
4. Info
SCOPE
TYPE
TARGET
USER
URGENCY
Copyright © 2023, Oracle and/or its affiliates
14
// TYPE - 1 error, 2 warning, 3 notification
// URGENCY - 1 immediate, 2 soon, 3 deferable, 4 info
// SCOPE - 1 session, 2 process, 3 pdb-instance, 4 cdb-instance, 5 cdb-cluster, 6 pdb-persistent, 7 cdb-persistent
// TARGETUSER - 1 app-dev, 2 sec-admin, 3 net-admin, 4 cluster-admin, 5 pdb-admin, 6 cdb-admin, 7 server-admin, 8 storage-admin, 9 dataops-admin
ID::2000
TYPE::2
URGENCY::1
SCOPE::4
TARGETUSER::6
TEXT::Parameter %s specified is high
CAUSE::Memory parameter specified for this instance is high
ACTION::Check alert log or trace file for more information relating to instance
configuration, reconfigure the parameter and restart the instance
STARTVERSION::21.1
Example Attention Message Definition – CDB Warning
Copyright © 2023, Oracle and/or its affiliates
15
[
IMMEDIATE Parameter SGA_MAX_SIZE specified is high
CAUSE: Memory parameter specified for this instance is high
ACTION: Check alert log or trace file for more information relating to instance
configuration, reconfigure the parameter and restart the instance
CLASS: CDB Instance / CDB ADMINISTRATOR / WARNING / AL-2000
TIME: 2020-05-01T11:09:02.223-07:00
ADDITIONAL INFO: -
WARNING: SGA_MAX_SIZE (6144 MB) is too high - it should be less than 5634 MB (80
percent of physical memory).
]
Example Attention Log Curated Message – CDB Warning
Copyright © 2023, Oracle and/or its affiliates
17
[
IMMEDIATE Shutting down ORACLE instance (abort) (OS id: 8394)
CAUSE: A command to shutdown the instance was executed
ACTION: Check alert log for progress and completion of command
CLASS: CDB Instance / CDB ADMINISTRATOR / ERROR / AL-1002
TIME: 2020-05-08T17:09:33.773-07:00
ADDITIONAL INFO: -
Shutdown is initiated by sqlplus@den02tlh (TNS V1-V3).
]
Example Attention Log Curated Message – CDB Error
Copyright © 2023, Oracle and/or its affiliates
19
[
SOON Heavy swapping observed on system
CAUSE: Memory usage by one more application is leading to heavy swapping
ACTION: Check alert log for more information, use tools to analyze memory
usage and take action
CLASS: CDB Instance / SERVER ADMINISTRATOR / WARNING / AL-2100
TIME: 2020-05-01T11:09:02.223-07:00
ADDITIONAL INFO: -
WARNING: Heavy swapping observed on system in last 15 mins.
Heavy swapping can lead to timeouts, poor performance, and instance eviction.
]
Example Attention Log Curated Message – Server Warning
Attention Log Use Cases – AHF + OCI Integration
Copyright © 2023, Oracle and/or its affiliates
21
Autonomous Health
Framework
Trace File Analyzer
…
…
Attention Log
Repository
Management VCN
AHF Service
Cloud Ops
Object
Store
Runbooks
Real-Time
Analytics
Blockchain
Documents
Graph
Analysis
Spatial
Processing
Text
Search
IoT
AHF uses all of 23c Database Features
Copyright © 2023, Oracle and/or its affiliates
23
Machine
Learning
• Compliance management
• Around 4000+ best
practices
• Covers Exadata and security
• Constant Cadence of
features
What is AHF
Copyright © 2023, Oracle and/or its affiliates
25
Compliance
Manager
Data
Collection
Root Cause
Analyzer
Service
Tooling
Auto
Upgrade
Bug
Matching
Data
Sanitizing
Resource
Allocation
Issue
Detection
Service
Console
Building compliance with best practices
Development methodology
1
Idea
Reports from development, testing, support etc
2
Expert review
Weekly meetings to review and test
3
MOS Note 757552.1
Published Exadata best practices
4
Default deployment
Bake best practices back in to default deployment
5 AHF compliance check
Generation of new checks
Copyright © 2023, Oracle and/or its affiliates
26
Limit checks
-profile
One or more of 40+
different component
focused check
categories
Upgrade readiness
-Database
-GI
-ODA
-Exadata
-ODA
Limit targets
-cells
-clusternodes
-ibswitches
-dbnames
Security assessment
Default password for
OS and database users
Database security
checks using DBSAT
Ways to run compliance checks
Copyright © 2023, Oracle and/or its affiliates
27
How to use 23c AHF AIOPS to protect Oracle Databases 23c
How to use 23c AHF AIOPS to protect Oracle Databases 23c
Collection Manager
Copyright © 2023, Oracle and/or its affiliates
31
• First Failure Capture
• Telemetry capture,
streaming
• Diagnostic log collection
• OS and Database metrics
• Collection standardization
• Rudimentary aggregation
and analysis
What is AHF
Copyright © 2023, Oracle and/or its affiliates
32
Compliance
Manager
Data
Collection
Root Cause
Analyzer
Service
Tooling
Auto
Upgrade
Bug
Matching
Data
Sanitizing
Resource
Allocation
Issue
Detection
Service
Console
DomU
Machine View
Alert
logs
Health
Data
Availability
Data
Performance
Data
Capacity
Data
Oracle Stack
Control Plane
Diagnostic
Collection
Object Store
AHF Service
AHF Agents detect issues &
create telemetry JSON
1
Uploads
telemetry to
Object Store
Telemetry
JSON
2
AHF agent
collects
diagnostics
then uploads to
Object store
3
AHF Service reads telemetry from Object Store and pushes
metrics to T2 and then processed diagnostic collection 4
AHF
Compliance
Data
Copyright © 2023, Oracle and/or its affiliates
33
SRDCs (Service Request Diagnostic Collection)
Oracle Grid Infrastructure
& Databases
AHF
1
AHF detects a fault
2
Diagnostics
are collected
3
Distributed
diagnostics are
consolidated and
packaged
4
Notification of fault is
sent
5 Diagnostic collection
is uploaded to Oracle
Storage Service for
later analysis
Object
Store
Copyright © 2023, Oracle and/or its affiliates
34
• Database areas
• Errors / Corruption
• Performance
• Install / patching / upgrade
• RAC / Grid Infrastructure
• Import / Export
• RMAN
• Transparent Data Encryption
• Storage / partitioning
• Undo / auditing
• Listener / naming services
• Spatial / XDB
• Other Server Technology
• Enterprise Manager
• Data Guard
• GoldenGate
• Exalogic
Full list in documentation
Some problem areas covered in SRDCs
Around 100 problem types covered
tfactl diagcollect –srdc <srdc_type> [-sr <sr_number>]
Copyright © 2023, Oracle and/or its affiliates
35
1. Generate ADDM reviewing Document 1680075.1 (multiple
steps)
2. Identify “good” and “problem” periods and gather AWR
reviewing Document 1903158.1 (multiple steps)
3. Generate AWR compare report (awrddrpt.sql) using “good”
and “problem” periods
4. Generate ASH report for “good” and “problem” periods
reviewing Document 1903145.1 (multiple steps)
5. Collect OSWatcher data reviewing Document
301137.1 (multiple steps)
6. Collect Hang Analyze output at Level 4
7. Generate SQL Healthcheck for problem SQL id using
Document 1366133.1 (multiple steps)
8. Run support provided sql scripts – Log File sync diagnostic
output using Document 1064487.1 (multiple steps)
9. Check alert.log if there are any errors during the “problem”
period
10. Find any trace files generated during the “problem” period
11. Collate and upload all the above files/outputs to SR
1. Run
Manual collection vs TFA SRDC for database performance
Manual method TFA SRDC
tfactl diagcollect –srdc dbperf [-sr <sr_number>]
Copyright © 2023, Oracle and/or its affiliates
36
Copyright © 2023, Oracle and/or its affiliates
37
Generates view of Cluster and Database diagnostic
metrics
• Always on - Enabled by default
• Provides Detailed OS Resource Metrics
• Assists Node eviction analysis
• Locally logs all process data
• User can define pinned processes
• Listens to CSS and GIPC GI events
• Categorizes processes by type
• Supports plug-in collectors (ex. traceroute,
netstat, ping, etc.)
• New CSV output for ease of analysis
AHF OS Data Collector
GIMR
ologgerd
(master)
osysmon
d
osysmond
osysmond
osysmond
OS Data OS Data
OS Data
OS Data
Automatic upgrade when AHF finds a new
version
New versions can be found automatically at:
• The local file system
• REST locations
• Object store locations
On-demand via ahfctl upgrade
The latest version can be pulled on-demand
from My Oracle Support
AHF will also prompt you to upgrade when it
detects it’s older than 180 days
Automatic AHF upgrade
39 Copyright © 2023, Oracle and/or its affiliates
• Log scanners for obvious
issues
• ML models to root cause
• Eliminate non-defect issues
• Recommend Patches
What is AHF
Copyright © 2023, Oracle and/or its affiliates
44
Compliance
Manager
Data
Collection
Root Cause
Analyzer
Service
Tooling
Auto
Upgrade
Bug
Matching
Data
Sanitizing
Resource
Allocation
Issue
Detection
Service
Console
Discovers Potential Cluster & DB Problems
Actual Internal data drives model
development
Applied purpose-built Applied ML
for knowledge extraction
Expert Dev team scrubs data
Generates Bayesian Network-based
diagnostic root-cause models
Uses BN-based run-time models to
perform real-time prognostics
Database Health - Applied Machine Learning
Copyright © 2023, Oracle and/or its affiliates
45
AHF Dev Team
Log
ASH
Metrics
ML
Knowledge
Extraction
BN
Models
Expert
Supervision
DB+Node
Runtime
Models
Feedback
Scrub Data
AHF
AHF
Machine
Learning
Pattern
Recognition
Bayesian
Network
Engines
CHA Operational Flow : Anomaly Detection -> Diagnostics -> Prognosis
For each data point …
AHF Anomaly Detection flow
Copyright © 2023, Oracle and/or its affiliates
46
Is data valid
?
Is behavior
expected ?
Is there a
problem ?
What is
causing the
problem ?
Data Validation
Operating State
Estimation
Fault
Identification
Diagnostic
Decision
Is a failure
likely ?
Prognosis
Models Capture the Dynamic Behavior of all Normal Operation
Models Capture all Normal Operating Modes
47
0
5000
10000
15000
20000
25000
30000
35000
40000
10:00 2:00 6:00
5100
9025
4024
2350
4100
22050
10000
21000
4400
2500
4900
800
IOPS
user commits (/sec)
log file parallel write (usec)
log file sync (usec)
A model captures the normal load phases and their statistics over time , and thus the
characteristics for all load intensities and profiles .
During monitoring , any data point similar to one of the vectors is NORMAL.
One could say that the model REMEMBERS the normal operational dynamics over time
In-Memory Reference Matrix
(Part of “Normality” Model)
IOPS
###
#
2500 4900 800
##
##
User Commits
###
#
10000 21000 4400
##
##
Log File Parallel
Write
###
#
2350 4100 22050
##
##
Log File Sync
###
#
5100 9025 4024
##
##
… … … … … …
Copyright © 2023, Oracle and/or its affiliates
AHF Anomaly Detection flow
48
Observed values
(Part of a Data Point)
Estimator/predictor (ESEE): “based on my normality model, the value of IOPS should be in the
vicinity of ~ 4900, but it is reported as 10500, this is causing a residual of ~ 5600 in magnitude”,
Fault detector: “such high magnitude of residuals should be tracked carefully! I’ll keep an eye on
the incoming sequence of this signal IOPSand if it remains deviant I’ll generate a fault on it”.
In-Memory Reference
Matrix
(Part of “Normality” Model)
IOPS
###
#
2500 4900 800
##
##
User Commits
###
#
10000 21000 4400
##
##
Log File Parallel
Write
###
#
2350 4100 22050
##
##
Log File Sync
###
#
5100 9025 4024
##
##
… … … … … …
10500
20000
4050
10250
…
Residual Values
(Part of a Data Point)
5600
-1000
-50
325
…
Observed -
Predicted =
Copyright © 2023, Oracle and/or its affiliates
Inline and Immediate Fault Detection and Diagnostic Inference
AHF Anomaly Detection flow
49
Machine Learning, Pattern
Recognition, & BN Engines
Time CPU ASM
IOPS
Networ
k % util
Network
_Packets
Dropped
Log
file
sync
Log file
parallel
write
GC CR
request
GC
current
request
GC current
block 2-way
GC
current
block
busy
Enq:
CF -
conte
ntion
…
15:16:00 0.90 4100 88% 105 2
ms
600 us 504 ms 513 ms 2 ms 5.9 ms 0
15:16:00
OK OK HIGH
1
HIGH
2
OK OK HIGH
3
HIGH
3
HIGH
4
HIGH
4
OK
Input : Data Point at Time t
Fault Detection and Classification
Diagnostic Inference
15:16:0
0
Symptoms
1. Network Bandwidth Utilization
2. Network Packet Loss
3. Global Cache Requests Incomplete
4. Global Cache Message Latency
Root Cause
(Target of Corrective Action)
Network Bandwidth Utilization
Diagnostic
Inference
Engine
Copyright © 2023, Oracle and/or its affiliates
Cross Node and Cross Instance Diagnostic Inference
AHF Anomaly Detection flow
50
15:16:00
Root Cause
(Target of Corrective
Action)
Network
Bandwidth
Utilization
Diagnostic
Inference
Engine
15:16:00
Root Cause
(Target of Corrective
Action)
Network
Bandwidth
Utilization
Diagnostic
Inference
Engine
15:16:00
Root Cause
(Target of Corrective
Action)
Network
Bandwidth
Utilization
Diagnostic
Inference
Engine
Cross Target
Diagnostic
Inference
Node 1
Node 2
Node 3
Corrective Action Target
Copyright © 2023, Oracle and/or its affiliates
Identify Signatures
• Incidents
• Bugs
Detect anomalies
• Logs
• OS metrics
Predict
• Resource usage
• Maintenance window
• Performance issues
• Workload Stability
Some AIOps Use Cases
Copyright © 2023, Oracle and/or its affiliates
51
Anomaly Detection – High Level
52
Known normal log entry (discard)
Probable anomalous Line (collect)
Log
Collection
File
Type
1
File
Type
2
File
Type
n..
Log File
Anomaly
Timeline
Probable
Anomalies
Copyright © 2023, Oracle and/or its affiliates
Trace File Analyzer – High Level Anomaly Detection Flow
53
Log
Cleansing
1 2 3 4 5 6
Entry Feature
Creation
Entry
Clustering
Model
Generation
Expert
Input
Knowledge Base
Creation
Knowledge
Base Indexing
Feedback
Training
Real-time
Log File Processing
Timestamp Correlation & Ranking
8 9
7
Batch
Feedback
Copyright © 2023, Oracle and/or its affiliates
Drain Algorithm
54
• Drain is an online log template miner that can extract templates (clusters) from a stream of log
messages in a timely manner.
• It employs a parse tree with fixed depth to guide the log group search process, which effectively
avoids constructing a very deep and unbalanced tree.
• Drain continuously learns on-the-fly and extracts log templates from raw log entries.
• Drain Research Paper :
• Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu. Drain: An Online Log Parsing Approach with Fixed
Depth Tree, Proceedings of the 24th International Conference on Web Services (ICWS), 2017.
• Link : http://jiemingzhu.github.io/pub/pjhe_icws2017.pdf
Drain Algorithm – Parameters for Tuning
55
• Drain Parameters for tuning to the log file type needs.
Parameter Description
[DRAIN]/sim_th similarity threshold
[DRAIN]/depth max depth levels of log clusters
[DRAIN]/max_children max number of children of an internal node
[DRAIN]/max_clusters max number of tracked clusters
[DRAIN]/extra_delimiters delimiters to apply when splitting log message into words
[MASKING]/masking parameters masking
[SNAPSHOT]/snapshot_interval_minutes time interval for new snapshots
[SNAPSHOT]/compress_state whether to compress the state before saving it
Our Improvisation over Drain
56
• Multi level drain signatures
• Association with source code with drain
signature for more precise feature capturing
• Interface to tune auto-marking of signatures to
view results of parameter changes in real-time.
CPU Usage and forecast
Jan 2021
Jan 2021
Jan 2021
Jan 2021
Copyright © 2023, Oracle and/or its affiliates
57
Seasonality determination to window identification flow
START_TIME CNT
2021-04-11 15:00:00 290
2021-04-11 16:00:00 31120
2021-04-11 17:00:00 21530
2021-04-11 18:00:00 26240
2021-04-11 19:00:00 40520
2021-04-11 20:00:00 54270
2021-04-11 21:00:00 51460
2021-04-11 22:00:00 44310
2021-04-11 23:00:00 25690
START_TIME
2021-04-11 15:00:00 -0.226098
2021-04-11 16:00:00 -0.069821
2021-04-11 17:00:00 -0.350088
2021-04-11 18:00:00 -0.187483
2021-04-11 19:00:00 -0.513240
2021-04-11 20:00:00 0.019737
2021-04-11 21:00:00 0.059213
2021-04-11 22:00:00 -0.011312
2021-04-11 23:00:00 -0.179156
START_TIME
2021-04-11 15:00:00 5.669881
2021-04-11 16:00:00 10.345606
2021-04-11 17:00:00 9.977203
2021-04-11 18:00:00 10.175040
2021-04-11 19:00:00 10.609551
2021-04-11 20:00:00 10.901727
2021-04-11 21:00:00 10.848560
2021-04-11 22:00:00 10.698966
2021-04-11 23:00:00 10.153857
Current Date : 2021-05-12 15:00:00
Current Position in Seasonality : -0.22609829742533585
Best Maintenance Period in next Cycle : 2021-05-12 19:00:00
Worst Maintenance Period in next Cycle : 2021-05-13 08:00:00
Original observation data
1
Convolution filter & average
2
Calculate seasonality
3
Use seasonality to
predict best
maintenance window
4
2021-04-11 2021-04-18 2021-04-25 2021-05-02 2021-05-09 2021-04-11 2021-04-18 2021-04-25 2021-05-02 2021-05-09 2021-04-11 2021-04-18 2021-04-25 2021-05-02 2021-05-09
Copyright © 2023, Oracle and/or its affiliates
58
Identifying time periods with high z-score events across
multiple metrics
7 May 2021
Copyright © 2023, Oracle and/or its affiliates
59
• Front-end for analysis, cause
and solution identification
• Unified Timeline
• Anomaly Detection
• Graphing for Time Series Data
• AHF Insights and Fleet
Insights
What is AHF
Copyright © 2023, Oracle and/or its affiliates
60
Compliance
Manager
Data
Collection
Root Cause
Analyzer
Service
Tooling
Auto
Upgrade
Bug
Matching
Data
Sanitizing
Resource
Allocation
Issue
Detection
Service
Console
Previously, results from different AHF
components were not available in a single
dashboard making it challenging to combine and
correlate.
To mitigate this, AHF Insights provides a web-
based graphical user interface, which does not
require a web server to host the web pages, for
all diagnostic data collectors and analyzers that
are part of AHF Kit.
AHF performs a diagnostic collection for a given
period to analyze the performance of database
systems from:
• Configuration
• Environment Topology
• Metrics
• Logs
This diagnostic data collected from the system
passes through AHF Insights and produce an
offline report.
AHF Insights Overview
Copyright © 2023, Oracle and/or its affiliates
61
AHF Insights provides a bird's eye view of the entire system with the ability to
further drill down for root cause analysis.
Information Captured
System Topology
• Resource Information
• Resource Configuration
• Summarized viewing of resource data
Insights
• Major events happening on the system
• Operating system information and it’s analysis
• Best practice compliance issues
• Software Recommendation
• Software / Hardware alerts for Database Server
• System changes over last 14 days
• RPM details and RPM inconsistencies among hosts
• Database Parameters and differences among databases
• Kernel Parameters and differences among hosts
Copyright © 2023, Oracle and/or its affiliates
62
• Latest AHF with AHF Insights code
• Feature available from AHF 22.3 for Exadata Systems
• Required AHF data sources (TFA, Exachk, CHM) should be
enabled and running
• 23.4 and higher for RAC Linux and ODA Systems
Prerequisites
Copyright © 2023, Oracle and/or its affiliates
63
How can I generate it ?
• Command : ahf analysis create --type insights --last 2h
• Takes around : 3 - 4 minutes (depending on the system)
• Size : 46MB zip (depending on the system)
Copyright © 2023, Oracle and/or its affiliates
64
System Topology
• Cluster
• Databases
• Database Servers
• Storage Servers
• Fabric Switches
Insights
• Timeline
• Operating System Issues
• Best Practice issues
• System Change
• Recommended Software
• Database Server
• RPM List
• Database Parameters
• Kernel Parameters
AHF Insights Report
Copyright © 2023, Oracle and/or its affiliates
65
Cluster Summary
1.Showcase relevant system
cluster information.
2.Get DB Home details by clicking
on the dropdown button located
inside the DB Home section.
3.Copy Cluster summary into user
clipboard.
Cluster
Copyright © 2023, Oracle and/or its affiliates
66
Cluster Summary
1.Showcase relevant system cluster
information.
2.Get DB Home details by clicking on
the dropdown button located inside
the DB Home section.
3.Copy Cluster summary into user
clipboard.
Cluster
Copyright © 2023, Oracle and/or its affiliates
67
Compliance
Manager
Data
Collection
Root Cause
Analyzer
Service
Tooling
Auto
Upgrade
Bug
Matching
Data
Sanitizing
Resource
Allocation
Issue
Detection
Service
Console
Oracle Autonomous Health Framework (AHF) 23c
Copyright © 2023, Oracle and/or its affiliates
68
How has Oracle and Customers benefited from this AI Ops implementation?
ü AI Ops has become an essential Cloud technology
ü Understand the problem space
ü Understand the environmental, technical and legal constraints
ü Use appropriate ML algorithms to the task
ü Spend quality time with your training sets
ü Incorporate explainability into the results
ü Provide a feedback mechanism for model evolution
ü Look for opportunities to incorporate actuators
ü Honor the culture and risk tolerance of your target audience
Oracle Cloud AI Ops Takeaways
Copyright © 2023, Oracle and/or its affiliates
69
Thank you
Any Questions?
Sandesh Rao
VP AIOps Autonomous Database
@sandeshr
https://www.linkedin.com/in/raosandesh/
https://www.slideshare.net/SandeshRao4
Copyright © 2023, Oracle and/or its affiliates
70

More Related Content

How to use 23c AHF AIOPS to protect Oracle Databases 23c

  • 1. Oracle Database 23c and AHF Insights to do better AIOps Aug 2023 Sandesh Rao VP AIOps , Autonomous Database @sandeshr https://www.linkedin.com/in/raosandesh/ https://www.slideshare.net/SandeshRao4
  • 2. Observe Engage Automate AIOps Realtime data Historical data Notification Collaboration Compliance Incident detection Diagnostic collection Issue clarification Cause & solution identification Runbooks Machine Learning Health Availability Performance Capacity Logs Scripts Health Checks Anomaly Detection Email Pager Jira Bug SR Slack ACR Sanitization Copyright © 2023, Oracle and/or its affiliates 2
  • 3. AHF AIOps Platform Detect Collect Create Notify Clarify Rediscover Analyze Mitigate & Fix Telemetry Mini & SRDC Bug & Jira Page & Email IC Service Bug De-duplication Issue Clustering Expert Systems Timeline Dev Containers Source Evaluators Workaround & Patches Copyright © 2023, Oracle and/or its affiliates 3
  • 4. AIOps and Applied Machine Learning Copyright © 2023, Oracle and/or its affiliates 4 How does Machine Learning play into AIOps? Time-Series Metrics Log/Trace Events Precursor Metric(s) Precursor Event(s) Root-Cause Metric Root-Cause Event Prevent Recover Root-Cause Action Root-Cause Action Problem Predictive Reactive
  • 5. AHF Compliance Manager • Compliance management • Around 4000+ best practices • Covers Exadata and security • Constant Cadence of features AHF Root Cause Analyzer • Log scanners for obvious issues • ML models to root cause • Eliminate non-defect issues • Recommend Patches AHF AutoUpgrade • Stack Deployment • RPM’s , automated packaged installers • Standard home locations What is AHF AHF Data Collectors • First Failure Capture • Telemetry capture, streaming • Diagnostic log collection • OS and Database metrics • Collection standardization • Rudimentary aggregation and analysis AHF ABS • Bug rediscovery • Autoclose known issues • ML based models • Cloud scale deployment AHF Service Console • Front-end for analysis, cause and solution identification • Unified Timeline • Anomaly Detection • Graphing for Time Series Data • AHF Insights and Fleet Insights Copyright © 2023, Oracle and/or its affiliates 5
  • 6. 6 Autonomous Health Cloud Platform Autonomous Health Cloud Platform Machines Smart Collectors SRs Expert Input Feedback & Improvement Bugs 1 SRs Logs Model Generation Model Knowledge Extraction Applied Machine Learning Cloud Ops Object Store Admin UI in Control Plane Oracle Support Bug DB SE UI in Support Tenant (CNS) Cleansing, metadata creation & clustering 5 Model generation with expert scrubbing 6 Deployed as part of cloud image, running from the start 1 Proactive regular health checking, real-time fault detection, automatic incident analysis, diagnostic collection & masking of sensitive data 2 Use real-time health dashboards for anomaly detection, root cause analysis & push of proactive, preventative & corrective actions. Auto bug search & auto bug & SR creation. 3 Auto SR analysis, diagnosis assistance via automatic anomaly detection, collaboration and one click bug creation 4 Message Broker Copyright © 2023, Oracle and/or its affiliates
  • 7. EXAchk , ORAchk , DBSat , Autoupgrade , CVU , Collection Manager (Apex App) The verification and compliance tools which support all the components across the stack What is AHF Copyright © 2023, Oracle and/or its affiliates 7 AHF Compliance Manager AHF Data Collectors AHF Root Cause Analyzer AHF Service Console TFA , CHM , Data Plane Telemetry , OSWatcher The different OS and Data Collectors CHA , DT , Parsers Automation which responds to the customer issues or makes it easier to slice and dice data AHF Insights and Fleet Insights The frontend which is visible to Customers and Support
  • 8. Oracle’s AI Ops Cloud Platform Implementation What does our platform look like implemented? Copyright © 2023, Oracle and/or its affiliates 8 Machine View
  • 9. What are some of the Operations areas that use AML? AIOps Using Applied Machine Learning Copyright © 2023, Oracle and/or its affiliates 9 Proactive Prevention OS Data Real-Time Performance Prognostics Engine Alert & Preventive Action DB Data Rapid Recovery Entry Clustering Knowledge Base Indexing Model Generation Log Cleansing 1 3 4 5 6 Expert Input Knowledge Base Creation Feedback Training Real-time Log File Processing Timestamp Correlation & Ranking 8 9 7 Entry Feature Creation 2 Logs Traces Alert & Preventive Action Logs Traces In Lab
  • 10. Pros: • Destination for Important DB Events • Single file to monitor by DBAs • Many tools available to parse • Supported by TFA for generating alarms Cons: • Includes both critical and non-critical events • Incudes messages not intended for DBAs • Inconsistently reports severity level • Can report unintuitive cause and action • New undocumented messages in every release Oracle Database Alert Log Copyright © 2023, Oracle and/or its affiliates 10
  • 11. Copyright © 2023, Oracle and/or its affiliates 11 The Curated Solution - New 21c Attention Log Contains only important events requiring customer attention Includes documented set of messages and attributes All Messages include these attributes: • Type • Urgency • Scope • Target User • Cause and Action • Additional debug information
  • 12. Oracle Database Attention Log Message Flow Copyright © 2023, Oracle and/or its affiliates 12 DB Component Diagnostic Framework alert/log.xml log/attention.log attention.amb (Message Definitions) Attention Log Message Attention Curated Message
  • 13. 1. App-Dev 2. Sec-Admin 3. Net-Admin 4. Cluster-Admin 5. PDB-Admin 6. CDB-Admin 7. Server-Admin 8. Storage-Admin 9. DataOps-Admin Attention Log Curation - Message Attributes Copyright © 2023, Oracle and/or its affiliates 13 1. Error 2. Warning 3. Notification 1. Session 2. Process 3. PDB-Instance 4. CDB-Instance 5. CDB-Cluster 6. PDB-Persistent 7. CDB-Persistent 1. Immediate 2. Soon 3. Deferable 4. Info SCOPE TYPE TARGET USER URGENCY
  • 14. Copyright © 2023, Oracle and/or its affiliates 14 // TYPE - 1 error, 2 warning, 3 notification // URGENCY - 1 immediate, 2 soon, 3 deferable, 4 info // SCOPE - 1 session, 2 process, 3 pdb-instance, 4 cdb-instance, 5 cdb-cluster, 6 pdb-persistent, 7 cdb-persistent // TARGETUSER - 1 app-dev, 2 sec-admin, 3 net-admin, 4 cluster-admin, 5 pdb-admin, 6 cdb-admin, 7 server-admin, 8 storage-admin, 9 dataops-admin ID::2000 TYPE::2 URGENCY::1 SCOPE::4 TARGETUSER::6 TEXT::Parameter %s specified is high CAUSE::Memory parameter specified for this instance is high ACTION::Check alert log or trace file for more information relating to instance configuration, reconfigure the parameter and restart the instance STARTVERSION::21.1 Example Attention Message Definition – CDB Warning
  • 15. Copyright © 2023, Oracle and/or its affiliates 15 [ IMMEDIATE Parameter SGA_MAX_SIZE specified is high CAUSE: Memory parameter specified for this instance is high ACTION: Check alert log or trace file for more information relating to instance configuration, reconfigure the parameter and restart the instance CLASS: CDB Instance / CDB ADMINISTRATOR / WARNING / AL-2000 TIME: 2020-05-01T11:09:02.223-07:00 ADDITIONAL INFO: - WARNING: SGA_MAX_SIZE (6144 MB) is too high - it should be less than 5634 MB (80 percent of physical memory). ] Example Attention Log Curated Message – CDB Warning
  • 16. Copyright © 2023, Oracle and/or its affiliates 17 [ IMMEDIATE Shutting down ORACLE instance (abort) (OS id: 8394) CAUSE: A command to shutdown the instance was executed ACTION: Check alert log for progress and completion of command CLASS: CDB Instance / CDB ADMINISTRATOR / ERROR / AL-1002 TIME: 2020-05-08T17:09:33.773-07:00 ADDITIONAL INFO: - Shutdown is initiated by sqlplus@den02tlh (TNS V1-V3). ] Example Attention Log Curated Message – CDB Error
  • 17. Copyright © 2023, Oracle and/or its affiliates 19 [ SOON Heavy swapping observed on system CAUSE: Memory usage by one more application is leading to heavy swapping ACTION: Check alert log for more information, use tools to analyze memory usage and take action CLASS: CDB Instance / SERVER ADMINISTRATOR / WARNING / AL-2100 TIME: 2020-05-01T11:09:02.223-07:00 ADDITIONAL INFO: - WARNING: Heavy swapping observed on system in last 15 mins. Heavy swapping can lead to timeouts, poor performance, and instance eviction. ] Example Attention Log Curated Message – Server Warning
  • 18. Attention Log Use Cases – AHF + OCI Integration Copyright © 2023, Oracle and/or its affiliates 21 Autonomous Health Framework Trace File Analyzer … … Attention Log Repository Management VCN AHF Service Cloud Ops Object Store Runbooks
  • 19. Real-Time Analytics Blockchain Documents Graph Analysis Spatial Processing Text Search IoT AHF uses all of 23c Database Features Copyright © 2023, Oracle and/or its affiliates 23 Machine Learning
  • 20. • Compliance management • Around 4000+ best practices • Covers Exadata and security • Constant Cadence of features What is AHF Copyright © 2023, Oracle and/or its affiliates 25 Compliance Manager Data Collection Root Cause Analyzer Service Tooling Auto Upgrade Bug Matching Data Sanitizing Resource Allocation Issue Detection Service Console
  • 21. Building compliance with best practices Development methodology 1 Idea Reports from development, testing, support etc 2 Expert review Weekly meetings to review and test 3 MOS Note 757552.1 Published Exadata best practices 4 Default deployment Bake best practices back in to default deployment 5 AHF compliance check Generation of new checks Copyright © 2023, Oracle and/or its affiliates 26
  • 22. Limit checks -profile One or more of 40+ different component focused check categories Upgrade readiness -Database -GI -ODA -Exadata -ODA Limit targets -cells -clusternodes -ibswitches -dbnames Security assessment Default password for OS and database users Database security checks using DBSAT Ways to run compliance checks Copyright © 2023, Oracle and/or its affiliates 27
  • 26. Copyright © 2023, Oracle and/or its affiliates 31
  • 27. • First Failure Capture • Telemetry capture, streaming • Diagnostic log collection • OS and Database metrics • Collection standardization • Rudimentary aggregation and analysis What is AHF Copyright © 2023, Oracle and/or its affiliates 32 Compliance Manager Data Collection Root Cause Analyzer Service Tooling Auto Upgrade Bug Matching Data Sanitizing Resource Allocation Issue Detection Service Console
  • 28. DomU Machine View Alert logs Health Data Availability Data Performance Data Capacity Data Oracle Stack Control Plane Diagnostic Collection Object Store AHF Service AHF Agents detect issues & create telemetry JSON 1 Uploads telemetry to Object Store Telemetry JSON 2 AHF agent collects diagnostics then uploads to Object store 3 AHF Service reads telemetry from Object Store and pushes metrics to T2 and then processed diagnostic collection 4 AHF Compliance Data Copyright © 2023, Oracle and/or its affiliates 33
  • 29. SRDCs (Service Request Diagnostic Collection) Oracle Grid Infrastructure & Databases AHF 1 AHF detects a fault 2 Diagnostics are collected 3 Distributed diagnostics are consolidated and packaged 4 Notification of fault is sent 5 Diagnostic collection is uploaded to Oracle Storage Service for later analysis Object Store Copyright © 2023, Oracle and/or its affiliates 34
  • 30. • Database areas • Errors / Corruption • Performance • Install / patching / upgrade • RAC / Grid Infrastructure • Import / Export • RMAN • Transparent Data Encryption • Storage / partitioning • Undo / auditing • Listener / naming services • Spatial / XDB • Other Server Technology • Enterprise Manager • Data Guard • GoldenGate • Exalogic Full list in documentation Some problem areas covered in SRDCs Around 100 problem types covered tfactl diagcollect –srdc <srdc_type> [-sr <sr_number>] Copyright © 2023, Oracle and/or its affiliates 35
  • 31. 1. Generate ADDM reviewing Document 1680075.1 (multiple steps) 2. Identify “good” and “problem” periods and gather AWR reviewing Document 1903158.1 (multiple steps) 3. Generate AWR compare report (awrddrpt.sql) using “good” and “problem” periods 4. Generate ASH report for “good” and “problem” periods reviewing Document 1903145.1 (multiple steps) 5. Collect OSWatcher data reviewing Document 301137.1 (multiple steps) 6. Collect Hang Analyze output at Level 4 7. Generate SQL Healthcheck for problem SQL id using Document 1366133.1 (multiple steps) 8. Run support provided sql scripts – Log File sync diagnostic output using Document 1064487.1 (multiple steps) 9. Check alert.log if there are any errors during the “problem” period 10. Find any trace files generated during the “problem” period 11. Collate and upload all the above files/outputs to SR 1. Run Manual collection vs TFA SRDC for database performance Manual method TFA SRDC tfactl diagcollect –srdc dbperf [-sr <sr_number>] Copyright © 2023, Oracle and/or its affiliates 36
  • 32. Copyright © 2023, Oracle and/or its affiliates 37 Generates view of Cluster and Database diagnostic metrics • Always on - Enabled by default • Provides Detailed OS Resource Metrics • Assists Node eviction analysis • Locally logs all process data • User can define pinned processes • Listens to CSS and GIPC GI events • Categorizes processes by type • Supports plug-in collectors (ex. traceroute, netstat, ping, etc.) • New CSV output for ease of analysis AHF OS Data Collector GIMR ologgerd (master) osysmon d osysmond osysmond osysmond OS Data OS Data OS Data OS Data
  • 33. Automatic upgrade when AHF finds a new version New versions can be found automatically at: • The local file system • REST locations • Object store locations On-demand via ahfctl upgrade The latest version can be pulled on-demand from My Oracle Support AHF will also prompt you to upgrade when it detects it’s older than 180 days Automatic AHF upgrade 39 Copyright © 2023, Oracle and/or its affiliates
  • 34. • Log scanners for obvious issues • ML models to root cause • Eliminate non-defect issues • Recommend Patches What is AHF Copyright © 2023, Oracle and/or its affiliates 44 Compliance Manager Data Collection Root Cause Analyzer Service Tooling Auto Upgrade Bug Matching Data Sanitizing Resource Allocation Issue Detection Service Console
  • 35. Discovers Potential Cluster & DB Problems Actual Internal data drives model development Applied purpose-built Applied ML for knowledge extraction Expert Dev team scrubs data Generates Bayesian Network-based diagnostic root-cause models Uses BN-based run-time models to perform real-time prognostics Database Health - Applied Machine Learning Copyright © 2023, Oracle and/or its affiliates 45 AHF Dev Team Log ASH Metrics ML Knowledge Extraction BN Models Expert Supervision DB+Node Runtime Models Feedback Scrub Data AHF AHF
  • 36. Machine Learning Pattern Recognition Bayesian Network Engines CHA Operational Flow : Anomaly Detection -> Diagnostics -> Prognosis For each data point … AHF Anomaly Detection flow Copyright © 2023, Oracle and/or its affiliates 46 Is data valid ? Is behavior expected ? Is there a problem ? What is causing the problem ? Data Validation Operating State Estimation Fault Identification Diagnostic Decision Is a failure likely ? Prognosis
  • 37. Models Capture the Dynamic Behavior of all Normal Operation Models Capture all Normal Operating Modes 47 0 5000 10000 15000 20000 25000 30000 35000 40000 10:00 2:00 6:00 5100 9025 4024 2350 4100 22050 10000 21000 4400 2500 4900 800 IOPS user commits (/sec) log file parallel write (usec) log file sync (usec) A model captures the normal load phases and their statistics over time , and thus the characteristics for all load intensities and profiles . During monitoring , any data point similar to one of the vectors is NORMAL. One could say that the model REMEMBERS the normal operational dynamics over time In-Memory Reference Matrix (Part of “Normality” Model) IOPS ### # 2500 4900 800 ## ## User Commits ### # 10000 21000 4400 ## ## Log File Parallel Write ### # 2350 4100 22050 ## ## Log File Sync ### # 5100 9025 4024 ## ## … … … … … … Copyright © 2023, Oracle and/or its affiliates
  • 38. AHF Anomaly Detection flow 48 Observed values (Part of a Data Point) Estimator/predictor (ESEE): “based on my normality model, the value of IOPS should be in the vicinity of ~ 4900, but it is reported as 10500, this is causing a residual of ~ 5600 in magnitude”, Fault detector: “such high magnitude of residuals should be tracked carefully! I’ll keep an eye on the incoming sequence of this signal IOPSand if it remains deviant I’ll generate a fault on it”. In-Memory Reference Matrix (Part of “Normality” Model) IOPS ### # 2500 4900 800 ## ## User Commits ### # 10000 21000 4400 ## ## Log File Parallel Write ### # 2350 4100 22050 ## ## Log File Sync ### # 5100 9025 4024 ## ## … … … … … … 10500 20000 4050 10250 … Residual Values (Part of a Data Point) 5600 -1000 -50 325 … Observed - Predicted = Copyright © 2023, Oracle and/or its affiliates
  • 39. Inline and Immediate Fault Detection and Diagnostic Inference AHF Anomaly Detection flow 49 Machine Learning, Pattern Recognition, & BN Engines Time CPU ASM IOPS Networ k % util Network _Packets Dropped Log file sync Log file parallel write GC CR request GC current request GC current block 2-way GC current block busy Enq: CF - conte ntion … 15:16:00 0.90 4100 88% 105 2 ms 600 us 504 ms 513 ms 2 ms 5.9 ms 0 15:16:00 OK OK HIGH 1 HIGH 2 OK OK HIGH 3 HIGH 3 HIGH 4 HIGH 4 OK Input : Data Point at Time t Fault Detection and Classification Diagnostic Inference 15:16:0 0 Symptoms 1. Network Bandwidth Utilization 2. Network Packet Loss 3. Global Cache Requests Incomplete 4. Global Cache Message Latency Root Cause (Target of Corrective Action) Network Bandwidth Utilization Diagnostic Inference Engine Copyright © 2023, Oracle and/or its affiliates
  • 40. Cross Node and Cross Instance Diagnostic Inference AHF Anomaly Detection flow 50 15:16:00 Root Cause (Target of Corrective Action) Network Bandwidth Utilization Diagnostic Inference Engine 15:16:00 Root Cause (Target of Corrective Action) Network Bandwidth Utilization Diagnostic Inference Engine 15:16:00 Root Cause (Target of Corrective Action) Network Bandwidth Utilization Diagnostic Inference Engine Cross Target Diagnostic Inference Node 1 Node 2 Node 3 Corrective Action Target Copyright © 2023, Oracle and/or its affiliates
  • 41. Identify Signatures • Incidents • Bugs Detect anomalies • Logs • OS metrics Predict • Resource usage • Maintenance window • Performance issues • Workload Stability Some AIOps Use Cases Copyright © 2023, Oracle and/or its affiliates 51
  • 42. Anomaly Detection – High Level 52 Known normal log entry (discard) Probable anomalous Line (collect) Log Collection File Type 1 File Type 2 File Type n.. Log File Anomaly Timeline Probable Anomalies Copyright © 2023, Oracle and/or its affiliates
  • 43. Trace File Analyzer – High Level Anomaly Detection Flow 53 Log Cleansing 1 2 3 4 5 6 Entry Feature Creation Entry Clustering Model Generation Expert Input Knowledge Base Creation Knowledge Base Indexing Feedback Training Real-time Log File Processing Timestamp Correlation & Ranking 8 9 7 Batch Feedback Copyright © 2023, Oracle and/or its affiliates
  • 44. Drain Algorithm 54 • Drain is an online log template miner that can extract templates (clusters) from a stream of log messages in a timely manner. • It employs a parse tree with fixed depth to guide the log group search process, which effectively avoids constructing a very deep and unbalanced tree. • Drain continuously learns on-the-fly and extracts log templates from raw log entries. • Drain Research Paper : • Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu. Drain: An Online Log Parsing Approach with Fixed Depth Tree, Proceedings of the 24th International Conference on Web Services (ICWS), 2017. • Link : http://jiemingzhu.github.io/pub/pjhe_icws2017.pdf
  • 45. Drain Algorithm – Parameters for Tuning 55 • Drain Parameters for tuning to the log file type needs. Parameter Description [DRAIN]/sim_th similarity threshold [DRAIN]/depth max depth levels of log clusters [DRAIN]/max_children max number of children of an internal node [DRAIN]/max_clusters max number of tracked clusters [DRAIN]/extra_delimiters delimiters to apply when splitting log message into words [MASKING]/masking parameters masking [SNAPSHOT]/snapshot_interval_minutes time interval for new snapshots [SNAPSHOT]/compress_state whether to compress the state before saving it
  • 46. Our Improvisation over Drain 56 • Multi level drain signatures • Association with source code with drain signature for more precise feature capturing • Interface to tune auto-marking of signatures to view results of parameter changes in real-time.
  • 47. CPU Usage and forecast Jan 2021 Jan 2021 Jan 2021 Jan 2021 Copyright © 2023, Oracle and/or its affiliates 57
  • 48. Seasonality determination to window identification flow START_TIME CNT 2021-04-11 15:00:00 290 2021-04-11 16:00:00 31120 2021-04-11 17:00:00 21530 2021-04-11 18:00:00 26240 2021-04-11 19:00:00 40520 2021-04-11 20:00:00 54270 2021-04-11 21:00:00 51460 2021-04-11 22:00:00 44310 2021-04-11 23:00:00 25690 START_TIME 2021-04-11 15:00:00 -0.226098 2021-04-11 16:00:00 -0.069821 2021-04-11 17:00:00 -0.350088 2021-04-11 18:00:00 -0.187483 2021-04-11 19:00:00 -0.513240 2021-04-11 20:00:00 0.019737 2021-04-11 21:00:00 0.059213 2021-04-11 22:00:00 -0.011312 2021-04-11 23:00:00 -0.179156 START_TIME 2021-04-11 15:00:00 5.669881 2021-04-11 16:00:00 10.345606 2021-04-11 17:00:00 9.977203 2021-04-11 18:00:00 10.175040 2021-04-11 19:00:00 10.609551 2021-04-11 20:00:00 10.901727 2021-04-11 21:00:00 10.848560 2021-04-11 22:00:00 10.698966 2021-04-11 23:00:00 10.153857 Current Date : 2021-05-12 15:00:00 Current Position in Seasonality : -0.22609829742533585 Best Maintenance Period in next Cycle : 2021-05-12 19:00:00 Worst Maintenance Period in next Cycle : 2021-05-13 08:00:00 Original observation data 1 Convolution filter & average 2 Calculate seasonality 3 Use seasonality to predict best maintenance window 4 2021-04-11 2021-04-18 2021-04-25 2021-05-02 2021-05-09 2021-04-11 2021-04-18 2021-04-25 2021-05-02 2021-05-09 2021-04-11 2021-04-18 2021-04-25 2021-05-02 2021-05-09 Copyright © 2023, Oracle and/or its affiliates 58
  • 49. Identifying time periods with high z-score events across multiple metrics 7 May 2021 Copyright © 2023, Oracle and/or its affiliates 59
  • 50. • Front-end for analysis, cause and solution identification • Unified Timeline • Anomaly Detection • Graphing for Time Series Data • AHF Insights and Fleet Insights What is AHF Copyright © 2023, Oracle and/or its affiliates 60 Compliance Manager Data Collection Root Cause Analyzer Service Tooling Auto Upgrade Bug Matching Data Sanitizing Resource Allocation Issue Detection Service Console
  • 51. Previously, results from different AHF components were not available in a single dashboard making it challenging to combine and correlate. To mitigate this, AHF Insights provides a web- based graphical user interface, which does not require a web server to host the web pages, for all diagnostic data collectors and analyzers that are part of AHF Kit. AHF performs a diagnostic collection for a given period to analyze the performance of database systems from: • Configuration • Environment Topology • Metrics • Logs This diagnostic data collected from the system passes through AHF Insights and produce an offline report. AHF Insights Overview Copyright © 2023, Oracle and/or its affiliates 61 AHF Insights provides a bird's eye view of the entire system with the ability to further drill down for root cause analysis.
  • 52. Information Captured System Topology • Resource Information • Resource Configuration • Summarized viewing of resource data Insights • Major events happening on the system • Operating system information and it’s analysis • Best practice compliance issues • Software Recommendation • Software / Hardware alerts for Database Server • System changes over last 14 days • RPM details and RPM inconsistencies among hosts • Database Parameters and differences among databases • Kernel Parameters and differences among hosts Copyright © 2023, Oracle and/or its affiliates 62
  • 53. • Latest AHF with AHF Insights code • Feature available from AHF 22.3 for Exadata Systems • Required AHF data sources (TFA, Exachk, CHM) should be enabled and running • 23.4 and higher for RAC Linux and ODA Systems Prerequisites Copyright © 2023, Oracle and/or its affiliates 63
  • 54. How can I generate it ? • Command : ahf analysis create --type insights --last 2h • Takes around : 3 - 4 minutes (depending on the system) • Size : 46MB zip (depending on the system) Copyright © 2023, Oracle and/or its affiliates 64
  • 55. System Topology • Cluster • Databases • Database Servers • Storage Servers • Fabric Switches Insights • Timeline • Operating System Issues • Best Practice issues • System Change • Recommended Software • Database Server • RPM List • Database Parameters • Kernel Parameters AHF Insights Report Copyright © 2023, Oracle and/or its affiliates 65
  • 56. Cluster Summary 1.Showcase relevant system cluster information. 2.Get DB Home details by clicking on the dropdown button located inside the DB Home section. 3.Copy Cluster summary into user clipboard. Cluster Copyright © 2023, Oracle and/or its affiliates 66
  • 57. Cluster Summary 1.Showcase relevant system cluster information. 2.Get DB Home details by clicking on the dropdown button located inside the DB Home section. 3.Copy Cluster summary into user clipboard. Cluster Copyright © 2023, Oracle and/or its affiliates 67
  • 59. How has Oracle and Customers benefited from this AI Ops implementation? ü AI Ops has become an essential Cloud technology ü Understand the problem space ü Understand the environmental, technical and legal constraints ü Use appropriate ML algorithms to the task ü Spend quality time with your training sets ü Incorporate explainability into the results ü Provide a feedback mechanism for model evolution ü Look for opportunities to incorporate actuators ü Honor the culture and risk tolerance of your target audience Oracle Cloud AI Ops Takeaways Copyright © 2023, Oracle and/or its affiliates 69
  • 60. Thank you Any Questions? Sandesh Rao VP AIOps Autonomous Database @sandeshr https://www.linkedin.com/in/raosandesh/ https://www.slideshare.net/SandeshRao4 Copyright © 2023, Oracle and/or its affiliates 70