Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online Meetup]

Software Analytics
with Jupyter, Pandas,
jQAssistant and Neo4j
Identifying Problems in Software Development
with Data Analysis
Markus Harrer
@feststelltaste
Neo4j Online Meetup
23rd November 2017

Markus Harrer
Software Development Analyst
Key Activities
Java Development, Data Analysis in Software
Development
Areas of Interest
Clean Code, Agile, Software Archeology, Software
Revival, Epistemology, Cognitive Psychology
@feststelltaste feststelltaste.de meetup@markusharrer.de
About me

Agenda
1. Motivation
2. Sofware Analytics
3. My impl of Software Analytics
4. Examples & Demos
5. Summary
6. Q&A

Motivation
Everything wrong with Software Development

Why is software development
still so crazy?

WALL OF IGNORANCE
Janelle Klein: IDEAFLOW - How to Measure the PAIN in Software Development. Leanpub

WALL OF IGNORANCE
RISK
VISIBILITY
Janelle Klein: IDEAFLOW - How to Measure the PAIN in Software Development. Leanpub

RISK
DATA ANALYSIS
VISIBILITY
My wife

RISK
DATA ANALYSIS
VISIBILITY
Me

Software Analytics
Sober Problem Solving with Data Analysis based on Software Data

Software Analytics is...
“... analytics on software data
for managers and software engineers
with the aim of empowering software
development individuals and teams
to gain and share insight from their data
to make better decisions.”
Tim Menzies, Thomas Zimmermann: Software Analytics - So What?. IEEE Software Magazine

Frequency
Questions
Use standard tools
for everyday‘s questions
Use Software Analytics to
tackle high-risk problems
Risk/Value
Right Insights for better Decisions
Adopted from Tim Menzies, Thomas Zimmermann: Software Analytics - So What?. IEEE Software Magazine

Types of Software Data
Communitychrono-
logical
Runtimestatic
=> Problems are interconnected, so should be the data sources!

Tackling problems –
automated,
data-driven and
reproducible.
MyGuideline
Software Analytics
= Data Science on Software Data

Why does it work now?
• Domain-Driven Design brings business language into code
• Data Science enables problem analysis for developers
• New Tools can create high-level concepts
Code Problems
Business Language
abstract
detailed
Problems can be connected to concepts in business terms!

My impl of Software Analytics
How can Developers use the Power of Data Analysis in their Daily Work?

What can you do today?
• Visualize developer contributions over time
• Identify unused, error-prone or abandoned code
• Create a code and problem inventory for legacy systems
• Find performance bottlenecks by analyzing call trees
• Visualize unwanted dependencies between modules
Make specific problems in your software system visible!
e. g. Race Conditions, Architecture Smells, Build Breaker, Programming Errors

Choose known tools
or tools for plan B*
Python
Neo4j, Pandas, Spark
* want to learn / profit from in near future
on a suitable platform.Jupyter, Zeppelin
=> Tools shouldn‘t stand in the way!

Notebookan open dialog with data
Context
Idea
Analysis
Conclusion
Problem
Context documented
Ideas, assumptions and
heuristics communicated
Preprocessing justified
Calculations understandable
Summaries conclusive
Everything automated

Python
Data Scientist's Best Friend: Easy, effective, fast programming
language
Pandas
Pragmatic Data Analysis Framework: Great data structures &
integrations with machine learning libraries
D3
Visualization Library for Data-Driven Document: Just beautiful,
interactive graphics!
Jupyter
Interactive Notebook: Central hub for data analysis and
documentation
Basic Tooling

Advanced Tooling: jQAssistant & Neo4j
+ =
scan document validate
https://jqassistant.org/

Advanced Tooling: jQAssistant & Neo4j
Main Ideas
• Scan software structures
• Store data in Neo4j database
• Execute queries
• Examine relationships
• Add high-level concepts
• Validate rules via constraints
• Generate reports

jQAssistant – Use Cases
Living,
self-validating
architecture
documentation

jQAssistant – Use Cases
Java Class
Business‘ Subdomain
Living,
self-validating
architecture
documentation
+
Find design &
code smells
+
Add business
perspectives

Neo4j Schema for Software Data
Node Labels
File
Class
Method
Commit
Relationship Types
CONTAINS
DEPENDS_ON
INVOKES
CONTAINS_CHANGE
Properties
name
fqn
signature
message
File Java
key value
name “Pet”
fileName “Pet.java”
fqn “foo.bar.Pet”
TypeFile

Cypher Query
Example
Spring PetClinic
“Give me all database objects”
MATCH
(t:Type)-[:ANNOTATED_BY]->()-[:OF_TYPE]->(a:Type)
WHERE
a.fqn="javax.persistence.Entity"
RETURN t AS JpaEntity

Toolchain
Python, Jupyter
XML/Graph
Tables
Text
Data
Pandas
jQAssistant
Input
Pandas,
Neo4j
Analysis
matplotlib
xlsx
E
pptx
P
Output
D3

Examples
The complete Toolchain in Action

Example JaCoCo  Pandas  D3
Production Coverage
1. Measure code coverage in
production
2. Calculate ratio of covered
lines to all lines
3. Visualize “usage hotspots”
with hierarchical bubble chart
https://www.feststelltaste.de/visualizing-production-coverage-with-jacoco-pandas-and-d3/

Example Git  Pandas  D3
Knowledge Island*
1. Take Git log with numstats
2. Calculate proportional
contributions for each
source code file per author
3. Visualize “ownership” with
hierarchical bubble chart
* heavily inspired by Adam Tornhillhttps://www.feststelltaste.de/knowledge-islands/

Example jQAssistant  Neo4j  Pandas  D3
Dependency Analysis between Bounded Contexts
https://www.feststelltaste.de/a-graphical-approach-towards-bounded-contexts/

Example jQAssistant  Neo4j  Pandas  D3
Dependency Analysis between Bounded Contexts
MATCH
(s1:Subdomain)<-[:BELONGS_TO]-
(type:Type)-[r:DEPENDS_ON*0..1]->
(dependency:Type)-[:BELONGS_TO]->(s2:Subdomain)
RETURN s1.name as type, s2.name as dep, COUNT(r) as number
https://www.feststelltaste.de/a-graphical-approach-towards-bounded-contexts/
Subdomains => Bounded Contexts that have meaning to business!

Example JProfiler  jQAssistant  Neo4j  Pandas
Mining performance hotspots
1. Record Call Trees
2. Identify which parts of
the application code
is responsible for most
of the DB operations
3. Trace problems back
to the root causes
https://www.feststelltaste.de/mining-performance-hotspots-with-jprofiler-jqassistant-neo4j-and-pandas-part-1-the-call-graph/
Requests
Incoming
Outgoing
SQL Calls

Example jQAssistant  Neo4j  Pandas
Recursive Method Calls
MATCH (m:Method)-[:INVOKES*]->(m)
RETURN m

Recursive Method Calls to Database
MATCH (m:Method)-[:INVOKES*]->(m)
-[:INVOKES]->(dbMethod:Method)
<-[:DECLARES]-(dbClass:Class)
WHERE dbClass.name = "Database"
RETURN m, dbMethod, dbClass

Identify possible Race Conditions
public class OwnerController {
...
private static int ownersIndexes;
MATCH
(c:Class)-[:DECLARES]->(f:Field)<-[w:WRITES]-(m:Method)
WHERE
EXISTS(f.static) AND NOT EXISTS(f.final)
RETURN c.name, f.name, w.lineNumber, m.name
static = same field for
all instances of that class

Summary
• Tooling for data analysis in software development is here!
• First analyses are easy to do using tools you already know
• Specific in-depth analysis are powerful and worthwhile
• Connection between business and developers is possible!
• Problems can be attached to code that is business-related
• Making the impact of risk-taking visible is a must-have to improve!
• Jupyter/Pandas & jQAssistant/Neo4j are my favorites
• Provide many ways for identifying problems
• Help to figure out solutions as well!

Links
Markus Harrer
• Blog: https://feststelltaste.de
• Twitter: https://twitter.com/feststelltaste
• SlideShare: https://www.slideshare.net/feststelltaste
• Consulting: http://markusharrer.de
jQAssistant/Neo4j
• Demos: https://jqassistant.org/get-started/
• Guide: http://buschmais.github.io/jqassistant/doc/1.3.0/
• Talk by Dirk Mahler: https://vimeo.com/170797227

Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online Meetup]

Related slideshows

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

Recommended for you

More Related Content

What's hot

What's hot (20)

Similar to Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online Meetup]

Similar to Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online Meetup] (20)

More from Markus Harrer

More from Markus Harrer (12)

Recently uploaded

Recently uploaded (20)

Software Analytics with Jupyter, Pandas, jQAssistant, and Neo4j [Neo4j Online Meetup]