john-the-ripper – Performance analysis, tools and experiments

Performance characterization of the john-the-ripper password crackers. There are five workloads with slightly different profiles combined togther. The first (bcrypt) and third (blowfish) workloads have a very hire retire rate while the fourth (HMAC-SHA512) has a lower one and the second (WPA PSK) and fifth (MD5) are in between. So we have an aggregate of all these together. It also looks like playing with compiler options can make a larger difference (https://www.phoronix.com/review/intel-meteorlake-gcc-clang). Those tests seem to have picked just the right compiler options…

Here is the AMD composite profile.

elapsed              719.563
on_cpu               0.876          # 14.02 / 16 cores
utime                10088.098
stime                2.788
nvcsw                5633           # 6.88%
nivcsw               76278          # 93.12%
inblock              17008          # 23.64/sec
onblock              6064           # 8.43/sec
cpu-clock            10091008298375 # 10091.008 seconds
task-clock           10091046378603 # 10091.046 seconds
page faults          707665         # 70.128/sec
context switches     85296          # 8.453/sec
cpu migrations       230            # 0.023/sec
major page faults    89             # 0.009/sec
minor page faults    707576         # 70.119/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             1181761586204  # 18.861 branches per 1000 inst
branch misses        1953540532     # 0.17% branch miss
conditional          875223584314   # 13.969 conditional branches per 1000 inst
indirect             31768861224    # 0.507 indirect branches per 1000 inst
cpu-cycles           42249053159235 # 3.67 GHz
instructions         62449360688413 # 1.48 IPC
slots                84520599603168 #
retiring             22972486141479 # 27.2% (42.9%)
-- ucode             19908311589    #     0.0%
-- fastpath          22952577829890 #    27.2%
frontend             649425361657   #  0.8% ( 1.2%)
-- latency           131863651656   #     0.2%
-- bandwidth         517561710001   #     0.6%
backend              29849704427132 # 35.3% (55.8%)
-- cpu               22970067699975 #    27.2%
-- memory            6879636727157  #     8.1%
speculation          44758977493    #  0.1% ( 0.1%)
-- branch mispredict 27856028572    #     0.0%
-- pipeline restart  16902948921    #     0.0%
smt-contention       31004181580785 # 36.7% ( 0.0%)
cpu-cycles           42471271376087 # 3.69 GHz
instructions         62688427445805 # 1.48 IPC
instructions         20906863465846 # 5.588 l2 access per 1000 inst
l2 hit from l1       98937781951    # 7.01% l2 miss
l2 miss from l1      3949232822     #
l2 hit from l2 pf    13648867835    #
l3 hit from l2 pf    4231154370     #
l3 miss from l2 pf   11181084       #
instructions         20900865024390 # 7.953 float per 1000 inst
float 512            99             # 0.000 AVX-512 per 1000 inst
float 256            1244           # 0.000 AVX-256 per 1000 inst
float 128            166214135569   # 7.953 AVX-128 per 1000 inst
float MMX            0              # 0.000 MMX per 1000 inst
float scalar         0              # 0.000 scalar per 1000 inst

Here is the Intel composite profile. One thing that stands out is a high amount of branch misprediction. What also stands out is somehow only was run on two cores? So overall a somewhat squirrely test that could use a deeper dive. It also looks like the sources detect presence of particular ISAs particularly those cryptographic ones. You can also see from the phoronix run that particular cryptographic libraries are linked in as compiler options.

elapsed              323.082
on_cpu               0.131          # 2.10 / 16 cores
utime                531.315
stime                146.991
nvcsw                10503543       # 99.85%
nivcsw               15479          # 0.15%
inblock              8              # 0.02/sec
onblock              2931888        # 9074.76/sec
cpu-clock            670147573846   # 670.148 seconds
task-clock           671777403316   # 671.777 seconds
page faults          50351069       # 74952.013/sec
context switches     10519619       # 15659.382/sec
cpu migrations       30361          # 45.195/sec
major page faults    1              # 0.001/sec
minor page faults    50351068       # 74952.012/sec
alignment faults     0              # 0.000/sec
emulation faults     0              # 0.000/sec
branches             342678860344   # 135.988 branches per 1000 inst
branch misses        3385141914     # 0.99% branch miss
conditional          342679096184   # 135.988 conditional branches per 1000 inst
indirect             49044340836    # 19.463 indirect branches per 1000 inst
slots                6042017136368  #
retiring             2057791788255  # 34.1% (34.1%)
-- ucode             206973543335   #     3.4%
-- fastpath          1850818244920  #    30.6%
frontend             869596524785   # 14.4% (14.4%)
-- latency           441086010050   #     7.3%
-- bandwidth         428510514735   #     7.1%
backend              2531687347647  # 41.9% (41.9%)
-- cpu               625305575545   #    10.3%
-- memory            1906381772102  #    31.6%
speculation          609068467938   # 10.1% (10.1%)
-- branch mispredict 488700891743   #     8.1%
-- pipeline restart  120367576195   #     2.0%
smt-contention       0              #  0.0% ( 0.0%)
cpu-cycles           3264854609930  # 0.61 GHz
instructions         4152760303612  # 1.27 IPC
l2 access            47561932109    # 20.313 l2 access per 1000 inst
l2 miss              20055987975    # 42.17% l2 miss

Drilling into the speculation amounts, I see occasional bursts of much higher misses. Also apparent in this graph is more time spent in particular benchmarks. I believe in some cases extra runs to make things converge but then this also adds to the totals where there are different workloads…So this one as a whole would benefit by breaking into separate cases rather than running them together.