CtrlK
BlogDocsLog inGet started
Tessl Logo

js-perf-investigation

Structured performance opportunity investigation for SpiderMonkey (the Firefox JavaScript engine). Use this skill when the user wants to investigate JS engine performance, profile SpiderMonkey, find optimization opportunities, write performance patches, or evaluate benchmark regressions. Trigger on mentions of: profiling JS, SpiderMonkey performance, JIT optimization, benchmark regression analysis, shell benchmarking, or any request to make JS workloads faster. The methodolgy is described mostly for the JS shell but can be adapted to browser investigation.

76

Quality

Does it follow best practices?

Impact

No eval scenarios have been run

SecuritybySnyk

Passed

No known issues

SKILL.md
Quality
Evals
Security

SpiderMonkey Performance Investigation

This skill guides a structured, evidence-driven performance investigation for the SpiderMonkey JavaScript engine. The methodology has four phases: hypothesis generation, evidence gathering, patch writing, and evaluation. Each phase builds on the last: resist the urge to skip ahead to writing patches before you have empirical evidence that a change will help.

When asked to create multiple patches, iterate through the phases each time to ensure each patch is independently validated and measured. Always create commits before moving onto a new patch if you are creating multiple patches. This will make it easier to review and to measure contribution.

The end result of this skill will be a summary of the investigation, and one or more patches that measurably improve the performance of the targeted workload, with each patch describing supporting evidence and measured impact.

Prerequisites

The user should provide:

  • A workload to investigate (a JS file, benchmark suite, or instructions to reproduce)
  • A build configuration or existing shell to use

You have access to:

  • samply — sampling profiler that produces Firefox Profiler-compatible output
  • profiler-cli — for analyzing profiles. This can also be used to investigate Gecko profiler profiles if the investigation is being done in the browser.
  • searchfox-cli — source code search for the Firefox codebase

For more details on how to use these tools load the "profiler-analysis" skill, which will also hint on how to get the tools installed if needed.

An artifacts/ directory can be created and this is excluded from version control.

Phase 1: Hypothesis Generation

The goal is to identify where time is being spent and form testable hypotheses about what could be improved.

1.1 Prepare the build

Use an opt-nodebug (optimized, no debug checks) build. Debug builds distort profiles with assertion overhead.

The user should provide or confirm the mozconfig to use. The key settings for an opt-nodebug build are:

ac_add_options --enable-optimize
ac_add_options --disable-debug

If the user hasn't specified a mozconfig, ask them — build configurations vary across machines and the user will know which obj-dir and config is appropriate for their setup.

Always run the shell with --strict-benchmark-mode when investigating performance. This flag validates the runtimeconfiguration and will error if something would produce unreliable numbers (e.g. JIT is disabled unexpectedly). Generating profiles without this flag risks producing misleading data.

1.2 Establish the workload

Examine the workload to understand what it does. If the workload has an iteration count or loop parameter, determine an appropriate count so that the workload runs for at least 30 seconds under profiling. Statistical profilers need sufficient samples to produce meaningful data — short runs produce noisy profiles where real hotspots are hard to distinguish from sampling noise.

For targeted micro-optimizations (e.g. improving a single opcode or a specific stub), longer runs (60s+) may be necessary to accumulate enough samples in the specific code path of interest.

If the workload driver supports iteration configuration, prefer that.

Otherwise, wrap it:

for (let i = 0; i < ITERATIONS; i++) {
    load("workload.js");  // or call the main function
}

1.3 Profile

Record a profile with samply. Always set IONPERF=func and PERF_SPEW_DIR so that JIT-compiled functions appear with readable names in the profile instead of raw addresses. The overhead is negligible:

mkdir -p artifacts/perf-spew
PERF_SPEW_DIR=artifacts/perf-spew IONPERF=func \
    samply record --save-only -o artifacts/profile.json.gz -- \
    ./obj-opt-nodebug/dist/bin/js --strict-benchmark-mode workload.js

Using --save-only avoids opening the browser and gives you a local file you can analyze with profiler-cli. Save profiles to the artifacts/ directory; you may need to gzip the profile for profiler-cli to read it.

For deeper JIT investigation (e.g. understanding what IR the JIT emitted for a hot function), use IONPERF=ir instead — see references/advanced-tools.md.

1.4 Analyze the profile

Start broad and narrow down: Looking at the profile, answer some of the following questionsfile:

  1. What are the top CPU consumers?
  2. What does the call tree look like top-down?
  3. Who calls a hot function?
  4. What does a specific function's time look like with callees collapsed?

For Speedometer profiles, always use --focus-marker="-async,-sync" to exclude async idle time between benchmark iterations.

1.5 Form hypotheses

Based on the profile data, form specific, testable hypotheses. Good hypotheses look like:

  • "Function X is called Y times from path Z — reducing call frequency by caching result W should save ~N% of its self time"
  • "The JIT is spending M% of time in IC stubs for property access pattern P — a specialized stub for this pattern could reduce that"
  • "Allocation pressure in function F is causing N% GC time — pretenuring could help"

Bad hypotheses (avoid these):

  • "Let's tune the inlining threshold" — tuning existing knobs tends to overfit to the current benchmark state rather than making general engine progress
  • "This function seems slow, let's rewrite it" — without understanding why it's slow

Phase 2: Evidence Gathering

Before writing a patch, gather enough evidence to be confident the hypothesis is sound.

2.1 Source investigation

Use searchfox-cli to understand the relevant code and understand the current behavior.

Use searchfox-cli for blame on relevant code, as well as git history on relevant files. This might provide context on why things are the way they are.

2.2 Instrumentation

Profiling shows where time is spent but not always why. When your hypothesis depends on runtime state (data distributions, cache hit rates, list lengths, frequency of code paths), add temporary instrumentation to measure it directly.

Use MOZ_LOG or JS_LOG for instrumentation.

JS_LOG(debug /* you can also add your own channel, but debug should be unused */, Debug, "list length: %zu, sorted: %s",
           list.length(), isSorted ? "yes" : "no");

Throttle instrumentation output when it would fire on every iteration — use a counter to log every Nth occurrence, or accumulate statistics and log a summary. Unthrottled logging in a hot path will drown the output and slow the workload enough to distort measurements.

static uint32_t callCount = 0;
if (++callCount % 10000 == 0) {
    JS_LOG_FMT(debug, Debug, "after %u calls: avg length = %zu",
               callCount, totalLength / callCount);
}

Re run with MOZ_LOG=debug:5 to see the output.

In a browser build you can add profiler markers instead of logging which can be read through gecko-profiling and the profiler-cli.

2.3 Re-run with instrumentation

Run the instrumented build and collect the data. This confirms whether your hypothesis about runtime behavior is correct before you invest in writing a real patch.

Phase 3: Patch Writing

Now that you have evidence, write the patch.

3.1 Design for measurability

Where possible, gate the optimization behind a JS::Prefs preference so you can do apples-to-apples comparison on the same binary. This eliminates build-to-build variation as a confounding factor and makes it trivial to re-measure later.

To add the pref, add an entry to StaticPrefList.yaml:

- name: javascript.options.experimental.my_optimization
  type: bool
  value: true
  mirror: always
  set_spidermonkey_pref: always

Then guard the code path:

if (JS::Prefs::experimental_my_optimization()) {
    // new path (default: on)
} else {
    // old path
}

Use set_spidermonkey_pref: always (not startup) so the pref can be toggled via --setpref without requiring a restart:

# Measure with optimization (default):
./js --strict-benchmark-mode workload.js
# Measure without:
./js --strict-benchmark-mode --setpref experimental.my_optimization=false workload.js

Note that pref-gating is not always feasible. For changes on extremely hot paths (tight JIT loops, inline caches), the branch on the pref check itself can be costly enough to distort measurements. In those cases, fall back to saving the obj-dir from a build without the patch and comparing against a build with the patch applied.

Note: You can't save -just- a js binary, as there are dynamically linked libraries. Always save the obj-dir, or create a different mozconfig.

3.2 Add development logging

During patch development, add JS_LOG logging to the debug channel to verify the new code path is being taken where expected. Throttle by a counter to avoid flooding output. Do a run with the instrumentation logging to ensure the logging fires when/where/as-much as expected. Remove or reduce this logging before the patch is finalized.

3.3 Microbenchmark

For a given optimization is is often compelling to also generate a microbenchmark which demonstrates in the absolute most ideal circumstances for the optimization what kind of result is achievable. This is not a replacement for measuring the real workload, but can be a useful sanity check that the optimization is working as intended and has the potential to produce the expected impact, and can help in choosing to keep patches which are effective in the microbenchmark but don't show good impact under the real workload.

3.4 Multiple patches

When investigating multiple optimization opportunities:

  • Develop each patch independently so its contribution can be measured in isolation
  • Commit each patch separately with a clear message describing the change and the hypothesis aims to address, evidence in favour and testing results.
  • At the end of optimziation, present:
    1. Total improvement from baseline (no patches) to all patches applied
    2. Individual contribution of each patch measured independently
    3. Any interactions between patches (does applying A make B more or less effective?)

Phase 4: Evaluation

4.1 Performance measurement

Run the workload with and without the patch (using the pref toggle or separate builds).

If hyperfine is available, you can use that if. If not, start with 5 runs of each configuration, collecting timing results into arrays.

# With pref-gated optimization — collect results into a file:
for i in $(seq 1 5); do
    ./js --strict-benchmark-mode --setpref experimental.my_optimization=true workload.js \
        2>&1 | tee -a artifacts/results_with.txt
done
for i in $(seq 1 5); do
    ./js --strict-benchmark-mode --setpref experimental.my_optimization=false workload.js \
        2>&1 | tee -a artifacts/results_without.txt
done

After collecting initial results, use a Python script to assess whether the sample size is sufficient. Use the Mann-Whitney U test (non-parametric, robust to non-normal distributions common in benchmark data) to test for significance:

# /// script
# dependencies = [
#   "numpy",
#   "scipy",
# ]
# ///

# use `uv run script.py` and deps should be automaticaly installed
import numpy as np
from scipy import stats

baseline = np.array([...])  # times without patch
patched = np.array([...])   # times with patch

stat, p_value = stats.mannwhitneyu(baseline, patched, alternative='two-sided')
effect_size = (np.mean(baseline) - np.mean(patched)) / np.mean(baseline) * 100

print(f"Baseline: {np.mean(baseline):.2f} +/- {np.std(baseline):.2f}")
print(f"Patched:  {np.mean(patched):.2f} +/- {np.std(patched):.2f}")
print(f"Effect:   {effect_size:.2f}%")
print(f"p-value:  {p_value:.4f}")

if p_value > 0.05:
    print("Result not statistically significant at p<0.05 — consider more runs")

If the p-value is borderline (0.01 < p < 0.10) or the effect size is small relative to the observed variance, collect additional runs and retest. But do not exceed 20 runs per configuration — if 20 runs on each side still can't produce a significant result, the effect is too close to the noise floor to be meaningfully measured this way. That's a signal to step back and reconsider: either the optimization isn't having the expected impact, or the workload needs to be restructured to isolate the effect better (e.g. more iterations of the hot path, a more targeted microbenchmark).

4.2 Profile the patched build

Don't just measure — profile again to confirm the patch is having the expected effect. The profile should show reduced time in the targeted code path. If it doesn't, investigate why.

4.3 Safety evaluation

After each patch is written, but before it's commited, run the correctness test suites.

Both of these must pass. Test with opt-nodebug first (because you have the build) but also test with an opt-debug build as well, as there are many debug-only assertions that catch errors that are needed to be evaluated.

./mach jit-test
./mach jstests

If the patch touches GC-related code, run both suites with --jitflags=all for more thorough coverage:

./mach jit-test --jitflags=all
./mach jstests --jitflags=all

Beyond the test suites, consider adding test cases to address

  • Edge cases the optimization might mishandle
  • Whether the patch changes general-purpose code paths that could regress other workloads

Investigation document

Produce a summary document (outside the source tree, e.g. in artifacts/) that records:

  1. Objective: What workload was being investigated and why
  2. Methodology: Build configuration, profiling setup, iteration counts
  3. Hypotheses investigated: For each hypothesis:
    • What the profile data suggested
    • What evidence was gathered (instrumentation results, source analysis)
    • Whether a patch was written and what it does
    • Measured performance impact (with numbers and variance)
  4. Hypotheses rejected: Hypotheses that were investigated but didn't pan out, and why — this is valuable for future investigators
  5. Results: Summary of total improvement achieved, per-patch breakdown
  6. Remaining opportunities: Observations from profiling that weren't pursued but could be investigated in future work

Anti-patterns to avoid

  • Patching without evidence: Never write an optimization patch based on intuition alone. Profile first, instrument if needed, then patch.
  • Knob tuning: Adjusting existing heuristic thresholds (inlining limits, IC stub counts, GC triggers) tends to overfit to the specific benchmark. Prefer structural improvements that make the engine generally better over threshold adjustments that win one benchmark.
  • Measuring too few iterations: A single run or a 2-second profile is not reliable. Ensure sufficient samples for statistical confidence.
  • Forgetting --strict-benchmark-mode: Without this flag, the shell may be in a configuration that produces misleading numbers. Always use it.
  • Comparing across builds without controlling for noise: Use pref-gated patches or carefully controlled build pairs. Random rebuild-to-rebuild variation can mask or exaggerate real differences.
  • Mixing together independnet changes in a single patch.
  • Advocating for changes that can't even be measured on a targeted microbenchmark. If the optimization can't show a clear improvement in an idealized scenario, it's unlikely to produce meaningful improvement in the real workload.
Repository
mozilla/enterprise-firefox
Last updated
Created

Is this your skill?

If you maintain this skill, you can claim it as your own. Once claimed, you can manage eval scenarios, bundle related skills, attach documentation or rules, and ensure cross-agent compatibility.