Search

Configure VitePress local search so queries on code-heavy documentation behave predictably: every query term must match, partial identifiers resolve, camelCase and hyphenated terms split correctly, and each page appears at most once.

When to use

The default VitePress local search is well-suited for general prose docs. For code-heavy skill or reference sites, three pain points show up quickly:

Duplicate results. A page with five headings can return five hits, all linking to different anchors of the same page. The user sees the same page listed repeatedly and has to figure out which hit is the one they want.
Identifier misses. HTTPResponse typed as "response" doesn't match because the default tokenizer splits only at whitespace. my-func typed as "func" also misses.
OR noise. With the default fuzzy/OR behavior, a two-word query ("http response") matches every page that mentions either word — most of the site.

The pattern below addresses all three.

Design decisions

Result granularity is one document per page, collapsed at index time. Each page appears once in results; there are no per-section anchors. This is simpler than query-time dedup and requires no runtime code.

Term matching uses AND, so every query term must appear somewhere in the document. A two-word query narrows results rather than widening them.

Tokenization splits at whitespace, hyphens, underscores, and camelCase boundaries. Code identifiers and hyphenated terms split correctly without any special cases. The original unsplit token is also retained alongside the parts, so a query like gethttpresponse still matches via prefix.

Field weighting boosts title (4×) and headings (2×) over body (1×), so the most relevant page ranks first.

Prefix matching is on, fuzzy is off. Partial identifiers match ("resp" → "response") without the noise that fuzzy matching introduces. Add fuzzy only after validating real queries prove it's necessary.

The pattern

All configuration lives inside the search key of your VitePress config.

Helpers module

Extract the functions into a separate module that both the config and tests can import. This keeps the config readable and lets tests import the real implementations rather than duplicating them.

// .vitepress/search-helpers.js

// Headings we flatten into paragraphs at index time (see renderForIndex).
const COLLAPSED_HEADINGS = new Set(["h2", "h3", "h4"]);

export function renderForIndex(src, env, md) {
  // Turn the markdown into markdown-it's token list, relabel each heading
  // token as a paragraph, then render that adjusted list back to HTML. See
  // "One result per page" below for why we work on tokens, not the HTML.
  const tokens = md.parse(src, env);
  for (const token of tokens) {
    const isHeading =
      token.type === "heading_open" || token.type === "heading_close";
    if (isHeading && COLLAPSED_HEADINGS.has(token.tag)) {
      token.tag = "p";
    }
  }
  return md.renderer.render(tokens, md.options, env);
}

// CRITICAL: tokenize and processTerm are serialized to the browser (see "The
// serialization trap" below), so each must be fully self-contained — inline
// every regex and helper, never reference module-scope code.
export function tokenize(text) {
  // Split at whitespace, hyphens, and underscores: `my-func` → ["my", "func"].
  const WORD_SEPARATORS = /[\s\-_]+/;
  // camelCase boundaries: a lowercase letter or digit before an uppercase one
  // (`getHTTP` → ["get", "HTTP"]), and an uppercase letter before an
  // uppercase+lowercase pair (`HTTPResponse` → ["HTTP", "Response"]).
  const CAMELCASE_BOUNDARY = /(?<=[a-z0-9])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])/;
  return text
    .split(WORD_SEPARATORS)
    .filter(Boolean)
    .flatMap((word) => {
      const parts = word.split(CAMELCASE_BOUNDARY).filter(Boolean);
      // Keep the original token alongside its parts so a whole-identifier query
      // like "httpresponse" still prefix-matches `HTTPResponse` in the index.
      return parts.length > 1 ? [...parts, word] : parts;
    });
}

export function processTerm(term) {
  return term.toLowerCase();
}

The serialization trap

tokenize and processTerm look like ordinary config — but VitePress hands them to the browser, not just the build. It serializes each one with Function.prototype.toString(), embeds the source string in the client bundle, and rebuilds the function on the page. Only the function's own source makes the trip. The moment the rebuilt function references a module-scope helper, constant, or import, it throws ReferenceError in the browser — and MiniSearch swallows that error, so every query silently returns zero results with nothing in the console to explain it.

That is exactly why the tokenizer above inlines its regexes and its splitting logic instead of factoring them into tidy named helpers. A clean tokenize = (t) => splitOnSeparators(t).flatMap(expandCamelCase) builds fine, indexes fine, and is completely broken in the browser. Test the round-trip directly so a future refactor can't reintroduce it:

test("tokenize survives serialization to the client", () => {
  const rebuilt = new Function(`return (${tokenize.toString()})`)();
  assert.deepEqual(rebuilt("getHTTPResponse"), tokenize("getHTTPResponse"));
});

_render is exempt — VitePress runs it only at build time and never serializes it to the client, so it may reference module scope freely (as renderForIndex does with COLLAPSED_HEADINGS).

One result per page (index-time collapse)

VitePress's default indexer splits each page at H2/H3 heading boundaries, creating one MiniSearch document per section. A page with five headings produces five separate search hits. To collapse this to one hit per page, provide a _render hook that turns subheadings into paragraphs before indexing. VitePress's section splitter finds section boundaries by looking for heading tags, so once the headings are gone the page stays whole.

Why rewrite the tokens instead of the HTML? The obvious approach is to render the page to an HTML string and find-and-replace <h2>…</h2> with <p>…</p>. That works, but it means parsing HTML with regular expressions — fiddly once headings carry attributes (anchor ids, classes) and a classic source of bugs. markdown-it gives us a cleaner path. It first turns the markdown into a flat list of tokens — one per element (heading_open, inline, heading_close, paragraph_open, …) — and only then walks that list to produce HTML. md.parse hands us the token list before it becomes HTML, so we just relabel each heading token's tag from "h2" to "p" and let md.renderer produce the HTML. No HTML string is ever parsed, and any attributes on a heading ride along untouched because we change only the tag name. This is the same token-rewriting technique the skillLinkPlugin in config.js uses to fix up links.

_render lives inside search.options and is called only when VitePress generates the search index — not during page rendering. The HTML your visitors receive is unaffected.

Trade-off: results always link to the top of the page, not the best-matching section. If your pages are short (one or two screens), users find what they want quickly. For long pages with many unrelated sections, filtering the result array to the best-scoring entry per URL preserves section anchors at the cost of more code in the search handler.

AND term matching + prefix + no fuzzy

miniSearch: {
  searchOptions: {
    combineWith: "AND",
    fuzzy: false,
    prefix: true,
    boost: { title: 4, titles: 2, text: 1 },
  },
},

combineWith: "AND" means every query term must appear somewhere in the document. A two-word query is a narrowing filter, not a union.

prefix: true matches any indexed term that starts with the query term. "resp" matches "response", "httpres" matches "httpresponse" (after camelCase splitting). Useful for code identifiers typed mid-word.

fuzzy: false keeps results predictable. If exact + prefix still misses queries in practice, add fuzzy: 0.2 and re-judge against real examples.

camelCase and punctuation tokenization

The tokenizer runs in two passes. First it splits at whitespace, hyphens, and underscores (my-func → ["my", "func"]). Then it splits each token at camelCase boundaries and retains the original unsplit token alongside the parts.

The camelCase split is one regex with two alternatives — kept inline rather than composed from named fragments, because the function must be self-contained for the browser (see "The serialization trap" above):

(?<=[a-z0-9])(?=[A-Z]) — split when a lowercase letter or digit precedes an uppercase letter: getHTTP → ["get", "HTTP"]
(?<=[A-Z])(?=[A-Z][a-z]) — split when an uppercase letter precedes an uppercase+lowercase pair: HTTPResponse → ["HTTP", "Response"]

Together: getHTTPResponse → ["get", "HTTP", "Response", "getHTTPResponse"], lowercased to ["get", "http", "response", "gethttpresponse"].

Retaining the original means a query like "gethttpresponse" prefix-matches "gethttpresponse" in the index. processTerm lowercases every token before indexing and every query term before searching, making all matches case-insensitive. "HTTP" and "http" hit the same indexed entry.

Wiring it up

// .vitepress/config.js
import { renderForIndex, tokenize, processTerm } from "./search-helpers.js";

export default defineConfig({
  themeConfig: {
    search: {
      provider: "local",
      options: {
        _render: renderForIndex,
        miniSearch: {
          options: { tokenize, processTerm },
          searchOptions: {
            combineWith: "AND",
            fuzzy: false,
            prefix: true,
            boost: { title: 4, titles: 2, text: 1 },
          },
        },
      },
    },
  },
});

Tests

// skills/web/vitepress/search.test.js
import { test, describe } from "node:test";
import assert from "node:assert/strict";
import {
  renderForIndex,
  tokenize,
  processTerm,
} from "../../../.vitepress/search-helpers.js";

Import the real implementations from search-helpers.js — never copy them into the test file. Copied implementations can diverge silently; imports break loudly when the behavior changes.

Trade-offs

Index-time collapse vs. query-time dedup. Collapsing at index time (_render approach) is simpler and requires no runtime code. The trade-off is that results always link to the top of the page. Query-time dedup — keeping one result per page with a section anchor — requires filtering the result array to the highest-ranked hit per URL, but MiniSearch exposes enough metadata to do this reliably.
Fuzzy off by default. Turning on fuzzy matching (fuzzy: 0.2) adds tolerance for typos but also adds noise. Validate first that exact + prefix covers real queries before enabling it.
Tokenizer performance. The camelCase tokenizer runs on every token during both indexing and querying. For very large sites (thousands of pages), measure build time before and after if needed — for typical skill sites this is not a bottleneck.

Search ​

When to use ​

Design decisions ​

The pattern ​

Helpers module ​

The serialization trap ​

One result per page (index-time collapse) ​

AND term matching + prefix + no fuzzy ​

camelCase and punctuation tokenization ​

Wiring it up ​

Tests ​

Trade-offs ​

See also ​