Skip to content

fix(filesystem): apply files.watcherExclude to all watchers#17630

Open
safisa wants to merge 4 commits into
eclipse-theia:masterfrom
safisa:safi-fix/watcher-exclude-all-watchers
Open

fix(filesystem): apply files.watcherExclude to all watchers#17630
safisa wants to merge 4 commits into
eclipse-theia:masterfrom
safisa:safi-fix/watcher-exclude-all-watchers

Conversation

@safisa

@safisa safisa commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

What it does

Fixes #17247
Fixes #10794

FileService.watch() previously appended only the temporary-upload exclude and relied on individual callers to apply files.watcherExclude. As a result:

This resolves files.watcherExclude generically inside FileService.watch(), so every watcher shares the same base excludes. The native @parcel/watcher ignore option then prunes the excluded directories for all watchers, and the existing watcher subsumption collapses overlapping watchers into a single OS watch.

Builds on #17598, which made the backend honor excludes via parcel's native ignore.

How to test

  1. Configure files.watcherExclude (e.g. "**/node_modules/**": true) in a workspace containing a large excluded subtree.
  2. Cause an internal/plugin watcher to request a recursive watch with empty excludes (e.g. a feature that watches a directory, or a language server that uses createFileSystemWatcher).
  3. On Linux, observe that the excluded subtree no longer accumulates OS watches, and that editing a file under a directory covered by two overlapping watchers now yields a single change event instead of duplicates.

Unit tests in packages/filesystem/src/browser/file-service-watcher.spec.ts cover the merge (enabled vs. false-valued patterns, caller-supplied excludes preserved, temporary-upload exclude retained) and that a child watcher requesting empty excludes is now subsumed by the root watcher.

Follow-ups

A separate driver remains and is intentionally out of scope here: a watcher rooted at an ancestor of a workspace folder (e.g. a language server watching the parent directory to detect deletion of the workspace folder itself) still triggers a recursive crawl of sibling trees, because the backend always watches recursively. files.watcherExclude cannot bound it since the watch root is outside the workspace. This warrants a dedicated issue/PR.

Breaking changes

  • This PR introduces breaking changes and requires careful review. If yes, the breaking changes section in the changelog has been updated.

Attribution

Review checklist

Reminder for reviewers

`FileService.watch()` only appended the temporary-upload exclude and left
`files.watcherExclude` to individual callers, so watchers requesting
`excludes: []` - internal recursive watchers and plugin/language-server
watchers created via `vscode.workspace.createFileSystemWatcher` - placed
unbounded OS watches, and overlapping watchers whose exclude lists differed
were never subsumed and produced duplicate change events.

Resolve `files.watcherExclude` generically in `FileService.watch()` so every
watcher shares the same base excludes: the native parcel `ignore` prunes
excluded directories for all watchers, and the existing watcher subsumption
collapses overlapping watchers into a single OS watch.

Closes eclipse-theia#17247, closes eclipse-theia#10794
@github-project-automation github-project-automation Bot moved this to Waiting on reviewers in PR Backlog Jun 9, 2026
@safisa

safisa commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

For reviewers on Linux: here is a self-contained diagnostic to verify the effect of this change on real inotify usage.

It reads files.watcherExclude from the workspace settings, enumerates every node process holding inotify watches, checks that excluded paths inside the workspace are no longer watched, and reports watch roots outside the workspace (useful for spotting watchers — e.g. language servers — rooted outside the workspace folder).

Requires Linux /proc (inotify introspection is not available on macOS/Windows), so run it on the host/container where the Theia backend runs.

python3 check-watcher-excludes.py --workspace /path/to/workspace -v
check-watcher-excludes.py
#!/usr/bin/env python3
"""
check-watcher-excludes.py

Diagnose inotify usage in a Theia / VS Code-style application.

What it does
------------
1. Reads `files.watcherExclude` from .theia/settings.json (or .vscode/settings.json).
2. Enumerates every `node` process that owns any inotify instance, and reports:
     - number of inotify instances (file descriptors of type inotify)
     - total number of inotify watches across those instances
     - a label identifying the process (Theia backend, plugin host, a specific
       extension by id, language server, etc.)
3. Auto-detects the main Theia backend (the node process with the most watches)
   and verifies that every active `files.watcherExclude` pattern is honored for
   paths inside the workspace folder.
4. For every node process, computes the *watch roots* outside the workspace -
   the topmost watched ancestors (a watched path is a root if none of its
   parents are also watched). Reports the count of watches under each root,
   attributed to the owning process. This is how you find which extension is
   responsible for watches under e.g. `/data/<sibling-dir>/...`.

Notes on `files.watcherExclude` semantics (VS Code / Theia):
- The setting only applies to paths inside an *open workspace folder*.
- A bare pattern like `node_modules` matches only at the workspace root.
  Use `**/node_modules` to match at any depth.
- Patterns ending in `/**` also match the folder itself plus all descendants.
- Extensions that run their own file watchers (Git, Java LS, etc.) bypass this
  setting entirely. Those watches will show up under their own PID below.

Usage
-----
    python3 check-watcher-excludes.py                    # auto everything
    python3 check-watcher-excludes.py --workspace /path  # specify workspace
    python3 check-watcher-excludes.py --pid 1229         # force backend PID
    python3 check-watcher-excludes.py --scan-root /path/to/parent-of-workspace
    python3 check-watcher-excludes.py -v                 # show sample paths

Exit code: 0 if no exclude leaks, 1 if leaks found, 2 on errors.
"""

import argparse
import json
import os
import platform
import re
import sys
from collections import defaultdict


# ---------- settings loading -------------------------------------------------

def strip_jsonc(text: str) -> str:
    """Remove // and /* */ comments, string-aware. Also trims trailing commas."""
    out = []
    i, n = 0, len(text)
    in_string = False
    while i < n:
        c = text[i]
        if in_string:
            out.append(c)
            if c == "\\" and i + 1 < n:
                out.append(text[i + 1])
                i += 2
                continue
            if c == '"':
                in_string = False
            i += 1
            continue
        if c == '"':
            in_string = True
            out.append(c)
            i += 1
            continue
        if c == "/" and i + 1 < n and text[i + 1] == "/":
            while i < n and text[i] != "\n":
                i += 1
            continue
        if c == "/" and i + 1 < n and text[i + 1] == "*":
            i += 2
            while i + 1 < n and not (text[i] == "*" and text[i + 1] == "/"):
                i += 1
            i += 2
            continue
        out.append(c)
        i += 1
    cleaned = "".join(out)
    cleaned = re.sub(r",(\s*[}\]])", r"\1", cleaned)
    return cleaned


def load_excludes(settings_path: str) -> dict:
    if not os.path.exists(settings_path):
        return {}
    with open(settings_path, "r", encoding="utf-8") as f:
        raw = f.read()
    try:
        data = json.loads(raw)
    except json.JSONDecodeError:
        try:
            data = json.loads(strip_jsonc(raw))
        except json.JSONDecodeError as e:
            print(f"WARN: failed to parse {settings_path}: {e}", file=sys.stderr)
            return {}
    return data.get("files.watcherExclude", {}) or {}


# ---------- glob -> regex ----------------------------------------------------

def glob_body_to_regex(pattern: str) -> str:
    out = []
    i = 0
    while i < len(pattern):
        c = pattern[i]
        if c == "*":
            if i + 1 < len(pattern) and pattern[i + 1] == "*":
                if i + 2 < len(pattern) and pattern[i + 2] == "/":
                    out.append("(?:.*/)?")
                    i += 3
                else:
                    out.append(".*")
                    i += 2
            else:
                out.append("[^/]*")
                i += 1
        elif c == "?":
            out.append("[^/]")
            i += 1
        elif c in r".+()|^$[]{}\\":
            out.append("\\" + c)
            i += 1
        else:
            out.append(c)
            i += 1
    return "".join(out)


def compile_pattern(pattern: str, workspace: str) -> re.Pattern:
    ws_prefix = re.escape(workspace.rstrip("/") + "/")
    p = pattern
    if p.startswith("**/"):
        body = glob_body_to_regex(p[3:])
        regex = ws_prefix + "(?:.*/)?" + body
    elif p.startswith("/"):
        body = glob_body_to_regex(p.lstrip("/"))
        regex = ws_prefix + body
    else:
        body = glob_body_to_regex(p)
        regex = ws_prefix + body
    regex += r"(?:/.*)?$"
    return re.compile(regex)


# ---------- /proc helpers ----------------------------------------------------

_INO_RE = re.compile(r"\bino:([0-9a-fA-F]+)")


def count_inotify_instances(pid: int) -> int:
    """Number of inotify file descriptors (instances) held by `pid`."""
    fd_dir = f"/proc/{pid}/fd"
    count = 0
    try:
        for fd in os.listdir(fd_dir):
            try:
                target = os.readlink(f"{fd_dir}/{fd}")
            except OSError:
                continue
            if "inotify" in target:
                count += 1
    except OSError:
        return 0
    return count


def count_inotify_watches(pid: int) -> int:
    """Number of inotify watches (sum of `inotify wd` lines across fdinfo)."""
    total = 0
    fdinfo = f"/proc/{pid}/fdinfo"
    try:
        for fd in os.listdir(fdinfo):
            try:
                with open(f"{fdinfo}/{fd}", "r") as f:
                    for line in f:
                        if line.startswith("inotify wd"):
                            total += 1
            except OSError:
                pass
    except OSError:
        return 0
    return total


def get_watched_inodes(pid: int) -> set:
    inodes = set()
    fdinfo = f"/proc/{pid}/fdinfo"
    try:
        for fd in os.listdir(fdinfo):
            try:
                with open(f"{fdinfo}/{fd}", "r") as f:
                    for line in f:
                        if not line.startswith("inotify wd"):
                            continue
                        m = _INO_RE.search(line)
                        if m:
                            inodes.add(int(m.group(1), 16))
            except OSError:
                pass
    except OSError:
        pass
    return inodes


def read_cmdline(pid: int) -> str:
    try:
        with open(f"/proc/{pid}/cmdline", "rb") as f:
            return f.read().replace(b"\x00", b" ").decode("utf-8", "replace").strip()
    except OSError:
        return ""


def read_comm(pid: int) -> str:
    try:
        with open(f"/proc/{pid}/comm", "r") as f:
            return f.read().strip()
    except OSError:
        return ""


def label_process(cmdline: str) -> str:
    """Heuristic label for a node process based on its command line."""
    # VS Code-style plugin path: .../plugins/<publisher.ext>/extension.js
    m = re.search(r"/plugins/([^/]+)/", cmdline)
    if m:
        return f"extension: {m.group(1)}"
    # Theia / VS Code shapes
    if "ipc-bootstrap" in cmdline:
        return "Theia worker (ipc-bootstrap)"
    if "plugin-host" in cmdline:
        return "Theia plugin host (extension runtime)"
    if re.search(r"/lib/backend/main\.js\b|/backend/main\.js\b", cmdline):
        return "Theia backend main"
    if "language-server" in cmdline.lower() or "languageserver" in cmdline.lower():
        return "language server"
    if "tsserver" in cmdline:
        return "TypeScript server"
    if "eslintServer" in cmdline:
        return "ESLint server"
    return "node"


def list_node_processes_with_inotify() -> list:
    """All `node` processes that hold at least one inotify instance."""
    procs = []
    for entry in os.listdir("/proc"):
        if not entry.isdigit():
            continue
        pid = int(entry)
        if read_comm(pid) != "node":
            continue
        instances = count_inotify_instances(pid)
        watches = count_inotify_watches(pid)
        if instances == 0 and watches == 0:
            continue
        cmdline = read_cmdline(pid)
        procs.append({
            "pid": pid,
            "instances": instances,
            "watches": watches,
            "cmdline": cmdline,
            "label": label_process(cmdline),
        })
    procs.sort(key=lambda p: (-p["watches"], -p["instances"]))
    return procs


# ---------- filesystem scan --------------------------------------------------

# Directories we never recurse into during the inode scan.
_SCAN_SKIP_NAMES = {"proc", "sys", "dev", "run", ".npm", ".cache"}


def build_inode_index(scan_roots: list, all_inodes: set) -> dict:
    """
    Walk each scan root once and build a dict: inode -> path (first hit).
    Only inodes present in `all_inodes` are kept (to bound memory).
    """
    index = {}
    seen_roots = set()
    for root in scan_roots:
        root = os.path.realpath(root)
        if not os.path.isdir(root) or root in seen_roots:
            continue
        seen_roots.add(root)
        try:
            st = os.stat(root)
            if st.st_ino in all_inodes and st.st_ino not in index:
                index[st.st_ino] = root
        except OSError:
            pass
        for dirpath, dirnames, filenames in os.walk(root, followlinks=False):
            # prune well-known noisy dirs
            dirnames[:] = [d for d in dirnames if d not in _SCAN_SKIP_NAMES]
            for name in dirnames:
                p = os.path.join(dirpath, name)
                try:
                    ino = os.lstat(p).st_ino
                except OSError:
                    continue
                if ino in all_inodes and ino not in index:
                    index[ino] = p
            for name in filenames:
                p = os.path.join(dirpath, name)
                try:
                    ino = os.lstat(p).st_ino
                except OSError:
                    continue
                if ino in all_inodes and ino not in index:
                    index[ino] = p
    return index


def compute_watch_roots(paths: list) -> list:
    """
    Given a list of watched absolute paths, return only the topmost ones:
    a path is a root if none of its ancestors are also in the set.
    """
    s = set(paths)
    roots = []
    for p in paths:
        ancestor = os.path.dirname(p)
        is_root = True
        while ancestor and ancestor != "/":
            if ancestor in s:
                is_root = False
                break
            ancestor = os.path.dirname(ancestor)
        if is_root:
            roots.append(p)
    return roots


def print_subtree_breakdown(root: str, paths_set: set, *,
                            indent: int = 6, depth: int = 0,
                            max_depth: int = 4, min_count: int = 20,
                            top_n: int = 10) -> None:
    """
    Recursively show how many watched paths live under each immediate
    subdirectory of `root`, drilling down while the count stays above
    `min_count` and the depth budget isn't exhausted.
    """
    if depth >= max_depth:
        return
    prefix = root.rstrip("/") + "/"
    groups: dict = defaultdict(int)
    for p in paths_set:
        if not p.startswith(prefix):
            continue
        rest = p[len(prefix):]
        head = rest.split("/", 1)[0]
        if head:
            groups[prefix + head] += 1
    if not groups:
        return
    items = sorted(groups.items(), key=lambda kv: -kv[1])[:top_n]
    for sub, n in items:
        if n < min_count:
            continue
        print(f"{' ' * indent}{n:6d}  {sub}")
        # Only drill deeper while there's still something to split.
        if n >= min_count * 2:
            print_subtree_breakdown(
                sub, paths_set,
                indent=indent + 2, depth=depth + 1,
                max_depth=max_depth, min_count=min_count, top_n=top_n,
            )


# ---------- main -------------------------------------------------------------

def main() -> int:
    ap = argparse.ArgumentParser(
        description=__doc__,
        formatter_class=argparse.RawDescriptionHelpFormatter,
    )
    ap.add_argument("--workspace", default=os.getcwd(),
                    help="Workspace folder to test (default: current dir)")
    ap.add_argument("--pid", type=int,
                    help="Studio backend PID for the exclude check "
                         "(default: auto = node process with most watches)")
    ap.add_argument("--settings",
                    help="Settings file (default: <workspace>/.theia/settings.json, "
                         "fallback to .vscode/settings.json)")
    ap.add_argument("--scan-root", action="append", default=[],
                    help="Extra directory to scan when mapping inodes -> paths. "
                         "Repeatable. By default the workspace and its parent are scanned.")
    ap.add_argument("--top-roots", type=int, default=10,
                    help="Max watch roots to print per process outside the workspace "
                         "(default 10)")
    ap.add_argument("--max-depth", type=int, default=4,
                    help="Max depth for the per-root hierarchical breakdown (default 4)")
    ap.add_argument("--min-count", type=int, default=20,
                    help="Minimum watch count to show a sub-group in the breakdown (default 20)")
    ap.add_argument("-v", "--verbose", action="store_true",
                    help="Show sample leaked paths per exclude pattern")
    args = ap.parse_args()

    workspace = os.path.realpath(args.workspace)
    if not os.path.isdir(workspace):
        print(f"ERROR: workspace not a directory: {workspace}", file=sys.stderr)
        return 2

    settings = args.settings
    if not settings:
        cand = os.path.join(workspace, ".theia", "settings.json")
        if not os.path.exists(cand):
            cand = os.path.join(workspace, ".vscode", "settings.json")
        settings = cand

    excludes = load_excludes(settings)
    active = [p for p, v in excludes.items() if v]

    print(f"Workspace : {workspace}")
    print(f"Settings  : {settings}")
    print(f"Excludes  : {len(active)} active patterns "
          f"({len(excludes) - len(active)} disabled)")

    if not os.path.isdir("/proc"):
        print()
        print(f"ERROR: this script requires Linux /proc (detected: {platform.system()}).",
              file=sys.stderr)
        print("       inotify introspection is not available on macOS or Windows.",
              file=sys.stderr)
        print("       Run this script on the Linux host where the Theia backend is running",
              file=sys.stderr)
        print("       (e.g. inside the container/VM that hosts the Theia process).",
              file=sys.stderr)
        return 2

    procs = list_node_processes_with_inotify()
    if not procs:
        print("ERROR: no node process is holding any inotify resources", file=sys.stderr)
        return 2

    # ---------- Process summary ----------
    print()
    print("=== Node processes with inotify usage ===")
    print(f"{'PID':>7}  {'Inst':>5}  {'Watches':>8}  Label")
    print(f"{'-'*7}  {'-'*5}  {'-'*8}  {'-'*60}")
    for p in procs:
        print(f"{p['pid']:>7}  {p['instances']:>5}  {p['watches']:>8}  {p['label']}")
    total_inst = sum(p["instances"] for p in procs)
    total_watches = sum(p["watches"] for p in procs)
    print(f"{'-'*7}  {'-'*5}  {'-'*8}")
    print(f"{'TOTAL':>7}  {total_inst:>5}  {total_watches:>8}")

    # ---------- Resolve all watched inodes -> paths ----------
    per_pid_inodes = {p["pid"]: get_watched_inodes(p["pid"]) for p in procs}
    all_inodes = set()
    for s in per_pid_inodes.values():
        all_inodes |= s

    scan_roots = list(args.scan_root) or []
    scan_roots = [workspace, os.path.dirname(workspace)] + scan_roots

    print()
    print(f"Scanning {len(set(map(os.path.realpath, scan_roots)))} root(s) "
          f"to resolve {len(all_inodes)} watched inodes -> paths...")
    inode_index = build_inode_index(scan_roots, all_inodes)
    print(f"  resolved {len(inode_index)}/{len(all_inodes)} inodes "
          f"({len(all_inodes) - len(inode_index)} unresolved - outside scan roots)")

    per_pid_paths = {pid: [inode_index[i] for i in inos if i in inode_index]
                     for pid, inos in per_pid_inodes.items()}

    # ---------- Exclude check on the backend ----------
    backend_pid = args.pid or procs[0]["pid"]
    print()
    print(f"=== Exclude check for backend PID {backend_pid} "
          f"({next((p['label'] for p in procs if p['pid'] == backend_pid), '?')}) ===")
    backend_paths = per_pid_paths.get(backend_pid, [])
    inside_ws = [p for p in backend_paths
                 if p == workspace or p.startswith(workspace + os.sep)]
    print(f"  watched paths inside workspace: {len(inside_ws)}")

    leaks_total = 0
    if active and inside_ws:
        for pat in active:
            try:
                rx = compile_pattern(pat, workspace)
            except re.error as e:
                print(f"  ??    {pat:<40s} (skipped: invalid pattern: {e})")
                continue
            leaks = [p for p in inside_ws if rx.match(p)]
            marker = "OK   " if not leaks else "LEAK "
            print(f"  {marker} {pat:<40s} -> {len(leaks)} watched")
            if leaks:
                leaks_total += len(leaks)
                if args.verbose:
                    for p in leaks[:10]:
                        print(f"            {p}")
                    if len(leaks) > 10:
                        print(f"            ... ({len(leaks) - 10} more)")
    elif not active:
        print("  (no active exclude patterns to verify)")

    # ---------- Watch roots outside workspace, per process ----------
    print()
    print("=== Watch roots OUTSIDE the workspace folder ===")
    print("(topmost watched ancestors; counts include all watched descendants)")
    print()

    any_outside = False
    for p in procs:
        paths = per_pid_paths.get(p["pid"], [])
        outside = [x for x in paths
                   if not (x == workspace or x.startswith(workspace + os.sep))]
        unresolved = len(per_pid_inodes[p["pid"]]) - len(paths)

        if not outside and not unresolved:
            continue
        any_outside = True

        print(f"--- PID {p['pid']}  ({p['label']}) ---")
        print(f"    cmdline: {p['cmdline'][:140]}{'...' if len(p['cmdline']) > 140 else ''}")
        if unresolved:
            print(f"    {unresolved} watched inodes not found in the scan roots "
                  f"(retry with --scan-root <dir>)")

        if outside:
            outside_set = set(outside)
            roots = compute_watch_roots(outside)
            counts = []
            for r in roots:
                n = sum(1 for x in outside_set
                        if x == r or x.startswith(r + os.sep))
                counts.append((n, r))
            counts.sort(reverse=True)
            print(f"    {len(outside)} watched paths under {len(roots)} root(s):")
            for n, r in counts[:args.top_roots]:
                print(f"      {n:6d}  {r}")
                print_subtree_breakdown(
                    r, outside_set,
                    indent=10,
                    max_depth=args.max_depth,
                    min_count=args.min_count,
                    top_n=args.top_roots,
                )
            if len(counts) > args.top_roots:
                print(f"      ... ({len(counts) - args.top_roots} more roots)")
        print()

    if not any_outside:
        print("  (no watches outside the workspace folder)")
        print()

    # ---------- Final result ----------
    if leaks_total:
        print(f"Result: {leaks_total} watched paths inside the workspace "
              f"violate files.watcherExclude")
        return 1
    print("Result: all active exclude patterns are honored inside the workspace.")
    return 0


if __name__ == "__main__":
    sys.exit(main())

Comment on lines +1461 to +1485
/**
* Resolve the effective exclude globs for a watcher: the caller-supplied `excludes`, the
* always-on temporary-upload exclude, and the user's `files.watcherExclude` preference.
*
* Applying `files.watcherExclude` here, for every watcher, rather than relying on individual
* callers, keeps the number of OS file watches (e.g. inotify watches on Linux) bounded even for
* watchers that request `excludes: []` - internal recursive watchers as well as plugin and
* language-server watchers created via `vscode.workspace.createFileSystemWatcher`. It also gives
* overlapping watchers a consistent set of excludes, so the watcher subsumption in `doWatch` can
* collapse them into a single OS watch instead of leaving duplicates that emit duplicate events.
*/
protected resolveWatcherExcludes(resource: URI, excludes: string[]): string[] {
const resolved = new Set(excludes);
// always ignore temporary upload files
resolved.add('**/theia_upload_*');
const configured = this.preferences.get('files.watcherExclude', undefined, resource.toString());
if (configured) {
for (const pattern of Object.keys(configured)) {
if (configured[pattern]) {
resolved.add(pattern);
}
}
}
return Array.from(resolved);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it closely mirrors MainFileSystemEventService.getExcludes. Do we want to keep both, or does this implementation render that one obsolete, since this is closer to the actual call to create the watcher?

Separately, (how) do we handle preference changes in this area? Now all watches are stamped with the excludes set at the time they're created. If the user changes their exclude preferences, who's responsible for making sure that watches are updated to reflect the change?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch on both.

Done the first — now that FileService.watch applies the excludes for every watcher, getExcludes here was redundant, so I dropped it and $watch just delegates. The merge behaviour is still covered by file-service-watcher.spec.ts.

On preference changes: fair point, and it's a real gap — watches are stamped at creation and nothing re-issues them when files.watcherExclude changes. That's pre-existing though (we've never re-watched on a change to that setting); this PR only widens where the excludes get applied. I'd rather not grow this PR for it and handle live re-watching as a follow-up, which pairs nicely with the ancestor-watch follow-up already noted in the description.

@safisa safisa requested a review from colin-grant-work June 18, 2026 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Waiting on reviewers

2 participants