Skip to content

Project allows document indexing in a local vector database and then search (supports Jira, Confluence and local files, can be integrated via MCP)

License

Notifications You must be signed in to change notification settings

shnax0210/documents-vector-search

Repository files navigation

Project allows document indexing in a local vector database and then search (supports Jira, Confluence and local files, can be integrated via MCP)

Base info

Key points:

  • Supports Jira/Confluence Data Center/Server and Cloud. For Jira ticket is a document, for Confluence page is a document.
  • Supports local files from a specified folder in various formats like: .pdf, .pptx, .docx, etc. Uses Unstructured for local files parsing;
  • Does NOT send any data to any third-party systems. All data are processed locally and stored locally (except in the case when you use it as MCP with a non-local AI agent).
  • Supports MCP protocol to use the vector search as a tool in AI agents.
  • Supports "update" operation, so there is no need to fully recreate the vector database each time.
  • Provides an abstraction to add more data sources and to use different technologies (embeddings, vector databases, etc.).

Key technologies used:

Please check this article for more context: https://medium.com/@shnax0210/mcp-tool-for-vector-search-in-confluence-and-jira-6beeade658ba

Communication:

Updates

2026/01/25 - Added ChromaDB and metafields filtering

As from the beginning FAISS lib was used as a vector database. ChromaDB was added since it has abilities to filter search results by metafields, which can be pretty convenient for the tool. For example, Confluence search results can be filtered by space or modification time. As of now ChromaDb is used by default, but if you still want to use FAISS (it has a bit better performance), just pass --indexes "indexer_FAISS_IndexFlatL2__embeddings_all-MiniLM-L6-v2" during collection creation.

Common use case

  1. You create a collection by a dedicated script (there are separate scripts for Jira, Confluence and local files cases). During the collection creation, data are loaded into your local machine and then indexed. Results are stored in a subfolder of ./data/collections with the name that you specify via the "--collection ${collectionName}" parameter. So a collection is just a folder with all needed information for search, such as: loaded documents, index files, metadata, etc. Once a collection is created, it can be used for search and update. The creation process can take a while; it depends on the number of documents your collection consists of and local machine resources.
  2. After some time, you may want to update existing collections to get new data, you can do it via a dedicated script. You will need to specify the collection name used during collection creation. Collection update reads and indexes only new/updated documents, so it should be much faster than collection creation.
  3. You can search in an existing collection by dedicated script.
  4. You can set up MCP tool for existing collection, so an AI agent will be able to use the search.

How to set up and use

  1. Clone the repository
  2. Install uv: https://docs.astral.sh/uv/
  3. Navigate to the root project folder and run: uv sync

Create collection for Confluence:

  1. Set env variables needed for authentification/authorization:
  • For Confluence Server/Data Center: set CONF_TOKEN env variable with your Confluence Bearer token (optionally, you can set CONF_LOGIN and CONF_PASSWORD env variables instead with your Confluence user login and password, but the token variant is more recommended).
  • For Confluence Cloud: set ATLASSIAN_EMAIL env variable with your Atlassian account email and ATLASSIAN_TOKEN env variable with your Atlassian Cloud API token. (Generate API token at: https://id.atlassian.com/manage/api-tokens)
  1. Run command like:
uv run confluence_collection_create_cmd_adapter.py --collection "confluence" --url "${baseConfluenceUrl}" --cql "${confluenceQuery}"

Notes:

  • The script automatically detects whether your Confluence instance is Cloud or Server/Data Center based on the URL:
    • URLs ending with .atlassian.net are treated as Confluence Cloud
    • All other URLs are treated as Confluence Server/Data Center
  • You can use different values for the "collection" parameter, but you will need to use the same value during collection updates and searches. It defines the collection name, and all collection data will be stored in a folder with that name under ./data/collections;
  • Please update ${baseConfluenceUrl} to the real Confluence base URL:
  • Please update ${confluenceQuery} to the real Confluence query, for example: "(space = 'MySpaceName') AND (created >= '2025-01-01' OR lastModified >= '2025-01-01')"

Create collection for Jira:

  1. Set env variables needed for authentification/authorization:
  • For Jira Server/Data Center: set JIRA_TOKEN env variable with your Jira Bearer token (optionally, you can set JIRA_LOGIN and JIRA_PASSWORD env variables instead with your Jira user login and password, but the token variant is more recommended).
  • For Jira Cloud: set ATLASSIAN_EMAIL env variable with your Atlassian account email and ATLASSIAN_TOKEN env variable with your Atlassian Cloud API token. (Generate API token at: https://id.atlassian.com/manage/api-tokens)
  1. Run command like:
uv run jira_collection_create_cmd_adapter.py --collection "jira" --url "${baseJiraUrl}" --jql "${jiraQuery}"

Notes:

  • The script automatically detects whether your Jira instance is Cloud or Server/Data Center based on the URL:
    • URLs ending with .atlassian.net are treated as Jira Cloud
    • All other URLs are treated as Jira Server/Data Center
  • You can use different values for the "collection" parameter, but you will need to use the same value during collection updates and searches. It defines the collection name, and all collection data will be stored in a folder with that name under ./data/collections;
  • Please update ${baseJiraUrl} to the real Jira base URL:
  • Please update ${jiraQuery} to the real Jira query, for example: "project = MyProjectName AND created >= -183d"

Create collection for local files

  1. Run a command like:
uv run files_collection_create_cmd_adapter.py --basePath "${pathToFolderWithFiles}"

Notes:

  • Please update ${pathToFolderWithFiles} to the actual folder path.
  • By default, the collection will be named after the last folder in --basePath (for example, if --basePath is "/Users/a/b", the collection name will be "b"). You can override this by adding --collection ${collectionName}, as in all other scripts.
  • By default, if a file cannot be read, it is just skipped and written to the log. You can override this by adding the --failFast parameter, so the script will fail immediately after the first error.
  • By default, all files from ${pathToFolderWithFiles} are included (except for some predefined types, like zip, jar, etc.). You can adjust this by adding --includePatterns and --excludePatterns parameters with regexes. If you specify both --includePatterns and --excludePatterns, only files that match --includePatterns and do not match --excludePatterns will be included. Examples:
    • Example of --includePatterns (the parameter can be used multiple times): --includePatterns "subfolder1/.*" "subfolder2/.*".
    • Example of --excludePatterns (the parameter can be used multiple times): --excludePatterns "subfolder1/.*" "subfolder2/.*".
  • The script uses the Unstructured Python library, which supports many file formats such as .pdf, .pptx, .docx, etc. Some file formats may require additional software installation, listed here.

Update existing collection:

  1. Set env variables needed for authentification/authorization (not needed for local files):
  • For Confluence Server/Data Center: set CONF_TOKEN env variable with your Confluence Bearer token (optionally, you can set CONF_LOGIN and CONF_PASSWORD env variables instead with your Confluence user login and password, but the token variant is more recommended).
  • For Confluence Cloud: set ATLASSIAN_EMAIL env variable with your Atlassian account email and ATLASSIAN_TOKEN env variable with your Atlassian Cloud API token. (Generate API token at: https://id.atlassian.com/manage/api-tokens)
  • For Jira Server/Data Center: set JIRA_TOKEN env variable with your Jira Bearer token (optionally, you can set JIRA_LOGIN and JIRA_PASSWORD env variables instead with your Jira user login and password, but the token variant is more recommended).
  • For Jira Cloud: set ATLASSIAN_EMAIL env variable with your Atlassian account email and ATLASSIAN_TOKEN env variable with your Atlassian Cloud API token. (Generate API token at: https://id.atlassian.com/manage/api-tokens)
  1. Run command like:
uv run collection_update_cmd_adapter.py --collection "${collectionName}"

Notes:

  • Please update ${collectionName} to the real collection name (the one used during collection creation), for example: "confluence" or "jira".

Search in collection:

Run command like:

uv run collection_search_cmd_adapter.py --collection "${collectionName}" --query "${searchQuery}"

Notes:

  • Please update ${collectionName} to the real collection name (the one used during collection creation), for example: "confluence" or "jira";
  • Please update ${searchQuery} to the text that you would like to search, for example: "How to set up react project locally";
  • You can add the "--includeMatchedChunksText" parameter to include matched chunks of a document text in search results.
  • You can use "--filter" parameter to add filtering by metafields.

Filtering by metafields

Filtering is available only for ChromaDB. Query syntax: https://cookbook.chromadb.dev/core/filters/

Confluence

Available metafields:

  • createdAt: date when page was created.
  • createdBy: lowercased user email who created a page;
  • lastModifiedAt: last date when page was updated;
  • space: space key.

Examples:

  • --filter '{"space": "SPACE_KEY"}'
  • --filter '{"$and": [{"space": "SPACE_KEY"}, {"lastModifiedAt": {"$gte": "2026-01-01"}}]}'
Jira

Available metafields:

  • createdAt: date when issue was created;
  • createdBy: lowercased user email who created an issue;
  • lastModifiedAt: last date when issue was updated;
  • project: project key (extracted from issue key);
  • type: issue type name (e.g., Bug, Task, Story);
  • epic: epic key or parent issue key;
  • priority: priority name (e.g., High, Medium, Low);
  • assignee: lowercased assignee email;
  • status: status name (e.g., Open, In Progress, Done).

Examples:

  • --filter '{"project": "PROJECT_KEY"}'
  • --filter '{"$and": [{"project": "PROJECT_KEY"}, {"lastModifiedAt": {"$gte": "2026-01-01"}}]}'

Set up MCP:

Add MCP configuration like:

{
    "servers": {
        ...
        "search_${collectionName}_stdio": {
            "type": "stdio",
            "command": "uv",
            "args": [
                "--directory",
                "${fullPathToRootProjectFolder}",
                "run",
                "collection_search_mcp_stdio_adapter.py",
                "--collection",
                "${collectionName}",
            ]
        },
        ...
    }
}

If you use VS code IDE and GitHub Copilot, you can add the configuration into .vscode/mcp.json file in the root of your project. You can check more details on youtube:

Notes:

  • Please update ${collectionName} to the real collection name (the one used during collection creation), for example: "confluence" or "jira".
  • Please update ${fullPathToRootProjectFolder} to the real full path to this project root folder.
  • It can be useful to increase the number of returned matched text chunks by setting "--maxNumberOfChunks ${number}". A bigger number means better search, but too large a number may break GitHub Copilot, probably because it does not fit into the model context window.
  • You can use "--filter" parameter to add filtering by metafields (check Filtering by metafields for more details).

Prompt examples:

  • "Find information about AI use cases, search info on Confluence, include all used links in response"
  • "Find information about PDP carousel, search info on Jira, include all used links in response"

Collection structure

Collection is a subfolder of the ./data/collections folder. A collection folder contains all files needed for performing vector search in the collection.

A collection folder consists of:

  • documents folder contains documents read by reader from the ./main/sources package and converted by converter from the ./main/sources package.
  • indexes folder contains available indexes (usually just one index but multiple are also supported);
  • manifest.json file contains information about the index such as name, last update time, reader details, and indexes.

Please check the ./main/core/documents_collection_creator.py code to find most of the details about collection creation or updating.

Please check the ./main/core/documents_collection_searcher.py code to find most of the details about searching in a collection.

Other useful info

  • Collection update reads only new information, so it should be much faster than collection creation. Collection update uses information from the collection manifest file located in ./data/collections/${collectionName}/manifest.json.
  • Collection update usually reads a bit more documents than were really updated since last time. Currently, the logic is as follows: it reads all documents that were created/updated since the "lastModifiedDocumentTime" field value from the ./data/collections/${collectionName}/manifest.json file minus 1 day. It's done so to guarantee that no document update will be lost due to parallel document creations (probably 1 day can be updated to some much less value like a couple of seconds, but it does not look like a big deal to me and I prefer just to be more sure that everything is updated). The "lastModifiedDocumentTime" field contains the value of the latest update time for all documents in the collection.
  • There is a cache mechanism for Jira/Confluence collection creation, so if you create a collection multiple times with the same parameters: url, query (JQL or CQL), etc. - documents will be read from the cache located in the ./data/caches subfolder (all important parameters are collected together and hashed, the hash is used as the folder name (./data/caches/{hash}) for cached documents, there is also a ./data/caches/{hash}_completed file that indicates if all documents were successfully read, the cache is used only in case if the ./data/caches/{hash}_completed file is present as well as the ./data/caches/{hash} folder). The cache is useful during testing, but can lead to a situation where new data are not read. In such a case, you can either run the "update" script after collection creation, or remove the cache manually before collection creation.

About

Project allows document indexing in a local vector database and then search (supports Jira, Confluence and local files, can be integrated via MCP)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages