Search Engine (Inverted Index)

Introduction

This project provides a program to create and query an inverted index, a critical data structure in information retrieval that maps terms to documents containing them. Using efficient data processing, optimized sorting, and compression, the program supports fast and accurate document retrieval.

The search engine includes:

An index creation module
Query ranking using the BM25 algorithm
A user interface for querying through a browser

Features

Index Formation: Builds an inverted index from a set of input documents.
Efficient Sorting and Compression: Uses heap sort and VarByte encoding for optimized data storage and retrieval.
Query Handling: Processes user queries with BM25 ranking, returning the top-ranked documents by relevance.

Requirements

Operating System: Linux or Windows with a C++ compiler (GCC 7.5 or higher recommended).
Libraries: Standard C++ library, optional multithreading library for large datasets.
Dataset: Directory of text files for indexing.
File Permissions: Ensure read and write permissions for input, intermediate, and output directories.

Installation and Setup

Prepare Input Files: Store the text files to be indexed in a directory.

Compile: Navigate to the project directory and compile the files:

cd search-engine
g++ -o form_inverted form_inverted.cpp
g++ -o io_efficient_merge_sort io_efficient_merge_sort.cpp
g++ -o query query.cpp

Run the Program:
- Create Index: ./form_inverted
- Merge and Compress: ./io_efficient_merge_sort
- Query: ./query "your search terms"

User Interface (Optional)

Compile the query program with g++ -std=c++20 query.cpp -o query.
Start a PHP server on localhost: php -S localhost:8000.
Access the UI at http://localhost:8000.

Internal Mechanics

Modules

form_inverted.cpp: Parses documents to create the intermediate inverted index.
io_efficient_merge_sort.cpp: Merges and compresses the index, saving it in binary format.
query.cpp: Processes queries and ranks documents using BM25.

Performance

The program is designed for efficient indexing and querying with metrics available for runtime analysis, file size estimation, and query processing speed.

Limitations

Memory Constraints: Large datasets may impact performance.
Index File Size: High document volume can lead to substantial index files.
Character Constraints: Terms containing semicolons are ignored.

Future Directions

Potential improvements include enhanced ranking using deep learning, parallel processing, and advanced compression techniques.

Example Usage

./query "sample search terms"

Expected output: A ranked list of document IDs based on BM25 relevance scores.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
Search Engine (Inverted Index).docx		Search Engine (Inverted Index).docx
form_inverted.cpp		form_inverted.cpp
index.html		index.html
io_efficient_merge_sort.cpp		io_efficient_merge_sort.cpp
process.php		process.php
query		query
query.cpp		query.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Search Engine (Inverted Index)

Introduction

Features

Requirements

Installation and Setup

User Interface (Optional)

Internal Mechanics

Modules

Performance

Limitations

Future Directions

Example Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Search Engine (Inverted Index)

Introduction

Features

Requirements

Installation and Setup

User Interface (Optional)

Internal Mechanics

Modules

Performance

Limitations

Future Directions

Example Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages