Overview of Search Architecture

The search functionality in this application is powered by a custom, in-memory inverted index implementation located in app.services.search_service.SearchIndex. This system provides full-text search capabilities across bookmark titles and descriptions without requiring an external search engine like Elasticsearch or Typesense.

The Inverted Index Architecture

The core of the search system is the SearchIndex class, which maintains a mapping of tokens (words) to sets of bookmark IDs. This structure allows for efficient retrieval of bookmarks that contain specific terms.

Data Structure

The index is stored in a private attribute _index, defined as:

self._index: Dict[str, Set[str]] = defaultdict(set)

When a bookmark is indexed, its title and description are broken down into tokens. Each token becomes a key in this dictionary, and the bookmark's ID is added to the corresponding set.

Lifecycle and Initialization

The SearchIndex is managed by the BookmarkService (found in app/services/bookmark_service.py). It is initialized once as part of the service's singleton setup:

def _init_services(self) -> None:
    """Bootstrap repository, cache, and search index."""
    self._repo = BookmarkRepository()
    self._cache: LRUCache[Bookmark] = LRUCache(max_size=256)
    self._search = SearchIndex(self._repo)

Upon initialization, the SearchIndex performs a full rebuild by fetching all existing bookmarks from the BookmarkRepository via the _rebuild() method.

The Indexing Pipeline

The indexing process transforms raw text into a searchable format through several steps defined in SearchIndex.

Tokenization and Filtering

The _tokenize method handles the conversion of text into searchable terms:

Normalization: Text is converted to lowercase.
Regex Splitting: The _TOKEN_RE ([a-z0-9]+) extracts alphanumeric sequences.
Stop Word Removal: Common words that provide little search value (e.g., "the", "and", "is") are filtered out using the _STOP_WORDS set.

_STOP_WORDS: Set[str] = {"the", "a", "an", "and", "or", "but", "in", "on", "at", "to", "for", "is", "it"}

def _tokenize(self, text: str) -> List[str]:
    """Split text into lowercase tokens, removing stop words."""
    tokens = _TOKEN_RE.findall(text.lower())
    return [t for t in tokens if t not in _STOP_WORDS]

Incremental Updates

The index is updated incrementally whenever a bookmark is created or modified. In BookmarkService.create_bookmark and BookmarkService.update_bookmark, the service calls self._search.index_bookmark(bookmark).

The index_bookmark method ensures consistency by first removing any existing entries for that bookmark ID before re-indexing the new content:

def index_bookmark(self, bookmark: Bookmark) -> None:
    self._remove_bookmark_from_index(bookmark.id)
    tokens = self._tokenize(f"{bookmark.title} {bookmark.description}")
    for token in tokens:
        self._index[token].add(bookmark.id)

Search and Ranking Logic

The search mechanism employs a strict "AND" strategy combined with a frequency-based ranking algorithm.

Query Processing

When a user performs a search via SearchIndex.search(query), the query string is tokenized using the same pipeline as the indexing process. The system then finds the intersection of bookmark IDs for all tokens in the query:

candidate_ids: Set[str] = self._index.get(tokens[0], set()).copy()
for token in tokens[1:]:
    candidate_ids &= self._index.get(token, set())

This means a bookmark must contain all tokens from the search query to be considered a match.

Relevance Ranking

Once candidate bookmarks are retrieved from the repository, they are ranked using the _rank_results static method. The score for a bookmark is determined by the total number of times the query tokens appear in its combined title and description:

@staticmethod
def _rank_results(bookmarks: List[Bookmark], tokens: List[str]) -> List[Bookmark]:
    """Rank results by number of token occurrences in title + description."""
    def score(b: Bookmark) -> int:
        text = f"{b.title} {b.description}".lower()
        return sum(text.count(t) for t in tokens)

    return sorted(bookmarks, key=score, reverse=True)

Integration and Constraints

Integration Points

The search functionality is exposed to the API through the BookmarkService.search method, which is called by the route handler in app/routes/bookmarks.py:

@bookmarks_bp.route("/search", methods=["GET"])
def search_bookmarks():
    query = request.args.get("q", "")
    limit = request.args.get("limit", 20, type=int)
    results = _service.search(query, limit=limit)
    return jsonify({"results": [b.to_dict() for b in results], "count": len(results)})

Important Considerations

In-Memory Only: The index is not persisted to disk. It is entirely rebuilt from the database every time the application starts.
Soft Deletes: In the current implementation of BookmarkService.delete_bookmark, bookmarks are "trashed" (a status update) but not explicitly removed from the SearchIndex. Since SearchIndex.search retrieves the bookmark object from the repository to return it, these trashed bookmarks may still appear in search results if the repository's get_bookmark method returns them.
Scale: As noted in the app/services/search_service.py module docstring, this implementation is intended for small datasets. For larger production environments, this component is designed to be replaced by a dedicated search engine.

The Inverted Index Architecture​

Data Structure​

Lifecycle and Initialization​

The Indexing Pipeline​

Tokenization and Filtering​

Incremental Updates​

Search and Ranking Logic​

Query Processing​

Relevance Ranking​

Integration and Constraints​

Integration Points​

Important Considerations​