Overview of Search Architecture
The search functionality in this application is powered by a custom, in-memory inverted index implementation located in app.services.search_service.SearchIndex. This system provides full-text search capabilities across bookmark titles and descriptions without requiring an external search engine like Elasticsearch or Typesense.
The Inverted Index Architecture
The core of the search system is the SearchIndex class, which maintains a mapping of tokens (words) to sets of bookmark IDs. This structure allows for efficient retrieval of bookmarks that contain specific terms.
Data Structure
The index is stored in a private attribute _index, defined as:
self._index: Dict[str, Set[str]] = defaultdict(set)
When a bookmark is indexed, its title and description are broken down into tokens. Each token becomes a key in this dictionary, and the bookmark's ID is added to the corresponding set.
Lifecycle and Initialization
The SearchIndex is managed by the BookmarkService (found in app/services/bookmark_service.py). It is initialized once as part of the service's singleton setup:
def _init_services(self) -> None:
"""Bootstrap repository, cache, and search index."""
self._repo = BookmarkRepository()
self._cache: LRUCache[Bookmark] = LRUCache(max_size=256)
self._search = SearchIndex(self._repo)
Upon initialization, the SearchIndex performs a full rebuild by fetching all existing bookmarks from the BookmarkRepository via the _rebuild() method.
The Indexing Pipeline
The indexing process transforms raw text into a searchable format through several steps defined in SearchIndex.
Tokenization and Filtering
The _tokenize method handles the conversion of text into searchable terms:
- Normalization: Text is converted to lowercase.
- Regex Splitting: The
_TOKEN_RE([a-z0-9]+) extracts alphanumeric sequences. - Stop Word Removal: Common words that provide little search value (e.g., "the", "and", "is") are filtered out using the
_STOP_WORDSset.
_STOP_WORDS: Set[str] = {"the", "a", "an", "and", "or", "but", "in", "on", "at", "to", "for", "is", "it"}
def _tokenize(self, text: str) -> List[str]:
"""Split text into lowercase tokens, removing stop words."""
tokens = _TOKEN_RE.findall(text.lower())
return [t for t in tokens if t not in _STOP_WORDS]
Incremental Updates
The index is updated incrementally whenever a bookmark is created or modified. In BookmarkService.create_bookmark and BookmarkService.update_bookmark, the service calls self._search.index_bookmark(bookmark).
The index_bookmark method ensures consistency by first removing any existing entries for that bookmark ID before re-indexing the new content:
def index_bookmark(self, bookmark: Bookmark) -> None:
self._remove_bookmark_from_index(bookmark.id)
tokens = self._tokenize(f"{bookmark.title} {bookmark.description}")
for token in tokens:
self._index[token].add(bookmark.id)
Search and Ranking Logic
The search mechanism employs a strict "AND" strategy combined with a frequency-based ranking algorithm.
Query Processing
When a user performs a search via SearchIndex.search(query), the query string is tokenized using the same pipeline as the indexing process. The system then finds the intersection of bookmark IDs for all tokens in the query:
candidate_ids: Set[str] = self._index.get(tokens[0], set()).copy()
for token in tokens[1:]:
candidate_ids &= self._index.get(token, set())
This means a bookmark must contain all tokens from the search query to be considered a match.
Relevance Ranking
Once candidate bookmarks are retrieved from the repository, they are ranked using the _rank_results static method. The score for a bookmark is determined by the total number of times the query tokens appear in its combined title and description:
@staticmethod
def _rank_results(bookmarks: List[Bookmark], tokens: List[str]) -> List[Bookmark]:
"""Rank results by number of token occurrences in title + description."""
def score(b: Bookmark) -> int:
text = f"{b.title} {b.description}".lower()
return sum(text.count(t) for t in tokens)
return sorted(bookmarks, key=score, reverse=True)
Integration and Constraints
Integration Points
The search functionality is exposed to the API through the BookmarkService.search method, which is called by the route handler in app/routes/bookmarks.py:
@bookmarks_bp.route("/search", methods=["GET"])
def search_bookmarks():
query = request.args.get("q", "")
limit = request.args.get("limit", 20, type=int)
results = _service.search(query, limit=limit)
return jsonify({"results": [b.to_dict() for b in results], "count": len(results)})
Important Considerations
- In-Memory Only: The index is not persisted to disk. It is entirely rebuilt from the database every time the application starts.
- Soft Deletes: In the current implementation of
BookmarkService.delete_bookmark, bookmarks are "trashed" (a status update) but not explicitly removed from theSearchIndex. SinceSearchIndex.searchretrieves the bookmark object from the repository to return it, these trashed bookmarks may still appear in search results if the repository'sget_bookmarkmethod returns them. - Scale: As noted in the
app/services/search_service.pymodule docstring, this implementation is intended for small datasets. For larger production environments, this component is designed to be replaced by a dedicated search engine.