Skip to main content

Search & Indexing

The search functionality in this project is implemented as an in-memory inverted index that provides full-text search capabilities across bookmark titles and descriptions. This design prioritizes speed and simplicity for small-to-medium datasets by avoiding the overhead of external search engines like Elasticsearch or Typesense.

The Inverted Index Architecture

The core of the search system is the SearchIndex class located in app/services/search_service.py. It maintains a mapping of lowercase tokens to sets of bookmark identifiers:

class SearchIndex:
def __init__(self, repository: "BookmarkRepository") -> None:
self._repo = repository
self._index: Dict[str, Set[str]] = defaultdict(set)
self._rebuild()

When the application starts, the BookmarkService singleton initializes the SearchIndex, which triggers a full rebuild by fetching all existing bookmarks from the repository. This ensures the in-memory state is synchronized with the persistent storage.

Tokenization and Normalization

Before text is indexed or searched, it undergoes a normalization process in the _tokenize method. The implementation uses a regular expression [a-z0-9]+ to extract alphanumeric tokens and filters out common English stop words (e.g., "the", "and", "is") defined in the _STOP_WORDS set.

_STOP_WORDS: Set[str] = {"the", "a", "an", "and", "or", "but", "in", "on", "at", "to", "for", "is", "it"}
_TOKEN_RE = re.compile(r"[a-z0-9]+")

def _tokenize(self, text: str) -> List[str]:
"""Split text into lowercase tokens, removing stop words."""
tokens = _TOKEN_RE.findall(text.lower())
return [t for t in tokens if t not in _STOP_WORDS]

Search Logic and Ranking

The search implementation in SearchIndex.search follows a strict AND-based matching strategy. For a bookmark to appear in the results, it must contain all tokens present in the search query.

Intersection-based Matching

The algorithm starts with the set of bookmark IDs associated with the first token and iteratively intersects it with the sets for subsequent tokens:

candidate_ids: Set[str] = self._index.get(tokens[0], set()).copy()
for token in tokens[1:]:
candidate_ids &= self._index.get(token, set())

This approach ensures high precision but means that a query like "python tutorial" will not return bookmarks that only contain "python" or only contain "tutorial".

Frequency-based Ranking

Once the candidate bookmarks are identified, they are ranked using a simple frequency-based scoring mechanism in _rank_results. The score is calculated by counting the total occurrences of all query tokens within the combined string of the bookmark's title and description:

@staticmethod
def _rank_results(bookmarks: List[Bookmark], tokens: List[str]) -> List[Bookmark]:
def score(b: Bookmark) -> int:
text = f"{b.title} {b.description}".lower()
return sum(text.count(t) for t in tokens)

return sorted(bookmarks, key=score, reverse=True)

Integration with Bookmark Lifecycle

The BookmarkService acts as a facade that orchestrates updates between the repository and the search index.

Incremental Updates

When a bookmark is created or updated, the service automatically updates the index. In update_bookmark, the index entry is refreshed by first removing the old entries and then re-indexing the new content:

# app/services/bookmark_service.py
def update_bookmark(self, bookmark_id: str, data: Dict[str, Any]) -> Tuple[Optional[Bookmark], Optional[str]]:
# ... update logic ...
self._repo.save_bookmark(bookmark)
self._search.index_bookmark(bookmark) # Updates the inverted index
self._cache.invalidate(bookmark.id)
return bookmark, None

The "Soft-Delete" Constraint

A notable design choice in the current implementation is how "trashed" (soft-deleted) bookmarks are handled. The delete_bookmark method in BookmarkService marks a bookmark as trashed but does not remove it from the SearchIndex:

def delete_bookmark(self, bookmark_id: str) -> bool:
bookmark = self._repo.get_bookmark(bookmark_id)
if not bookmark:
return False
bookmark.trash()
self._repo.save_bookmark(bookmark)
self._cache.invalidate(bookmark_id)
return True

Because the SearchIndex is not notified of the status change, trashed bookmarks will still appear in search results. This reflects a tradeoff where the search index remains a pure text-to-ID mapping, leaving status filtering to the caller or the repository layer, though the current search implementation does not filter by status.

Tradeoffs and Limitations

  1. Memory Usage: Since the entire index is held in memory, the application's RAM footprint grows linearly with the number of unique tokens and bookmarks. The _rebuild method uses a hardcoded limit of 10,000 bookmarks, suggesting this implementation is intended for personal or small-team use.
  2. Startup Latency: The index is rebuilt every time the application starts. For large datasets, this could lead to significant delays before the API is ready to serve requests.
  3. Consistency: While create and update operations are synchronized, the lack of index removal during soft-deletion (trashing) creates a discrepancy between the search results and the active bookmark list.
  4. No Partial Matching: The regex-based tokenization and exact token matching mean that searching for "pyth" will not match "python". Users must provide full words as they appear in the text.