Indexing and Updating Bookmarks
To manually manage the search index for bookmarks, use the SearchIndex class. This service maintains an in-memory inverted index that maps tokens from bookmark titles and descriptions to their IDs, enabling full-text search.
Indexing or Updating a Bookmark
You can add a new bookmark to the index or update an existing one using the index_bookmark method. If the bookmark ID already exists in the index, it is first removed and then re-indexed with the latest content.
from app.services.search_service import SearchIndex
from app.models.bookmark import Bookmark
# Assuming 'repo' is an instance of BookmarkRepository
search_index = SearchIndex(repository=repo)
# Create or retrieve a bookmark
bookmark = Bookmark(
id="b123",
url="https://example.com",
title="Example Site",
description="A useful site for testing search indexing."
)
# Add or update the bookmark in the index
search_index.index_bookmark(bookmark)
The index_bookmark method performs the following:
- Calls
_remove_bookmark_from_indexto clear any existing tokens associated with the bookmark ID. - Tokenizes the combined string of
bookmark.titleandbookmark.description. - Adds the bookmark ID to the set of IDs for each generated token.
Removing a Bookmark from the Index
To stop a bookmark from appearing in search results, use the remove_bookmark method.
# Remove a bookmark by its ID
search_index.remove_bookmark("b123")
This method removes the bookmark ID from all token sets in the index. If a token set becomes empty after removal, the token itself is deleted from the index to save memory.
How Search Works
The SearchIndex.search method uses AND-logic for multi-token queries. All tokens in the query must be present in the bookmark's title or description for it to be returned as a result.
# Search for bookmarks containing both "example" AND "testing"
results = search_index.search("example testing", limit=10)
for bookmark in results:
print(f"Found: {bookmark.title}")
- Tokenization: The query is lowercased and split into tokens using
_TOKEN_RE. Stop words (e.g., "the", "and", "is") are filtered out. - Intersection: The index finds the set of bookmark IDs for each token and performs an intersection.
- Ranking: Results are ranked by relevance using
_rank_results, which calculates a score based on the total number of occurrences of the query tokens in the bookmark's title and description.
Troubleshooting and Limitations
Soft-Deletes and Status Changes
In the current implementation of BookmarkService, operations like delete_bookmark (which moves a bookmark to the trash), archive_bookmark, and restore_bookmark do not automatically update the SearchIndex.
If you need the search index to reflect these status changes, you must manually call remove_bookmark or index_bookmark:
# Example: Manually removing a trashed bookmark from search
if service.delete_bookmark(bookmark_id):
search_index.remove_bookmark(bookmark_id)
In-Memory Persistence
The SearchIndex is entirely in-memory. It is rebuilt from the BookmarkRepository every time the SearchIndex class is initialized (typically during the BookmarkService singleton initialization).
# Inside SearchIndex.__init__
self._rebuild()
The _rebuild method fetches up to 10,000 bookmarks from the repository to populate the index.
Field Limitations
The index only considers the title and description fields. Tags, URLs, and collection names are not currently indexed for search.