Files
readur/docs/advanced-search.md

20 KiB

Advanced Search Guide

Readur provides powerful search capabilities that go far beyond simple text matching. This comprehensive guide covers all search modes, advanced filtering, query syntax, and optimization techniques.

Table of Contents

Overview

Readur's search system is built on PostgreSQL's full-text search capabilities with additional enhancements for document-specific requirements.

Search Capabilities

  • Full-Text Search: Search within document content and OCR-extracted text
  • Multiple Search Modes: Simple, phrase, fuzzy, and boolean search options
  • Advanced Filtering: Filter by file type, date, size, labels, and source
  • Real-Time Suggestions: Auto-complete and query suggestions as you type
  • Faceted Search: Browse documents by categories and properties
  • Cross-Language Support: Search in multiple languages with OCR text
  • Relevance Ranking: Intelligent scoring and result ordering

Search Sources

Readur searches across multiple content sources:

  1. Document Content: Original text from text files and PDFs
  2. OCR Text: Extracted text from images and scanned documents
  3. Metadata: File names, descriptions, and document properties
  4. Labels: User-created and system-generated tags
  5. Source Information: Upload source and file paths

Search Modes

Best for: General purpose searching and quick document discovery

How it works:

  • Automatically applies stemming and fuzzy matching
  • Searches across all text content and metadata
  • Provides intelligent relevance scoring
  • Handles common typos and variations

Example:

invoice 2024

Finds: "Invoice Q1 2024", "invoicing for 2024", "2024 invoice data"

Features:

  • Auto-stemming: "running" matches "run", "runs", "runner"
  • Fuzzy tolerance: "recieve" matches "receive"
  • Partial matching: "doc" matches "document", "documentation"
  • Relevance ranking: More relevant matches appear first

Phrase Search (Exact Match)

Best for: Finding exact phrases or specific terminology

How it works:

  • Searches for the exact sequence of words
  • Case-insensitive but order-sensitive
  • Useful for finding specific quotes, names, or technical terms

Syntax: Use quotes around the phrase

"quarterly financial report"
"John Smith"
"error code 404"

Features:

  • Exact word order: Only matches the precise sequence
  • Case insensitive: "John Smith" matches "john smith"
  • Punctuation ignored: "error-code" matches "error code"

Fuzzy Search (Approximate Matching)

Best for: Handling typos, OCR errors, and spelling variations

How it works:

  • Uses trigram similarity to find approximate matches
  • Configurable similarity threshold (default: 0.8)
  • Particularly useful for OCR-processed documents with errors

Syntax: Use the ~ operator

invoice~     # Finds "invoice", "invoce", "invoise"
contract~    # Finds "contract", "contarct", "conract"

Configuration:

  • Threshold adjustment: Configure sensitivity via user settings
  • Language-specific: Different languages may need different thresholds
  • OCR optimization: Higher tolerance for OCR-processed documents

Boolean Search (Logical Operators)

Best for: Complex queries with multiple conditions and precise control

Operators:

  • AND: Both terms must be present
  • OR: Either term can be present
  • NOT: Exclude documents with the term
  • Parentheses: Group conditions

Examples:

budget AND 2024                    # Both "budget" and "2024"
invoice OR receipt                  # Either "invoice" or "receipt"
contract NOT draft                  # "contract" but not "draft"
(budget OR financial) AND 2024      # Complex grouping
marketing AND (campaign OR strategy) # Marketing documents about campaigns or strategy

Advanced Boolean Examples:

# Find completed project documents
project AND (final OR completed OR approved) NOT draft

# Financial documents excluding personal items
(invoice OR receipt OR budget) NOT personal

# Recent important documents
(urgent OR priority OR critical) AND label:"this month"

Query Syntax

Search within specific document fields for precise targeting.

Available Fields

Field Description Example
filename: Search in file names filename:invoice
content: Search in document text content:"project status"
label: Search by labels label:urgent
type: Search by file type type:pdf
source: Search by upload source source:webdav
size: Search by file size size:>10MB
date: Search by date date:2024-01-01

Field Search Examples

filename:contract AND date:2024        # Contracts from 2024
label:"high priority" OR label:urgent  # Priority documents
type:pdf AND content:budget            # PDF files containing "budget"
source:webdav AND label:approved       # Approved docs from WebDAV

Range Queries

Date Ranges

date:2024-01-01..2024-03-31    # Q1 2024 documents
date:>2024-01-01               # After January 1, 2024
date:<2024-12-31               # Before December 31, 2024

Size Ranges

size:1MB..10MB                 # Between 1MB and 10MB
size:>50MB                     # Larger than 50MB
size:<1KB                      # Smaller than 1KB

Use wildcards for partial matching:

proj*           # Matches "project", "projects", "projection"
*report         # Matches "annual report", "status report"
doc?ment        # Matches "document", "documents" (? = single character)

Exclusion Operators

Exclude unwanted results:

invoice -draft                 # Invoices but not drafts
budget NOT personal           # Budget documents excluding personal
-label:archive proposal       # Proposals not in archive

Advanced Filtering

File Type Filters

Filter by specific file formats:

Common File Types:

  • Documents: PDF, DOC, DOCX, TXT, RTF
  • Images: PNG, JPG, JPEG, TIFF, BMP, GIF
  • Spreadsheets: XLS, XLSX, CSV
  • Presentations: PPT, PPTX

Filter Interface:

  1. Checkbox Filters: Select multiple file types
  2. MIME Type Groups: Filter by general categories
  3. Custom Extensions: Add specific file extensions

Search Syntax:

type:pdf                       # Only PDF files
type:(pdf OR doc)              # PDF or Word documents
-type:image                    # Exclude all images

Date and Time Filters

Predefined Ranges:

  • Today, Yesterday, This Week, Last Week
  • This Month, Last Month, This Quarter, Last Quarter
  • This Year, Last Year

Custom Date Ranges:

  • Start Date: Documents uploaded after specific date
  • End Date: Documents uploaded before specific date
  • Date Range: Documents within specific period

Advanced Date Syntax:

created:today                  # Documents uploaded today
modified:>2024-01-01          # Modified after January 1st
accessed:last-week            # Accessed in the last week

Size Filters

Size Categories:

  • Small: < 1MB
  • Medium: 1MB - 10MB
  • Large: 10MB - 50MB
  • Very Large: > 50MB

Custom Size Ranges:

size:>10MB                     # Larger than 10MB
size:1MB..5MB                  # Between 1MB and 5MB
size:<100KB                    # Smaller than 100KB

Label Filters

Label Selection:

  • Multiple Labels: Select multiple labels with AND/OR logic
  • Label Hierarchy: Navigate nested label structures
  • Label Suggestions: Auto-complete based on existing labels

Label Search Syntax:

label:project                  # Documents with "project" label
label:"high priority"          # Multi-word labels in quotes
label:(urgent OR critical)     # Documents with either label
-label:archive                 # Exclude archived documents

Source Filters

Filter by document source or origin:

Source Types:

  • Manual Upload: Documents uploaded directly
  • WebDAV Sync: Documents from WebDAV sources
  • Local Folder: Documents from watched folders
  • S3 Sync: Documents from S3 buckets

Source-Specific Filters:

source:webdav                  # WebDAV synchronized documents
source:manual                  # Manually uploaded documents
source:"My Nextcloud"          # Specific named source

OCR Status Filters

Filter by OCR processing status:

Status Options:

  • Completed: OCR successfully completed
  • Pending: Waiting for OCR processing
  • Failed: OCR processing failed
  • Not Applicable: Text documents that don't need OCR

OCR Quality Filters:

  • High Confidence: OCR confidence > 90%
  • Medium Confidence: OCR confidence 70-90%
  • Low Confidence: OCR confidence < 70%

Search Interface

Location: Available in the header on all pages Features:

  • Real-time suggestions: Shows results as you type
  • Quick results: Top 5 matches with snippets
  • Fast navigation: Direct access to documents
  • Search history: Recent searches for quick access

Usage:

  1. Click on the search bar in the header
  2. Start typing your query
  3. View instant suggestions and results
  4. Click a result to navigate directly to the document

Advanced Search Page

Location: Dedicated search page with full interface Features:

  • Multiple search modes: Toggle between search types
  • Filter sidebar: All filtering options in one place
  • Result options: Sorting, pagination, view modes
  • Export capabilities: Export search results

Interface Sections:

Search Input Area

  • Query builder: Visual query construction
  • Mode selector: Choose search type (simple, phrase, fuzzy, boolean)
  • Suggestions: Auto-complete and query recommendations

Filter Sidebar

  • File type filters: Checkboxes for different formats
  • Date range picker: Calendar interface for date selection
  • Size sliders: Visual size range selection
  • Label selector: Hierarchical label browser
  • Source filters: Filter by upload source

Results Area

  • Sort options: Relevance, date, filename, size
  • View modes: List view, grid view, detail view
  • Pagination: Navigate through result pages
  • Export options: CSV, JSON export of results

Search Results

Result Display Elements

Document Cards:

  • Filename: Primary document identifier
  • Snippet: Highlighted text excerpt showing search matches
  • Metadata: File size, type, upload date, labels
  • Relevance Score: Numerical relevance ranking
  • Quick Actions: Download, view, edit labels

Highlighting:

  • Search terms: Highlighted in yellow
  • Context: Surrounding text for context
  • Multiple matches: All instances highlighted
  • Snippet length: Configurable in user settings

Result Sorting

Sort Options:

  • Relevance: Best matches first (default)
  • Date: Newest or oldest first
  • Filename: Alphabetical order
  • Size: Largest or smallest first
  • Score: Highest search score first

Secondary Sorting:

  • Apply secondary criteria when primary sort values are equal
  • Example: Sort by relevance, then by date

Search Configuration

User Preferences

Search Settings (accessible via Settings → Search):

  • Results per page: 10, 25, 50, 100
  • Snippet length: 100, 200, 300, 500 characters
  • Fuzzy threshold: Sensitivity for approximate matching
  • Default sort: Preferred default sorting option
  • Search history: Enable/disable query history

Search Behavior

  • Auto-complete: Enable search suggestions
  • Real-time search: Search as you type
  • Search highlighting: Highlight search terms in results
  • Context snippets: Show surrounding text in results

Search Optimization

Query Optimization

Best Practices

  1. Use Specific Terms: More specific queries yield better results

    Good: "quarterly sales report Q1"
    Poor: "document"
    
  2. Combine Search Modes: Use appropriate mode for your needs

    Exact phrases: "status update"
    Flexible terms: project~
    Complex logic: (budget OR financial) AND 2024
    
  3. Leverage Filters: Combine text search with filters

    Query: budget
    Filters: Type = PDF, Date = This Quarter, Label = Finance
    
  4. Use Field Search: Target specific document aspects

    filename:invoice date:2024
    content:"project milestone" label:important
    

Performance Tips

Efficient Searching

  1. Start Broad, Then Narrow: Begin with general terms, then add filters
  2. Use Filters Early: Apply filters before complex text queries
  3. Avoid Wildcards at Start: *report is slower than report*
  4. Combine Short Queries: Use multiple short terms rather than long phrases

Search Index Optimization

The search system automatically optimizes for:

  • Frequent Terms: Common words are indexed for fast retrieval
  • Document Updates: New documents are indexed immediately
  • Language Support: Multi-language stemming and analysis
  • Cache Management: Frequent searches are cached

OCR Search Optimization

Handling OCR Text

OCR-extracted text may contain errors that affect search:

Strategies:

  1. Use Fuzzy Search: Handle OCR errors with approximate matching
  2. Try Variations: Search for common OCR mistakes
  3. Use Context: Include surrounding words for better matches
  4. Check Original: Compare with original document when possible

Common OCR Issues:

  • Character confusion: "m" vs "rn", "cl" vs "d"
  • Word boundaries: "some thing" vs "something"
  • Special characters: Missing or incorrect punctuation

Optimization Examples:

# Original: "invoice"
# OCR might produce: "irwoice", "invoce", "mvoice"
# Solution: Use fuzzy search
invoice~

# Or search for context
"invoice number" OR "irwoice number" OR "invoce number"

Saved Searches

Creating Saved Searches

  1. Build Your Query: Create a search with desired parameters
  2. Test Results: Verify the search returns expected documents
  3. Save Search: Click "Save Search" button
  4. Name Search: Provide descriptive name
  5. Configure Options: Set update frequency and notifications

Managing Saved Searches

Saved Search Features:

  • Quick Access: Available in sidebar or dashboard
  • Automatic Updates: Results update as new documents are added
  • Shared Access: Share searches with other users (future feature)
  • Export Options: Export results automatically

Search Organization:

  • Categories: Group related searches
  • Favorites: Mark frequently used searches
  • Recent: Quick access to recently used searches

Smart Collections

Saved searches that automatically include new documents:

Examples:

  • "This Month's Reports": type:pdf AND content:report AND date:this-month
  • "Pending Review": label:"needs review" AND -label:completed
  • "High Priority Items": label:(urgent OR critical OR "high priority")

Search Analytics

Search Performance Metrics

Available Metrics:

  • Query Performance: Average search response times
  • Popular Searches: Most frequently used search terms
  • Result Quality: Click-through rates and user engagement
  • Search Patterns: Common search behaviors and trends

User Search History

History Features:

  • Recent Searches: Quick access to previous queries
  • Search Suggestions: Based on search history
  • Query Refinement: Improve searches based on past patterns
  • Export History: Download search history for analysis

Basic Search API

GET /api/search?query=invoice&limit=20
Authorization: Bearer <jwt_token>

Query Parameters:

  • query: Search query string
  • limit: Number of results (default: 50, max: 100)
  • offset: Pagination offset
  • sort: Sort order (relevance, date, filename, size)

Advanced Search API

POST /api/search/advanced
Authorization: Bearer <jwt_token>
Content-Type: application/json

{
  "query": "budget report",
  "mode": "phrase",
  "filters": {
    "file_types": ["pdf", "docx"],
    "labels": ["Q1 2024", "Finance"],
    "date_range": {
      "start": "2024-01-01",
      "end": "2024-03-31"
    },
    "size_range": {
      "min": 1048576,
      "max": 52428800
    }
  },
  "options": {
    "fuzzy_threshold": 0.8,
    "snippet_length": 200,
    "highlight": true
  }
}

Search Response Format

{
  "results": [
    {
      "id": "550e8400-e29b-41d4-a716-446655440000",
      "filename": "Q1_Budget_Report.pdf",
      "snippet": "The quarterly budget report shows a <mark>10% increase</mark> in revenue...",
      "score": 0.95,
      "highlights": ["budget", "report"],
      "metadata": {
        "size": 2048576,
        "type": "application/pdf",
        "uploaded_at": "2024-01-15T10:30:00Z",
        "labels": ["Q1 2024", "Finance", "Budget"],
        "source": "WebDAV Sync"
      }
    }
  ],
  "total": 42,
  "limit": 20,
  "offset": 0,
  "query_time": 0.085
}

Troubleshooting

Common Search Issues

No Results Found

Possible Causes:

  1. Typos: Check spelling in search query
  2. Too Specific: Query might be too restrictive
  3. Wrong Mode: Using exact search when fuzzy would be better
  4. Filters: Remove filters to check if they're excluding results

Solutions:

  1. Simplify Query: Start with broader terms
  2. Check Spelling: Use fuzzy search for typo tolerance
  3. Remove Filters: Test without date, type, or label filters
  4. Try Synonyms: Use alternative terms for the same concept

Irrelevant Results

Possible Causes:

  1. Too Broad: Query matches too many unrelated documents
  2. Common Terms: Using very common words that appear everywhere
  3. Wrong Mode: Using fuzzy when exact match is needed

Solutions:

  1. Add Specificity: Include more specific terms or context
  2. Use Filters: Add file type, date, or label filters
  3. Phrase Search: Use quotes for exact phrases
  4. Boolean Logic: Use AND/OR/NOT for better control

Slow Search Performance

Possible Causes:

  1. Complex Queries: Very complex boolean queries
  2. Large Result Sets: Queries matching many documents
  3. Wildcard Overuse: Starting queries with wildcards

Solutions:

  1. Simplify Queries: Break complex queries into simpler ones
  2. Add Filters: Use filters to reduce result set size
  3. Avoid Leading Wildcards: Use term* instead of *term
  4. Use Pagination: Request smaller result sets

OCR Search Issues

OCR Text Not Searchable

Symptoms: Can't find text that's visible in document images Solutions:

  1. Check OCR Status: Verify OCR processing completed
  2. Retry OCR: Manually retry OCR processing
  3. Use Fuzzy Search: OCR might have character recognition errors
  4. Check Language Settings: Ensure correct OCR language is configured

Poor OCR Search Quality

Symptoms: Fuzzy search required for most queries on scanned documents Solutions:

  1. Improve Source Quality: Use higher resolution scans (300+ DPI)
  2. OCR Language: Verify correct language setting for documents
  3. Image Enhancement: Enable OCR preprocessing options
  4. Manual Correction: Consider manual text correction for important documents

Search Configuration Issues

Settings Not Applied

Symptoms: Search settings changes don't take effect Solutions:

  1. Reload Page: Refresh browser to apply settings
  2. Clear Cache: Clear browser cache and cookies
  3. Check Permissions: Ensure user has permission to modify settings
  4. Database Issues: Check if settings are being saved to database

Filter Problems

Symptoms: Filters not working as expected Solutions:

  1. Clear All Filters: Reset filters and apply one at a time
  2. Check Filter Logic: Ensure AND/OR logic is correct
  3. Label Validation: Verify labels exist and are spelled correctly
  4. Date Format: Ensure dates are in correct format

Next Steps