[docs-only] Update search README.md (#7655)

References: #7553 (enhancement: improve content extraction stop word cleaning) Making the term `stop word` and the use of the envvar more clear.
2026-01-08 05:09:46 -06:00 · 2023-11-03 14:46:02 +01:00
parent 853161e0b9
commit 4b86bd0921
1 changed files with 1 additions and 2 deletions
--- a/services/search/README.md
+++ b/services/search/README.md
@@ -70,8 +70,7 @@ When the search service can reach Tika, it begins to read out the content on dem

 Content extraction and handling the extracted content can be very resource intensive. Content extraction is therefore limited to files with a certain file size. The default limit is 20MB and can be configured using the `SEARCH_CONTENT_EXTRACTION_SIZE_LIMIT` variable.

-When extracting the content you can specify whether filler words are ignored or not.
-To keep them, the environment variable `SEARCH_EXTRACTOR_TIKA_CLEAN_STOP_WORDS` must be set to false.
+When extracting content, you can specify whether [stop words](https://en.wikipedia.org/wiki/Stop_word) like `I`, `you`, `the` are ignored or not. Noramlly, these stop words are removed automatically. To keep them, the environment variable `SEARCH_EXTRACTOR_TIKA_CLEAN_STOP_WORDS` must be set to `false`.

 When using the Tika container and docker-compose, consider the following: