[docs-only] Update search README.md (#7655)

References: #7553 (enhancement: improve content extraction stop word cleaning)

Making the term `stop word` and the use of the envvar more clear.
This commit is contained in:
Martin
2023-11-03 14:46:02 +01:00
committed by GitHub
parent 853161e0b9
commit 4b86bd0921

View File

@@ -70,8 +70,7 @@ When the search service can reach Tika, it begins to read out the content on dem
Content extraction and handling the extracted content can be very resource intensive. Content extraction is therefore limited to files with a certain file size. The default limit is 20MB and can be configured using the `SEARCH_CONTENT_EXTRACTION_SIZE_LIMIT` variable.
When extracting the content you can specify whether filler words are ignored or not.
To keep them, the environment variable `SEARCH_EXTRACTOR_TIKA_CLEAN_STOP_WORDS` must be set to false.
When extracting content, you can specify whether [stop words](https://en.wikipedia.org/wiki/Stop_word) like `I`, `you`, `the` are ignored or not. Noramlly, these stop words are removed automatically. To keep them, the environment variable `SEARCH_EXTRACTOR_TIKA_CLEAN_STOP_WORDS` must be set to `false`.
When using the Tika container and docker-compose, consider the following: