mirror of
https://github.com/opencloud-eu/opencloud.git
synced 2026-01-08 05:09:46 -06:00
[docs-only] Update search README.md (#7655)
References: #7553 (enhancement: improve content extraction stop word cleaning) Making the term `stop word` and the use of the envvar more clear.
This commit is contained in:
@@ -70,8 +70,7 @@ When the search service can reach Tika, it begins to read out the content on dem
|
||||
|
||||
Content extraction and handling the extracted content can be very resource intensive. Content extraction is therefore limited to files with a certain file size. The default limit is 20MB and can be configured using the `SEARCH_CONTENT_EXTRACTION_SIZE_LIMIT` variable.
|
||||
|
||||
When extracting the content you can specify whether filler words are ignored or not.
|
||||
To keep them, the environment variable `SEARCH_EXTRACTOR_TIKA_CLEAN_STOP_WORDS` must be set to false.
|
||||
When extracting content, you can specify whether [stop words](https://en.wikipedia.org/wiki/Stop_word) like `I`, `you`, `the` are ignored or not. Noramlly, these stop words are removed automatically. To keep them, the environment variable `SEARCH_EXTRACTOR_TIKA_CLEAN_STOP_WORDS` must be set to `false`.
|
||||
|
||||
When using the Tika container and docker-compose, consider the following:
|
||||
|
||||
|
||||
Reference in New Issue
Block a user