selectEmbeddingProperties
Customize the text extraction process for each document.
The selectEmbeddingProperties
hook allows you to customize the text extraction and transformation process.
You can use this hook to select the properties of the document that will be used to generate the embeddings.
This hook receives a single document as input and must return one of the following:
string[]
: An array of strings with the properties to usestring
: A single string with the concatenated propertiesstring
: A single string with the text to use for embeddings
Given that OramaCore is schemaless, this hook is particularly useful to customize the text extraction process depending on the document structure, which can vary from document to document.
Let's take the following documents as an example:
As you can see, the structure of the documents is different. With the selectEmbeddingProperties
hook, you can customize the text extraction process for each document.
Returning a single string
You could write a JavaScript function like this:
Which will return the following strings for the documents:
-
For
document1.json
: -
For
document2.json
:
This way, you can easily produce highly optimized embeddings for each document.
Returning a single markdown string
Another approach is to return a single markdown string that will be used for embeddings:
This will produce the following outputs for the two documents:
-
For
document1.json
: -
For
document2.json
:
This approach allows you to generate complete markdown documents rich in information that can be used for embeddings.
Returning an array of strings
Finally, you can return an array of strings with the properties name to use for each document. OramaCore will then concatenate the values of these properties to generate the embeddings.
This will produce the following outputs for the two documents:
-
For
document1.json
: -
For
document2.json
:
There is no right or wrong way to use the selectEmbeddingProperties
hook. You can use it in the way that best fits your needs and the structure of your documents.