Introduction
Datasources are a type of component in NEXTgpt, which can be used to store information indexed on text chunks. These indexed chunks can subsequently be queried to get the most relevant content to be sent as context to the LLMs.
In addition:
- Datasources are required for specific Frameworks (Please refer to Frameworks for further details.)
- Datasources can be created by the developers of NEXTgpt developer portal.
How to create a Datasource?
In order to create a new datasource, you need to fulfill the following requirements:
-
Have your documents ready in the supported formats (
.txt
,.doc
,.docx
,.pptx
,.pdf
,.md
or.csv
). -
Login to NEXTgpt Developer Portal.
-
Navigate to Datasource Creation tab.
-
Browse and upload the files.
-
Type a datasource name in the field Datasource (should be unique and not appear in the Existing Datasources section.)
-
If the datafiles are large in size, which most likely will be the case, Doc Slice should be ticked to allow for chunking, where the Doc Max Section Length, Doc Section Overlap and Doc Sentence Search Limit can be configured to match your needs
-
If the datafile is in
.csv
format,csv_header (bool): whether the csv contains a header row. Column names are inferred from first row if true and as
k0
,k1
, etc if false.csv_key - content from this column is used to index the datasource.
csv_out_template - python fstring utilizing column names. This will be the content field when querying the datasource.
For example consider the following csv file,
Setting
-
csv_header
as True -
csv_key
as 'name' -
csv_out_template
as 'Employee {name} is {age} years old'
will create datasource indexed on the column 'name' with content 'Employee John Doe is 31 years old', 'Employee Alan Ng is 28 years old', etc.
If the first row of csv file is not an header replace 'name' with 'k1' and 'age' with 'k2' in the above and set
csv_header
to false to get the same datasource. -
How to query a Datasource?
To query a datasource, navigate to Datasource Query tab, select the datasource you want to query and provide the following info.
Query
: A string to find relevant content.Size
: The number of top matching results.
Types of Datasources
We currently support three types of datasources.
Datasource | Hosted | Dependency | Remarks |
---|---|---|---|
Azure |
Cloud | Azure Subscription | Paid Azure AI search subscription is required. |
FAISS |
On-prem | CPU for small indexing | Suitable for short-to-medium length text embedding tasks. |
ColBERT |
On-prem | GPU for large indexing | High quality embeddings. Suitable for longer text. |
Please check the following sections for detailed information on different types of Datasources and how they can be used.