Skip to content

Introduction

Datasources are a type of component in NEXTgpt, which can be used to store information indexed on text chunks. These indexed chunks can subsequently be queried to get the most relevant content to be sent as context to the LLMs.

In addition:

  • Datasources are required for specific Frameworks (Please refer to Frameworks for further details.)
  • Datasources can be created by the developers of NEXTgpt developer portal.

How to create a Datasource?

In order to create a new datasource, you need to fulfill the following requirements:

  1. Have your documents ready in the supported formats (.txt, .doc, .docx, .pptx, .pdf, .md or .csv).

  2. Login to NEXTgpt Developer Portal.

  3. Navigate to Datasource Creation tab.

  4. Browse and upload the files.

  5. Type a datasource name in the field Datasource (should be unique and not appear in the Existing Datasources section.)

  6. If the datafiles are large in size, which most likely will be the case, Doc Slice should be ticked to allow for chunking, where the Doc Max Section Length, Doc Section Overlap and Doc Sentence Search Limit can be configured to match your needs

  7. If the datafile is in .csv format,

    csv_header (bool): whether the csv contains a header row. Column names are inferred from first row if true and as k0, k1, etc if false.

    csv_key - content from this column is used to index the datasource.

    csv_out_template - python fstring utilizing column names. This will be the content field when querying the datasource.

    For example consider the following csv file,

    employee_id, name, age
    E1234, John Doe, 31
    E2345, Alan Ng, 28
    E5645, Jess Tan, 30
    

    Setting

    • csv_header as True

    • csv_key as 'name'

    • csv_out_template as 'Employee {name} is {age} years old'

    will create datasource indexed on the column 'name' with content 'Employee John Doe is 31 years old', 'Employee Alan Ng is 28 years old', etc.

    If the first row of csv file is not an header replace 'name' with 'k1' and 'age' with 'k2' in the above and set csv_header to false to get the same datasource.

How to query a Datasource?

To query a datasource, navigate to Datasource Query tab, select the datasource you want to query and provide the following info.

  • Query: A string to find relevant content.
  • Size: The number of top matching results.

Types of Datasources

We currently support three types of datasources.

Datasource Hosted Dependency Remarks
Azure Cloud Azure Subscription Paid Azure AI search subscription is required.
FAISS On-prem CPU for small indexing Suitable for short-to-medium length text embedding tasks.
ColBERT On-prem GPU for large indexing High quality embeddings. Suitable for longer text.

Please check the following sections for detailed information on different types of Datasources and how they can be used.

  1. Azure

  2. FAISS

  3. ColBERT