Wikipedia Article Sections

Instructions

Language Model

ChatGPT (openai)

Knowledge

Show as Links

💪 Capabilities

🦻 Speech Recognition & Translation

🗣️ Text to Speech & Lipsync

Text-to-Speech Provider

Google Text-to-Speech

Voice name (Google TTS)

hi-IN-Wavenet-A (Female)

Please refer to the list of voice names here

🫦 Add Lipsync Video

🩻 Photo & Document Intelligence

🧩 Developer Tools and Functions

⚙️ Settings

🗣️ Google Text-to-Speech Settings

Speaking rate

1.0 is the normal native speed of the speaker

Pitch

Increase/Decrease semitones from the original pitch

📄 Knowledge Base

👩‍🏫 Search Instructions

How should the LLM interpret the results from your knowledge base?

Citation Style

Plain Text / WhatsApp Numbers + Footnotes

🔗 Shorten citation links

Cache

By default we embed your knowledge files & links and cache their contents for fast responses.

Always Check for Updates

Create Synthetic Data

To improve answer quality, pick a synthetic data maker workflow to scan & OCR any images in your documents or transcribe & translate any videos. It also can synthesize a helpful FAQ. Adds ~2 minutes of one-time processing per file.

———

Advanced Settings

In general, you should not need to adjust these.

👁‍🗨 Conversation Summarization

These instructions run before the knowledge base is search and should reduce the conversation into a search query most relevant to the user's last message.

🔑 Keyword Extraction

Instructions to create a query for keyword/hybrid BM25 search. Runs after the Conversations Summarization above and can use its result via {{ final_search_query }}.

✏ Embeddings Model

Text Embedding 3 Large (OpenAI)

Dense Embeddings Weightage

Weightage for dense vs sparse embeddings. 0 for sparse, 1 for dense, 0.5 for equal weight.
Generally speaking, dense embeddings excel at understanding the context of the query, whereas sparse vectors excel at keyword matches.

Max Citations

The maximum number of document search citations.

Max Snippet Words

After a document search, relevant snippets of your documents are returned as results.
This setting adjusts the maximum number of words in each snippet (tokens = words * 2).
A high snippet size allows the LLM to access more information from your document results, at the cost of being verbose and potentially exhausting input tokens (which can cause a failure of the copilot to respond).

Snippet Overlap Ratio

Your knowledge base documents are split into overlapping snippets.
This settings adjusts how much those snippets overlap (overlap tokens = snippet tokens / overlap ratio).
In general you shouldn't need to adjust this.

🔠 Language Model Settings

Avoid Repetition

Response Format

———

Max Output Tokens

The maximum number of tokens to generate in the completion. Increase to generate longer responses.

Creativity (aka Sampling Temperature)

Higher values allow the LLM to take more risks. Try values larger than 1 for more creative applications or 0 to ensure that LLM gives the same answer when given the same user input.

Answer Outputs

How many answers should the copilot generate? Additional answer outputs increase the cost of each run.

Run cost = 4 credits

Breakdown: 1 (ChatGPT (openai)) + 3/run

By submitting, you agree to Gooey.AI's terms & privacy policy.

Show Raw Output

Assistant

To remove toxicity from a dataset, one approach is to perform dataset cleaning. This involves removing toxic passages from the dataset, discarding low-quality data, and de-duplicating the content [1] . By applying these steps, a cleaned and high-quality dataset can be obtained.

It's worth noting that as large language models (LLMs) generate more content, future datasets may become increasingly "contaminated" by LLM-generated contents themselves [1] . Therefore, ongoing efforts to improve dataset cleaning techniques will be crucial in maintaining the quality of datasets used for training LLMs.

If you have any other questions, feel free to ask! 😊

Large language model - Dataset cleaning https://gooey.ai/2/9lNn

User
How to remove toxicity?

💁‍♀️ Sources

Generated in 8.2s on

...