Knowledge Format
The design of the knowledge base plays a central role in the reliability, efficiency, and maintainability of a chatbot. In TUNa, the knowledge base functions as the exclusive source of verified information and directly influences retrieval quality, response latency, and the likelihood of hallucinations. Consequently, the choice of knowledge formats is not only a matter of content representation but also a technical design decision with measurable effects on system behavior.
The main knowledge file format used in this project is Markdown. Markdown offers a strong balance between structural expressiveness and token efficiency. Headings, lists, and concise formatting provide clear semantic boundaries that improve retrieval precision while introducing minimal token overhead. Compared to other formatted documents, Markdown allows relevant sections to be embedded or retrieved with fewer overhead tokens, which positively affects inference speed and reduces context saturation. From a reliability perspective, the explicit structure of Markdown supports grounded responses by clearly separating definitions, procedures, and exceptions, thereby lowering the risk of hallucinations. In addition, Markdown files are easy to read, edit, and version controlable, making them highly maintainable in collaborative development settings, such as in this project.
PDF files are used selectively when information is only available in this format. From a technical standpoint, PDFs are suboptimal for language model based retrieval. They often contain dense paragraphs, repeated headers, or layout artifacts that increase high token usage without adding semantic value. This can negatively impact retrieval speed and reduce precision, as relevant information may be embedded within large text blocks. Furthermore, the lack of explicit semantic structure increases the risk that the model misinterprets or overgeneralizes content, potentially leading to hallucinations. Nevertheless, PDFs offer high reliability in terms of source authenticity and human-readability and are therefore retained when accuracy to the original document outweighs the disadvantages in efficiency and maintainability.
JSON represents a contrasting design choice, optimized for machine-readability and strict structural consistency. JSON is highly efficient for representing discrete entities, mappings, and metadata, enabling precise retrieval and minimizing ambiguity at the data level. When used appropriately, this format can reduce hallucinations caused by loosely phrased textual descriptions. However, JSON tends to be token-heavy due to repeated keys and syntax, which can increase context size and slow down inference when large datasets are involved. Moreover, JSON is less suitable for natural language explanations and is harder to maintain for nontechnical contributors, limiting its usefulness for procedural or advisory content.
Plain text files provide minimal syntactic overhead and can be token-efficient for very short or static information. However, the absence of explicit structure reduces retrieval reliability and makes it harder for the model to distinguish between different types of information, such as conditions, steps, or exceptions. This lack of structure increases the likelihood of ambiguous interpretation and therefore increases hallucinations in more complex queries. Plain text is also less maintainable for growing knowledge bases, as content organization relies entirely on manual conventions rather than formal structure.
In summary, the knowledge format choices in TUNa reflect a trade-off between efficiency, reliability, and maintainability. Markdown is the preferred format due to its favorable balance of readability, token efficiency, and structural clarity. PDFs are used only when required by the source, accepting reduced retrieval efficiency in exchange for authoritative accuracy. JSON and plain text are reserved for narrowly defined technical use cases where their specific advantages outweigh their limitations. This deliberate design contributes to stable model behavior, reduced hallucination risk, and scalable maintenance of the knowledge base.