I've encountered many tasks that could almost be automated, except that they have some kind of ambigious step that requires manual attention. For example, it's extremely common to find instances where freeform text needs to be flagged if it contains sensitive or important information. For such tasks, it's easy to write software that performs subsequent steps such as interacting with a database or moving files around, but ambiguous text-processing can be extremely difficult to implement programatically. Solutions that use regular expressions or other simple text processing techniques can be finnicky, inaccurate, and difficult to tune.

LLMs have presented an opportunity to make big gains in this problem space. Most programming languages can easily call into large commercial models like OpenAI's GPT-4 or Google's Gemini. You can read text in from your program, pass it to these service's API, and perform actions based on the results. However, using external services for this can be problematic because of the cost and privacy implications. You may not want to depend on a paid service for an important task, and you really may not want to pass organizational data to it.

On the other hand, open source LLMs that run locally on your system have been rapidly improving. They aren't quite as powerful as commercial models, but they can be surprisingly effective, particularly if you give them simple tasks and make efforts to constrain their responses. Additionally, you always have the option of fine-tuning to improve performance.

One of the biggest open source project to run these models is llama.cpp, which performs inference on open source models. It can be built as a library and utilized from other languages which write bindings to its API. I really like to experiment with new problem spaces in Clojure, so I used llama.clj to experiment with programatically interfacing with LLMs to automate tasks.

I decided to put together an example to experiment with to simulate a case where we have a bunch of customer service inquiries and we want to figure out whether they are refund requests, order inquiries, or just general feedback. The goal of these experiments was only to perform this classification step.

(require '[com.phronemophobic.llama :as llama])
(require '[com.phronemophobic.llama.util :as llutil])

;; 8B parameter llama 3 model with 4bit quantization that easily runs on my Macbook M1
(def model-path "/Users/jon/development/cpp/llama.cpp/models/Meta-Llama-3-8B-Instruct.Q4_0.gguf")
(def llama-context (llama/create-context model-path {}))

(def inquiries
  [{:classification "Order Inquiry" :inquiry "Where is my order? It was supposed to arrive yesterday."}
   {:classification "Refund Request" :inquiry "I want to return an item and get a refund. Can you help me with that?"}
   {:classification "General Feedback" :inquiry "I think your website could be more user-friendly."}
   ;; ... many more omitted
])

(def classifications ["Order Inquiry" "Refund Request" "General Feedback"])

My setup was to run each classification over a classify function and append the experimental result to the actual result so that I could compare them.

(defn naive-classify [inquiry]
  (llama/generate-string
   llama-context
   (llama3-inquiry inquiry)))

(defn naive-greedy-classify [inquiry]
  (llama/generate-string
   llama-context
   (llama3-inquiry inquiry)
   {:samplef llama/sample-logits-greedy}))

(defn classify-inquiries [classify-fn inquiries]
  (map (fn [inquiry]
         (assoc inquiry
                :experimental-classification
                (classify-fn (:inquiry inquiry))))
       inquiries))
       
(defn correct-results [results]
  (filter #(= (:classification %) (:experimental-classification %)) results))

(defn incorrect-results [results]
  (filter #(not= (:classification %) (:experimental-classification %)) results))

(defn summary-results [results]
  {:correct (count (correct-results results))
   :incorrect (count (incorrect-results results))
   :accuracy (/ (count (correct-results results)) (count results))})

llama.clj provides a few different sampling functions that you can use to generate text. Sampling functions are the means by which the text tokens which form the response are selected. In addition to giving a few options for pre-defined functions, the library also gives you the ability to create your own. For my experiments, I started with microstatv2 (which naive-classify above uses above - llama.clj uses this function by default) and a greedy sampling function (which naive-greedy-classify requests via the samplef key).

I noticed I would generally get about 92% accuracy with the model, data, and prompts I used. When I looked at the responses that were incorrect, I noticed that many of them were from responses that would misformat the specific response format that I asked for, e.g. returning something like "[Order Inquiry]" instead of just "Order Inquiry". Since llama.cpp is feeding you text straight from the model, you have to get a little creative to deal with this. Both llama.cpp and llama.clj provide different ways to constrain models to produce valid json, which you can then further validate. I decided to try constraining responses by first just retrying if the gets a response does not belong to the set of strings I'm looking for:

(defn retry-classify [inquiry]
  (loop [count 0]
    (let [response (llama/generate-string
                    llama-context
                    (llama3-inquiry inquiry))]
      (if (some #(= response %) classifications)
        response
        (if (> count 3)
          nil
          (recur (inc count)))))))

This took care of those cases nicely and brought accuracy up to about 95%. It didn't noticeably decrease performance because it doesn't retry often - only in specific failure cases.

One potential issue is that if the model ever went haywire and started filling the context window with garbage (which does happen occasionally on some models), it would take inordinate amounts of time to generate responses and fail, which would slow everything to a crawl. I decided to try hand rolling a greedy sampling function that selects the first token which completes a valid classification to avoid this situation. The way that llama.cpp works is that it determines the relative probability of every possible text token at each point in the response. What I wanted to try to do in my sampling function was constrain the responses by forcing it to only choose tokens that are available in pre-defined categories (defined by classifications above). For the first generated token, it can only select the first token of any of those classifications. It selects the member of those tokens which it deems most likely to be correct.

From there, at each step, we filter out only classifications which start with the response tokens we've already accumulated, and repeat the process until we have a full response.

(defn greedy-constrained-classify [inquiry]
  (let [prompt (llama3-prompt (str "Inquiries can be one of 'Order Inquiry', 'Refund Request', or 'General Feedback'. What is the classification of the following inquiry? Reply with only the classification and nothing else: \"" inquiry "\""))
        prompt-tokens (llutil/tokenize llama-context prompt)]
    (llama/llama-update llama-context (llama/bos) 0)
    (doseq [token prompt-tokens]
      (llama/llama-update llama-context token))
    (loop [classification-tokens (map #(llutil/tokenize llama-context %) classifications)
           acc []]
      (let [logits (llama/get-logits llama-context)
            valid-tokens (map first classification-tokens)
            token (->> logits
                       (map-indexed (fn [idx p]
                                      [idx p]))
                       (filter #(contains? (set valid-tokens) (first %)))
                       (apply max-key second)
                       first)
            next-acc (conj acc token)
            next-classification-tokens (->> classification-tokens
                                            (filter #(= (first %) token))
                                            (map rest)
                                            (map vec)
                                            (into []))]
        (if (or (empty? next-classification-tokens) (every? empty? next-classification-tokens))
          (llutil/untokenize llama-context next-acc)
          (do
            (llama/llama-update llama-context token)
            (recur next-classification-tokens next-acc)))))))

Unfortunately, in my testing, this brought accuracy way down to around 84%. When I dug into the issue, it seemed that in failing cases, the very first response token it generates is for the wrong classification. It's kind of hard for me to tell what the issue is without more knowledge about llama.cpp or how the wrapper works. It's possible that that there may be an issue with the initial context setup while reading in the prompt. I emulated the setup steps that I found in llama.clj's documentation, but perhaps llama.cpp has changed since the repository was written (the examples and tutorials were written with llama 2), or maybe llama 3 is fundamentally different and requires different processing from llama 2, or maybe I overlooked some other mistake in my code.

I'm planning to dig into llama.cpp more in depth next to learn more and continue experimenting. By looking at llama.cpp's internal mechanics, it would be interesting to see why my sampling function failed. Before that, I think it might be a good idea to work through Andrej Karpathy's neural network course to try to get some more context and understanding of how these libraries work under the hood. It probably isn't strictly necessary, but the more domain knowledge you have, the easier it is to understand the architecture of software projects.

Overall, llama.clj is extremely fun to learn and sketch out ideas with. Even without much knowledge about LLMs, you could probably use it put together some interesting and effective applications. I came in expecting just a dry but functional wrapper to llama.cpp, but the documentation was fantastic and I learned a lot of new techniques and concepts from the clojure code used throughout the project. From the little bit of llama.cpp that I looked at, it didn't really seem to strive to maintain API stability, so it seems like it must be a challenge for wrapper libraries such as llama.clj to keep up to date. In any case, I had a great time and I'm looking forward to learning more about training and inference.