Beyond the Prompt: The Technical Nuances of GenAI Systems at

Pablo Elgueta

Head of AI Engineering - is a chat-based tool for Microsoft Teams and Slack that simplifies affiliate marketing analytics. It lets you connect your affiliate network accounts via API keys, and ask natural language questions like "How were sales for Advertiser X yesterday?" or "What was the conversion rate for Advertiser X last quarter?" Additionally, you can get AI-generated reports with insights, key KPIs, publisher analysis, and fraud detection – all complete with a data spreadsheet.

How does work?

Discern Intent

The first step in the process is to try to determine if the user's intention is to retrieve data or not. Often users will start their first interaction with the chatbot by saying “Hi” or asking it for its features. In this case it’s important to be able to reply in a conversational manner, while still limiting the type of output the chatbot can provide. To determine this we use Function Calling, where the function in this case is a simple Pydantic class that takes one argument as a boolean, and which will be True if the user is asking about data, and False if not.


This approach creates a simple binary question for the LLM, which can be replied quickly, cheaply and consistently. We’ve had great success with models like Haiku for these types of problems.


It’s worth noting that our original iteration made use of agents for this purpose, but we found them clumsy and slow to use, while not providing the accuracy we needed. We realize defining intent like this is similar to old school NLP chatbots and kind of defeats the purpose of open-ended chatbot interfaces, but in our case our priority is not to provide a conversation to the user, but useful data.


Named Entity Recognition

If the user intent is data related, then we move on to Named Entity Recognition. Our task is to identify the advertiser name so we can retrieve their id to be able to call the business data API later. As users often make spelling mistakes or have their own naming conventions for their brands, simple fuzzy matching will not be enough to extract them, so we resorted to LLMs.


Previously we had a relatively complex multi-step process to handle NER, which included a Vector Database and semantic search. It needed a specialized data pipeline to keep it updated, and it added quite a bit of latency, while not providing the level of accuracy we needed. After lots of testing and experimentation, we decided to be pragmatic and chose a straightforward solution: to just pull the whole list of advertisers and their ids through an API call and pass it to the LLM prompt, instructing it to decide which one was a match.

In our case this was viable due the small number of advertisers that each user has access to. With increasing context windows, we see this as a viable solution for other use cases.

Other changes included incorporating a few-shot prompting approach, which significantly enhanced the accuracy of the LLM. This technique involved providing a diverse range of potential questions that the LLM might encounter, which aided in improving its performance.

With these changes we managed to reduce the code from hundreds of lines to a couple dozen, while also reducing latency and complexity and increasing accuracy.



Query Reconstruction

The next step in the pipeline is Query Reconstruction. It often happens that users are not the best at prompting the model, or some additional context needs to be added to improve accuracy. For example, if the user asks “What were sales like yesterday for Advertiser X?”, and then follows-up with “and for Advertiser Y?”, the LLM won’t be able to answer as the second question on its own doesn’t make any sense. Normally apps handle this with memory, but we’ve found that the LLM often gets confused when there’s too much context, especially as the chat history size increases.

To solve this, we pass the chat history to the Query Reconstructor module with the only instruction of rewriting the query based on that, and then we pass it along to the next step. As before, we’ve found that having more atomic modules leads to better performance, cost, and latency.



Step four is the new addition to our architecture. Previously we had three other steps that occurred sequentially; checking the query for errors, deciding if the data request was for bulk or specific data, and extracting the necessary arguments for the API call.

With optimization in mind, specially in terms of latency, we decided to parallelize the calls using multithreading, specifically using LangChain’s RunnableParallel. Here we make the calls for every potential option in the routing tree, and only make a decision once we have the results for all of them. We made the tradeoff of making some extra LLM calls that wouldn’t have been necessary with the sequential approach against speed gains, with a result of over 70% latency reduction.

The first of the parallelized calls is the Query Checker, which, as its name implies, checks if there’s any missing information in the user request that could lead to an API error, such as if they didn't include the metric they need, or a specific time frame. Before, we just sent a message to the user telling them they had missing arguments, but found that actually telling them what was missing and how to correct the question made for better UX.

In this instance, we employ function calling to coerce the model to produce a JSON response with two arguments: a Boolean value to check if the query includes the arguments, and a string containing commentary on the missing data. In this way we can easily route the user and provide quick feedback without having to initiate the data retrieval pipeline.


The next module is what we call the Semantic Router. Its role is to determine if the user is requesting a general report or asking for specific data. This can be tricky, as the difference between “Give me a performance analysis for Advertiser X last month” and “What was the performance for Advertiser X in sales last month” can be subtle for the LLM.

We’re also using Function Calling here, having the model pick between a Reports or Query function depending on the user question. Originally it also extracted the arguments, but we found that splitting the functionality into separate calls increased the accuracy, and since we’re doing the calls simultaneously latency is not an issue.


At the same time as the previous two calls, we run two more that extract the arguments needed for the API call, one assuming the Reports pipeline will be chosen, and other for the Query pipeline. Once we get the results for all four calls, we decide which results to take and move forward.


After we have the arguments, we can go and make the API call, retrieving the results and converting them to a Dataframe, which is then passed again to an LLM to interpret and convert to natural language, and returned to the user.

In the case of the Reports module, we do some additional post processing to convert the data into an excel file, which is also sent to the user.


The field of AI is marked by relentless evolution. To stay ahead of the curve, it's crucial to maintain a mindset of adaptability. Embracing experimentation with new approaches and techniques is essential as the capabilities of LLMs rapidly expand. What was impossible a month ago might just be with the newly released model, and the state of the art for a certain functionality might just not be anymore either. At, we embody this philosophy, constantly refining our architecture to ensure we deliver the fastest and most insightful affiliate marketing analytics possible.