//AI in your Organization: let's start with the bang-BOT! Part 2

What is an AI Assistant's ideal diet? Your AI Assistant feeds on data just as we need food and calories. But, the quantity and quality of data make a difference in how your assistant will grow and be able to respond.

As anticipated, in this chapter, we talk about data and how to guarantee your assistant a diet that helps him to help you as best as possible.

If you missed the first part, these were the starting questions:

"What is an AI-based Bot? How do you go from a simple support Bot to an AI Assistant that interacts in a natural and 'human' way?" I suggest you read the first part if you haven't already.

Before starting, I asked non-technical people to forgive my technicalities and technicians to forgive my extreme simplifications. Here, we are right on the border between the two worlds. 🙂

I know you were looking forward to spending ten minutes discussing this topic, so let's begin!

Which data to feed your AI

We take it for granted that an averagely structured company has a lot of public data on its websites, including support ones such as the classic knowledge base or support center, product catalogs, price lists, technical characteristics, and all kinds of material. Added to these are more confidential data such as configurators, technical data sheets, product specifications, processes, and, last but not least, thousands of emails and messages exchanged with its users.

This data is fundamental to ensure that our AI Assistant can understand the context in which it must operate; it is essential to give relevant and practical answers, and it is fundamental to guide it to provide you with the answers you need to share.

Otherwise, he takes from his knowledge and gives generic and wrong answers, taking him into the psychedelic world of Hallucinations.

Three important points:

As I would like to repeat, your information must be well-analyzed and selected. This process is called "Data Preparation" and is essential for good results. If garbage comes in, garbage comes out: for computer scientists, the saying is ancient: Garbage In, Garbage Out. Imagine what can happen if, in the data provided, there are conflicting versions of the same product, if there are different prices, and if a response from your customer service has turned out to be wrong and is hundreds of times to your users. Remember from the "Hiring an AI in your Organization" series that the responsibility is yours!

In addition to providing him with the data, we will have to oblige him to use only our knowledge to avoid him starting to talk about other topics with your users at your expense.

Furthermore, always keep in mind that the job of any LLM is to produce content based on the knowledge acquired in its model. Therefore, given the same question and source data, you will never get the same answers because LLMs generate new text starting from your prompt.

So, with inadequate or inaccurate data, the risk you run is to provide a correct input and get an output with an interpretation of the AI model.

Come on, let's start by looking at some types of information you might want to feed it with.

The tickets

Tickets, for those who don't know, are support requests of any type, internal or external, recorded on helpdesk systems to track, sort, manage, and measure them.

Let's say, to include everyone, that a ticket represents a user request and contains conversations and actions from when they make it to when you resolve it.

Often, there are no systems in the Company to handle them, and 'tickets' are unstructured in the various support mailboxes or chat on Teams/Slack and similar tools. There is, therefore, a lot of hidden knowledge full of questions and answers.

OK?

As mentioned above, especially if you have many of them, it doesn't always make sense to use all the tickets to feed the AI Assistant: not all contain correct answers or well-posed questions. Those who, by category, are the best in terms of both the quality of the question and the answer must be selected.

And periodic maintenance must be carried out because excellent answers provided perhaps a couple of years ago would be overtaken by more recent answers where your operators have found better solutions.

So, connecting the AI you choose to ALL tickets/emails seems somewhat wrong because it would risk inserting garbage into the model. Poisoning it. Repetita Iuvant.

It essentially means having to carry out a good data analysis and selection phase with sufficiently large samples of positive interactions that have proven effective.

'Spontaneous' feedback

A blessing and delight for many, social media is a mine of essential information. So, one idea is to use all the feedback you collect and bring it into an AI Assistant that, when queried appropriately, can summarize your performance as seen from the market's point of view. Data Preparation work can be important here, too.

What can you do about it? For example, allowing the Research & Development department to interact with your customers based on the feedback received. Or to analyze new product ideas by exploiting existing interactions.

In addition to the feedback, you could provide other 'metadata,' such as the number of confirmations of the validity of that ticket, to help the Assistant take this into account.

The Knowledge Base

Have you ever developed a support site or section where the entire knowledge base is public? I'm talking about things like https://help.disneyplus.com/ or https://help.openai.com/. Or more straightforward things, such as pages with answers to the most frequently asked questions (FAQ) like these: https://www.garanteprivacy.it/faq.

They are complex projects requiring a lot of time, knowledge of the company, and constant maintenance.

But they often represent the crown jewel of your knowledge and, especially when combined with statistical data on their actual use, represent an excellent selection of knowledge to feed your AI Assistant.

From my point of view, these could 100% end up in the knowledge base without much analysis. Do you agree?

The Web site

If you have a knowledge base, you also have a well-structured website containing up-to-date, correct, clear information about your company and your products and services.

Many AI Assistant solution providers allow you to connect to your site and use all content as an information source. Very probably here, too, it is worth giving full access to the information, verifying that the chosen solution can notice when some document is updated, a non-obvious and non-trivial issue in this area.

Then there are the PDFs. (But forget them!)

Regardless of the content, the channel, and the objective, we are all clogged with PDFs describing every detail of our organization.

Consequently, serial AI Generative consumers always keep PDFs at the center of discussions regarding their use as a reliable source to be fed to an LLM.

Are these the proper documents to use?

NO!

Why?

Because they are documents designed for 'printing' or at least to be suitable for the human eye.

But all the formatting information, the way the text is arranged within them, the presence of irrelevant graphic content, etc., do nothing but slow down and create potential confusion for the engine that will have to analyze them and transform them into pure text, in the random order that could derive from the composition of the document. The presence of boxes often makes the text disjointed and difficult to read for an LLM.

Furthermore, images that would be ignored in specific AI engines often represent the text.

They essentially contain very unstructured data.

If you have the text with which they were created - imagining they are your company documents - do not hesitate to use the original text even better in TXT, HTML, Markdown Word, or Excel formats, which are more straightforward and more digestible for LLMs.

Alternatively, if you can't help it, select the text from the files that interest you, paste it into a text file, and see what comes out.

Avoid PDFs if you can. Thank you. 😀

But the technical manuals are in PDF... and we have many essential images in the company!!!

Sure, your company often offers complex products, with many figures and specific instructions represented in manuals that must be well laid out. Usually, the only solution available is a series of PDFs you have created with so much effort (and expense).

For example, to explain which are the right buttons on a gas boiler, a picture is worth a thousand words (even if I never understand what I should do when it stops).

The exciting thing is that the current versions of Chat GPT or Gemini can process them and use them to give you the answers.

Try some Ikea manuals, and even if they are still far from perfection, you will discover how much the new models can understand from the assembly instructions.

To make your life easier, I did a test myself, trying to ask both Gemini, Gemini Advanced, and Chat GPT. You see that they have understood what we are talking about but have a very different nature to the response...

Gemini

Gemini Advanced

Chat GPT 4

But consider that the AI Assistant you are building is unlikely to have the power of GPT-4 unless, as we will see in the next chapter, it is worth spending 20x to get the answers. So, if you have a lot of information in manuals of this type, my advice is to rely on those who know how to manage this type of information: those who create those PDFs starting from structured data.

For example, I cite a company like ekrai.com (disclaimer: with whom I collaborated), which, having the structured data of each of your procedures or technical specifications available, can provide optimized data for a Generative AI system.

What about image analysis? Keep this as phase two of the project: even if multimodal models are making giant strides, and we will probably have significant innovations in 2024, the most reliable tools available today are based, above all, on text. And if you want to start with a small project, consider excluding them for now.

Everything else is on the cloud file system...

Of course, you have tens of thousands of Word, Excel, and PowerPoint documents. The previous rules on the 'order and selection' concept also apply here. Often, two versions of the same document contradict each other or contain invalidated information.

Some systems indeed allow publishing and synchronizing entire document folders of documents, but... re-read the part above about data preparation, please. 😀

Ok, luckily, everything is there on the ERP (or on the CRM).

If you need to provide 100% accurate data and seeing it present in the output is fine, this can be done in several ways.

The critical concept is that Generative AI tends to invent and rewrite everything. This is fine when you need "Good Enough" but not exact results.

Don't despair, therefore, there are many possibilities:

Export endless quantities of files in CSV format from your management systems and load them into the assistant's configuration. Reminding you periodically to delete them and upload more updated ones when they change.
Use Assistants that offer direct access, using Actions, to your systems via API or access to a database or data lake. As part of their analysis activities of your request, they take care of 'calling' your ERP or CRM and requesting, for example, data on a product, on a customer, on an order, transferring them precisely into the result provided by the 'TO THE. This is made possible by some tools (e.g., OpenAI GPTs and Assistants), which allow you to make calls directly 'from the prompt' through Actions.
Use systems that offer RAG (Retrieval Augmented Generation) solutions by searching for the data in your repository before or after building the output. In the next chapter, you will find in-depth information on this approach.

In general, the concept to remember is that in a solution of this type, it is possible to mix content generated by AI with content taken exactly from your systems.

Among the various necessary skills, what counts here is writing the prompt well.

And then there's the data that the Assistant generates!

Every good AI Assistant generates significant amounts of data: The conversations with the questions and the answers provided, the date and time they are asked, if not anonymous, the users asking the questions, the response times, the duration of the conversations, user feedback, detection of user sentiment, classification of topics covered, errors or problems during interactions, attempts to circumvent it with malicious code...

Is that enough for you?

This data is essential for your data scientists to constantly improve the service and ensure you deliver it with quality answers.

Why am I talking about attempts to circumvent it? Because your AI 'intern' is very naive and can be quite easily manipulated to extract sensitive information of all kinds.

We will talk about this in the following chapters as well!

And who can see all this stuff?

When we talk about data, and in this case, I'm referring to all the data used to power the solution, the analytical data on the conversations, and the conversations themselves (which may contain personal data of your users), one of the main questions is: " Where is it kept this data and who has access? ”

It is easy to understand that managing them with high security is vital. Knowing their geographical location is not just a common sense practice; thinking about the GDPR is not an option but a necessary responsible action (Yes, right, an ESG topic) to ensure protection and compliance for your information.

When you use third-party solutions, you effectively hand over all your public and private information and instructions to external providers.

Delving into the solutions infrastructure provider's reliability, terms, and conditions is another crucial step. This allows us to understand whether the supplier itself can access the data and, consequently, evaluate the level of trust to place in the solution.

In short, data management in the back end (Behind the scenes) is not only a question of accessibility but also trust and transparency in its custody.

The same question must be answered by considering the type of users who will access your assistant and asking yourself: "Am I only providing public information available to anyone?" Otherwise, identifying and profiling users makes a lot of sense.

But suppose you wanted to create several AI Assistants, one internal with all the questions the technical office could provide, one dedicated to the partner network, and one dedicated to registered customers, each with potentially sensitive or confidential data. In that case, you will have to select solutions that allow you to distinguish and provide customized levels of security and accessibility.

Fortunately, tools can already provide all these approaches without writing computer code. We will also talk about this in the next chapter.

So consider carefully:

Will the first Bot be public or private? And will it be based on confidential or public data?
How much is it worth the risk?

Start thinking seriously about an area of your company or your work that would be great to be involved in making the first AI assistant.

Starting small!

Think seriously about where the data are, and you will see that you will have all the elements to start creating something exciting next time!

📢 If you have thoughts or comments or want to help spread these thoughts, please share this page with anyone you think might appreciate it. Your opinion matters so much!

🚀 To keep you updated on my content:

📝 Subscribe to Blog Glimpse so you don't miss a single update

🔗 Follow me on Linkedin

📚 Check out 'Glimpse', my novel about artificial intelligence

🗓️ Contact me if you would like to organize a Workshop on AI or for any ideas.

See you soon!

Massimiliano