🔓 Unlocking the Secrets of AI Models

The evolution from structured data to large language models

🔓 Unlocking the Secrets of AI Models
Large Language AI models provide instantaneous access to the world's data. How does it work, and how did we get there?

Executive Summary: Science fiction has become a reality, allowing us to ask questions or create content and receive instant, seemingly magical responses. But how does this 'magic' work? The secret lies in AI models, powerful tools that have revolutionized how computers store, retrieve, and generate data. In this article, we'll journey through the evolution of data storage and explore the new frontier of large language models, uncovering how these technologies shape our digital experiences today.

The dawn of the structured database

Computers have always had ways of storing structured information, but they were big and expensive, and not everyone had access. In the late 1970s and early 1980s, we saw the promise of inexpensive computers that could store increasing amounts of data, promising a new computing era.

Like any industry, standards are essential because if you get trained on a topic, then it is [mostly] transferable. For databases, those transferable skills came from technologies we call RDBMS (relational database management systems) and SQL (structured query language).

Those two base technologies made (and still make) it more accessible for software developers to create and manage database applications using standards. Big companies like Oracle and Microsoft developed their own RDBMS and SQL products. Most standards were kept in place, so their value came from ease-of-use and value-added features, like reporting, data analysis, and application development tools.

Let's pretend for a moment that we want to create a database of our customers. We can create an RBMS table we call CUSTOMER, and then we can define a relationship with another table we call ADDRESS. We also decide what fields we want to capture, like first_name, last_name, address, and so on. Since one customer can have multiple addresses, each address can reference the customer to which it is related. That is where we introduce the concept of a primary key (PK) and foreign key (FK). All that is to say, an RDBMS is a highly opinionated representation of your data.

Here is how the database tables might look in the RDBMS (relationship database management system):

A diagram showing a relational database schema with tables and relationships, representing structured data organization.
An example of a relational database schema illustrating how structured data is organized in tables with defined relationships.

With the RDBMS structure, we can add data to the tables. As you can see in the following image, Alice, Clara, and Edward all have multiple addresses. Notice that the customer_id is repeated numerous times in the Address_Data table view.

Finally, we can use SQL to extract and transmit data to and from the RDBMS. Using that previous example, let's write some SQL to get information on Clara Brown:

SELECT 
    c.customer_id, 
    c.first_name, 
    c.last_name, 
    c.email, 
    c.phone_number, 
    a.address_id, 
    a.street_address, 
    a.city, 
    a.state, 
    a.postal_code, 
    a.country
FROM 
    Customer c
JOIN 
    Address a 
ON 
    c.customer_id = a.customer_id
WHERE 
    c.first_name = 'Clara' AND 
    c.last_name = 'Brown';

At first, the SQL might look a little convoluted, but what we are doing is telling the RDBMS that we want to SELECT certain fields FROM the Customer and Address tables. We then JOIN the two tables using that customer_id field, WHERE the customer's first name is Clara and last name is Brown.

The RDBMS will only return the selected information, and then the software developer or report writer can format it as they see fit. The following image shows an example of a report using that same SQL to provide information on Clara:

A report output displaying selected information about Clara Brown
Example of a report generated using SQL to provide information on Clara Brown.
Further reading: I will not cover the entire history of databases and how they led to AI models. Still, if you are interested in more database history, I recommend reading about Big Data and NoSQL.

Accessing databases with SQL language is not something the everyday user could (or should) be doing. Further, databases are highly structured, organized, and consistent. Businesses focused on that element for a long time, but by the late 90s and early 2000s, nearly everyone working at a company had a computer. While they might be accessing structured data, they were also generating and accessing unstructured data, such as documents, presentations, scans, etc. For example:

  • Contracts
  • Purchase orders and invoices
  • Physicians notes
  • HR policies
  • Marketing copy
  • Project plans
  • Sales presentations
  • Websites
  • Blogs

With the number of documents growing exponentially, it was more important than ever to offer a way to reliably search this information, which gave rise to the index. An index is not technically a database but a data storage and retrieval mechanism designed to extract keywords, essential topics, and elements from documents to make them searchable.

For example, let's say you are in the facilities department and receive the following document, which is a pitch to build a picnic area on your premises:

Document: This is a proposal to elevate your employees' break time with our new scenic picnic area at the corporate office. Designed for relaxation and team bonding, this outdoor space offers a refreshing change from the usual work environment, fostering creativity and collaboration in a serene natural setting.

You might upload this document to a shared workspace in your company to review and make comments. When you upload that document, you are not storing it in a database. Instead, a database stores a document's existence, which informs a little piece of software called a crawler to index the document once it gets a chance. That document becomes searchable in the index, and people in your team can search for picnic proposal and get a result!

The index would take forever to get you the results you are looking for if there were hundreds or thousands of files, so it optimizes what it stores. For this reason, the index might store what it feels is relevant information and store only the following content:

Index: Proposal for new scenic picnic area designed for relaxation and team bonding, offering a refreshing break from the work environment. Enhances creativity and collaboration in a natural setting.

That is about a 32% space savings, and the index uses various other tricks to reduce the size of the information it stores. As you might expect, search engines like Google use indexing to store all the web's data and quickly serve you the results.

The graph

In the early 2000s, social media became a hot way to connect with old friends, new friends, distant family members, and people you like to yell at randomly. In the early days, you created a web page where you could share your favorite music and maybe follow people. However, as the technology advanced, finding content relevant to you, helping you search for people you might want to connect with, and much more became imperative, especially for social media companies.

Have you ever been on a social media site and thought, "You know what? When I left college, I lost touch with Clara Brown. Let me look and see if she's on my socials." So you go to the search bar, and as you type the letters "Cl," Clara Brown displays at the top of the search results! Wait, how did that happen? Enter the graph database!

A graph database can utilize structured and unstructured data and efficiently traverse and query massive amounts of interconnected data. Going back to the social media example, take a look at the following image:

A graph showing nodes representing individuals connected by lines that indicate relationships.
An example of a social media graph showing the interconnectedness of users.

In that image, you are the yellow dot, or what a social graph might call a node. Those lines are edges that connect you to other people's nodes. The closer they are to you, the more interconnectedness. The further away, the less connected. You can see how some outliers spring off your deeply connected nodes. Perhaps one of those groupings is indirect connections, such as people who share the same alma mater, birthday, or beliefs.

This social media graph example can identify a probabilistic pattern: you and Clara attended the same college, graduated on similar dates, and even share some affinity relationships. When you go to that search bar looking for someone, the search engine might be optimized to help you find new people to connect with. Therefore, when you type "Cl," the search engine immediately sees the connection in the graph to someone you are not connected to, starting with "Cl," and determines Clara Brown is a solid choice.

AI vs. Interconnectedness

At this point, you learned:

  • 📊 Structured databases use a predefined schema to organize and store data consistently using tables, rows, and columns.
  • 📚 Unstructured databases are designed to store and manage data without a fixed structure. They allow flexible and scalable storage of diverse data types (text, images, or documents).
  • 🕸️ Graph databases enable efficient traversal and querying of complex, interconnected relationships.

Let's pretend for a minute that I can access an app that allows me to search for Clara Brown. The app can search our corporate databases (structured data), corporate portals (unstructured documents), and some outside information from social media sites. The following image is what the results of my search might look like:

A search result screen displaying Clara Brown's profile information pulled from corporate databases, unstructured documents, and social media sites.
The search result example for Clara Brown shows information from various structured and unstructured data sources.

But did you notice something important missing from all this data? The computers can see the information and, with some algorithmic wizardry, can identify Clara's interconnectedness in and out of our business. However, it does not understand the data.

With an AI system, I could enter a prompt like this: Who is Clara Brown? And receive a response like this:

Based on your role in the company as a Facilities Manager, I can see the interactions between you, Clara, and the rest of the Internet. Clara appears to be your preferred vendor for creating open spaces in your organization. Most recently, Clara has pitched an idea for a picnic space outside your Sacramento facility, which includes a sales pitch, an insurance policy estimate, and a detailed cost estimate totaling about $50,000. It would appear another expert in the field of open spaces, by the name of Shawn Lomi is connected with Clara and they both work for Acme corp. Is there more information you would like me to provide?

You can then follow up on your question with information about which your company has no awareness: Does Clara write or speak about outside spaces? If so, does the proposal for our picnic space match what Clara might consider best practices?

I have reviewed Clara's latest articles and the transcription for a recent podcast on the topic. It would appear Clara's proposal mostly follows their own best practices, with one caveat. It would appear Clara belives in planting with native species, but the proposal you are reviewing only mentions plants. You might want to follow up with Clara on this point.

As you can see, the AI recognized and understood a lot about Clara and reached well beyond the basic knowledge contained within your corporate databases. So, how was AI able to do that?

Introducing AI models and neural networks

Humans are hungry for data, so the more we store, the better. However, how do we make sense of it all, and how can you interact with this data in a way that makes sense to your needs? Enter large language models (or LLMs.)

The purpose of an LLM is to collect information that allows you to interact with the knowledge using a natural language prompt. In the case of so-called foundational models, they attempt to vacuum up as much data as possible and store it for recall. Foundational LLMs include OpenAI's ChatGPT (also Microsoft Copilot), Google's Gemini, and Meta's Llama.

Generally speaking, data is collected, and then that information can be categorized by a human or computer. In the following image, you can see it is possible to organize and relate information, making it look something like a much more extensive graph database:

However, the model has more tricks than simply storing information. The model is designed based on how humans believe our brains function. Consider raising a child as you teach them about the world.

You open a picture book to teach the child the difference between a cat and a dog. You ask the child to point to the dog, but they point to a cat, so you say "no" and gently move their finger to point to the dog. Later, you are outside, and you ask the child, "What is that animal?" they say, "Dog," and you get excited and support the child in giving the correct answer.

If you were to create an AI model that knows explicitly the difference between a cat and a dog, you would follow a similar process. This time, however, you would share hundreds or even thousands of pictures of cats and dogs. You would label them so the AI can determine how best to store the information. You would even have a sample data set to ensure the AI was smart enough to decide. Training the model with right and wrong answers is called reinforcement learning.

💡
Reinforcement learning is one method of training AI models. I use this example because it is useful to know that early model training can benefit from this approach.

Reinforcement learning is giving computer information about cats and dogs and then asking them to provide correct information during data sampling. The way all this works is you give the AI a picture or description of a dog input , and then the neural network analyzes the data in hidden layers (a topic for another day) and then outputs what percentage of correctmess it believes to be true. In other words, the AI won't necessarily output "this is a dog"; instead, it will output something like "I am 89.2353% sure this is a dog."

The beauty of a foundational model is that it can contain world knowledge, so you can converse with it as if you are talking to a human being. However, that world knowledge does not necessarily extend to information about you or your company.

While you may hear a lot about AGI—or Artificial General Intelligence—right now, there is no one über one-size-fits-all model. Instead, we can chain models together and get better outcomes.

As you can see in the example below, ChatGPT is great at world knowledge. It can analyze and categorize purchases if you provide your credit card history. It can even be smart about the type of purchase simply by where you purchased the product (Whole Foods gets categorized as 'food,' Apple purchases get classified as 'Electronics,' etc.).

You could connect the output of that standardized categorization of purchases and run it through another AI model trained specifically on fraud detection. Both models can work together to let you know if fraudulent activities appear. As you can see in the following image, ChatGPT does best (categorization, understanding intent, and communications), and the fraud detection model does best (providing a risk score based on historical and current purchase history).

Many companies and governments build AI models to perform these specific functions and then write apps against them or connect them to a foundational model for an even better experience.

If you want to see how companies build new models for a specific industry, I suggest you start with the following video, which shows how Google's Med-PaLM 2 LLM can solve expert-level medical questions.

AI models and facts

As you may recall from earlier in this article, databases contain facts. If you cannot get the data, it does not try to give you something else. It has the data, or it doesn't. An AI model, on the other hand, tries its best to provide you with a result no matter what. That result is a percentage, so you may or may not get the result you are looking for. Using that same fraud detection model, it did not say you have a fraudulent charge; it said there is a 68% chance it was fraud.

However, a large language model (LLM) usually does not give you the percentage, so you must fact-check its work. A basic example would be, "Just give me the name of the best movie ever made." You have locked the LLM into giving you the name of a movie, so you have to decide how to interpret that. Maybe the LLM is giving you this information because of its social popularity or box office results, or perhaps it responds with a movie because of its ground-breaking use of cinematography.

The good news here is that LLM AI models are conversational, so imagine this conversation:

Prompt: Just give me the name of the best movie ever made
AI: "The Godfather" (1972)

At this point, I could go online and research this answer, so let me continue the conversation:

Prompt: Reference data you are using so I can fact check this.
AI: The claim that "The Godfather" is the best movie ever made is supported by its frequent top rankings in critical lists from the American Film Institute and the British Film Institute's Sight & Sound poll. Renowned critics like Roger Ebert have praised its storytelling, direction, and performances. The film's cultural impact, including its memorable quotes and influence on the crime genre, is well-documented in film studies and media analyses, making it a widely recognized cinematic masterpiece.

Now, I can review a few sites for accuracy and determine if the facts are correct for the purposes of my research. Granted, this was a simple –if not problematic– example, but very often, all you need to do is ask for references. You can then make the editorial determination if the reference material meets your satisfaction for determining the factualness of the AI's response.

In summary

We started with databases, indexes, and graphs. That data is often factually correct. Models attempt to be accurate, but they try to "think" about an answer and provide you with a result. Sometimes that result is factual, and sometimes it is not. Sometimes, you are to blame for the model's lack of factualness because you did not give a good enough prompt, and sometimes, the model is responsible because it wants to please you with a response, accurate or not.

I hope this article helped you better understand AI models. If you want to discuss this topic further, schedule a one-on-one session on my calendar.

Bill Talks AI Podcast Working Session - Bill Raymond
Meeting purpose Introduce each other and define a podcast topic Document a conversational outline Select a recording date Answer any questions

Dr. Deborah Chen provided editing services for this article.