Blog Home

Build Enterprise AI: Function Calling, MCP, & Knowledge Graphs

Apr 23, 2026 by Bal Heroor

Enterprise AI systems today go far beyond chat interfaces. To build production-grade AI, you need three core capabilities working in concert: function calling (letting models invoke real tools and APIs), Model Context Protocol (a standardized interface for tool integration), and knowledge graphs (structured semantic memory that enhances retrieval).

This blog is part four of our Agentic AI Architectures series. In the previous blog, we explored how prompt caching can cut AI costs by up to 90% on AWS Bedrock.

This blog is a technical deep-dive into each layer, how they work individually, how they compose, and what it takes to run them securely at enterprise scale. The architectural patterns covered here are derived from real-world deployments, including the use of Amazon Bedrock Agents, LLM-assisted graph construction, and graph-enhanced RAG pipelines.

If you prefer a visual walkthrough instead of reading, we’ve also created a video for this blog—watch it here.

 

What Is Function Calling in Enterprise AI?

Function calling, also called tool use, is the mechanism by which a large language model (LLM) decides to invoke an external function rather than generate a free-text response. The model does not execute the function itself. Instead, it emits a structured JSON payload describing the function name and arguments, which your application layer intercepts, executes, and returns as context for the model to continue reasoning.Tool-use Single Agent-2

This is not prompt engineering. It is a first-class capability supported natively by models like Claude, GPT-4o, Gemini, and the Bedrock-hosted model families. The distinction matters at enterprise scale: function calling is deterministic in structure (the model must conform to a schema you define), auditable (you control what gets invoked and when), and composable (multiple tools can be chained across a single reasoning turn).

 

How the Function Calling Loop Works

The loop operates in four discrete steps:

4 steps of function calling

  1. Schema registration: You declare available tools as JSON schemas in the system prompt or API request. Each tool has a name, description, and parameter definition.

  2. Model decision: Given a user query, the model determines whether to answer directly or invoke a tool. If it chooses a tool, it returns a tool_use block instead of a text response.

  3. Execution: Your application receives the structured payload, calls the actual function (database, API, service), and captures the result.

  4. Result injection: You return the function result to the model as a tool_result message. The model then generates its final response, now grounded in real data.

This loop can be single-turn (one tool call, one response) or multi-turn (the model chains multiple tool invocations before producing output). The latter is where enterprise complexity begins.

 

Function Calling Schema Example

Below is a canonical function definition for a database query tool:


{
  "name": "query_sales_database",
  "description": "Executes a read-only SQL query against the enterprise sales database. Use this to retrieve revenue, pipeline, or account data. Never use for write operations.",
  "input_schema": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string",
        "description": "A valid SQL SELECT statement. Must not contain INSERT, UPDATE, DELETE, DROP, or TRUNCATE."
      },
      "database": {
        "type": "string",
        "enum": ["sales_prod", "sales_staging"],
        "description": "Target database environment."
      },
      "max_rows": {
        "type": "integer",
        "description": "Maximum rows to return. Defaults to 100. Hard cap at 1000.",
        "default": 100
      }
    },
    "required": ["query", "database"]
  }
}

Notice the intentional design decisions here. The description explicitly prohibits write operations, the database field is constrained to an enum, and max_rows is capped at the schema level. These are not optional niceties — in production, they are your first line of defense.

 

The Four Enterprise Tool Patterns

Not all tools are created equal. In enterprise environments, the tools your AI system invokes vary dramatically in risk profile, latency characteristics, and failure behavior. Treating them as interchangeable is one of the most common architectural mistakes. A code-execution tool that times out is a very different operational problem from an SQL tool that returns stale data, which is different again from an API chain that partially succeeds before hitting a rate limit.

The four patterns below represent the categories that cover the overwhelming majority of enterprise tool use cases. Each has a distinct implementation profile, a distinct security surface, and distinct failure modes you need to design for explicitly. Before you register a single tool with your AI system, you should know which category it falls into. Because that categorization determines your sandboxing strategy, your logging requirements, and your error-handling contract.

1. SQL and Query Tools

SQL tool use is among the most powerful and highest-risk enterprise patterns. When implemented correctly, it allows non-technical users to extract business intelligence through natural language. When implemented carelessly, it becomes a data exfiltration surface.

The model receives a user question — "What was the EMEA quarterly pipeline by segment for Q1 2025?" — and generates a SQL query from it. Your application validates the query, executes it against a read replica, and returns the result set.


{
  "type": "tool_use",
  "id": "tool_abc123",
  "name": "query_sales_database",
  "input": {
    "query": "SELECT segment, SUM(pipeline_value) AS total_pipeline FROM opportunities WHERE region = 'EMEA' AND quarter = 'Q1-2025' GROUP BY segment ORDER BY total_pipeline DESC",
    "database": "sales_prod",
    "max_rows": 50
  }
}

Your execution layer must enforce: parameterized queries (never raw string interpolation), read-only database credentials, row-level security filters that scope results to the authenticated user's access tier, and query timeout limits. The model should never have the ability to construct a query that bypasses your access controls — enforce this at the database driver level, not just in the prompt.

 

2. API Chaining

API chaining enables the model to orchestrate multi-step workflows across external services. A single user request might trigger a CRM lookup, followed by a Slack notification, followed by a calendar event creation. Each step depends on data from the previous one.

The key design challenge here is state management between tool calls. Each tool result must be injected back into the conversation before the model decides its next action. Your application layer owns this conversation state. The model itself is stateless.

Define clear tool boundaries: one tool per external system, scoped to the minimum permission set required. Avoid tools with overlapping functionality, as this increases the probability of the model making ambiguous routing decisions. When two tools could plausibly satisfy the same intent, your descriptions must be written with surgical precision to differentiate them.

 

3. Code Execution

Code execution tools let the model write and run code, Python for data analysis, JavaScript for transformations, shell scripts for automation, and observe the output. This is what powers capabilities like data visualization, statistical analysis, and file processing.

At the enterprise level, code execution must be sandboxed. Containers with no network egress, no filesystem access beyond a designated scratch directory, and CPU/memory limits are non-negotiable. Tools like AWS Lambda with tight IAM policies, Firecracker microVMs, or Kubernetes jobs with network policies are appropriate primitives.

The model should not be able to install arbitrary packages at runtime. Pre-bake your execution environment with the exact package versions your use cases require, and treat that image as a versioned, audited artifact.

 

4. Internal Tools and Business Logic

Internal tools expose proprietary business logic as callable functions: ticket escalation systems, approval workflows, internal knowledge bases, and HR policy engines. These are tools you build, own, and maintain.

Design them with the same rigor you would apply to any internal microservice. Version your tool schemas. Treat breaking changes to a tool's parameter structure as API breaking changes. Implement idempotency keys for tools that trigger side effects. Log every invocation with the model's reasoning trace, the input payload, and the execution result.

 

RAG vs. Tool Use: Choosing the Right Pattern

RAG and Tool Use patterns are often conflated. They solve different problems and belong to different layers of your architecture.

Dimension

Retrieval-Augmented Generation (RAG)

Tool Use / Function Calling

Primary purpose

Inject relevant documents into context

Execute actions and retrieve live data

Data freshness

Depends on index update frequency

Real-time (at invocation)

Latency

Low (vector search)

Variable (depends on downstream API/DB)

Structured output

No — retrieves unstructured text chunks

Yes — returns structured JSON or tabular data

Write operations

Not applicable

Supported (with appropriate safeguards)

Failure mode

Missing or stale context

API errors, timeouts, schema mismatches

Primary use case

Long-form document Q&A, policy lookup

Business process automation, live analytics

Cost driver

Embedding + vector DB

API calls, compute, database query cost

In practice, production enterprise systems use both. RAG surfaces relevant documents as background context; tool use retrieves or modifies live operational data. The architecture is additive, not either/or.

 

What Is the Model Context Protocol (MCP)?

The Model Context Protocol (MCP) is an open standard, introduced by Anthropic in late 2024, that defines a common interface between LLM hosts and the tools, data sources, and services they connect to. Think of it as a USB-C standard for AI integrations: any MCP-compatible tool can plug into any MCP-compatible host without custom glue code.

Before MCP, every AI application required bespoke integration logic for each tool. You wrote a Salesforce connector for Claude, a different one for GPT-4, and maintained both. MCP decouples the tool definition from the model host, which means you write the integration once and it works across any compliant runtime.

 

MCP vs. Custom Integrations

Dimension

MCP

Custom Integration

Standardization

Defined schema; works across hosts

Ad hoc per model/platform

Portability

Tool works with any MCP-compatible LLM

Locked to specific model or SDK

Discovery

Server exposes tool manifest dynamically

Manually declared per deployment

Maintenance burden

Low — one implementation

High — multiplied per host

Ecosystem

Growing library of pre-built MCP servers

Built from scratch

Security model

Defined at the server level

Implemented per integration

Versioning

Protocol-level versioning support

Custom version management

 

MCP Architecture

An MCP deployment has three components: the MCP Host (the LLM application), the MCP Client (the protocol handler embedded in the host), and the MCP Server (the service that exposes one or more tools). Communication between client and server uses JSON-RPC 2.0 over stdio (for local servers) or HTTP with Server-Sent Events (for remote servers).

MCP architecture

 

When your MCP Host initializes, it queries each configured server for its tool manifest. The manifest describes available tools, their schemas, and their capabilities. From that point, the model can invoke any listed tool using the standard protocol. No custom routing logic is required in your application.

For enterprise deployments, you host your own MCP servers, one per domain (CRM, ERP, ITSM, data warehouse). Each server enforces its own authentication, authorization, and rate limiting. The MCP layer becomes your AI tool governance boundary.

 

Amazon Bedrock Agents: ReAct, Action Groups, and Guardrails

Amazon Bedrock Agents provides a managed implementation of multi-step AI orchestration on AWS. It is built around three concepts you need to understand deeply before deploying it in production.

 

The ReAct Loop

ReAct Single Agent-1

Bedrock Agents uses the ReAct (Reasoning + Acting) prompting framework to drive multi-step tool use. In each iteration of the loop, the model: Thinks (produces internal reasoning about what to do next), Acts (invokes a tool via an action group), and Observes (receives the tool result and updates its reasoning).

This loop continues until the model determines it has sufficient information to produce a final response, or until a configured step limit is reached. In production, you must set explicit step limits. Without them, a poorly defined task can generate runaway invocation chains that accumulate cost and latency.

The ReAct loop is not a black box. Bedrock exposes the intermediate reasoning traces, which you should capture and log. They are invaluable for debugging failure modes: you can inspect exactly why the model chose a particular tool, what it expected the result to be, and how it updated its plan after receiving the actual output.

 

Action Groups

Action groups are the unit of tool registration in Bedrock Agents. Each action group is backed either by an AWS Lambda function or an OpenAPI schema pointing to an HTTP endpoint. You define the available operations, their parameters, and the conditions under which the agent can invoke them.

Design your action groups around business domains, not technical systems. An action group called customer_data that exposes get_customer_profile, get_order_history, and update_contact_info is easier to govern than three separate action groups organized by the underlying microservices they call. The domain-oriented boundary makes permission scoping more intuitive and audit trails more interpretable.

 

Guardrails

Bedrock Guardrails operate as a policy enforcement layer that wraps both the model's input (the user's query) and the model's output (its response). They are not optional in enterprise deployments — they are how you enforce content policies, data residency requirements, and regulatory constraints without modifying your model or application code.

Guardrails support: topic denial (blocking queries outside permitted domains), content filtering (detecting and blocking harmful content categories), PII detection and redaction, grounding checks (validating that the model's response is supported by the retrieved context), and word/phrase blocklists. You configure these policies as a named Guardrail resource and attach it to your agent. Every invocation passes through it — input and output — before the model or the caller sees the content.

 

What Are Knowledge Graphs in Enterprise AI?

A knowledge graph is a structured representation of entities (people, products, contracts, locations) and the typed relationships between them. Unlike a vector store, which retrieves semantically similar text chunks, a knowledge graph lets you traverse explicit connections: "Which contracts does Customer A have? Which products are covered by those contracts? Which support tickets reference those products?"

Knowledge Graph

 

This traversal capability is what makes knowledge graphs indispensable for enterprise AI. Vector search tells you what is similar. A knowledge graph tells you what is connected.

 

Graph-Enhanced RAG

Standard RAG retrieves the top-k most semantically similar chunks to a query. Graph-enhanced RAG adds a traversal step: after identifying the most relevant entity from vector search, the system walks the graph to pull in connected context that a pure embedding search would miss.

The retrieval pipeline works as follows:

  1. Embed the user query and perform vector similarity search against entity embeddings.

  2. Identify the anchor entity or entities most relevant to the query.

  3. Traverse the graph from those anchors — one or two hops — collecting connected entities and relationship labels.

  4. Serialize the subgraph as structured context (entity-relationship triples or a natural-language summary).

  5. Inject both the vector-retrieved chunks and the graph-derived context into the model's prompt.

The result is a model that can answer questions like "What is the risk exposure of our contracts with vendors in the APAC region who use Component X?" — a query that requires traversing vendor → contract → component relationships rather than finding a semantically similar document.

 

LLM-Assisted Graph Building

Building a production knowledge graph at enterprise scale requires extracting entities and relationships from unstructured sources: contracts, emails, support tickets, and product documentation. LLMs are highly effective at this extraction task when prompted with precision.

A standard extraction prompt defines the entity types (Organization, Person, Product, Contract, Location), the relationship types (SUPPLIES, EMPLOYS, COVERS, LOCATED_IN), and returns structured triples:


{
  "entities": [
    { "id": "e1", "type": "Organization", "name": "Acme Corp" },
    { "id": "e2", "type": "Product", "name": "DataBridge v3" }
  ],
  "relationships": [
    {
      "source": "e1",
      "target": "e2",
      "type": "LICENSES",
      "properties": {
        "contract_id": "CTR-2024-0892",
        "expiry_date": "2026-03-31",
        "tier": "Enterprise"
      }
    }
  ]
}

 

You run this extraction pipeline over your document corpus, deduplicate and merge entities (entity resolution is its own discipline — use deterministic matching before probabilistic), and load the resulting triples into a graph database such as Amazon Neptune, Neo4j, or a property graph layer on top of your existing data lake.

Treat the graph as a living schema. As your business evolves, new relationship types will emerge. Version your entity and relationship schemas the same way you version a relational schema — with migrations, not manual edits.

 

Security Architecture for Enterprise AI Systems

Security is not a feature you add to an AI system. It is a structural property you design in from the beginning. Four control planes matter most.

 

Security Architecture

 

Input Validation

Every user query entering your AI system should be validated before the model sees it. This means: length limits (protect against prompt stuffing), character set enforcement (reject unexpected encoding), PII detection (flag or redact sensitive data before it enters the model context), and injection pattern detection (prompt injection attempts often follow recognizable patterns).

Do not rely on the model to detect and reject malicious inputs. The model is not your security layer. Your application is.

 

Row-Level Access Control

When your AI system queries databases or knowledge stores, the results it retrieves must be scoped to what the authenticated user is permitted to see. This is non-trivial when the model is generating queries dynamically.

The correct pattern is to inject user identity and access tier as a non-overridable query filter at the database driver level — not as an instruction in the prompt. A prompt instruction like "only return data the user has access to" can be overridden by a sufficiently adversarial user query. A parameterized WHERE clause at the driver level cannot.

 

This pattern ensures that regardless of what SQL the model generates, the result set is always constrained to the user's permitted scope.

 

Output Filtering

The model's output must be inspected before it is returned to the user. Output filtering catches: PII that slipped through input validation and was echoed back, hallucinated data presented as fact (grounding checks compare output claims against retrieved context), and policy violations (content the model should not have produced given your domain restrictions).

Grounding checks are particularly important for enterprise deployments where factual accuracy is a contractual or regulatory requirement. If the model asserts a number or a policy statement, your output layer should verify that assertion against the retrieved context before surfacing it to the user.

 

Observability

You cannot govern what you cannot observe. Every AI interaction in production should emit structured logs containing: the user query (sanitized of PII), the tools invoked and their input payloads, the tool results, the model's reasoning trace (if available), the final response, and the latency breakdown per step.

These logs serve three purposes: debugging (why did the model give the wrong answer?), auditing (did the system comply with policy during this interaction?), and continuous improvement (which tool invocations fail most often, and why?). Store them in a system your security and compliance teams can query independently of the AI application.

Build dashboards that track tool invocation error rates, RAG retrieval quality (measured by grounding check pass rate), average reasoning chain length, and p95 end-to-end latency. Anomalies in any of these are early indicators of either model degradation or adversarial usage patterns.

 

Putting It Together: A Reference Architecture

A production enterprise AI system integrating all three layers — function calling, MCP, and knowledge graphs — looks like this in practice:

Untitled design (2)

The user query is entered through an API Gateway with authentication and rate limiting. It passes through an input validation layer (PII, injection, length). The validated query reaches the AI orchestration layer — Bedrock Agents or a custom ReAct loop — which has access to registered tools via MCP servers.

When the model needs to retrieve information, it may invoke a vector search tool (for unstructured document retrieval), a graph traversal tool (for structured relationship queries), or a SQL tool (for live operational data). All tool invocations are logged with full payloads. Results are injected back into the model context. The model produces a candidate response, which passes through output filtering and grounding checks before being returned to the user.

This architecture is not hypothetical. The patterns described here reflect the depth of systems built and deployed by teams operating at the intersection of enterprise data complexity and LLM capability. Getting the layers right individually is necessary. However, production reliability comes from how they compose under load, failure, and adversarial conditions.

 

Conclusion

Function calling, MCP, and knowledge graphs are not independent capabilities you can adopt incrementally without a plan. A weak tool schema undermines your knowledge graph retrieval. A missing row-level security control corrupts your function-calling safety model. An unobserved output filtering layer leaves your compliance posture unverifiable.

The teams that succeed with enterprise AI at scale treat these systems with the same engineering rigor applied to any distributed system: defined interfaces, versioned schemas, layered security controls, and instrumentation at every boundary. The AI layer is powerful. The infrastructure surrounding it is what makes it trustworthy.

Mactores specializes in enterprise AI architecture, from knowledge graph design and MCP server implementation to production-grade Bedrock Agent deployments. The patterns in this post reflect real deployment experience across regulated industries and large-scale data environments.

Let's Talk

Bottom CTA BG

Work with Mactores

to identify your data analytics needs.

Let's talk