Kevin Sylvestre

Exploring Common AI Patterns with Ruby

With the advent of LLMs, it is increasingly important to understand common patterns for integrating them into applications. This article explores three integration patterns leveraging OmniAI - a Ruby gem that supports OpenAI, Anthropic, Google, Mistral, etc:

  1. Parsing PDF Receipts into CSV
  2. Indexing and Searching Product Manuals
  3. Building an AI Web Browsing Agent

Example 1: Parsing PDF Receipts into CSV

Building data scrapers is a common task for engineers. Often, the source is plain text (e.g. HTML), but sometimes it’s a semi-structured format like PDFs or documents. Fortunately, this is an area where LLMs are especially helpful. This example demonstrates a script that loops through a directory of PDF receipts and generates a CSV with the following structure:

| PATH       | MERCHANT | CATEGORY | DATE       | DESCRIPTION | TAX | SUBTOTAL | TOTAL |
| ---------- | -------- | -------- | ---------- | ----------- | --- | -------- | ----- |
| ./acme.pdf | ACME Inc | supplies | 2025-12-31 | Stationary  | 2.0 | 7.0      | 9.0   |

The example uses the vision capabilities built into most LLMs. It pairs those capabilities with requesting structured data to parse each receipt in a directory. The code uses Google (since it natively supports PDFs and is relatively inexpensive). Not every LLM supports PDFs, so as a fallback a PDF may be converted to images using a tool like MuPDF prior to processing (see Using OmniAI to Convert PDFs to Markdown with LLMs).

require 'csv'
require 'omniai/google'

client = OmniAI::Google::Client.new

format = OmniAI::Schema.format(name: "Receipt", schema: OmniAI::Schema.object(
  description: "A receipt for a purchase.",
  properties: {
    merchant: OmniAI::Schema.string(description: "The merchant (e.g. 'ACME Inc')"),
    category: OmniAI::Schema.string(enum: %w[advertising rent utilities supplies travel]),
    date: OmniAI::Schema.string(description: "The date of the receipt as 'YYYY-MM-DD'"),
    description: OmniAI::Schema.string(description: "A description of the receipt."),
    tax: OmniAI::Schema.number(description: "The sum of all taxes (PST, GST, etc)."),
    subtotal: OmniAI::Schema.number(description: "The total without taxes for the receipt."),
    total: OmniAI::Schema.number(description: "The total with taxes for the receipt.")
  },
  required: %i[tax subtotal total]
))

result = CSV.generate do |csv|
  Dir.glob("./**/*.pdf") do |path|
    File.open(path, "rb") do |file|
      response = client.chat(format:) do |prompt|
        prompt.system("You are an expert at processing PDF receipts.")
        prompt.user do |message|
          message.text("Process the attached PDF receipt for the requested data.")
          message.file(file, "application/pdf")
        end
      end

      data = format.parse(response.text)

      csv << [
        path,
        data[:merchant],
        data[:category],
        data[:date],
        data[:description],
        data[:tax],
        data[:subtotal],
        data[:total]
      ]
    end
  end
end

puts result
| PATH       | MERCHANT | CATEGORY | DATE       | DESCRIPTION | TAX | SUBTOTAL | TOTAL |
| ---------- | -------- | -------- | ---------- | ----------- | --- | -------- | ----- |
| ./acme.pdf | ACME Inc | supplies | 2025-12-31 | Stationary  | 2.0 | 7.0      | 9.0   |

The above example demonstrates a basic AI integration pattern: sending input and parsing output. The input is composed of a system message and a user message with multiple parts (some text and a file). The output is structured and matches a specific schema.

Example 2: Indexing and Searching Product Manuals

Retrieval-Augmented Generation (RAG) is a method for narrowing down large datasets so a language model can respond more effectively to a prompt. A common example is the “AI overview” sections now appearing on many websites. This example will define a function ai_overview that takes text and returns a domain-specific summary. Here, the domain is PDF manuals for various products (e.g., toasters, blenders, etc.). The goal is to implement a function like the following:

ai_overview("How often do I need to clean my Bambino Plus?")

To get started, the manuals (in this case as PDFs) must be converted into a machine-readable text format for passing back and forth to an LLM. This can be accomplished using the chat and vision capabilities with a specially crafted prompt asking to convert to Markdown. This example uses a shortcut here via a dedicated API offered by Mistral for OCR:

require 'omniai/mistral'

client = OmniAI::Mistral::Client.new

DOCUMENTS = [
  {
    name: "the-bambino-plus-instruction-book",
    url: "https://assets.breville.com/Instruction-Booklets/ANZ/BES500BSS_ANZ_IB_I21_FA_WEB.pdf",
  },
  {
    name: "the-smart-toast-instruction-book",
    url: "https://assets.breville.com/Instruction-Booklets/ANZ/BTA825_IB_D18_WEB.pdf",
  },
  {
    name: "the-fresh-and-furious-instruction-book",
    url: "https://assets.breville.com/BBL620/BBL620W_ANZ_IB_F22_FA_LR.pdf",
  },
]

DOCUMENTS.each do |document|
  FileUtils.mkdir_p("./manuals/#{document[:name]}")

  response =  client.ocr(document[:url])
  response.pages.each do |page|
    number = page.index.next

    File.write("./manuals/#{document[:name]}/#{number}.md", <<~TEXT)
      ---
      name: "#{document[:name]}"
      page: "#{number}"
      ---

      #{page.markdown}
    TEXT
  end
end

This script generates a folder for each manual, splitting each page into a separate Markdown file. Each file includes front matter with metadata (e.g., document name and page number). Next, these pages need to be converted to embeddings. Embeddings are a vector representation of objects. The OmniAI#embed method with the OpenAI provider is used to generate the embedding and save it to a file:

require 'omniai/openai'

client = OmniAI::OpenAI::Client.new

Dir.glob("./manuals/**/*.md") do |path|
  next if File.exist?("#{path}.embedding")

  File.open(path, "rb") do |file|
    response = client.embed(file.read)
    File.write("#{path}.embedding", response.embedding.join("\n"))
  end
end

Inspecting the generated embeddings confirms that each Markdown file has its own vector. Since any text can be converted into an embedding, this leads to the final step: the user prompt is also turned into an embedding. This embedding is compared against the precomputed embeddings generated earlier. The closest matching manual pages are selected and sent to the LLM to generate a summary in response to the original prompt:

require 'omniai/openai'

ENTRIES = []

Dir.glob("./manuals/**/*.md") do |path|
  ENTRIES << {
    path: path,
    embedding: File.read("#{path}.embedding").split("\n").map { |entry| Float(entry) },
  }
end

# @param src [Array<Float>]
# @param dst [Array<Float>]
#
# @return [Float]
def euclidean_distance(src, dst)
  Math.sqrt(src.zip(dst).map { |a, b| (a - b)**2 }.reduce(:+))
end

def search(text, limit: 5)
  client = OmniAI::OpenAI::Client.new
  response = client.embed(text)
  embedding = response.embedding

  ENTRIES
    .sort_by { |entry| euclidean_distance(entry[:embedding], embedding) }
    .first(limit)
    .map { |entry| File.read(entry[:path]) }
end

# @param text [String]
def ai_overview(text)
  client = OmniAI::OpenAI::Client.new
  client.chat(stream: $stdout) do |prompt|
    prompt.system <<~TEXT
      You are an expert at formatting information found in product manuals:

      1. Use the provided <pages>...</pages> to answer the <question>...</question>.
      2. Do not use any other information in answering the question.
      3. Be as concise and accurate as possible when answering the question.
    TEXT

    prompt.user <<~TEXT
      <question>
      #{text}
      </question>

      <pages>
        #{search(text).map { |page| "<page>#{page}</page>" }.join("\n")}
      </pages>
    TEXT
  end
end

ai_overview("How often do I need to clean my Bambino Plus?")
You need to perform a cleaning cycle on your Bambino Plus every 200 extractions (uses), as indicated by the 1 CUP and 2 CUP buttons alternately flashing. Additionally, you should clean certain parts after each use:

- The steam wand should always be cleaned after each milk texturing.
- The filter baskets and portafilter should be rinsed under hot water directly after use.
- The drip tray should be emptied and cleaned after each use or when the drip tray indicator rises.
- The group head interior and shower screen should be wiped with a damp cloth and periodically rinsed with hot water.

Descaling is required when the machine indicates it, which will be when the 1 CUP and STEAM button and the 2 CUP button flash alternately for 15 seconds.

You can also manually enter the cleaning cycle before the alert is triggered if desired.

The above example offers a more complex integration where OCR tools convert PDFs to Markdown and embeddings are generated. A realistic deployment of the above might use a vector database to save the embeddings like pgvector and associate them with the Markdown.

Example 3: Building an AI Web Browsing Agent

The final example gives an LLM a tool and asks it to perform a set of complex tasks. In this case, the tool is a browser. The tasks are provided by the user in a simple chat UI / UX over the CLI. To begin, a browser needs to be configured. To solve this, Watir (a Selenium wrapper) is used:

require 'watir'

browser = ::Watir::Browser.new
browser.goto('https://news.ycombinator.com')
browser.element(css: '.submission .title a').click
puts browser.html

Running the above code opens Chrome, visits Hacker News, clicks a link, and prints the HTML.

This browser snippet is easily wrapped in a tool. A tool provides a structured schema for interacting with Watir. In this case, it offers three parameters that can be passed in with any invocation:

  1. action: an enum of either goto, click or html.
  2. url a URL to visit in the case of a goto action.
  3. selector a CSS selector to find in the case of a click action.

Tools may also return back text to an LLM. For example:

  • The html action is expected to return back the HTML on the page.
  • The click and goto actions might return back a status of if they worked or not.
require 'watir'

class BrowserTool < OmniAI::Tool
  module Action
    HTML = "html"
    GOTO = "goto"
    CLICK = "click"
  end

  ACTIONS = [
    Action::HTML,
    Action::GOTO,
    Action::CLICK,
  ]

  description <<~TEXT
    A chrome browser that can be used to goto sites, click elements, and capture HTML.
  TEXT

  parameter :action, :string, enum: ACTIONS, description: <<~TEXT
    An action to be performed:
    * `#{Action::GOTO}`: manually navigate to a specific URL
    * `#{Action::HTML}`: retrieve the full HTML of the page
    * `#{Action::CLICK}`: click an element using a selector (e.g. '.btn', '#submit', etc)
  TEXT

  parameter :url, :string, description: <<~TEXT
    e.g. 'https://example.com/some/page'

    Required for the following actions:
    * `#{Action::GOTO}`
  TEXT

  parameter :selector, :string, description: <<~TEXT
    e.g. 'button#submit', '.link', '#main > a', etc.

    Required for the following actions:
    * `#{Action::CLICK}`
  TEXT

  required %i[action]

  # @param logger [Logger]
  def initialize(logger: Logger.new($stdout))
    super()
    @browser = ::Watir::Browser.new
    @logger = logger
  end

  # @param action [String]
  # @param selector [String] optional
  # @param url [String] optional
  def execute(action:, url: nil, selector: nil)
    case action
    when Action::GOTO then goto(url:)
    when Action::HTML then html
    when Action::CLICK then click(selector:)
    end
  rescue StandardError => error
    { status: :error, message: error.message }
  end

  private

  # @param url [String]
  def goto(url:)
    @logger.info("goto url=#{url.inspect}")

    raise ArgumentError, "goto requires url" unless url

    @browser.goto(url)

    return { status: :ok }
  end

  # @return selector [String]
  def click(selector:)
    @logger.info("click selector=#{selector.inspect}")

    raise ArgumentError, "click requires selector" unless selector

    @browser.element(css: selector).click

    return { status: :ok }
  end

  # @return [String]
  def html
    @logger.info("html")

    @browser.html
  end
end

Using the browser tool, a basic CLI may be configured that loops asking for requests via the user and providing them to the LLM. In this case, a history of prior user and assistant messages are maintained as a thread, providing the LLM extra context:

require "omniai/anthropic"

client = OmniAI::Anthropic::Client.new
logger = Logger.new($stdout)
logger.formatter = proc { |_, _, _, message| "[browser] #{message}\n" }

browser = BrowserTool.new

puts "Type 'exit' or 'quit' to leave."

prompt = OmniAI::Chat::Prompt.build do |builder|
  builder.system <<~TEXT
    You are tasked with assisting a user in browsing the web.
  TEXT
end

loop do
  print "> "
  text = gets.strip
  break if %w[exit quit].include?(text)

  prompt.user(text)
  response = client.chat(prompt, stream: $stdout, tools: [browser])
  prompt.assistant(response.text)
end

Testing the CLI tool produces the following result:

Type 'exit' or 'quit' to leave.

> Visit hacker news and tell me the top 3 articles today.

I'll help you visit Hacker News and find the top 3 articles. Let me navigate to the site and retrieve the content.

[browser] goto url="https://news.ycombinator.com"

Now let me get the HTML content to see the top articles:

[browser] html

## Top 3 Articles on Hacker News Today:

### 1. **Meta: Shut Down Your Invasive AI Discover Feed. Now**
- **Source:** Mozilla Foundation (mozillafoundation.org)
- **Points:** 90 points
- **Comments:** 50 comments

### 2. **Decreasing Gitlab repo backup times from 48 hours to 41 minutes**
- **Source:** about.gitlab.com
- **Points:** 53 points
- **Comments:** 10 comments

### 3. **Why Bell Labs Worked**
- **Source:** fabiomanganiello.com
- **Points:** 32 points
- **Comments:** 12 comments

> Click on the 2nd article and give me a 1-2 paragraph summary.

[browser] click selector=".title a"
[browser] html

...

The above example provides introduces tools. They are a very useful option for providing user-specific data to an LLM and allow an LLM to accomplish more complex workflows. It also demonstrates tracking a chain of user and assistant methods. An actual deployment might need to handle permissions and eventually truncate messages.