Document inlining: Crossing the modality gap with Compound AI
By Fireworks AI Team|12/23/2024
DeepSeek V3, a state-of-the-art open model, is now available. Try it now!
By Fireworks AI Team|12/23/2024
Most of the world’s data, such as medical records, podcasts and financial statements, live as images, PDFs, audio files, or dedicated knowledge stores, formats that LLMs do not process well or accept. Accessing and processing this data is critical for AI applications to solve real-world use cases. VLMs and multi-modal models claim they have filled this gap, but in reality, these are incomplete solutions that can only handle limited types of inputs and lack reasoning capabilities on these modalities, leading to lower-quality results and higher costs. For example, due to a lack of vision training data, VLMs experience “modality gaps” where identical tasks can have significantly better results when inputs are processed via text instead of image.
Bridging this quality gap requires manually setting up complex workflows and pipelines to convert multi-media sources into a format LLMs can understand. Users need to parse data, format it into plain text and potentially chunk/embed it.
Our vision at Fireworks is to deliver the highest quality across all modalities while abstracting away this complexity through compound AI: by building an automated pipeline that transforms any digital asset format to be LLM compatible for processing and logical reasoning. This approach enables you to achieve higher quality results across any input type with the ease of use of a LLM.
Today, we are excited to launch a public preview of our first use case, Document Inlining, a compound system that automatically turns any LLM into a vision model to ingest images or PDFs for document-based vision tasks. Document Inlining parses images and pipes them directly into an LLM of your choice to deliver:
Text-only LLMs have limited utility in handling documents. Recently, vision models have become popular but face 2 major challenges in processing content.
To improve quality and usability of document use cases, Fireworks’ is introducing Document Inlining - a compound system that composes prompt transformation techniques to enable LLMs to handle PDFs and multiple images of documents.
Document Inlining transcribes images and PDFs into structured text to be ingested by LLMs, using a two-step approach:
Key Challenges: Under the hood, our parsing pipeline solves several key challenges
Advantages: This approach provides benefits including:
We can see the benefits of Document Inlining end-to-end in the following example. Without Document Inlining, we prompted Qwen 2VL vision model with “How many letter Ts are there in the table in total?” and received an obviously incorrect answer. However, with Document Inlining, we can use the smarter “Qwen 2.5 72B instruct”, and receive a correct response (responses vary per model run).
While this approach excels at typical document layouts, there are still limitations when handling highly visual (little text), spatially dependent, or layout heavy content that does not translate well into structured text.
To evaluate the effectiveness of document inlining, we conducted two experiments on a dataset of arXiv articles paired with related questions. Each article was provided in PDF form, and we randomly selected 100 article–question pairs. We then ran these pairs through selected models, using Claude 3.5-Sonnet to choose which responses were preferred. We selected Claude as the evaluator because the Anthropic API natively supports PDF ingestion.
In the first experiment, we compared an open-weight, text-only LLM (Qwen2.5-72B-Instruct) with GPT4o. The open-weight model used document inlining to process each PDF, while GPT4o received each page as an image. We found that Qwen2.5-72B-Instruct’s responses were preferred over GPT4o’s in 68% of the comparisons. Detailed results are provided in the chart below.
In the second experiment, we used the same VLM (Qwen2-VL-72B-Instruct) under two different setups: one with document inlining, and the other ingesting PDFs page by page as images. As illustrated in the chart below, document inlining led to a clear improvement in response quality.
Get started today in our docs. Use Document Inlining with a 1-line code edit for any LLM, including serverless, on-demand or fine-tuned models. Simply follow the OpenAI API specification for vision models and append #transform=inline to the content URL. You can also use Document Inlining in our UI playground for any model by enabling the “Transform” option. See this end to end demo for more!
Document_Inlining_Playground_Demo.mov
During public preview, Document Inlining incurs no added costs compared to our typical text models. You’ll pay only for output tokens and input tokens (including transcribed content) but will NOT incur additional costs for document parsing. See docs for more info.
Document Inlining is still compatible with LLM features like structured output / json mode to extract structured information from documents. Check out the below code snippet for usage of Document Inlining with JSON mode
For early access to the dedicated Fireworks Parser for use cases that require document storage, fill out this form.
Document Inlining showcases the power of compound AI systems. With Document Inlining, instead of relying on one vision model to handle all tasks, we achieve higher-quality, faster and more cost-efficient results by using a specialized parser and reasoning model. We’ll expand Document Inlining with other input transformations, including audio file inlining and inference-time search over long documents.
Fireworks makes it easy to build compound AI systems, by providing one place for:
Keep in touch with us on Discord or Twitter. Stay tuned for more updates coming soon!