Multimodal Geo Audit with Gemini-Embedding-2

Classic SEO (TF-IDF, keywords) is not suitable for RAG systems.

LLMS do not deal with lexical chains but Vector representations.

The main criterion is the Semantic similarity between a query and content.

Tools like Screaming Frog Seo Spider already make it possible to estimate this similarity on the text side.

For my part, I developed scripts to go further (chunking, scoring per passage, etc.).

But a web page, nowadays, is multimodal:

We find:

Text (HTML)
imagery
videos
metadata
sometimes audio

These elements all contribute to the global signal used during the retrieval.

The majority of current audits use models text-only (like the series text-embedding DopenAI). This method is incomplete.

Traditional tools before Gemini Embedding 2 do not:

The vectorization of all these assets
No unified scoring
No visibility on the contribution of each element

Result:

Identification of weak chunks
Image impact measurement
Prioritization of optimizations based on scores

TLDR

Why Gemini-Embdding-2?

With the publication of Google Gemini-Embedd-2, this point is resolved.

What does the model actually do:

Text + image vectorization in the same space
Homogeneous similarity calculation
Scoring by Asset
Aggregation in overall score

lstructural advantage of gemini-embedding-2-preview lies in its ability to natively process composite objects. The script is not satisfied withAnalyze the alternative text dan image; It directly transmits bytes (bytes) of theImage via class types.part.from_bytes to model dEmbedding to project them in the same vector space as the text query.

Multimodal Geo Audit with Gemini-Embedding-2

RAG ENGINES do not ingest a web page as a block. They ‘chunk’ it (cut it into passages) to maximize the relevance in their context window.

The Multimodal GEO auditor reproduces this exact behavior:

Extraction of the Central entity: If no query is provided, the tool uses an LLM (gemini-2.5-flash) in ‘Quality Rater’ mode to read the DOM and extract the Pure Search Intent (Core Search Intent). Dan Petrovic (Dejan Ai) has widely demonstrated that the Modern optimization targets features, not character strings.

Contextual chunk: HTML is cleaned from noise (header, footer, nav). The tags <p> are concatenated with their parent title (<H2> or <h3>).

Native multimodal evaluation: Cis the strength of Gemini-Embedd-2. Instead of reading only theattribute alt Done image, the script downloads theimage and sends its bytes (bytes) direct to the model. lImage is vectorized and compared to the text query in the same multidimensional space.

Mathematical calculation: relevance is more than an opinion, it’s a cosine distance between the chunk vector (text or image) and the query vector.

How to use the GEO Multimodal auditor

Extraction of the entity and Intention (LLM Fallback)

If the audit is launched without a strict target query, it is detected by the model, the script uses gemini-2.5-flash with a prompt like a Quality assessor (Search Quality Rater) on the first 20,000 characters of the raw DOM.

The goal is to isolate the ‘Core Search Intent’. The LLM therefore identifies the Entity processed by the page before the vector calculation.

Chunking and passage indexing

Martin Splitt (Google) as well as the director of Perplexity have repeatedly confirmed that the indexing and Relevance assessment separate at the passage level, not the global page. The script uses beautifulsoup To clean the code (removal of <Nav>, <footer>, <Aside>, <script>).

The logic of chunk Preserve the context:

The tags <p> are extracted and concatenated with their parent subtitle (<h1>, <H2>, <h3>).
For tags <img>, Lsource URL is extracted. The file is downloaded, converted if necessary (to JPEG for compatibility), and associated with the direct surrounding paragraph to provide a position context.

Vector projection

Google templates require strict setting of the setting TASK_TYPE To calibrate the latent space.

The target query vector is generated with the directive Task: Search Result.
Each asset Extract (text, binary image, PDF link) is vectorized with Task: retrieval_document.

Cosine distance calculation and weighting matrix

semantic proximity is calculated via the function cosine From the Scipy bookstore: score = float(1 - cosine(query_vec, vec)). A score of 1.0 indicates an absolute vector match.

To evaluate the page, jApplies a weighting matrix:

HTML (text): 45%
Image (bytes): 25%
D metadataImage (if download failed): 10%
Video: 10%
Audio / PDF: 5%

A RAG viability threshold is set at 0.75.

The ‘gap’ is calculated for each asset (0.75 - Score). the priority of Intervention Sobtains by multiplying this gap by the weight of the Element (Weight * Gap).

Example DA multimodal GEO audit with Gemini Embedding 2

I have tested with the following URL: https://aioseo.fr/en/bing-ai-performance-how-to-exploit-the-data-for-ranker-on-chatgpt-copilot-and-other-llm/

Data extracted

Resolved query: Bing AI Data for LLM Ranking
Audited assets: 35 (text blocks and media)
Overall Geo Score: 0.6585 (i.e. 65.85/100)

Multimodal Geo Performance Analysis

The overall score is 0.658. below the 0.75 minimum cosine similarity requested.

the data generated in the reports (in particular the Semantic Flow and the priority matrix) reveal the penalty mechanics:

Semantic dilution: Text passages (HTML) have a strong variance. Some paragraphs skeep away from the entity Bing AI Data. Lily Ray and Aleyda Solis regularly underline the importance of the density of the information (gain information) for theE-E-A-T. An LLM sanctions peripheral content (Fluff). If a chunk drops to 0.50, it degrades the overall score by 45%.
Multimodal mismatch: Generic or decorative images generate very low cosine scores. The template literally vectorizes pixels. If Limage does not contain visual data correlated to the Intention (Bing dashboards, ranking diagrams), its vector strongly diverges from the query.

Graph decortication

Here is how to read and exploit the 6 graphics generated during this audit.

The multimodal global score

The result: 0.6585 / 1.0 (i.e. 65.85%)

what You have to understand: This score is a weighted average (HTML 45%, 25% images, etc.) of the cosine similarities of all page elements. To be systematically extracted by a RAG system, the empirical target is at 0.75.
SEO/GEO interest: A score of 0.65 indicates that the document is relevant but diluted. If an LLM should choose between your page and a competing page scoring at 0.80 to formulate its AI Overview, it will choose the most dense and closest vector. Your page will not cross the threshold of retrieval.

Multimodal footprint

what the We observe: These graphs compare the average of each format (text, image, video) with the critical red line of the threshold of 0.75.
SEO/GEO interest: They are used to diagnose the failure by format. In Geo, we often forget the images. If your text scores at 0.80 but your images score at 0.30, your page is penalized. With Gemini-Embedd-2, a photo of generic bank illustration Images has no vector correlation with lentity Bing AI Data. The model sees pixels that mean nothing from a technical point of view. Result: the ‘image’ mode collapses and pulls the note down.

The semantic flow of passages

what theWe observe: a curve that follows the score of each paragraph (<p>) in his order dAppearance on the page, with a trend line (moving average).
SEO/GEO interest: Cis themost powerful diagnostic tool for theGain information. as pointed out Lily Ray and Aleyda Solis Regarding the E-E-A-T criteria, the density dinformation is queen. If the curve plunges into some sections (for example, an overly long introduction or an off-topic H2), this means that these chunksssemantically move away from the target query. For LLMS, this ‘fluff’ is toxic: it dilutes the relevance of your document in theVector space.

Semantic relevance distribution

what theWe observe: the scattering (volatility) of scores within dThe same category (HTML vs images).
SEO/GEO interest: This indicates the consistency of your content. A very stretched boxplot on HTML shows a heterogeneous page: some passages are dSurgical precision,others are pure filling. In GEO, the regularity (a short boxplot above 0.75) ensures that nwhatever chunk Extracted by the engine will be ultra-relevant to power its response.

The prioritization matrix

what theWe observe: a cross between the gap of relevance (the distance that separates theelement of the target score of 0.75) and the weight of Lelement in the page. The bigger and higher the bubble, the moreemergency is great.
lSEO/GEO interest: Cis your roadmap Mathematical editorial. duringA classic audit, we tell you ‘improving the content’. Here, the matrix (and theexport CSV associated) dictates to you exactly what h2 or what picture must be deleted or rewritten in priority to leverage on the overall score. We no longer modifyBlind.

Examples of recommendations to optimize multimodal GEO

Based on these results, we can suggest:

Pruning Weak Chunks: View the file action_backlog.csv generated by the script. Sort by column priority. any block <p> whose cosine score is less than 0.70 must be ruthlessly removed or densified with a technical vocabulary strictly related to scraping, APIs orLLM architecture.
Image Replacement by Visual Data: lEvaluation by bytes Requires informative visuals. Replace images dillustration by captures ofAnnotated screen of Bing Webmaster Tools reports. The Gemini model will align the vector of these new images with that of the query, mechanically increasing the score of the ‘Image’ category (which weighs 25% of the final score).
HN/P structural alignment: Mark Williams-Cook validated that semantic search systems depend ona deterministic markup. each text located under a <H2> must respond immediately to the promise of this title. the concatenation made during the chunk (Heading: text) Requires this semantic proximity to generate a powerful vector.

Multimodal Geo Audit with Gemini-Embedding-2

Why Gemini-Embdding-2?

Multimodal Geo Audit with Gemini-Embedding-2

How to use the GEO Multimodal auditor

Extraction of the entity and Intention (LLM Fallback)

Chunking and passage indexing

Vector projection

Cosine distance calculation and weighting matrix

Example DA multimodal GEO audit with Gemini Embedding 2

Multimodal Geo Performance Analysis

Graph decortication

The multimodal global score

Multimodal footprint

The semantic flow of passages

Semantic relevance distribution

The prioritization matrix

Examples of recommendations to optimize multimodal GEO

Leave a Reply Cancel reply