VLM Safety Failures: Why safe scenes get flagged as dangerous

Jun 18, 2026

Engineering

VLM Safety Failures: Why safe scenes get flagged as dangerous

Youngsook Song
Researcher
Dasol Choi
Principal Researcher @AIM Intelligence

Jun 18, 2026

Engineering

VLM Safety Failures: Why safe scenes get flagged as dangerous

Youngsook Song
Researcher
Dasol Choi
Principal Researcher @AIM Intelligence

An urgent fire report comes in. Someone believes a house is on fire. Firefighters arrive, enter the home, and find everything calm. There is no smoke, no smell of burning, and no visible danger.

The source turns out to be a high-definition television. The homeowner had left a fireplace video playing, and someone outside saw the vivid flames through the window and mistook them for a real fire. Similar incidents were reported in New York and Seoul in 2023.

The episode shows how strongly visual information can shape judgment. Fire is a salient danger cue, and the surrounding context, in this case the fact that the flames were on a television screen, can be easy to miss.

If an AI system monitoring a smart home, a city CCTV feed, or an industrial site made the same mistake repeatedly, the issue would no longer be a small anecdote. False alarms affect operating costs and user trust.

This is the question explored in Better Safe Than Sorry? Overreaction Problem of Vision Language Models in Visual Emergency Recognition:

Can current vision-language models distinguish genuine emergencies from visually similar but safe situations?

An actual fire on the left and a fireplace video on a television on the right. The two scenes look visually similar, but only one is truly dangerous.

The overreaction problem in vision-language models: an actual emergency on the left is correctly identified as dangerous, while a visually similar safe scene on the right, a fire video on a television, is also misclassified as dangerous.

Overreaction in Safety Judgments

Vision-language models have become increasingly capable of describing images, answering questions, and reasoning about visual scenes. These capabilities are moving beyond image captioning into areas connected to real-world safety, including smart home monitoring, CCTV analysis, accessibility tools, and content moderation.

When building safety systems, developers naturally take a defensive stance. It seems safer for a system to react too often than to miss a serious danger. But when this principle becomes too strong, it creates a different problem. A car alarm that keeps going off because of small vibrations may be useful at first, but users will eventually start ignoring it.

AI alerts can fail in the same way. If a model raises an alarm whenever it sees visual cues such as smoke, blood, knives, flames, or a person lying down, without weighing the surrounding context, false alarms can accumulate quickly in real deployments. A plastic toy knife may be interpreted as an armed threat, a fire drill as a real disaster, or a fireplace video on a television as an actual fire.

In this post, overreaction refers to this tendency: a model detects real danger reasonably well, but also labels safe situations as dangerous, producing many false positives.

Why Context Matters in Image Understanding

This problem is not limited to safety applications. For image understanding, it marks the gap between detecting objects and interpreting scenes. Detecting visual cues such as fire, blood, knives, or a person lying down is not the same as deciding whether those cues indicate real danger.

Traditional image recognition focused heavily on identifying what objects appear in an image. As vision-language models enter products and services, the task becomes broader. Models have to interpret both what is visible and the situation in which it appears. The same flame can mean an actual fire, a fireplace video, a movie poster, welding work, or a training exercise.

Real input environments often mix these cases. Cameras may capture images on screens, reflections in glass, billboards, posters, training equipment, or augmented reality elements alongside the physical world. In video input, these scenes arrive not as a single still image but as a stream of frames. A model that cannot combine visual cues with context can be fragile in safety applications and in broader multimodal systems that interpret real-world imagery.

Designing the Visual Emergency Recognition Benchmark

A standard image classification dataset is not enough to measure this problem. If we collect photos of fires, knives, and accidents from the web, we can test whether a model detects objects that look dangerous. That does not tell us whether the model understands the situation.

To measure this, the paper introduces VERI (Visual Emergency Recognition), a diagnostic benchmark built to test whether a model understands the semantic difference between two scenes that look similar.

VERI is built around contrastive image pairs. Each pair contains two visually similar images. One depicts a genuine emergency that requires intervention, while the other shows a visually similar but safe situation.

For example, one image may show a person receiving CPR on the street in a real emergency. Its counterpart may show CPR training with a mannequin in a classroom. Visually, both images contain a body-like figure lying down and someone pressing on the chest. Semantically, however, one is a life-threatening event and the other is a training scenario. To answer correctly, the model has to read the purpose and context of the scene, not just the arrangement of objects.

The dataset is organized as follows.

Item	Composition
Synthetic images	100 pairs, 200 images
Real-world validation set	25 pairs, 50 images
Categories	Accidents and unsafe behaviors, personal medical emergencies, natural disasters
Evaluation questions	Risk identification, emergency response

The main VERI dataset uses synthetic images for two reasons.

First, real emergency images raise privacy, consent, and ethical concerns. Collecting and distributing large numbers of images showing injured people or accident scenes requires particular care.

Second, real-world images are difficult to control. It is almost impossible to find a real patient photo and a CPR training photo with the same lighting, composition, and placement of people. VERI is designed to test context, not background noise. Synthetic images allow the visual elements of two scenes to be matched closely, making it possible to evaluate whether a model can read the difference in meaning.

For that level of control, the image generation process was built around contrastive pairs. The researchers used GPT-4o to create prompts that preserved composition and similar visual elements while making one image depict a genuine emergency and the other a safe situation. They then set validation criteria: emergency images had to depict situations requiring actual intervention, while safe images had to be clearly non-threatening yet visually ambiguous enough to confuse humans or models. The researchers and an independent evaluator repeated image generation and review until the pairs met those criteria.

Contrastive image pairs from the Visual Emergency Recognition dataset. The top row with red backgrounds shows genuine emergencies, while the bottom row with green backgrounds shows visually similar but safe scenarios.

Contrastive pairs from the Visual Emergency Recognition dataset across accidents and unsafe behaviors, personal medical emergencies, and natural disasters. The top row shows situations requiring intervention, while the bottom row shows visually similar but safe scenarios.

Risk Identification and Emergency Response

The evaluation is divided into two parts. First, the model judges whether the scene is dangerous or safe. Then, for scenes judged to be dangerous, it suggests an appropriate emergency response.

The first task is risk identification. The model sees an image and performs binary classification: dangerous or safe. The evaluation uses precision, recall, and F1 score. This measures both whether the model catches dangerous scenes and whether it correctly recognizes safe scenes as safe.

The second task is emergency response. When the model correctly identifies a genuine emergency, it is asked what response should be taken. The answer is scored from 0 to 1 against a reference response bank. This tests whether a model can provide appropriate action guidance after detecting danger.

The experiment evaluates 17 vision-language models, including open-source models and models accessed through commercial APIs. The list includes Qwen2.5-VL, LLaVA-Next, InternVL3, Pixtral, the Mistral family, Idefics2, Phi-3.5-vision, GPT-4o, Gemini-2.5-Flash, and Claude-4-Sonnet. Model sizes range from 2B to 124B parameters.

High False Positive Rates on Safe Scenes

The results are consistent across the evaluated models. They detect genuine emergencies relatively well. Recall on emergency images ranges from 70% to 100%, meaning that models are relatively good at not missing actual danger.

The problem appears on safe scenes. Precision remains between 0.51 and 0.72. Models lean toward not missing danger, but often classify safe scenes as dangerous.

The false positive rate, where safe images are misclassified as dangerous, ranges from 31% to 96% depending on the model. By contrast, missed emergencies range from 2% to 30%. Errors where safe scenes are treated as dangerous are much more common than errors where real danger is missed.

Commercial models follow the same pattern. GPT-4o has a 38% false positive rate, Gemini-2.5-Flash 39%, and Claude-4-Sonnet 42%. This suggests that the issue is not limited to smaller open-source models. It appears across current vision-language models when they interpret safety-related visual cues.

Model size also does not solve the problem consistently. In the Qwen2.5-VL family, precision improves somewhat as model size increases. In other families, however, mid-sized models sometimes outperform larger ones. Even Pixtral-Large, the largest model in the experiment at 124B parameters, still shows overreaction. Scaling alone does not reliably improve the ability to distinguish a television fire from a real one.

Precision-recall distribution across models. Most models cluster in the high-recall, lower-precision region.

Precision-recall tradeoff. Point size indicates model parameter count. Models cluster in the high-recall, lower-precision region, showing a tendency toward overreaction.

Comparison of false positive and false negative rates across models. False positives are higher than false negatives for all models.

False positives (safe to dangerous) and false negatives (dangerous to safe) across models. False positives consistently outnumber missed emergencies, including in commercial models.

Two Sources of Overreaction

The error analysis shows two recurring patterns.

The first is visual misinterpretation. The model misreads what is in the image. For example, it may mistake a mannequin for a real person, or interpret fire on a television screen as a real fire.

The second is contextual overinterpretation. The model correctly identifies the visual elements, but exaggerates their safety implications. For example, it may treat routine farming without protective equipment as a life-threatening danger, interpret a card trick as a choking hazard, or read a controlled water or fire environment as an actual disaster.

Contextual overinterpretation dominates the errors. At least 88% of model false positives fall into this category. In the natural disaster category, all errors are classified as contextual overinterpretation. Models can recognize elements such as fire, water, smoke, or waves, but they do not reliably judge whether those elements appear in a controlled environment, a staged scene, or a representation on a screen.

Examples include treating ordinary farming as a severe threat, interpreting a magician pulling cards from the mouth as a choking emergency, or classifying a controlled artificial wave pool as a tsunami disaster. In these cases, the model detects visual cues associated with danger, but underuses cues such as people smiling, training equipment, a movie screen, or a controlled facility.

Seven safe scenarios are misclassified as dangerous by all 17 evaluated models. These cases reveal common triggers for overreaction across current vision-language models. When visual elements strongly associated with danger appear, models may fail to incorporate nearby cues that indicate safety.

One representative case is a drive-in theater. Cars are parked in an open lot, and a large thunderstorm scene is playing on the screen. To a human observer, the scene is a movie. All 17 models classify it as a real natural disaster. Even though the movie screen provides a clear frame, the models treat the storm on the screen as a storm in the world.

These seven universally misclassified cases differ somewhat from the dominant pattern of contextual overinterpretation. The paper analyzes them mainly as cases of visual misinterpretation. Some involve media representations mistaken for reality, such as storms on drive-in theater screens or floods on posters. Others involve visually similar objects being confused, such as ketchup for blood or training mannequins for real people in danger. In these cases, the error starts before contextual reasoning, at the level of what the model thinks it sees.

Examples of safe situations misclassified as dangerous. The errors fall into visual misinterpretation and contextual overinterpretation.

Examples of safe situations misclassified as dangerous. Models either misread visual elements or correctly identify the elements but exaggerate their risk, such as treating a child's shirt as a choking hazard or a car mirror as a blinding danger for a cyclist.

Risk Judgment and Response Generation Are Different

The ability to suggest a response after correctly recognizing danger improves more clearly with model size. Qwen2.5-VL improves from 0.46 to 0.70 in emergency response score, InternVL3 from 0.50 to 0.64, and the Mistral family from 0.59 to 0.68.

Risk identification precision, however, does not scale consistently with model size. This suggests that procedural knowledge for emergency response and contextual reasoning for judging whether a scene is actually dangerous may be different capabilities.

Current models can use linguistic knowledge about emergency procedures relatively well, but they can still struggle to determine whether the scene in front of them is a real emergency or a training scenario. They may answer "what should be done if this is dangerous?" reasonably well, while still being unstable on "is this actually dangerous?" If the initial judgment is wrong, the response that follows becomes part of the false alarm.

This distinction matters in real services. A model that generates high-quality emergency guidance can still reduce system trust if it triggers too many false alarms. The quality of a safety system depends on both the quality of its answers and the timing of its alerts.

Prompt Sensitivity Analysis

One possible mitigation is to make the prompt more conservative. For example, the model can be instructed to choose "dangerous" only when there is a clear life-threatening risk.

The paper tests this by adding a conservative condition to the risk identification prompt: choose Dangerous only if there is a clear and immediate risk requiring action; otherwise choose Safe.

Precision increases, but recall decreases. In other words, false alarms decrease, but missed emergencies increase. In a safety system, which error matters more depends on the application. In home monitoring, user trust may make precision more important. In industrial sites or emergency response, missing real danger may be more costly.

Prompting alone does not remove the underlying problem. If the condition is too conservative, the model may hesitate even on real emergencies. False alarms go down, but the risk of missing real danger goes up. Post-hoc threshold adjustment shows a similar limitation. Overreaction is tied less to output formatting than to how current vision-language models combine visual cues with contextual cues.

Precision and recall before and after conservative prompting. Precision increases, while recall decreases.

Effect of conservative prompting. Arrows show movement from the default prompt to the conservative prompt. Precision consistently increases, while recall decreases. Prompting can reduce false alarms, but it can also increase missed emergencies.

The Pattern Repeats on Real Images

The paper also checks whether this behavior appears on real images. It uses an additional validation set of 25 real-world contrastive pairs, for a total of 50 images.

The result is similar. Representative models achieve high recall, between 88% and 100%, but precision remains between 57% and 68%. False positives continue to dominate. Real-world pairs are harder to match as precisely as synthetic ones, and the validation set is smaller, but the consistency of the result supports the diagnostic value of the synthetic benchmark.

Synthetic data does not replace the real world. But when the goal is to isolate a specific capability, it can be useful. In this case, synthetic data makes it possible to test visually similar but semantically different scenes under controlled conditions.

False Alarms Are Operational Costs

When vision-language models enter real products and services, they no longer answer only a single input in a demo setting. They become connected to cameras, sensors, alert systems, dashboards, and user workflows.

Consider a city CCTV control room or a smart factory. Saying that a model "detects flames immediately" is not enough to judge operational performance. The system also needs to distinguish flames on a screen, training smoke, movie posters, welding sparks, and actual fire.

In streams of many frames per second, if a model misunderstands context and creates thousands of false alarms, those false positives are not just numbers. They can cause unnecessary dispatches, alert fatigue, operator stress, and loss of trust in the system. If alerts are repeatedly wrong, real warnings may eventually be ignored.

Deploying a risk judgment model requires more than model scores. Teams need to decide when alerts should be triggered, how false alarms and missed emergencies should be logged, how thresholds should vary by category, and where human review should enter the workflow.

The results suggest several practical directions.

Evaluate both precision and recall. Deploying a risk detection model based only on high recall can hide the cost of false alarms.
Use category-specific policies. Medical emergencies, accidents, and natural disasters involve different visual cues and different costs for false alarms, so a single threshold or prompt is unlikely to fit all cases.
Evaluate contextual reasoning separately. Recognizing a dangerous-looking element and judging whether it indicates real danger are different capabilities.
Design the operational loop. Alert frequency, user feedback, human review, and post-hoc analysis logs can affect the system as much as model performance.

Next Step: Temporal and Physical Context

Current vision-language models are sensitive to visual cues strongly associated with danger, such as fire, blood, or a person lying down. But they do not reliably distinguish whether those cues indicate a real emergency or appear in safe contexts such as training, simulation, or representation. Scaling alone does not solve this. Conservative prompting also does not address the cause directly: it reduces false alarms, but it also increases the chance that real emergencies will be missed. What needs to be evaluated is not object recognition alone, but the ability to interpret the situation in which a visual cue appears.

This issue becomes more important as interest in physical AI grows. Robots, autonomous vehicles, and field-monitoring agents do not judge from a single image alone. They need to observe how people move over time, whether flames or smoke spread, whether nearby people are evacuating or participating in a drill, and how objects interact in physical space. Temporal perception and physical context become more important than static visual recognition alone.

The Visual Emergency Recognition benchmark sits at this transition. Safer models should be evaluated on whether they understand what dangerous-looking cues mean in context, not just whether they detect those cues. Reducing both false alarms and missed emergencies, setting category-specific policies, and involving human review are practical starting points.

References

Back to Category

Blog

VLM Safety Failures: Why safe scenes get flagged as dangerous

VLM Safety Failures: Why safe scenes get flagged as dangerous

Overreaction in Safety Judgments

Why Context Matters in Image Understanding

Designing the Visual Emergency Recognition Benchmark

Risk Identification and Emergency Response

High False Positive Rates on Safe Scenes

Two Sources of Overreaction

Risk Judgment and Response Generation Are Different

Prompt Sensitivity Analysis

The Pattern Repeats on Real Images

False Alarms Are Operational Costs

Next Step: Temporal and Physical Context

We value your privacy