Vision Language Models are Biased
Large language models (LLMs) memorize a vast amount of prior knowledge from the Internet that help them on downstream tasks...
Large language models (LLMs) memorize a vast amount of prior knowledge from the Internet that help them on downstream tasks...
Multi-head self-attention (MHSA) is a key component of Transformers, a widely popular architecture in both language and vision. Multiple heads...
Interpretability for Table Question Answering (Table QA) is critical, particularly in high-stakes industries like finance or healthcare. Although recent approaches...
Detecting object-level changes between two images across possibly different views is a core task in many applications that involve visual...
An Achilles heel of Large Language Models (LLMs) is their tendency to hallucinate nonfactual statements. A response mixed of factual...
Generative AI (GenAI) holds significant promise for automating everyday image editing tasks, especially following the recent release of GPT-4o on...
Large language models (LLMs) often exhibit strong biases, e.g, against women or in favor of the number 7. We investigate...
Most face identification approaches employ a Siamese neural network to compare two images at the image embedding level. Yet, this...
Nearest neighbors (NN) are traditionally used to compute final decisions, e.g., in Support Vector Machines or k-NN classifiers, and to...
While large language models with vision capabilities (VLMs), e.g., GPT-4o and Gemini-1.5 Pro, are powering various image-text applications and scoring...
* Equal contribution. CLIP-based classifiers rely on the prompt containing a {class name} that is known to the text encoder....
Large multimodal models (LMMs) have evolved from large language models (LLMs) to integrate multiple input modalities, such as visual inputs....
Since BERT (Devlin et al., 2018), learning contextualized word embeddings has been a de-facto standard in NLP. However, the progress...
Image classifiers are information-discarding machines, by design. Yet, how these models discard information remains mysterious. We hypothesize that one way...
Face identification (FI) is ubiquitous and drives many high-stake decisions made by law enforcement. State-of-the-art FI approaches compare two images...
Explaining how important each input feature is to a classifier’s decision is critical in high-stake applications. An underlying principle behind...
Large-scale, multimodal models trained on web data such as OpenAI’s CLIP are becoming the foundation of many applications. Yet, they...
Explaining artificial intelligence (AI) predictions is increasingly important and even imperative in many high-stakes applications where humans are the ultimate...
Three important criteria of existing convolutional neural networks (CNNs) are (1) test-set accuracy; (2) out-of-distribution accuracy; and (3) explainability. While...
Recent research in adversarially robust classifiers suggests their representations tend to be aligned with human perception, which makes them attractive...
Explaining the decisions of an Artificial Intelligence (AI) model is increasingly critical in many real-world, high-stake applications. Hundreds of papers...
Multi-agent spatiotemporal modeling is a challenging task from both an algorithmic design and computational complexity perspective. Recent work has explored...
Do state-of-the-art natural language understanding models care about word order – one of the most important characteristics of a sequence?...
Interpretability methods often measure the contribution of an input feature to an image classifier’s decisions by heuristically removing it via...
* All authors contributed equally. Adversarial training has been the topic of dozens of studies and a leading method for...
* Equal contributions. Attribution methods can provide powerful insights into the reasons for a classifier’s decision. We argue that a...
Despite excellent performance on stationary test sets, deep neural networks (DNNs) can fail to generalize to out-of-distribution (OoD) inputs, including...
Large, pre-trained generative models have been increasingly popular and useful to both the research and wider communities. Specifically, BigGANs a...
Motion-sensor cameras in natural habitats offer the opportunity to inexpensively and unobtrusively gather vast amounts of data on animals in...
Training deep neural networks on images represented as grids of pixels has brought to light an interesting phenomenon known as...