LLM Evaluation

Cosine Similarity as Logits?: A Scalable Knowledge Probe Using Embedding Vectors from Generative Language Models

There has been a growing interest in utilizing pre-trained language model (PLM) as a soft knowledge base. Knowledge probers evaluate PLM's ability to fulfil such role using relational knowledge stored in a knowledge graph (KG). However, knowledge probes for generative language models are slow, and do not scale over input size. To this end, we propose a fast and scalable knowledge probe for generative PLMs and demonstrating its ability to probe using KGs that were previously infeasible.

Mar 24, 2026

Noisy Channel に基づく生成確率による画像生成評価

近年の画像生成（T2I）モデルの進展により，生成画像の表現力や多様性は大きく向上している一方で，長文や複雑な指示を含む生成では，単一指標で出力を評価することが難しく，既存の評価手法は高度化した生成能力に十分対応できていない．本研究では，生成確率に基づくNoisy Channelにより T2I 評価を再定式化し，画像のテキスト整合性と視覚的品質を統一的に捉える確率的評価指標を提案する．提案手法は，LVLM の推論能力を教師強制尤度として用いた整合性評価と，自己回帰型画像生成モデルの尤度による品質評価を組み合わせることで，生成結果間の相対比較に依存せず，各画像を独立に評価できる．検証の結果，提案手法は人手による画像選好と高い整合性を示し，既存のスコアリング手法を一貫して上回る性能を達成した．また，評価観点を切り替えることで，同一の確率的枠組みのもとで多様な人手判断を柔軟に捉えられることを確認した．

Mar 5, 2026

The Knowledge graph completion (KGC) task aims to predict missing relations in knowledge graphs (KGs). Recently, text-based KGC approaches have gained attention but they present challenges: encoder-based methods require fine-tuning making it non-ideal when an ideal KG for training cannot be obtained, such as when KG is sparse or predicting new relation-types. Meanwhile, decoder-based methods make prediction by generating tokens, where entity disambiguation becomes a challenge. KGC is also used in knowledge proving, which aims to evaluate the know edge retrieval capability of pre-trained language models (PLMs), but existing probes for generative PLM capable of ranking all multi-token and single-token entities are computationally inefficient. To address these problems, we propose DEER, an encoder-based few-shot KGC, leveraging a generative PLM that achieves a linear inference time complexity. Our experiment shows that DEER outperforms a fine-tuned KGC model in a relationally inductive setting and aligns with an existing knowledge-probing method, positioning it as a possible alternative.

Mar 10, 2025