Speaker
Description
Vision-Language Models (VLMs) integrate image information into the representation space through visual encoders, overcoming the information bottleneck associated with pure text token inputs. Tailored to the structural characteristics of the LHAASO array detectors, this study explores the feasibility of employing array trigger images as a new input modality, aiming to enhance particle identification performance via more comprehensive signal representation. We constructed and evaluated a VLM based on the GLM-4.1V-9B architecture using simulated data. Results demonstrate that the model exhibits significant Scaling Law characteristics and achieves a substantial improvement in understanding detector information compared to text-only models. This work validates the feasibility of multimodal approaches in WCDA data analysis, providing a novel technical pathway for optimizing WCDA's particle identification and reconstruction capabilities.