Detecting and Understanding Vulnerabilities in Language Models via Mechanistic Interpretability
ISSN: 1045-0823
ISBN: 9781956792041
Année de publication: 2024
Proceedings of the 33rd International Joint Conference on Artificial Intelligence, IJCAI 2024
Pages: 385-393
Type: Communication dans un congrès