I sometimes hear about archives which scan for and remove malware from the content they ingest. It is true that archives contain malware, but this isn't a good idea:
- Most content in archives is never accessed by a reader who might be a target for malware, so most of the malware scan effort is wasted. It is true that increasingly these days data mining accesses much of an archive's content, but it does so in ways that are unlikely to activate malware.
- At ingest time, the archive doesn't know what it is about the content future scholars will be interested in. In particular, they don't know that the scholars aren't studying the history of malware. By modifying the content during ingest they may be destroying its usefulness to future scholars.
- Scanning and removing malware during ingest doesn't guarantee that the archive contains no malware, just that it doesn't contain any malware known at the time of ingest. If an archive wants to protect readers from malware, it should scan and remove it as the preserved content is being disseminated, creating a safe surrogate for the reader. This will guarantee that the reader sees no malware known at access time, likely to be a much more comprehensive set.
See, for example, the Internet Archive's Malware Museum, which contains access surrogates of malware which has been defanged.