I sometimes hear about archives which scan for and remove malware from the content they ingest. It is true that archives contain malware, but this isn't a good idea:
- Most content in archives is never accessed by a reader who might be a target for malware, so most of the malware scan effort is wasted. It is true that increasingly these days data mining accesses much of an archive's content, but it does so in ways that are unlikely to activate malware.
- At ingest time, the archive doesn't know what it is about the content future scholars will be interested in. In particular, they don't know that the scholars aren't studying the history of malware. By modifying the content during ingest they may be destroying its usefulness to future scholars.
- Scanning and removing malware during ingest doesn't guarantee that the archive contains no malware, just that it doesn't contain any malware known at the time of ingest. If an archive wants to protect readers from malware, it should scan and remove it as the preserved content is being disseminated, creating a safe surrogate for the reader. This will guarantee that the reader sees no malware known at access time, likely to be a much more comprehensive set.
See, for example, the Internet Archive's Malware Museum, which contains access surrogates of malware which has been defanged.
The impracticality of excluding malware from digital collections is emphasized by Tobias Lauinger et al's Thou Shalt Not Depend on Me: Analysing the Use of Outdated JavaScript Libraries on the Web (also here). They report:
ReplyDelete"97% of ALEXA sites and 83.6% of COM sites contain JavaScript."
and:
"ALEXA and COM crawls contain a median of 24 and 9 inline scripts, respectively, with 5% of the sites having hundreds of inline scripts — the maximum observed was 19K and 25K."
and:
"we detect at least one of our 72 target libraries on 87.7% of all ALEXA sites and 46.5% of all sites in COM"
and:
"Overall, we find that 37.8% of [ALEXA] sites use at least one library version that we know to be vulnerable, and 9.7% use two or more different vulnerable library versions (COM ... 37.4% and 4.1%)."
and:
"To characterise lag from a per-site point of view, we calculate the maximum lag of all inclusions on each site and find that 61.4% of ALEXA sites are at least one patch version behind on one of their included libraries (COM: 46.2%). Similarly, the median ALEXA site uses a version released 1,177 days (COM: 1,476 days) before the newest available release of the library."
So Web archives are full of really old, really vulnerable JavaScript.