For the SHA-1 hash collision part of your question, this has been addressed by a few of the answers.
However, a big portion of this hinges on the type of file we're working with:
Maintains the file's overall content and operation (but of course now includes malicious content that was not originally there changed contents)
What this means varies greatly on what is detecting the alterations:
- If it's a signed executable, not a (reasonable) chance: you'd have to get two hash collisions somehow: the SHA-1 of the file and the internal .exe signature.
- If it's an unsigned executable, .com, unsigned .dll, or similar, their resource forks can be added to in ways that will not change their operation and thus you could (eventually) get a hash collision that is not detectable by 'normal' operation.
- If it's a source code file or similar structure (.cs, .c, .h, .cpp, .rb, .yml, .config, .xml, .pl, .bat, .ini) the additions, modifications, or removals can be constrained to valid comment syntax such that the change would not be discernible by most uses (compiling or running it, not opening it up with a text editor).
- If it's an .iso or .zip or other container format, it is also more unlikely since most random changes will corrupt the container. It is possible to do: add a bogus file entry or alter a content within the container and recheck it, but you're adding a layer of complexity and adding additional time to check the result, as well as having limited degrees of freedom with respect to how and what contents may be changed.
- If it's a text or text-like format, they can be changed almost any way you like while still being a 'valid' file, though the content will probably be noticeable.
- With many formats like .rtf, .doc, .html, .xslx, and other markup-esque formats, they can be added or modified in ways that will be undetectable by parsers, so other than the length (or even with a constrained length, less freedom) the files can be altered to (eventually) get a hash collision while still being not only a valid file, but not noticeably changed in any way that would be visible to the typical applications they would be used with.
So, what you're left with is how to get collisions in whatever structure that is noncorrupting and some degree of undetectable perhaps:
- Make any functional changes you desire (perhaps insertion of malicious content) and make any additional changes to retain file format specific validity
- Add a section that will be non-functional (between comment blocks, at the very end of a text file with 3k carriage returns above it, isolate a current comment block)
- Add or select a character/code point/byte for modification and try every possible valid combination (not every byte combination is valid for different encodings, for example).
- Recompute the hash, see if collision matches.
- if it does not, goto 3.
Let's say you have a super fast computer and a smallish file, such that modification with a valid byte sequence and recomputing the hash takes 1 millisecond (probably requiring some dedicated hardware). If the hash distribution is perfectly random and distributed across the range, you will get a collision with SHA-1 every 2^160
attempts (brute forcing it).
2^160/1000/60/60/24/365.24
= 4.63x10^37 years
= 46,300,000,000,000,000,000,000,000,000,000,000,000 years
= 46 undecillion years.
But hey, let's try the 2^60
and 2^52
versions, and pretend that they allow us to modify the file any way we like (they don't) and that they, too, can be done in 1ms each try:
2^52 yields 142,714 years
/*humans might still be around to care, but not about these antiquated formats*/
2^60 yields 3.65x10^7 years = 36,500,000 years
/*machines will probably have taken over anyway*/
But hey, you might get lucky. Really, really, more-of-a-miracle-than-anything-people-call-miracles lucky.