Closed
Description
It seems that git is incorrectly considering some PDFs as text files, e.g. papers/n4431.pdf
. One can force git to consider files with certain endings to be binary using the .gitattributes
file. According to the docs, the line
*.pdf -diff
should solve the issue.
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Milestone
Relationships
Development
No branches or pull requests
Activity
Ensure PDF files are not compared with text-based diff
tkoeppe commentedon Nov 13, 2016
I cannot reproduce this problem. My git detects binary files just fine as is. Is this really a problem that needs to be solved? PDFs don't really get changed, they only ever get added.
aaronpuchert commentedon Nov 14, 2016
With
git 2.10.2
, rungit show 31ac8124469b466750cc18396b72547b6d1c613d
:This is how it should look like (after applying #1022):
commit 31ac8124469b466750cc18396b72547b6d1c613d Author: Richard Smith <richard@metafoo.co.uk> Date: Wed Nov 19 11:59:42 2014 -0800 Added papers N4296 (new working draft with post-Urbana motions) and N4297 (editor's report for same). diff --git a/papers/n4296.pdf b/papers/n4296.pdf new file mode 100644 index 0000000..2cfe24f Binary files /dev/null and b/papers/n4296.pdf differ [...]
Specifically, the following PDF files are recognized as text files:
papers/n4140.pdf
,papers/n4296.pdf
,papers/n4431.pdf
,papers/n4567.pdf
. All other papers are correctly recognized as binary.Well,
git log -- papers/N3376.pdf
yieldsjensmaurer commentedon Nov 28, 2016
@tkoeppe: Care to reconsider my pull request, then?
tkoeppe commentedon Nov 28, 2016
@jensmaurer: OK, considering. @zygoloid, do you mind a new file in the repository?
tkoeppe commentedon Nov 29, 2016
@aaronpuchert: To be pedantic, the PDF files you are seeing as text are serialized as text. That's why they're 11MB, rather than the binary-serialized versions that come it at under 6MB. Your proposal would be adding a deliberate override to not treat those text files as text.
aaronpuchert commentedon Nov 29, 2016
They're not really text files, rather a mixture of plain text and binary blobs. Run
grep -A2 "^stream$" papers/n4296.pdf
and you'll find plenty of them. The text/binary-decision is probably a little bit fuzzy by design, or it only looks at the first n bytes, and doesn't find the blobs. This isn't a science.If you look at the other PDF files, you find that they are a mixture of text and binary as well, but there are more binary blobs, or they are right at the beginning. Otherwise, they are in no way different. They might have been generated by different applications though.
Also, the diff of PDFs will most likely not be useful at all, whether they are more text or more binary.
tkoeppe commentedon Nov 29, 2016
OK, fair enough. I'll let @zygoloid make the commit, so he's aware there's a new file now.
Ensure PDF files are not compared with text-based diff (#1022)
aaronpuchert commentedon Nov 30, 2016
Thanks for fixing the issue!