Forcing git to consider PDFs as binary files #484

Closed

Closed

Forcing git to consider PDFs as binary files#484

Assignees

It seems that git is incorrectly considering some PDFs as text files, e.g. papers/n4431.pdf. One can force git to consider files with certain endings to be binary using the .gitattributes file. According to the docs, the line

*.pdf -diff

should solve the issue.

added a commit that references this issue

on Nov 13, 2016

Ensure PDF files are not compared with text-based diff

mentioned this

on Nov 13, 2016

Ensure PDF files are not compared with text-based diff #1022

self-assigned this

on Nov 13, 2016

Contributor

I cannot reproduce this problem. My git detects binary files just fine as is. Is this really a problem that needs to be solved? PDFs don't really get changed, they only ever get added.

Author

With git 2.10.2, run git show 31ac8124469b466750cc18396b72547b6d1c613d:

commit 31ac8124469b466750cc18396b72547b6d1c613d
Author: Richard Smith <richard@metafoo.co.uk>
Date:   Wed Nov 19 11:59:42 2014 -0800

    Added papers N4296 (new working draft with post-Urbana motions) and
    N4297 (editor's report for same).

diff --git a/papers/n4296.pdf b/papers/n4296.pdf
new file mode 100644
index 0000000..2cfe24f
--- /dev/null
+++ b/papers/n4296.pdf
@@ -0,0 +1,354848 @@
+%PDF-1.4
+%320324305330
+1 0 obj
+<< /S /GoTo /D (section*.1) >>
+endobj
+4 0 obj
+(Contents)
+endobj
[...]

This is how it should look like (after applying #1022):

commit 31ac8124469b466750cc18396b72547b6d1c613d
Author: Richard Smith <richard@metafoo.co.uk>
Date:   Wed Nov 19 11:59:42 2014 -0800

    Added papers N4296 (new working draft with post-Urbana motions) and
    N4297 (editor's report for same).

diff --git a/papers/n4296.pdf b/papers/n4296.pdf
new file mode 100644
index 0000000..2cfe24f
Binary files /dev/null and b/papers/n4296.pdf differ
[...]

Specifically, the following PDF files are recognized as text files: papers/n4140.pdf, papers/n4296.pdf, papers/n4431.pdf, papers/n4567.pdf. All other papers are correctly recognized as binary.

PDFs don't really get changed, they only ever get added.

Well, git log -- papers/N3376.pdf yields

commit 672675e01528fba6ca02f5340eee747566ebdca8
Author: Stefanus Du Toit <sjdutoit@gmail.com>
Date:   Tue Feb 28 15:43:25 2012 -0500

    Updated version of N3376

commit 8bb4e26a7d769882ae469ecb87e1daa046cb70a1
Author: Stefanus Du Toit <sjdutoit@gmail.com>
Date:   Mon Feb 27 00:10:22 2012 -0500

    N3376 draft

Member

@tkoeppe: Care to reconsider my pull request, then?

Contributor

@jensmaurer: OK, considering. @zygoloid, do you mind a new file in the repository?

Contributor

@aaronpuchert: To be pedantic, the PDF files you are seeing as text are serialized as text. That's why they're 11MB, rather than the binary-serialized versions that come it at under 6MB. Your proposal would be adding a deliberate override to not treat those text files as text.

Author

They're not really text files, rather a mixture of plain text and binary blobs. Run grep -A2 "^stream$" papers/n4296.pdf and you'll find plenty of them. The text/binary-decision is probably a little bit fuzzy by design, or it only looks at the first n bytes, and doesn't find the blobs. This isn't a science.

If you look at the other PDF files, you find that they are a mixture of text and binary as well, but there are more binary blobs, or they are right at the beginning. Otherwise, they are in no way different. They might have been generated by different applications though.

Also, the diff of PDFs will most likely not be useful at all, whether they are more text or more binary.

Contributor

OK, fair enough. I'll let @zygoloid make the commit, so he's aware there's a new file now.

closed this as completedin #1022

on Nov 30, 2016

added a commit that references this issue

on Nov 30, 2016

Ensure PDF files are not compared with text-based diff (#1022)

Author

Thanks for fixing the issue!

to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

jensmaurer

Labels

No labels

No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Participants