Skip to content

Forcing git to consider PDFs as binary files #484

Closed
@aaronpuchert

Description

@aaronpuchert

It seems that git is incorrectly considering some PDFs as text files, e.g. papers/n4431.pdf. One can force git to consider files with certain endings to be binary using the .gitattributes file. According to the docs, the line

*.pdf -diff

should solve the issue.

Activity

added a commit that references this issue on Nov 13, 2016
adfe7dc
self-assigned this
on Nov 13, 2016
tkoeppe

tkoeppe commented on Nov 13, 2016

@tkoeppe
Contributor

I cannot reproduce this problem. My git detects binary files just fine as is. Is this really a problem that needs to be solved? PDFs don't really get changed, they only ever get added.

aaronpuchert

aaronpuchert commented on Nov 14, 2016

@aaronpuchert
Author

With git 2.10.2, run git show 31ac8124469b466750cc18396b72547b6d1c613d:

commit 31ac8124469b466750cc18396b72547b6d1c613d
Author: Richard Smith <richard@metafoo.co.uk>
Date:   Wed Nov 19 11:59:42 2014 -0800

    Added papers N4296 (new working draft with post-Urbana motions) and
    N4297 (editor's report for same).

diff --git a/papers/n4296.pdf b/papers/n4296.pdf
new file mode 100644
index 0000000..2cfe24f
--- /dev/null
+++ b/papers/n4296.pdf
@@ -0,0 +1,354848 @@
+%PDF-1.4
+%320324305330
+1 0 obj
+<< /S /GoTo /D (section*.1) >>
+endobj
+4 0 obj
+(Contents)
+endobj
[...]

This is how it should look like (after applying #1022):

commit 31ac8124469b466750cc18396b72547b6d1c613d
Author: Richard Smith <richard@metafoo.co.uk>
Date:   Wed Nov 19 11:59:42 2014 -0800

    Added papers N4296 (new working draft with post-Urbana motions) and
    N4297 (editor's report for same).

diff --git a/papers/n4296.pdf b/papers/n4296.pdf
new file mode 100644
index 0000000..2cfe24f
Binary files /dev/null and b/papers/n4296.pdf differ
[...]

Specifically, the following PDF files are recognized as text files: papers/n4140.pdf, papers/n4296.pdf, papers/n4431.pdf, papers/n4567.pdf. All other papers are correctly recognized as binary.

PDFs don't really get changed, they only ever get added.

Well, git log -- papers/N3376.pdf yields

commit 672675e01528fba6ca02f5340eee747566ebdca8
Author: Stefanus Du Toit <sjdutoit@gmail.com>
Date:   Tue Feb 28 15:43:25 2012 -0500

    Updated version of N3376

commit 8bb4e26a7d769882ae469ecb87e1daa046cb70a1
Author: Stefanus Du Toit <sjdutoit@gmail.com>
Date:   Mon Feb 27 00:10:22 2012 -0500

    N3376 draft
jensmaurer

jensmaurer commented on Nov 28, 2016

@jensmaurer
Member

@tkoeppe: Care to reconsider my pull request, then?

tkoeppe

tkoeppe commented on Nov 28, 2016

@tkoeppe
Contributor

@jensmaurer: OK, considering. @zygoloid, do you mind a new file in the repository?

tkoeppe

tkoeppe commented on Nov 29, 2016

@tkoeppe
Contributor

@aaronpuchert: To be pedantic, the PDF files you are seeing as text are serialized as text. That's why they're 11MB, rather than the binary-serialized versions that come it at under 6MB. Your proposal would be adding a deliberate override to not treat those text files as text.

aaronpuchert

aaronpuchert commented on Nov 29, 2016

@aaronpuchert
Author

They're not really text files, rather a mixture of plain text and binary blobs. Run grep -A2 "^stream$" papers/n4296.pdf and you'll find plenty of them. The text/binary-decision is probably a little bit fuzzy by design, or it only looks at the first n bytes, and doesn't find the blobs. This isn't a science.

If you look at the other PDF files, you find that they are a mixture of text and binary as well, but there are more binary blobs, or they are right at the beginning. Otherwise, they are in no way different. They might have been generated by different applications though.

Also, the diff of PDFs will most likely not be useful at all, whether they are more text or more binary.

tkoeppe

tkoeppe commented on Nov 29, 2016

@tkoeppe
Contributor

OK, fair enough. I'll let @zygoloid make the commit, so he's aware there's a new file now.

added a commit that references this issue on Nov 30, 2016
8c8b4a2
aaronpuchert

aaronpuchert commented on Nov 30, 2016

@aaronpuchert
Author

Thanks for fixing the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @aaronpuchert@tkoeppe@jensmaurer

      Issue actions

        Forcing git to consider PDFs as binary files · Issue #484 · cplusplus/draft