Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Forcing git to consider PDFs as binary files #484

Closed
aaronpuchert opened this issue Apr 26, 2015 · 8 comments
Closed

Forcing git to consider PDFs as binary files #484

aaronpuchert opened this issue Apr 26, 2015 · 8 comments
Assignees

Comments

@aaronpuchert
Copy link

It seems that git is incorrectly considering some PDFs as text files, e.g. papers/n4431.pdf. One can force git to consider files with certain endings to be binary using the .gitattributes file. According to the docs, the line

*.pdf -diff

should solve the issue.

@tkoeppe
Copy link
Contributor

tkoeppe commented Nov 13, 2016

I cannot reproduce this problem. My git detects binary files just fine as is. Is this really a problem that needs to be solved? PDFs don't really get changed, they only ever get added.

@aaronpuchert
Copy link
Author

aaronpuchert commented Nov 14, 2016

With git 2.10.2, run git show 31ac8124469b466750cc18396b72547b6d1c613d:

commit 31ac8124469b466750cc18396b72547b6d1c613d
Author: Richard Smith <richard@metafoo.co.uk>
Date:   Wed Nov 19 11:59:42 2014 -0800

    Added papers N4296 (new working draft with post-Urbana motions) and
    N4297 (editor's report for same).

diff --git a/papers/n4296.pdf b/papers/n4296.pdf
new file mode 100644
index 0000000..2cfe24f
--- /dev/null
+++ b/papers/n4296.pdf
@@ -0,0 +1,354848 @@
+%PDF-1.4
+%320324305330
+1 0 obj
+<< /S /GoTo /D (section*.1) >>
+endobj
+4 0 obj
+(Contents)
+endobj
[...]

This is how it should look like (after applying #1022):

commit 31ac8124469b466750cc18396b72547b6d1c613d
Author: Richard Smith <richard@metafoo.co.uk>
Date:   Wed Nov 19 11:59:42 2014 -0800

    Added papers N4296 (new working draft with post-Urbana motions) and
    N4297 (editor's report for same).

diff --git a/papers/n4296.pdf b/papers/n4296.pdf
new file mode 100644
index 0000000..2cfe24f
Binary files /dev/null and b/papers/n4296.pdf differ
[...]

Specifically, the following PDF files are recognized as text files: papers/n4140.pdf, papers/n4296.pdf, papers/n4431.pdf, papers/n4567.pdf. All other papers are correctly recognized as binary.

PDFs don't really get changed, they only ever get added.

Well, git log -- papers/N3376.pdf yields

commit 672675e01528fba6ca02f5340eee747566ebdca8
Author: Stefanus Du Toit <sjdutoit@gmail.com>
Date:   Tue Feb 28 15:43:25 2012 -0500

    Updated version of N3376

commit 8bb4e26a7d769882ae469ecb87e1daa046cb70a1
Author: Stefanus Du Toit <sjdutoit@gmail.com>
Date:   Mon Feb 27 00:10:22 2012 -0500

    N3376 draft

@jensmaurer
Copy link
Member

@tkoeppe: Care to reconsider my pull request, then?

@tkoeppe
Copy link
Contributor

tkoeppe commented Nov 28, 2016

@jensmaurer: OK, considering. @zygoloid, do you mind a new file in the repository?

@tkoeppe
Copy link
Contributor

tkoeppe commented Nov 29, 2016

@aaronpuchert: To be pedantic, the PDF files you are seeing as text are serialized as text. That's why they're 11MB, rather than the binary-serialized versions that come it at under 6MB. Your proposal would be adding a deliberate override to not treat those text files as text.

@aaronpuchert
Copy link
Author

They're not really text files, rather a mixture of plain text and binary blobs. Run grep -A2 "^stream$" papers/n4296.pdf and you'll find plenty of them. The text/binary-decision is probably a little bit fuzzy by design, or it only looks at the first n bytes, and doesn't find the blobs. This isn't a science.

If you look at the other PDF files, you find that they are a mixture of text and binary as well, but there are more binary blobs, or they are right at the beginning. Otherwise, they are in no way different. They might have been generated by different applications though.

Also, the diff of PDFs will most likely not be useful at all, whether they are more text or more binary.

@tkoeppe
Copy link
Contributor

tkoeppe commented Nov 29, 2016

OK, fair enough. I'll let @zygoloid make the commit, so he's aware there's a new file now.

@aaronpuchert
Copy link
Author

Thanks for fixing the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants