Using MS Office To Convert Topps’ PDF Checklists To More Useful Formats

Now that 2016 Topps baseball has hit the shelves, it seemed like as good a time as any to share some of the various formatting tricks I’ve learned while trying to make use of the information found in the checklists off of the Topps website.

Because I track my collection in an Access database, I’m always looking to get that downloaded checklist information into a format I can upload into my database.  From playing around with these online checklists, I’ve got it down to something of a science.

Some of you might find some or all of these tricks helpful, even if you don’t track your collection the same way I do.

Now bear in mind that this is going to get a little wordy and technical, and what I’m going to tell you is done in Windows 7 with Office 2013.  You may need to make adjustments for your own computer… and the format I use isn’t necessarily what you would want. Nevertheless, I’m sure many of you can leverage what I do and suit it for your own purposes.

OK, first off, Topps’ checklists are on their websites in PDF files.  This gives them a uniform format on the website, but pose some minor challenges.  When you pull up a PDF checklist in your browser, you could just right-click on the checklist, chose “Select All” and then right-click and select “Copy”, but I’ve had mixed results with this method.  If you do get your information this way, I recommend scrolling all the way to the bottom of the PDF so that the entire document is in the browser’s memory (It does seem to make a difference).

What I prefer doing is to download that PDF file to my computer, go to the Downloads folder, open it up (assuming you’ve got some version of Adobe Acrobat or some other PDF viewer installed) and work on it that way.  I get better results, and if I ever want to go back and reference the original, it’s right there on my laptop.

OK, once I’ve got the PDF opened, I do a ctrl-A to select All, and then a ctrl-C to copy.  Now the text of the checklist is in the computer’s clipboard.

At this point, I go to Word and do a Paste/Keep Text Only (how you do this depends on the version of Word you have.  Older versions it’s something like Paste/Special).

At this point, your text looks like this:

13 Drew Hutchison Toronto Blue Jays®
14 Michael Taylor Washington Nationals® Rookie
15 Gregory Polanco Pittsburgh Pirates®

Now the format I need requires a tab between each “data element” in the line – between the card number and player name, between the player and team name, and also between the team name and any Topps notes like “Rookie” for Michael Taylor above. Right now we don’t have tabs, we have spaces.

This is where we use the “Replace” option.  I bring this up by doing Ctrl-H, but it’s also available through menus and the Word “ribbons”.  As with anything computer-y, there are several different ways of doing it.  I’m all about the keyboard shortcuts, so I tend to think of actions that way.

Many of you have probably used Replace to replace one word with another, for example, but how do you do this with non-printing characters like tabs and Paragraph marks?  You can’t just type them in, it doesn’t work that way.  This is where we start to play with Control Codes, which is a sort of Microsoft shorthand for these special characters. These Control Codes all involved the caret character (^), which is the character you get when you hold down shift and hit the number 6 your keyboard. The caret with a lower-case T following it (^t) represents a tab, and the caret with a lower-case p following it (^p) represents a paragraph mark you find at the end of most lines.  Sometimes the way the original document was generated means the line ends with a “new line” mark, which is represented by ^l… that’s a lower-case L, not an upper-case I.

Using “Find And Replace” combined with these control characters will let you do a lot of funky little replaces.

But the thing is, that you don’t want to go ahead and replace all of the spaces with tabs; if you do that, you’ll get a tab between “Michael” and “Taylor” or between “Washington” and “Nationals”.

What you’ll want to do instead is leverage the formatting that Topps uses for these checklists. To get a tab between the player name and the city/team name, you can replace the space before the city name with a tab by doing something like this:

PDF checklist - replace space city with tab city

Now it’s not completely obvious from this image, but in the “Find what” field there’s a space before the word Toronto.

When you click on the Replace All button, it will take all of the instances of a space and Toronto and replace them with a tab and Toronto.  That will put a tab between “Drew Hutchison” and “Toronto Blue Jays” in the line below.
13 Drew Hutchison Toronto Blue Jays®

Since this is tedious to do for every city in MLB, what I’ve done is to set up a Macro – essentially a little computer program – to do this for every city. That way, once I’ve got my checklist pasted into Word, I can just run my “tab-before-city” macro and within a second or two it puts a tab in the document before every city I have set up in the macro (My macro includes every MLB city, plus those from the past and other sports… Brooklyn, Montreal, California, Edmonton, Green Bay, etc.) I’m not going to teach you how to create and run macros – that would make a long post really really long – but I just wanted to point out that this is an option.

OK, so how do you do put a tab after the card number? This time, you take advantage of the fact that the last digit of the card number is followed by a space, so you replace “1 and a space” with “1 and a tab”, “2 and a space” with “2 and a tab” and so on. Again, macros aren’t essential but they make life easier.
PDF checklist - replace 1 space with 1 tab

And again, you can’t see it in the image but the “1” in the “Find What” box above is followed by a space.

If you’re annoyed by the little copyright logos and you don’t need them because you’re using this data for your own private, non-commercial use, here’s another little trick. Select one of the little characters and THEN bring up your “Replace”. Word will assume that the text you have selected is what you want to replace.

PDF checklist - replace copyright

Click on the Replace All button and – POOF!  Goodbye, registered copyright.

By the way, to completely get all of those characters, you’d have to do a similar Replace involving the ™ character.

You might notice that I’m replacing the ® symbol with a tab. Why would I want to do that? Why not just whack the stupid little character and be done with it?

Well, lines like this are why:
14 Michael Taylor Washington Nationals® Rookie

For the Drew Hutchinson line, having a tab at the end doesn’t make much difference, but for the above line, my Replace will get a tab between the team name and “Rookie”… This also works for other little notations Topps makes like “Checklist”, “League Leader”, “Short print” and so on.

After we’re done, we have tabs where we want tabs… Here’s a screen shot where I have the “Show/hide control characters” feature set to “Show”.
PDF checklist - with tabs

We’ve got tabs in there now… So now what?

Well, if we copy the lines with tabs and just paste it right into an Excel spreadsheet… voila!
PDF checklist - Excel spreadsheet
Excel uses tabs as an indication where to break the data into different cells, so we get nicely-formatted spreadsheet!

FYI, Upper Deck and Panini already have their downloads available in an Excel format (XLS).

Having the checklist in a spreadsheet is useful enough as it is, but I also take those spreadsheets and upload them into my Excel database. There’s a pretty decent amount of busywork involved, but the end result is worth it to me.

One more thing… I’ve had checklists, like a Bowman checklist from a year or two ago,  that were formatted in such a way that when I pasted it into a Word document, the information didn’t come out on lines but instead came out with each bit of information on its own line, like this:
BDPP1
Dominic Smith
New York Mets®
BDPP2
Kohl Stewart
Minnesota Twins®

This is where we leverage the ^p Control character, which represents the “Paragraph” mark that usually comes at the end of a line.  (Again, sometimes you’ll have a “new line” character at the end of a line;  in those cases, go with ^l – that’s a lower case L – rather than ^p).

So we replace all of the “paragraph” characters with tabs…

PDF Checklist bowman paragraph to tab

That puts everything on one very long line.  Once again, we leverage the copyright at the end of the line to replace that character and a tab with a paragraph mark.

PDF Checklist copyright tab to paragraph

The end result is this:

PDF checklist bowman final result

…And you know where to take it from here!

BTW, if in this Bowman example you wanted to separate the BDPP prefix from the card number, all you’d need to do is replace BDPP with BDPP^t.

I should point out that even though you’re ready to start attacking PDF documents with this plan of attack, it’s not immune to failure.  Multiple players on one card can cause issues, as can insert cards where the “card number” doesn’t end with digits (TA-AL, BGA-BG, 66B-MS).  My main objective of this post was to get people started down this road, and hopefully you’ve got enough knowledge to sort things out on your own.

Of course, if you have any questions, I’ll do my best to answer them for you!

Advertisements

3 thoughts on “Using MS Office To Convert Topps’ PDF Checklists To More Useful Formats

    • Like most bloggers, my goal is to write something I would like to read, and since today’s subject falls into the category of “I wish someone had told me this as saved me the effort”, I figured I’d share it with everyone.

      I like the “…For Dummies” book cover idea! If I do another post like this, I’ll see if I can’t whip up some artwork!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s