Linda Bawcom shares the second of her two-part journey on syllabus crawling and scraping. Read Part 1 here.
A Synopsis and Unanswered Questions
For those of you who may have missed Part 1 or are old like me and have forgotten, here’s the Cliff Notes (If you’re too young to know about those, it means “the Wiki synopsis”):
I volunteered to get syllabi from Houston Community College. They couldn’t dump the syllabi, so I tried to teach myself how to crawl and scrape (after finding out that didn’t mean begging) for them using Google Chrome. I couldn’t figure out how to write the function in the spreadsheet. I asked for help and Outwit Hub was suggested.
Now, I’m writing Part 2 for two reasons. First, you probably have a few unanswered questions if you read Part 1. The first one is likely, “How did you ever get your PhD?”, followed by “Don’t you think your ‘skills’ might be better put to use in some other area?” But maybe you’re wondering “Did you use Outwit Hub?” If the latter is the case, then this part is really for you, though the other questions will be answered, too, so this part won’t be a total waste of your time.
Another Synopsis about Outwit Hub Pro
I’ll do another summary here because I figure you might be reading this while your class is taking a test, and you’re wondering if you should do anything about that kid in the corner who is looking down at his/her lap and smiling all time. Or maybe you’re in one of those meetings where you have to really pay attention so that you don’t get volunteered to be the head of a committee. So here’s the long and short of it:
Outwit Hub Pro works and there’s no problem with a limit on the number of documents or links you can get (you can also catch images). This version is $90.00 for the first year. If you download it in Firefox, an icon for it appears on the Firefox tool bar. There’s a free Beta version for Documents which I finally managed to install, but I can’t get it to work yet (you probably can). Anyway, I’ve now sent along about 17,000 syllabi. Not bad, right? Thing is, it took me about 30 hours of tutorials and hit and miss to get there. I’ll get back to that in a minute (if you’re a speed reader).
Using Outwit Hub Pro and HCCS
Only about 3,000 syllabi come from Houston Community College. That’s because Outwit Hub can be stupid. Okay, “stupid” is not politically correct. Let’s just say Outwit might be dyslexic. Right at the bottom of the page for the syllabi by course, it plainly says ‘Next’, yet the program can’t read it and won’t go to the next page automatically so I have to do that manually. There are over 100,000 syllabi, with only ten to a page. I’ll have to go manually through 100 pages or 1,000? Let me get the calculator. Wow! that’s a lot of pages. It was taking me a long time to do this manually in the scraper because I had to:
1) Find the source code I wanted to use (there’s a breadcrumb for that in the program, though I don’t really know if ‘breadcrumb’ is the right word. Anyway, there’s a whatchamacallit for that.
2) Copy and paste the beginning marker e.g. <ahref ,and the end marker, e.g. acct (the prefix or the course) of the page I wanted to scrape which would return links to syllabi. You can also sometimes set it up so that you have descriptions of each column where you’ve put markers (e.g Department, Course, Link). It depends a little on how the source code is written. This is the cool part because you don’t have to know anything about math or weird symbols.
3) Click on “Execute” (you’d think they could have come up with a friendlier command like “Fetch” or “Your wish is my command”).
4) Look in “scraped” to find out if I used the right markers. If I did, then the right links appear to the syllabi. By the way, you have to be careful when deleting one link in the “scraped” window or you’ll delete all. Guess how I found that out?
5) Click on an icon that lets me see the web site.
6) Click on ‘next’ (because of the dyslexic program).
7) And finally, copy and paste the new URL into the scraper and hit ‘execute’ again.
Learning about Documents
I did the above for over 1,000 syllabi until I saw a thingy that said “Documents.” Okay, I’ll be truthful here. I kind of skimmed those tutorials you have to read and the million FAQs, which by the way, never seemed to have the questions I did. That could be because mine were too dumb. I did send an e-mail asking a question, but I confused the poor technician so much that he suggested I pay for getting a special program that will do what I want (for a mere $200.00). You have to be very very specific with technical questions. Anyway, what I did was watch a few YouTube tutorials (over and over) and just went for it. I do the same with manuals that come with electric or electronic things (drives my husband nuts). But really, how hard can it be to plug something in and hit “power”? Admittedly, the result of that has occasionally led to going back to YouTube and learning how to fix it.
Getting back to “Documents”: The program automatically gets the links to any kind of document after you’ve told it which URL to go to-just copy-paste. You don’t have to scrape at all. You have to be careful here too, though. At the bottom of the page, there’s a doohickey that gives you the option to “Empty automatically” or “Empty on demand.” The default is “Empty automatically,” so when you go to the next page in a series, it empties the first page. Guess how I found that out? You also have to be careful here, too, concerning deleting. (For those of you who are reading this because you are definitely wondering now how I got my PhD. The answer is simple: a trip to Lourdes [see Part 1].)
When you finish, in the “Scraped” or ‘Documents” window, there’s a big doodad that says “Export,” and you get to choose from a number of different formats like Excel, JSON (I don’t know what that is, but Alex really likes it), HTML, and some other letters. I save both in JSON (for Alex) and HTML (for me). Then I copy the folder with the syllabi for the course into DropBox for Alex. And voila!
The Texas A & M Problem
I got a little tired of doing HCCS, so I thought I’d browse the next largest college, which is Texas A & M (HCCS has around 70,000-90,000 students depending on where you look, and Texas A & M about 60,000). After a little snooping around, I realized that they don’t have a web site like HCCS that is not password protected for the syllabi by department or discipline. The good news is, I found the web site where all the syllabi are listed on pages where Outwit recognized “Next” (so maybe it’s not dyslexic but classist and will only do this if it feels the university is up to its standards). The bad news is that they are listed not by department or course, but by CRN (class number). The good news is that we now have 14,500 syllabi spanning three years. The bad news is that we won’t know what course each syllabus is for until we open it…or could we?
How to Fix the Problem
From what I’ve heard, Dennis is a genius as far as computer stuff is concerned. I figure he knows a way to write a little algorithm or two (that’s something to do with math, but I thought I’d impress you by using it). He could write them so that the syllabi will magically open up and automatically sort themselves by subject. I’m sure I’ve seen that on YouTube.
But even if for some very odd reason that can’t be done, I’m thinking “extra credit.” All the teachers in the project could assign opening a couple of hundred links to those students who haven’t done anything all semester, realize they are going to fail, and beg for something to do in order to raise their grade. Of course we know it won’t raise it enough, but we should give them the opportunity to “feel good about themselves.” r what about those wonderful overachievers in our classes who break down in tears because it looks like they’ll get an A-? Let’s give them the opportunity to “develop their spirit of volunteerism” and “give back to the community” (in linguistics, we’re a “discourse community” and I say that counts).
So these have been my adventures in OSPLand. I’m sure there will be plenty more, and I can think of no other devoted members to a worthy project that I’d rather share them with. A special thank you, however, goes to Alex, without whose support and humor I would have lost interest and momentum a long time ago, and to Kristine for so kindly making sure I was connected to everyone.
Finally, for those who were wondering if my “skills” might be better put to use in some other area? I’ll leave that up to you.
Y’all take care,