One of the recent problems I solved, crawling all the LinkedIn skills. Without any adieu, here is the source code.
Well, most of it is self-explanatory. Have also attached the complete skill list used by LinkedIn for anyone to download. Note: This is something available for free on the public domain. Also, the program written is good for the current date, I could write a dynamic one that would automatically update, but then I was too lazy to do that :)
Line 1: Requiring anemone library which is a web spider framework written in ruby.
Line 3-6: I am initialising the characters to pages map obtained from the URL. For example, take a look at this link https://www.linkedin.com/directory/topics-o/ Here the character O has 99 sub pages. Similarly, character x has 73 sub pages. I manually assigned it here for the crawler to go that many times
Line 8: The variable all_urls consists all the possible combinations from a to z at the max each character having 99 subpages. The variable skipped_urls is to catch the URLs whose values are not crawlable because LinkedIn detected scrapping is going on. That will be collected and will be printed for recrawling later.
Line 9: Mapping all the possible URLs mentioned above into the variable all_urls
Line 11: Open a file called skills.txt in write mode and make it ready
Line 12: Iterate over each of the URLs present in all_url variable
Line 15-20: This is where the real crawling occurs. The XPath selector searches for class=column and collects all the skills in the given page and writes them directly into the file
Line 22: Capture all the skipped_urls in case LinkedIn blocks the scraper (which is the program)
Line 27: Print the skipped_urls using which we will have to rerun the program again - which I leave it to the reader to figure out how.