17 November 2016

Crawl all the linkedin skills



One of the recent problems I solved, crawling all the LinkedIn skills. Without any adieu, here is the source code.

Well, most of it is self-explanatory. Have also attached the complete skill list used by LinkedIn for anyone to download. Note: This is something available for free on the public domain. Also, the program written is good for the current date, I could write a dynamic one that would automatically update, but then I was too lazy to do that :)

Code Explanation
Line 1: Requiring anemone library which is a web spider framework written in ruby.
Line 3-6: I am initialising the characters to pages map obtained from the URL. For example, take a look at this link https://www.linkedin.com/directory/topics-o/ Here the character O has 99 sub pages. Similarly, character x has 73 sub pages. I manually assigned it here for the crawler to go that many times
Line 8: The variable all_urls consists all the possible combinations from a to z at the max each character having 99 subpages. The variable skipped_urls is to catch the URLs whose values are not crawlable because LinkedIn detected scrapping is going on. That will be collected and will be printed for recrawling later.
Line 9: Mapping all the possible URLs mentioned above into the variable all_urls
Line 11: Open a file called skills.txt in write mode and make it ready
Line 12: Iterate over each of the URLs present in all_url variable
Line 15-20: This is where the real crawling occurs. The XPath selector searches for class=column and collects all the skills in the given page and writes them directly into the file
Line 22: Capture all the skipped_urls in case LinkedIn blocks the scraper (which is the program)
Line 27: Print the skipped_urls using which we will have to rerun the program again - which I leave it to the reader to figure out how.

Cheers,
Braga

10 comments:

Girish Kamath said...

You are a genius....I need this badly for one of my personal project, thanks a ton mate

bragadeesh said...

You're welcome, enjoy!

Christophe Pellier said...

Hi,
thanks a lot for this.
One question though. Do you think it is possible to get the localization of the skills in the different native languages like in french, german, italian. Actually the list is only in english.
thaks a lot for your answer.

bragadeesh said...

@Christophe - Sure, it would need some tweak in the code. What language you want?

Christophe Pellier said...

Would be great if we can have french.

thank in advance

blog copa said...

Where can I find the source code?

Christophe Pellier said...

Hi Bragadeesh,
Will be very nice if I can get the french ones :)

bragadeesh said...

Chris,

I searched for the LinkedIn directory structure, unfortunately, they do not have one. If you can find the link and paste it here, surely I can tweak the program and get you all the skills.

https://www.linkedin.com/directory/topics/

The alternate option is to translate the English skill sets to french 1:1 using Google, but I am not sure about the quality and throughput of the translation.

Thanks for stopping by and apologies for the delay in replying.

Christophe Pellier said...

Hi Brag,
No issues. thanks a lot for your answer.

Vib said...

Hi Brag,

Got to know about you, also gone through the code, it seems impressive.

need to get in touch with you, i have a research project, i think you can put some light on that.

kindly share your interest by mailing me back vibhavshetye@gmail.com

thank you
Vibhav