17 November 2016

Crawl all the linkedin skills



One of the recent problems I solved, crawling all the LinkedIn skills. Without any adieu, here is the source code.

Well, most of it is self-explanatory. Have also attached the complete skill list used by LinkedIn for anyone to download. Note: This is something available for free on the public domain. Also, the program written is good for the current date, I could write a dynamic one that would automatically update, but then I was too lazy to do that :)

Code Explanation
Line 1: Requiring anemone library which is a web spider framework written in ruby.
Line 3-6: I am initialising the characters to pages map obtained from the URL. For example, take a look at this link https://www.linkedin.com/directory/topics-o/ Here the character O has 99 sub pages. Similarly, character x has 73 sub pages. I manually assigned it here for the crawler to go that many times
Line 8: The variable all_urls consists all the possible combinations from a to z at the max each character having 99 subpages. The variable skipped_urls is to catch the URLs whose values are not crawlable because LinkedIn detected scrapping is going on. That will be collected and will be printed for recrawling later.
Line 9: Mapping all the possible URLs mentioned above into the variable all_urls
Line 11: Open a file called skills.txt in write mode and make it ready
Line 12: Iterate over each of the URLs present in all_url variable
Line 15-20: This is where the real crawling occurs. The XPath selector searches for class=column and collects all the skills in the given page and writes them directly into the file
Line 22: Capture all the skipped_urls in case LinkedIn blocks the scraper (which is the program)
Line 27: Print the skipped_urls using which we will have to rerun the program again - which I leave it to the reader to figure out how.

Cheers,
Braga

19 comments:

My first blog said...

You are a genius....I need this badly for one of my personal project, thanks a ton mate

bragadeesh said...

You're welcome, enjoy!

cpellier said...

Hi,
thanks a lot for this.
One question though. Do you think it is possible to get the localization of the skills in the different native languages like in french, german, italian. Actually the list is only in english.
thaks a lot for your answer.

bragadeesh said...

@Christophe - Sure, it would need some tweak in the code. What language you want?

cpellier said...

Would be great if we can have french.

thank in advance

blog copa said...

Where can I find the source code?

cpellier said...

Hi Bragadeesh,
Will be very nice if I can get the french ones :)

bragadeesh said...

Chris,

I searched for the LinkedIn directory structure, unfortunately, they do not have one. If you can find the link and paste it here, surely I can tweak the program and get you all the skills.

https://www.linkedin.com/directory/topics/

The alternate option is to translate the English skill sets to french 1:1 using Google, but I am not sure about the quality and throughput of the translation.

Thanks for stopping by and apologies for the delay in replying.

cpellier said...

Hi Brag,
No issues. thanks a lot for your answer.

Vib said...

Hi Brag,

Got to know about you, also gone through the code, it seems impressive.

need to get in touch with you, i have a research project, i think you can put some light on that.

kindly share your interest by mailing me back vibhavshetye@gmail.com

thank you
Vibhav

Sunil Ojha said...

Hi,

Thanks a lot for this. This is really awesome.

I need your help. Can we also find out the list of companies in linkedin? These are also available in public domain and have exactly similar kind of URL.

https://www.linkedin.com/directory/companies-a/

Also, if you can help me how can I execute this particular script?

I need this for one of my project.

Unknown said...

Hey bra you don't know how happy I am right now. Thank you so much. Now I can sleep in peace.

Unknown said...

Hi!

Is there a way to get the skills in spanish? In Spanisk skills are called "aptitudes". Let me know, thanks!
Diego

bragadeesh said...

Hola Diego. Do you have a directory/url for that?

Unknown said...

Hi! the opposite. I was wandering if, using your code, I could crawl the LinkedIn skills in spanish.

Srinivas said...

Great Job really Userful

Unknown said...


Can you please share the final file?

I tried to execute it online and could not get the output: https://repl.it/repls/ImaginativeEvenNasm

thanks!

Unknown said...

The links to topic do not work.

Unknown said...

Just one more thing: how did you implement blogger on your site, did you use the api?