Ramblings of a techie: Crawl all the linkedin skills

17 November 2016

Crawl all the linkedin skills

One of the recent problems I solved, crawling all the LinkedIn skills. Without any adieu, here is the source code.

Well, most of it is self-explanatory. Have also attached the complete skill list used by LinkedIn for anyone to download. Note: This is something available for free on the public domain. Also, the program written is good for the current date, I could write a dynamic one that would automatically update, but then I was too lazy to do that :)

Code Explanation
Line 1: Requiring anemone library which is a web spider framework written in ruby.
Line 3-6: I am initialising the characters to pages map obtained from the URL. For example, take a look at this link https://www.linkedin.com/directory/topics-o/ Here the character O has 99 sub pages. Similarly, character x has 73 sub pages. I manually assigned it here for the crawler to go that many times
Line 8: The variable all_urls consists all the possible combinations from a to z at the max each character having 99 subpages. The variable skipped_urls is to catch the URLs whose values are not crawlable because LinkedIn detected scrapping is going on. That will be collected and will be printed for recrawling later.
Line 9: Mapping all the possible URLs mentioned above into the variable all_urls
Line 11: Open a file called skills.txt in write mode and make it ready
Line 12: Iterate over each of the URLs present in all_url variable
Line 15-20: This is where the real crawling occurs. The XPath selector searches for class=column and collects all the skills in the given page and writes them directly into the file
Line 22: Capture all the skipped_urls in case LinkedIn blocks the scraper (which is the program)
Line 27: Print the skipped_urls using which we will have to rerun the program again - which I leave it to the reader to figure out how.

Cheers,
Braga

19 comments:

My first blog said...: You are a genius....I need this badly for one of my personal project, thanks a ton mate; November 19, 2016 at 4:04 PM
bragadeesh said...: You're welcome, enjoy!; November 23, 2016 at 5:37 AM
cpellier said...: Hi,
thanks a lot for this.
One question though. Do you think it is possible to get the localization of the skills in the different native languages like in french, german, italian. Actually the list is only in english.
thaks a lot for your answer.; February 3, 2017 at 11:33 PM
bragadeesh said...: @Christophe - Sure, it would need some tweak in the code. What language you want?; February 5, 2017 at 6:44 AM
cpellier said...: Would be great if we can have french.

thank in advance; February 6, 2017 at 6:57 AM
blog copa said...: Where can I find the source code?; February 14, 2017 at 10:56 AM
cpellier said...: Hi Bragadeesh,
Will be very nice if I can get the french ones :); March 23, 2017 at 7:36 AM
bragadeesh said...: Chris,

I searched for the LinkedIn directory structure, unfortunately, they do not have one. If you can find the link and paste it here, surely I can tweak the program and get you all the skills.

https://www.linkedin.com/directory/topics/

The alternate option is to translate the English skill sets to french 1:1 using Google, but I am not sure about the quality and throughput of the translation.

Thanks for stopping by and apologies for the delay in replying.; March 25, 2017 at 8:48 PM
cpellier said...: Hi Brag,
No issues. thanks a lot for your answer.; March 27, 2017 at 2:15 AM
Vib said...: Hi Brag,

Got to know about you, also gone through the code, it seems impressive.

need to get in touch with you, i have a research project, i think you can put some light on that.

kindly share your interest by mailing me back vibhavshetye@gmail.com

thank you
Vibhav; June 11, 2017 at 11:34 PM
Sunil Ojha said...: Hi,

Thanks a lot for this. This is really awesome.

I need your help. Can we also find out the list of companies in linkedin? These are also available in public domain and have exactly similar kind of URL.

https://www.linkedin.com/directory/companies-a/

Also, if you can help me how can I execute this particular script?

I need this for one of my project.; August 7, 2017 at 9:39 AM
Unknown said...: Hey bra you don't know how happy I am right now. Thank you so much. Now I can sleep in peace.; September 8, 2017 at 6:24 PM
Unknown said...: Hi!

Is there a way to get the skills in spanish? In Spanisk skills are called "aptitudes". Let me know, thanks!
Diego; September 7, 2018 at 7:01 PM
bragadeesh said...: Hola Diego. Do you have a directory/url for that?; September 8, 2018 at 12:28 AM
Unknown said...: Hi! the opposite. I was wandering if, using your code, I could crawl the LinkedIn skills in spanish.; September 11, 2018 at 3:46 PM
Srinivas said...: Great Job really Userful; December 30, 2018 at 9:57 PM
Unknown said...: Can you please share the final file?

I tried to execute it online and could not get the output: https://repl.it/repls/ImaginativeEvenNasm

thanks!; May 4, 2019 at 12:02 PM
Anonymous said...: The links to topic do not work.; April 20, 2021 at 12:34 PM
Anonymous said...: Just one more thing: how did you implement blogger on your site, did you use the api?; April 20, 2021 at 12:42 PM