One of the recent problems I solved, crawling all the LinkedIn skills. Without any adieu, here is the source code.
Well, most of it is self-explanatory. Have also attached the complete skill list used by LinkedIn for anyone to download. Note: This is something available for free on the public domain. Also, the program written is good for the current date, I could write a dynamic one that would automatically update, but then I was too lazy to do that :)
Code Explanation
Line 1: Requiring anemone library which is a web spider framework written in ruby.
Line 3-6: I am initialising the characters to pages map obtained from the URL. For example, take a look at this link https://www.linkedin.com/directory/topics-o/ Here the character O has 99 sub pages. Similarly, character x has 73 sub pages. I manually assigned it here for the crawler to go that many times
Line 8: The variable all_urls consists all the possible combinations from a to z at the max each character having 99 subpages. The variable skipped_urls is to catch the URLs whose values are not crawlable because LinkedIn detected scrapping is going on. That will be collected and will be printed for recrawling later.
Line 9: Mapping all the possible URLs mentioned above into the variable all_urls
Line 11: Open a file called skills.txt in write mode and make it ready
Line 12: Iterate over each of the URLs present in all_url variable
Line 15-20: This is where the real crawling occurs. The XPath selector searches for class=column and collects all the skills in the given page and writes them directly into the file
Line 22: Capture all the skipped_urls in case LinkedIn blocks the scraper (which is the program)
Line 27: Print the skipped_urls using which we will have to rerun the program again - which I leave it to the reader to figure out how.
Cheers,
Braga
19 comments:
You are a genius....I need this badly for one of my personal project, thanks a ton mate
You're welcome, enjoy!
Hi,
thanks a lot for this.
One question though. Do you think it is possible to get the localization of the skills in the different native languages like in french, german, italian. Actually the list is only in english.
thaks a lot for your answer.
@Christophe - Sure, it would need some tweak in the code. What language you want?
Would be great if we can have french.
thank in advance
Where can I find the source code?
Hi Bragadeesh,
Will be very nice if I can get the french ones :)
Chris,
I searched for the LinkedIn directory structure, unfortunately, they do not have one. If you can find the link and paste it here, surely I can tweak the program and get you all the skills.
https://www.linkedin.com/directory/topics/
The alternate option is to translate the English skill sets to french 1:1 using Google, but I am not sure about the quality and throughput of the translation.
Thanks for stopping by and apologies for the delay in replying.
Hi Brag,
No issues. thanks a lot for your answer.
Hi Brag,
Got to know about you, also gone through the code, it seems impressive.
need to get in touch with you, i have a research project, i think you can put some light on that.
kindly share your interest by mailing me back vibhavshetye@gmail.com
thank you
Vibhav
Hi,
Thanks a lot for this. This is really awesome.
I need your help. Can we also find out the list of companies in linkedin? These are also available in public domain and have exactly similar kind of URL.
https://www.linkedin.com/directory/companies-a/
Also, if you can help me how can I execute this particular script?
I need this for one of my project.
Hey bra you don't know how happy I am right now. Thank you so much. Now I can sleep in peace.
Hi!
Is there a way to get the skills in spanish? In Spanisk skills are called "aptitudes". Let me know, thanks!
Diego
Hola Diego. Do you have a directory/url for that?
Hi! the opposite. I was wandering if, using your code, I could crawl the LinkedIn skills in spanish.
Great Job really Userful
Can you please share the final file?
I tried to execute it online and could not get the output: https://repl.it/repls/ImaginativeEvenNasm
thanks!
The links to topic do not work.
Just one more thing: how did you implement blogger on your site, did you use the api?
Post a Comment