17 November 2016

Crawl all the linkedin skills

One of the recent problems I solved, crawling all the LinkedIn skills. Without any adieu, here is the source code.

Well, most of it is self-explanatory. Have also attached the complete skill list used by LinkedIn for anyone to download. Note: This is something available for free on the public domain. Also, the program written is good for the current date, I could write a dynamic one that would automatically update, but then I was too lazy to do that :)

Code Explanation
Line 1: Requiring anemone library which is a web spider framework written in ruby.
Line 3-6: I am initialising the characters to pages map obtained from the URL. For example, take a look at this link https://www.linkedin.com/directory/topics-o/ Here the character O has 99 sub pages. Similarly, character x has 73 sub pages. I manually assigned it here for the crawler to go that many times
Line 8: The variable all_urls consists all the possible combinations from a to z at the max each character having 99 subpages. The variable skipped_urls is to catch the URLs whose values are not crawlable because LinkedIn detected scrapping is going on. That will be collected and will be printed for recrawling later.
Line 9: Mapping all the possible URLs mentioned above into the variable all_urls
Line 11: Open a file called skills.txt in write mode and make it ready
Line 12: Iterate over each of the URLs present in all_url variable
Line 15-20: This is where the real crawling occurs. The XPath selector searches for class=column and collects all the skills in the given page and writes them directly into the file
Line 22: Capture all the skipped_urls in case LinkedIn blocks the scraper (which is the program)
Line 27: Print the skipped_urls using which we will have to rerun the program again - which I leave it to the reader to figure out how.


22 September 2016

State of Rails Releases

I am dealing with multiple Rails applications at once some of which are funded right and some of which have ran into maintenance modes. I just wanted to have a pictorial representation of what is the state of each of the Rails versions, unfortunately, I cannot get one. So I spent half an hour trying to decode the Rails official releases page and came up with this.

If you closely observe, the Rails releases 3.0.x and 3.1.x are history (I don't even want to pull up data for releases before that). It is high time you plan to upgrade your stack to a minimum of Rails 4.2 before you loose all the goodness the Rails community has to offer. Yes, I know it is a herculean task for folks in 3.x in which case I recommend you to do a complete rewrite of your application piecemeal by piecemeal!

Data Source: http://weblog.rubyonrails.org/releases/


26 July 2016

20 Tips for an Effective Code Review

It is a well-established fact that most of the bugs in the Software Development life cycle could be prevented literally right at the source (code). Since Code Review is almost an inevitable process in the Agile paradigm, keep in mind these 20 tips/guidelines (in no particular order) to become an effective reviewer of code. This is not restrictive to any one language but applicable to all. I've been reviewing code for many years and one of my core successes lie in stressing these points across the team. This is also the only way to effectively nurture and scale teams across the organisation.

  1. Identify the right tool: Identifying the right tool is very important. Because one should not be thrown off for adding a review just because the tool is not efficient enough. There are many open source tools out there. In most cases, you may have to host it yourself or you can also opt for services that do the hosting for you. If you are an Open Source Contributor, you would know how effective Github can be which also happens to be my personal favorite.
  2. Pre-conditions/Checklist: Any patch or a pull request that is submitted should have a minimum set of pre-conditions like it should have a Green build. A lot of Review tools have hooks to be configured to poll the SCM automatically and run the build. Build tools like Jenkins, Travis support these with minimal to no configuration. Ensure that you use them! Because it will definitely save you time and heartache instead of seeing stuff getting pushed to your trunk/master/production branch.
  3. Avoid Repeat Mistakes: As Gustavo Fring from The Breaking Bad rightly says "Never Repeat the same mistake twice", it is crucial that repetitive patterns are broken. In the context of code review, this means the developers should not get the same review comment that they had received earlier. This ensures that with each Iteration - the Quality of the patches improves so that if at all any new review comments are there - they are only new and anything that is given in the past are assumed to have been implemented in the later ones. If this is not happening, it is up to the Reviewer to go and identify to see where the leak is.
  4. Self Review: The person who is submitting the diff should first review it him/herself. Many obvious things like debugger statements, extra/missing files, ignorable files could be identified here. And I would recommend even doing a full fledged review of his/her own code as if he/she would do another one's. This culture also reduces the burden on the reviewer of concentrating on the Meat of the patch and not the obvious ones.
  5. Design Review: The Reviewer should also be able to decode the design introduction/changes that the Pull Request has and should be in a position to judge and give appropriate feedback. This is very important. 
  6. UI Review: Although software developers look at just the code and give feedback (because that is all they can be seeing in a Pull Request or a diff), they often neglect how the end product would look in a browser or the device where the code was intended to. It is extremely difficult to guess on how it will look. I recommend everyone to go the extra mile of looking how it looks and whether it relates to the original functionality. This is going to take some extra time. In my experience, this has insane returns in terms of identity and squashing obvious UI related issues. 
  7. Non-Logical Checklist: Code Review does not only involve in vetting the Logical Integrity but also some non-logical things like Naming convention, Spacing/Indentation, Object oriented compliance checks. Ensure that there is such a checklist in the first place.
  8. Keeping a pulse on the industry: This is true not only in this context but also on the overall wholeness of a Programmer. You should be up to date on what is going on in the Programming world, at least in the particular language you are part of. Knowledge of things like critical security patches, feature additions, language enhancements, performance improvements proves to be really powerful in assisting an effective review process.
  9. Encourage feedback: One does not always have to agree with what is said in the Review. If there are some contradictions - it is best they are addressed between the Reviewer and the Reviewed (or Reviewee). I also encourage that all the review comments are responded to. This process gives confidence to the entire team that any review comment will not go unanswered. 
  10. Avoid Oral Reviews: When a patch or pull requested gets created that too for teams and developers co-located or sitting next to each other, it is tempting to just go through it and give all the feedback orally. This may be fine if the team is small (only 2) and they fully own the codebase. However, this has some negative effects in terms of follow up and broadcasting. What I mean by broadcast is that there could be a review comment which could be applicable for the entire team.
  11. Learn from other reviews: Encourage to team members to not just read your own reviews and apply however to read the other reviews within the team. I've heard this famous quote - 'An intelligent person learns from their own mistakes, but a genius learns from the mistakes of others'. Let's make everyone in the team a Genius! 
  12. Dual Reviews: Similar to a Doubly refined sugar or Oil, the throughput and Quality of the code review could improve if it has a Second reviewer if that's possible.
  13. Review the Reviewer: It is a bit over-zealous to expect anyone coming new to the team who is relatively younger to the software development or who has not involved in Review process in the past to quickly catch up to all the nuances in the code review process. It would be nice if these guidelines are slowly implemented and mentoring/onboarding is in the organisation's culture. In simple terms, there could be a reviewer who can review whether reviewer complies to all the best practices out there.
  14. Over Engineering: At times it is tempting for Reviewers to comment on things that may look like Over Engineering work. These cases it is okay to voice your opinion to the reviewer.
  15. Enterprise Adherence: An Enterprise will have an adherence in different horizontals in terms on what tools they need to use, what style guide they need to follow, what frameworks has been used across different projects. It is up to the Enterprise Architect or the Senior Member of the team to proactively absorb all these facts and ensure that the entire review process is in Adherence with the overall Enterprise. This is crucial because each Atomic commit may slowly introduce things that could stray away from what the Enterprise would want. It may not look like a problem at all in the initial phase. However, should there be a consolidation happen across various projects - having multiple stacked apps across the enterprise would result in painful Refactors and often ends up leaving a huge amount of technical debt behind.
  16. Dependency Injection: Be wary of addition or removal of a new Library to the code. This again falls under the adherence of standards across the projects. Make sure that any introduction of a new library is well evaluated across the team and that it has enough support both in the near and long run. I have seen a lot of libraries which were started by individual contributors go unmaintained for years. Ensure that there is a strong community following and is very active.
  17. Against the Right branch: This may seem like something that may not belong here but in my personal experience I've faced this issue multiple times where a Reviewed creates a pull request against a different (or default) branch instead of the one that it actually has to go.
  18. Tech Debt Identification: During the course of the review, the reviewer may stumble upon an issue which involves a good amount of effort. In such cases, it is not advised to block it and hamper the delivery commitments. Instead, the right thing to do here is to add these things to a technical debt backlog where it could be groomed and picked up in future.
  19. Copy Paste excuses: "I did not do this - it was already there - I just copied/moved it" - Yes this is a very common statement every developer says when his code is challenged - however ensure that any code that is touched has to comply to the coding standards set by the team.
  20. Make Guidelines explicit: It is a very good process for all the developers on-boarding to a new team to have a set of guidelines (you could use this) explicit and review it from time to time. This could be done across the organisation.

The above list may look overwhelming. However, if you have the knack and right drive to implement some or all of these - the productivity of the Engineering team would increase by multi-folds.


08 June 2016

Jenkins bump from 1.x to 2.x

Jenkins is one of the amazing open source softwares especially after it forked itself from its predecessor Hudson. Amazing for multiple reasons - but the one that has really "amazed" me is the painless upgrades it provide. Trust me, I am a Rails developer and I know in & out on what a nightmare it is to perform an upgrade!

Recently I had to perform an upgrade for Jenkins and it was as dead simple as replacing a .war file and I was err.... DONE! All I did was stopped and started the process whilst replacing the war file. Once I rebooted, it all just worked. They seriously think about making their installations to run the latest software which I personally consider a true value.

Anyways, the reason me posting this is to help my fellow developers who are running into an issue where their slave would not start up after the upgrade process.

You would face an error something like this below when you go and inspect in the slave agent.

What this error means is that it could not invoke the slave because of an outdated java running in the slave. All you've got to do is upgrade it and you should be all good to go.

This was how the overall upgrade felt like :)


31 May 2016

Responsive Email design with Rails

It is almost imperative in the recent times, the emails we send out are expected to be responsive with a heavy number of users preferring to read or more like skim through emails from their smartphones. To find the ideal sweet spot that aids in not only developing fully responsive emails, but also to do it quickly and easily is vital. There are lot of factors should be taken into account both from a business perspective and from a developer standpoint. I am listing them here (in no particular order)

  1. Responsive design - works consistent across all the devices from mobile layout to the most stringent Outlook Email client.
  2. The UI should be consistent with ways to freeze the Headers, Footers and should follow a proper template similar to Rails Action View Layouts.
  3. Should be able to easily testable in developer mode with support for Plain text view besides supporting HTML View.
  4. Avoiding hardcoding of styles in each and every HTML tag. Hardcoding styles in the email has been the norm in Rails community and other web frameworks as well for a very long time.
  5. Should be easily testable in all types of browsers. Even a minor modification/tweak should be tested quickly instead of painfully sending emails again and again.
The following may seem a shorter list, but believe me - to quench all the above criterions I had to go through a lot of different phases with varied learning curves. To attack all the above problems - I would suggest the following tools and libraries to make our lives super simple.

  1. Zurb's "Foundation for emails" (previously called Ink) that provides with ready-made available templates to kick start and later customise on top of it to our heart's content.
  2. Premailer-Rails - A wonderful Rails pre-processor that makes the email design entirely stylesheet driven as opposed to hardcoding styles directly in the tag. Not only does it removes the pain of having hardcoded styles, it also provides a packages view to render the Plain text automagically - with 0 amount of code required from the developer.
  3. Letter Opener - A classical tool by Ryan to quickly preview the emails in development mode.
  4. Litmus - If you are into Responsvie email design, you have no reason not to subscribe to Litmus as they provide a comprehensive way to template, design and test your email in in-numerous email clients.
That is it! Combine these tools and with a slight learning curve, you can claim yourself as a fully responsive e-mail designer.


22 January 2016

Streaming vs Synchronus Replication in Postgres

I recently faced one strange issue in Rails which usually questioned some of the basic Relation Database principles. It gave me almost a sleepless night until I was able to get to the Root cause of the issue.

The problem

The problem was pretty straightforward. A Rake task generates an email and the email had two places where the count of documents was mentioned. Ideally they are supposed to be the same - but for some reason it was different.

The pain point

The reason this particular problem was painful because this has not occurred for few years and that it occurred only intermittently. The problem with intermittency is that there is always some theory behind. Here too there was something. Here are steps I had to perform to find the Root cause.

The approach

I first looked into the Rake task's log file which is outputted when my specific Email job runs. Things looked fine there - meaning it completed in under 90 seconds as expected.
The next step was to look at the production logs. The logs as expected was having 30 insert statements - Check. And it also has a read statement for the insert statements before and it was a typical count(*) query. The problem occurred at this point. The count(*) should have returned 30 but instead it returned 4. There comes another count(*) somewhere below in the code - but that returned 30 as expected!

The above step revealed that this problem is not with the Rails layer but something to do with our production database setup. So routed my energy towards there.
The production database environment is a Master-Slave configuration with Master taking Writes and Reads and Slave purely configured to take Reads. Both these nodes are load balanced via a PG Pool server. My initial gut said to investigate some time in the PGPool but that is not much useful as all PG Pool going to do is route traffic.
So I went and read about the Master - Slave Replication configuration. I read about two types of replication. One being synchronous replication and the other being Streaming Replication. Digging into that I found my root cause!

Synchronous vs Streaming Replication

Assume you have two databases A and B with A being a R/W Master and B being R-only Slave. If an insert or update command is issued, it goes and writes that entry to A as its configured for write. If the database returns after it ensures that all the slaves got this write - it is called as Synchronus or 2-Safe Replication. If A does not wait for this step however acknowledge whether it wrote successfully and later streams that value to B - this is called as Streaming Replication.

Both has their obvious own pros and cons. Streaming Replication is for Raw Speed and is also a very good configuration where there are too many writes. And Synchronous Replication although not as fast as Streaming provides 100% consistency. We unfortunately were in Streaming Replication mode. The 30 inserts happened so fast at A, that before even it could stream them to B, the count query intervened and read the half baked data from B. I am talking in terms of millisecond speed.

How did we fix it?

We isolated all our cron jobs to run in a dedicated node and pointed the database directly to the Master database server skipping the PG Pool in the process. In a single database configuration the concept of Streaming or Synchronous Replication does not apply. Hope this was helpful!


07 January 2016

Quick way to import a jar in Eclipse

Following is a quick demonstration to import .jar files via Eclipse IDE

See Also: TRIE ADT Tutorial

You will need to understand why you are doing this. Any java program, for it to execute, first needs to be compiled without errors. The compilation process (using "javac") is going to convert your source code into bytecode (.class) file which you can then use the "java" executable to run. An IDE like Eclipse will do the compilation automatically in the background. That is why when you have any sort of error in the program (syntax/logical etc.,), you see it getting highlighted immediately. 

So, when you are referencing an external library, javac executable needs to know which library you are referencing. You can do this by setting the CLASSPATH variable before executing the program if it is via command line. Eclipse makes it easier, instead of specifying the Java library file's location through the command line, you can do it through the editor by Right clicking on the project and selecting Build Path --> Configure Build Path. With the Libraries tab selected, click on "Add External Archives" and give your .jar files. Your program should now compile smoothly without errors.

Using an IDE introduces insane efficiency for designing and development. However, always have an understanding of what is going behind the scenes. Hope you learned something useful today!