I’m excited to tell you, but quite honestly, I actually can’t believe it’s taken me this long to put 2 and 2 together. Have you ever really watched the video clip below that you can find on LinkedIn’s Learning Center as well as on YouTube?
If you ignore the information regarding the new features and pay close attention to the video, you can hear Esteban talk about how LinkedIn is always on the lookout for talented Lucene Open Source engineers and watch him search for them. Lucene is an open source text search engine that I’ve written about in multiple posts for its advanced search functionality, including extended Boolean.
LinkedIn uses Lucene as their Text Search Engine
When I first watched the video, I never gave the Lucene stuff a second thought because LinkedIn doesn’t actually offer any of Lucene’s truly advanced search functionality – LinkedIn doesn’t even support root-word/wildcard searching, let alone extended Boolean search. I figured if they were already using Lucene for their text search engine they would offer all of Lucene’s search functionality, which they don’t.
Then I watched the video again the other day (not exactly sure why) and I it made me curious. Had they already implemented Lucene, or were they looking to do so? I did some research to see if I could confirm a link between LinkedIn with Lucene (pun intended). Although TechCrunch reported that LinkedIn upgraded its people search, they failed to mention the technology behind the upgrade. I was then able to dig up an article that verified that LinkedIn had implemented Lucene as their text search engine.
So What Can LinkedIn Do With Lucene?
I’m glad you asked – be prepared to be amazed!
Lucene supports single and multiple character wildcard searches within single terms. That means you could search for the term develop* and LinkedIn would return results of people who mention every word that begins with the root of “develop:” develop, developed, developing, developer, develops, etc. That would mean no more having to type out long OR statements where you have to think about all of the different ways a particular term can be written.
Lucene supports configurable proximity search – or the ability to find words that are a within a specific distance from each other (3 words, 8 words, your choice). For example, if you wanted to find people who mention that they have experience configuring routers, you can use Lucene’s proximity search functionality via the tilde symbol (~) to target phrases where some mention of config* is made within 5 words of router or routers.
This functionality is HUGE, as it allows sourcers and recruiters to drastically increase the relevance of search results by targeting people based on their responsibilities rather than basic keyword search (aka buzzword bingo). Without forcing some variant of the word “configure” to be within 5 words of “router” or “routers,” you can just as earily return results of people who do not mention that they have been specifically responsible for configuring routers – you could end up finding people who mention that they’ve configured other things (e.g. servers), and who make 1 mention of the word “router” in their skill summary because they have a router at home (but no paid professional experience). That would be what I call a false positive hit. The result mentioned the search terms, but it did not match the intent of my search – which is to find people who have been responsible for configuring routers.
When I talk about targeting people based on their responsibilities, I mean searching for responsibility verbs (administer, manage, develop, design, configure, filing, reconcile, audit, etc.) mentioned in close proximity (in the same sentence) to skill/technology nouns (oracle, statements, servers, projects, reports, Microsoft Dynamics, SAP, etc.). Being able to control how close words like those are in proximity to each other – down to the sentence level – allows sourcers and recruiters to perform semantic search (aka, natural language search). Essentially, you are able to find people based on what they DO, not just the words they happen to mention in their profile.
If you’re new to the concept of semantic search, I strongly suggest you read these articles (Semantic Search 1, Semantic Search 2, Semantic Search with Proximity, Semantic Search without Proximity) that will throughly explain the concept as well as show you how can currently leverage proximity search to your advantage on Monster and Exalead.
Variable Term Weighting
Here’s the other biggie – Lucene allows you to control the the relevance weighting of your search terms. Lucene calls it “boosting.” In other words – you can tell Lucene that specific terms in your search string are more important/relevant to you than others. That’s right – instead of the search engine taking all of your search terms and “deciding” which results are the most relevant, YOU control the search relevance based on which terms you think are more critical and match the intent of what you’re specifically looking for.
To boost a term with Lucene you can use the caret (^) symbol with a boost factor (a number) at the end of the term you are searching. The higher the boost factor, the more relevant the term will be, so boosting allows you to control the relevance of your results by boosting specific terms.
For example, if you are searching for the following terms: Unix, Windows, Citrix, VMware, storage, and you really needed people who had significant Citrix experience, you can boost that term with the ^symbol:
Unix AND Windows AND Citrix^5 AND VMware AND storage
This will make profiles with more mentions of the term Citrix to appear more relevant and thus be higher in the search results ranking. This is important, because people who have a lot of experience with Citrix (in terms of specific responsibilities and/or mulitple positions in their career history in which they use Citrix) will likely have multiple mentions of Citrix in their profile. Boosting Citrix will result in bubbling all of the profiles with many mentions of Citrix to the top of the results.
This is especially critical because without the ability to “tell” the search engine with specific terms are actually most relevant to you, the search engine makes its own “decision” as to what’s relevant. And in the case of my example – the search engine may see profiles who mention the word Windows 20 times in their profile as highly relevant, even if they only mention Citrix once – which isn’t likely to actually be someone who matches my need of a strong Citrix professional.
In addition to boosting single terms, you can also boost phrases. By default, the boost factor is 1. Although the boost factor must be positive, it can be less than 1 (e.g. 0.2).
More Lucene Search Functionality
Lucene also supports fuzzy searching (finding matches of misspellings and similar words) based on the Levenshtein Distance, and range searches (similar to Google’s numrange search). To learn more, here is a page that lists all of Lucene’s search functionality.
Now that you know that LinkedIn uses Lucene as their text search engine and you’ve seen all of the powerful search functionality Lucene has to offer – wouldn’t you like to be able to use wildcard searching, proximity search, term weighting, and fuzzy search when searching LinkedIn? I know I do! Those features can make a HUGE difference in the relevance of search results.
I’m still trying to figure out why LinkedIn doesn’t offer users all of Lucene’s search functionality as they’ve been using Lucene as their text search engine for at least 7 months now.
I’ve tried to communicate my search improvement suggestions to LinkedIn a couple of different ways. In June I sent message to Esteban Kozak – Senior Product Manager overseeing search at LinkedIn – via LinkedIn (of course) that detailed all of my suggestions for improving LinkedIn’s search functionality, including wildcard search, proximity search, and term weighting – and I haven’t received a response.
I also caught William Uranga Tweeting from a LinkedIn customer advisory session last week, so I DM’d him and let him know I had a list of search recommendations and he kindly let me send them to him via email so he could share them during the session at LinkedIn. William wrote a post about his customer advisory session experience at LinkedIn – you can read it here.
We can only hope that sometime in the near future LinkedIn taps into the awesome search power of Lucene, enabling users to take control of search relevance and tap into semantic search. I know I’ve got my fingers crossed!
Esteban Kozak replied to my message with this helpful response:
1- Prefix matching: we are currently evaluating the release of prefix matching for names in order to enable a quick way to navigate your contacts from the mobile application. Prefix matching for free text queries is very expensive because the query needs to be translated into a huge OR statement in the back end. There are better ways to solve this problem more elegantly. We are investigating alternative approaches like stemming, automatic expansion at query time and other techniques to ensure good recall.
2- Proximity search / Term weighting: These two are much easier to open up and will be available shortly.
Also – be sure not to miss LinkedIn Principle Search Engineer Jake Mannix’s thorough and detailed comments below.
It appears we have much to look forward to with regard to LinkedIn search functionality!