LinkedIn Search: What it COULD and SHOULD be

Posted by | July 06, 2009 | Extended Boolean, LinkedIn | 20 Comments

Did you know that LinkedIn currently has the ability to deliver incredibly powerful search functionality to its users – WELL beyond what we all have access to now?  What am I talking about?

I’m excited to tell you, but quite honestly, I actually can’t believe it’s taken me this long to put 2 and 2 together. Have you ever really watched the video clip below that you can find on  LinkedIn’s Learning Center as well as on YouTube?

If you ignore the information regarding the new features and pay close attention to the video, you can hear Esteban talk about how LinkedIn is always on the lookout for talented Lucene Open Source engineers and watch him search for them. Lucene is an open source text search engine that I’ve written about in multiple posts for its advanced search functionality, including extended Boolean.

LinkedIn uses Lucene as their Text Search Engine

When I first watched the video, I never gave the Lucene stuff a second thought because LinkedIn doesn’t actually offer any of Lucene’s truly advanced search functionality – LinkedIn doesn’t even support root-word/wildcard searching, let alone extended Boolean search. I figured if they were already using Lucene for their text search engine they would offer all of Lucene’s search functionality, which they don’t.

Then I watched the video again the other day (not exactly sure why) and I it made me curious. Had they already implemented Lucene, or were they looking to do so? I did some research to see if I could confirm a link between LinkedIn with Lucene (pun intended).  Although TechCrunch reported that LinkedIn upgraded its people search, they failed to mention the technology behind the upgrade. I was then able to dig up an article that verified that LinkedIn had implemented Lucene as their text search engine.

So What Can LinkedIn Do With Lucene?

I’m glad you asked – be prepared to be amazed! 

Wildcard Searches

Lucene supports single and multiple character wildcard searches within single terms. That means you could search for the term develop* and LinkedIn would return results of people who mention every word that begins with the root of “develop:” develop, developed, developing, developer, develops, etc. That would mean no more having to type out long OR statements where you have to think about all of the different ways a particular term can be written.

Proximity Search

Lucene supports configurable proximity search – or the ability to find words that are a within a specific distance from each other (3 words, 8 words, your choice). For example, if you wanted to find people who mention that they have experience configuring routers, you can use Lucene’s proximity search functionality via the tilde symbol (~) to target phrases where some mention of config* is made within 5 words of router or routers.

“config* rout*”~5

This functionality is HUGE, as it allows sourcers and recruiters to drastically increase the relevance of search results by targeting people based on their responsibilities rather than basic keyword search (aka buzzword bingo). Without forcing some variant of the word “configure” to be within 5 words of “router” or “routers,” you can just as earily return results of people who do not mention that they have been specifically responsible for configuring routers – you could end up finding people who mention that they’ve configured other things (e.g. servers), and who make 1 mention of the word “router” in their skill summary because they have a router at home (but no paid professional experience). That would be what I call a false positive hit. The result mentioned the search terms, but it did not match the intent of my search – which is to find people who have been responsible for configuring routers.

When I talk about targeting people based on their responsibilities, I mean searching for responsibility verbs (administer, manage, develop, design, configure, filing, reconcile, audit, etc.) mentioned in close proximity (in the same sentence) to skill/technology nouns (oracle, statements, servers, projects, reports, Microsoft Dynamics, SAP, etc.). Being able to control how close words like those are in proximity to each other – down to the sentence level – allows sourcers and recruiters to perform semantic search (aka, natural language search). Essentially, you are able to find people based on what they DO, not just the words they happen to mention in their profile.

If you’re new to the concept of semantic search, I strongly suggest you read these articles (Semantic Search 1, Semantic Search 2, Semantic Search with Proximity, Semantic Search without Proximity) that will throughly explain the concept as well as show you how can currently leverage proximity search to your advantage on Monster and Exalead.

Variable Term Weighting

Here’s the other biggie – Lucene allows you to control the the relevance weighting of your search terms. Lucene calls it “boosting.” In other words – you can tell Lucene that specific terms in your search string are more important/relevant to you than others. That’s right – instead of the search engine taking all of your search terms and “deciding” which results are the most relevant, YOU control the search relevance based on which terms you think are more critical and match the intent of what you’re specifically looking for.

To boost a term with Lucene you can use the caret (^) symbol with a boost factor (a number) at the end of the term you are searching. The higher the boost factor, the more relevant the term will be, so boosting allows you to control the relevance of your results by boosting specific terms.

For example, if you are searching for the following terms: Unix, Windows, Citrix, VMware, storage, and you really needed people who had significant Citrix experience, you can boost that term with the ^symbol:

Unix AND Windows AND Citrix^5 AND VMware AND storage

This will make profiles with more mentions of the term Citrix to appear more relevant and thus be higher in the search results ranking.  This is important, because people who have a lot of experience with Citrix (in terms of specific responsibilities and/or mulitple positions in their career history in which they use Citrix) will likely have multiple mentions of Citrix in their profile. Boosting Citrix will result in bubbling all of the profiles with many mentions of Citrix to the top of the results.

This is especially critical because without the ability to “tell” the search engine with specific terms are actually most relevant to you, the search engine makes its own “decision” as to what’s relevant. And in the case of my example – the search engine may see profiles who mention the word Windows 20 times in their profile as highly relevant, even if they only mention Citrix once – which isn’t likely to actually be someone who matches my need of a strong Citrix professional.

In addition to boosting single terms, you can also boost phrases. By default, the boost factor is 1. Although the boost factor must be positive, it can be less than 1 (e.g. 0.2).

More Lucene Search Functionality

Lucene also supports fuzzy searching (finding matches of misspellings and similar words) based on the Levenshtein Distance, and range searches (similar to Google’s numrange search). To learn more, here is a page that lists all of Lucene’s search functionality.

Conclusion

Now that you know that LinkedIn uses Lucene as their text search engine and you’ve seen all of the powerful search functionality Lucene has to offer – wouldn’t you like to be able to use wildcard searching, proximity search, term weighting, and fuzzy search when searching LinkedIn? I know I do! Those features can make a HUGE difference in the relevance of search results.

I’m still trying to figure out why LinkedIn doesn’t offer users all of Lucene’s search functionality as they’ve been using Lucene as their text search engine for at least 7 months now.

I’ve tried to communicate my search improvement suggestions to LinkedIn a couple of different ways. In June I sent message to Esteban Kozak – Senior Product Manager overseeing search at LinkedIn – via LinkedIn (of course) that detailed all of my suggestions for improving LinkedIn’s search functionality, including wildcard search, proximity search, and term weighting – and I haven’t received a response.

I also caught William Uranga Tweeting from a LinkedIn customer advisory session last week, so I DM’d him and let him know I had a list of search recommendations and he kindly let me send them to him via email so he could share them during the session at LinkedIn. William wrote a post about his customer advisory session experience at LinkedIn – you can read it here.

We can only hope that sometime in the near future LinkedIn taps into the awesome search power of Lucene, enabling users to take control of search relevance and tap into semantic search. I know I’ve got my fingers crossed!

Update!

Esteban Kozak replied to my message with this helpful response:

1- Prefix matching: we are currently evaluating the release of prefix matching for names in order to enable a quick way to navigate your contacts from the mobile application. Prefix matching for free text queries is very expensive because the query needs to be translated into a huge OR statement in the back end. There are better ways to solve this problem more elegantly. We are investigating alternative approaches like stemming, automatic expansion at query time and other techniques to ensure good recall.

2- Proximity search / Term weighting: These two are much easier to open up and will be available shortly.

Also – be sure not to miss LinkedIn Principle Search Engineer Jake Mannix’s thorough and detailed comments below.

It appears we have much to look forward to with regard to LinkedIn search functionality!

Related Posts Plugin for WordPress, Blogger...

About Glen Cathey

Glen Cathey is a sourcing and recruiting thought leader with over 16 years of experience working in large staffing agency and global RPO environments (>1,000 recruiters and nearly 100,000 hires annually). Starting out his career as a top producing recruiter, he quickly advanced into senior management roles and now currently serves as the SVP of Strategic Talent Acquisition and Innovation for Kforce, working out of their renowned National Recruiting Center with over 300 recruiters. Often requested to speak on sourcing and recruiting best practices, trends and strategies, Glen has traveled internationally to present at many talent acquisition conferences (5X LinkedIn Talent Connect - U.S. '10, '11, '12, Toronto '12, London '12, 2X Australasian Talent Conference - Sydney & Melbourne '11, '12, 6X SourceCon, 2X TruLondon, 2X HCI) and is regularly requested to present to companies (e.g., PwC, Deloitte, Intel, Booz Allen Hamilton, Citigroup, etc.). This blog is his personal passion and does not represent the views or opinions of anyone other than himself.

  • http://www.CruiterTalk.com Ryan Leary

    It’s amazing that they have not yet opened this feature. I agree that having the ability to narrow your search is critical. We’ve got the same functionality internally here and the results as compared to those that do not are tremendous. I’m starting to feel that LinkedIn may be smelling their roses a little too much.

    I guess it goes back to the argument of monetization, but in the end that ruins the fun and efficiency of the tool to the masses. I get it that LI has to make money, but they can’t forget about the people that put them in the position they are in. Great posting.

  • Pingback: LinkedIn Search: What it COULD and SHOULD be | Boolean Black Belt | Quality Solutions()

  • Pingback: Twitted by DorothyBeach()

  • http://www.linkedin.com/in/adamnash Adam Nash

    Glen,

    This is a very clever post. Yes, we actually take advantage of some of these features already, particularly around our search intelligence, query handling and relevance. But it is still early days.

    I think you’ll be very pleased with the continued enhancements we make to LinkedIn search this year. We have a team completely focused on this area.

    Take care,
    Adam

  • phil

    This is great information. I landed on your site because it seems the folks at Linkedin have broken there search. I was tying to do a standard OR search for more then one company and the simple boolean searches using AND OR do not work. Any thoughts on why this is?

  • phil

    Thank you for contacting LinkedIn Customer Support.

    Our search function currently does not support the “OR” keyword search.

    I am also including a link to our free Learning Center ( http://learn.linkedin.com/training ). This page gives you access to free e-learning modules and webinar sessions designed to meet your LinkedIn learning needs. Many members have found watching one of the learn-at-your-own-pace modules like ‘Creating and Managing your Profile’ to be extremely helpful. You can also register to attend one of our webinars like ‘LinkedIn Subscriptions: Get the Most out of your Premium Subscription’. I hope that you will find these learning opportunities useful as well.

  • Anonymous

    @Glen – nice, and we’ll push for it with our LinkedIn contacts, too (Shally Steckerl was the first external vendor officially authorized by LinkedIn to do advanced LI training).

    @Phil – that is simply not true. I just ran a simple search of the type you indicated this morning — chief AND (oncologist OR neurologist) — under LI Advanced People Search just to make sure I’d have distinct results and it definitely pulls up both expected result sets.

  • http://aces.arbita.net Glenn Gutmacher

    @Glen – nice, and we’ll push for it with our LinkedIn contacts, too (Shally Steckerl was the first external vendor officially authorized by LinkedIn to do advanced LI training).

    @Phil – that is simply not true. I just ran a simple search of the type you indicated this morning — chief AND (oncologist OR neurologist) — under LI Advanced People Search just to make sure I’d have distinct results and it definitely pulls up both expected result sets.

  • Henry

    Here are more search tips right from linkedin:
    http://www.linkedin.com/static?key=pop/pop_more_search

  • http://www.linkedin.com/in/jakemannix Jake Mannix

    Glen,
    While I’m not in a position to speak for the company, I can say, as one of engineers responsible for search at LinkedIn, that we’ve been using Lucene since the start of the company, not just the past 7 months, and in fact multiple engineers here have contributed code both back to the core Lucene library itself, but also additional search libraries which extend and enhance Lucene (in particular: the Zoie realtime search package: http://zoie.googlecode.com was developed at LinkedIn, and released as an open-source extension of Lucene).
    As Glen Gutmacher notes above, full boolean functionality (as well as exact phrase matching) is available in LinkedIn search – try something like: “venture capital” AND (equity OR fund AND NOT hedge)
    Briefly regarding some of Lucene’s capabilities which we are *not* exposing currently (especially things like wildcard or fuzzy matching) – remember that we are are providing search across a result set of more than forty million user profiles as they are being updated in real time, and the text alone is not the only component to the relevance: the searcher’s personal view on the social graph plays a strong role, and every user has a different set of connections (and 2nd degree connections, etc) which informs this relevance component.
    In short: there’s a lot going on when you do just a simple search, and keeping the performance within desired latency specifications puts strong constraints on what kinds of queries we can perform as we continue to scale the site (imagine the amount of processing our servers would have to do to retrieve the results of the simple query: “manag* team”~10 ). Performance is on the forefront of our minds when doing due diligence on whether to implement any given feature.
    Allowing users to provide their own boosting parameters is an interesting thought, but as this is not something typical search engine users are accustomed to (look at how few, statistically speaking, people even use boolean queries, either on LinkedIn, or with other search pages) I would be surprised if this would be a heavily used feature.

  • Boolean Black Belt

    Adam,
    Thank you for reading my post and leaving your comment. I am definitely looking forward to the upcoming search enhancements!

  • Boolean Black Belt

    Jake,
    Thank you VERY much for responding to my post with your detailed comment. You’ve essentially answered one of the questions I posed in my article, which was why LinkedIn does not allow users to use the wildcard, proximity, and term weighting/boosting search functionality that Lucene supports – that was very helpful. I also appreciated finding out that LinkedIn has always used Lucene – I’d done some research online and I could not find anything that definitively linked LinkedIn and Lucene prior to November 2008.

    I have had the privilege of working with software and database engineers responsible for an enterprise portal (involving search) and a 40+ TB data warehouse for a Fortune 50 company, so although I am clearly non-technical, I can appreciate (although certainly nowhere near your level of understanding) the challenges associated with searching profiles that are being updated in real time, and I am aware that prefix searching can be a significant drain on resources (although it can be implemented with extremely fast search execution).

    I do have a few questions for you if you have the time – I’d love to tap your significant experience and insight on a couple of things:

    #1 When you sort results by keyword, does LinkedIn search performance still suffer from the challenge of taking into consideration the searcher’s personal network in terms of determining relevance? If sorting results by keyword only is NOT encumbered by the searcher’s connections when considering relevance, could it be less resource-intensive to offer more search functionality such as prefix and configurable proximity? If so, perhaps offering a truly separate keyword-only search and results sorting, unaffected by the user’s network/connections, could offer users more advanced search functionality that could otherwise be too complex/resource dependent when sorting by what LinkedIn terms as “relevance” – which incorporates both keyword and connections. Thoughts?

    #2 If not for the real-time updates/search aspect, does enabling wildcard/prefix search still pose as serious of a resource drain for searching LinkedIn? There are databases of well over 40M records where new records are being added every minute (but each record is typically not altered after being entered) that seem to effortlessly handle queries with 10+ prefix/wildcard terms. What in your opinion poses the biggest challenge for LinkedIn in offering prefix searching?

    I’m glad you find the idea of offering users the ability to boost specific search terms interesting, and I do agree that most people are unaware of this kind of search functionality. However, that fact alone should not determine whether not not such a feature is offered. In fact, being able to offer search term boosting could be a nice way to differentiate LinkedIn search, in that I am not aware of any social network, social media application, online resume database, or Internet search engine that offers this feature. Bragging rights, if you will (e.g., the Ferrari of people search at no additional cost). Plus, booting terms is not rocket science – I am confident that many users will be able to easily take advantage of the ability to control the relevance of their own results based on what they feel are the most important search terms.

    Ultimately, information management is all about retrieval – data is worthless without the ability to find exactly what you need when you need it. Basic keyword search with Boolean logic (AND/OR/NOT) is a approach that only allows for basic retrieval, which is (IMO) intrinsically limited and imprecise when it comes to retrieving relevant results and is prone to a large percentage of false positive results. I believe that most people only know how to use basic keyword search because they have not been offered anything else.

    When users with premium access to LinkedIn can view 300, 500, 700, or 1000+ results, I’m pretty confident most of them aren’t really viewing past the first 100 – 200, as most people don’t have the time to do so. As such, to truly provide value to your paying customers, it is critical for users to be able to take some control over the relevance of their search results so that the first 100 to 200 are in fact the most relevant to them – in other words, that the the first 100 – 200 results are the ones that most closely match the INTENT of the user’s search, not just the keywords. Search terms, in and of themselves, do not determine relevance – the search engine does – and the search engine does not and can not “know” the intent of the user. I can tell you from personal experience that boosting (controlling specific term relevance weighting) and configurable proximity search can play a HUGE role in users being able to take true control over the relevance of their results.

  • http://www.linkedin.com/in/jakemannix Jake Mannix

    Hi Glen,
    I’m glad my comments could help you understand a little of what we’re doing behind the scenes at LinkedIn. I can’t get into too many details about all of the “Product”-centric decisions on what gets implemented when, but I’ll briefly try to address some of your technical questions:
    If we had 2 completely separate systems, one for doing strictly keyword-based relevance, and not being as up to date (ie. not being realtime – taking only batch updates on a daily or weekly basis), and one for using the social-graph-incorporated relevance with realtime updates, with a 100-millisecond backend SLA, then it’s certainly possible that we could allow much more resource intensive queries such as arbitrary wildcard and prefix queries on the former system. But what what would the expense be, in terms of development and maintenance, as well as hardware, to maintain both systems? If you try to run both kinds of queries on the same exact system, you run into the problem of resource allocation: the small minority of users who query with very computationally expensive queries end up locking up searching resources for the rest of everyone else. It’s a delicate balancing act, and we have to weigh the costs to the many of exposing additional functionality to the few. To give you an idea about the amount of CPU time I’m talking about, wildcard queries can take anywhere from 10-100 times (or more!) longer to execute in lucene than simple boolean queries (boosting doesn’t really affect performance: we do boosting behind the scenes already, it’s just not exposed in the UI), and the user may not know how slow it’s going to be when executing it, because the latency is highly dependent on the number of terms the wildcard expands to, and in turn how many hits those terms generate. Because we have a distributed system to serve the search requests (your query goes not to one index, but to roughly 10 at the same time, each with a subset of the userbase), 10-100 times the latency means one user could be hogging the resources of 10 CPU-cores for anywhere from 100ms to 10 seconds. Another way to put it: if we allow queries which are 100 times as expensive, if only 5% of our querying userbase takes advantage of this functionality, we would be using up 5x the *total load* on our search servers! This being a popular public site under non-insignificant load, this is a serious concern.

    This is not to say that one can’t do prefix/wildcard-based searching on a Lucene-backed search system with a large number of documents, in a performant way. It’s just that doing so while also serving a heavy load of textually-simpler (but also taking into account the multiple language preferences of our users, and as mentioned before, the social-graph component) more popular queries with low latency, in the same system is highly nontrivial, and having multiple systems serving the same data in different ways poses its own resourcing challenges.

    Allowing much more advanced control over query relevance is more a question of “what Products LinkedIn should provide”, and is not really my bailiwick, but also digs into the question of who LinkedIn builds products for: we obviously try to serve our Power Users, who do sourcing for a living, but LinkedIn is not just for them – it’s for everyone who wants to take control of their career as if it were a small business, for hiring managers who aren’t search-engine experts, for people looking to connect with former coworkers and clients. The average user, while wanting the search system to take their “intent” into account, may not have the time or inclination to spend a lot of time learning query syntax or a new UI to plug in how much they care about each term in their query.

    On the other hand, users in the past decade have been trained by Google to assume that the search engine will be “smart enough” to know what they mean without them being very specific. Similarly, at LinkedIn, we do a lot of work with offline data mining to do things figuring out that when a user searches for “VP IBM”, they’re looking for someone with the *title* VP and the *company* IBM, even without specifying ccompany:IBM AND ctitle:VP (because a miniscule fraction of our userbase uses the query-field based syntax we expose that you are familiar with). Similarly, since the typical user is looking for people who currently do that thing, instead of in the past, we dynamically turn VP IBM into (ccompany:IBM AND ctitle:VP)^current_boost OR (pcompany:IBM AND ptitle:VP)^past_boost OR (VP AND IBM)^body_boost, where current_boost > past_boost > body_boost are boost parameters we need to figure out based on how well it serves our users (of course, there’s more going on here as well: when someone puts in “Dell” – are they looking for Michael Dell, the CEO, or are they looking for someone *at* Dell – we do some fancy magic to figure out the relative probabilities of both, if the user doesn’t specify by using the company or name fields, and adjust the boosts accordingly: (lname:Dell^last_name_boost)^last_name_probability OR (ccompany:Dell^current_boost OR pcompany:Dell^past_boost)^company_probability and other things like this).

    But since you really want more control over the kinds of searches you can do on the site… well, I can’t say anything now, but just wait a few weeks or so, you’ll start to get a taste of some of the stuff our Search Team has been brewing for quite some time, and I hope you’ll like it. :)

  • Pingback: The Cardinal Rule of E-Sourcing | Boolean Black Belt()

  • Pingback: 090713 Techno Links | johnsumser.com()

  • http://www.booleanblackbelt.com Boolean Black Belt

    Jake,
    It’s been a short while since we exchanged comments and ideas regarding LinkedIn search – I was wondering when some of the new search functionality you had hinted at might finally be released?

    Also, I have a question for you regarding some claims I have heard someone make about the ability to alter their search ranking in LinkedIn – I was hoping to point you towards a video and either debunk or confirm what this person is claiming.

    Looking forward to your response. Thanks!

  • Pingback: LinkedIn Search Results Sorting: Relevance or Keyword?()

  • Pingback: Latest Find People news – LinkedIn Search: What it COULD and SHOULD be | People Finder()

  • Pingback: 090713 Techno Links | HR Examiner with John Sumser()

  • Pingback: LinkedIn’s Voltron Search: What’s New and What’s Missing()