Website source code searching made easier

NerdyData is a new search engine that lets you mine the code behind web pages

By  


NerdyData's source code search interface

Image credit: ITworld/Phil Johnson

We’re all familiar with search engines like Google and Bing that will search through website text and keywords, but what if you’re interested in querying the code behind a site? I did a little bit of source code mining for a recent article and it involved querying raw source code stored in GitHub using Google BigQuery. Not real hard, but it definitely required that I jump through a few hoops. Well, now, there’s NerdyData, a new tool for searching the source code for live websites.

NerdyData, which launched in July, has indexed the HTML, JavaScript, CSS and plain text of more than 140 million websites. Users can do different kinds of searches, including a free-form source code search for a given phrase. You can also do a comparative search of up to five terms, to find the domains using the terms, a backlink/image search to let you find sites referencing a given URL and an SEO search, to let you query inside of a number of predefined tags, such as TITLE and META tags, Google Analytics and AdSense tags and Twitter buttons.

Here’s the main catch: the site is subscription based. There are currently two subscription levels offered: Professional ($99/month) and Enterprise ($149/month). They buy you credits on the site, which are then used to pay for queries (each source code search is 2 credits). Anyone can try the service out for free; just go to the site and you’ll have 20 credits to play with.

I tinkered around with NerdyData and was pretty impressed. However, I think if I were a developer using it to find particular chunks of code to see how something was implemented, the tool is a little lacking. Ironically, its simplicity, which is one of its strengths, is also one of its weaknesses. It only searches for exact matches of the phrase you enter, and only matches against alphanumeric characters, hyphens and dots (i.e., you can’t match on <, ==, & or other code symbols). You can’t do anything really fancy like, say, use regular expressions, as I did when using BigQuery to search through GitHub code.

However, the tool should be of a lot more interest to marketers, as you can use the other types of searches for more business-oriented reasons. For example, you can use the comparative search to see how many sites are talking about your business, or the backlink search to see who’s linking back to you. The source code search could also be interesting if you want to see who’s using your open source code, for example.

While these are pretty valuable services, time will tell if they’re valuable enough to induce people to pay the subscription fees. We’ll just have to wait and see.

Read more of Phil Johnson's #Tech blog and follow the latest IT news at ITworld. Follow Phil on Twitter at @itwphiljohnson. For the latest IT news, analysis and how-tos, follow ITworld on Twitter and Facebook.

Join us:
Facebook

Twitter

Pinterest

Tumblr

LinkedIn

Google+

Cloud ComputingWhite Papers & Webcasts

See more White Papers | Webcasts

Answers - Powered by ITworld

Join us:
Facebook

Twitter

Pinterest

Tumblr

LinkedIn

Google+

Ask a Question
randomness