Source Code Search Engine

Concept

Programmers spend 50% of their time just looking at source code. When trying to understand how a system is organized, they often must look at and across many files that make up the system.

It is difficult to find code in large software systems of thousands of files coded in multiple programming languages. Often programmers use string search tools such as Unix grep or some IDE editor command. grep searches are not fast on thousands of files, and do not provide any easy way to see the resulting text. IDE searches are limited to at best the current project, not the entire source code base.

The Search Engine provides an interactive interface enabling one to search across a large source code base quickly, using the language structure of each of the languages providing far more precise answers than simple string searches can produce. For any query, the Search Engine offers a list of matches with surrounding context; the use can select a specific match and immediately inspect the source file.

Search Engine features

Screen Shots

Metrics

The Search Engine computes Cyclomatic and Halstead Complexity metrics, as well as Source Line, Code Line, Comment Line and Blank Line counts for each of the files indexed. This gives users an easy way to determine the relative complexity of system modules of interest. You can see an example metrics result file.

Productivity Comparison with grep on Linux kernel
(7.3 million lines, 18030 files, mixed C and ASM files)

2.8 Seconds: Source Code Search Engine
Using a search query:

        I=Interrupt*

to find an identifier starting with Interrupt takes the Search Engine 2.8 seconds. It finds 229 hits only in identifiers (because that's what was asked). It looked only at .c, .h, or .S files. Using the UI, you can scroll forwards and backwards through the short list of hits easily to select one. You can click on a hit to instantly see it in the context of the full source text file with the hit highlighted.

56.6 Seconds: grep
Using cygwin grep for the same task:

        grep Interrupt -R C:\work\linux-2.6.19.2

takes 56.6 seconds and produces 5297 hits (most of them in comments or in the middle of identifiers we didn't want). Looking at 5297 hits is frankly crazy. After deciding what the right hit is, you still have to type the file name into your editor to see the full source text around the file. With considerable thought you might write a grep regular expression that weakly approximates what the Source Code Search Engine does more carefully (consider ignoring hits in strings and comments). But that will take you much longer than a minute. grep climbed through some additional 2000 files in Linux directories that aren't .c, .h, or .S files, adding to its cost. You can also write a more complex find and grep command that will filter out the unwanted files. But that requires thought and more typing.

Difference in productivity: 20x or better on just the search part. Since the Search Engine also shows you the full source text with a single click, you can examine a lot of hits in context very quickly.

Examples run on Intel i7 2.39 Ghz Windows XP with 5200 RPM disk, 6GB RAM, source code files defragged before test. Both samples run twice to fill the cache, with second value reported here.

Download an evaluation version for Java, C#, C++, COBOL and Pseudo code

Technology

Computer languages are typically structured from a set of allowed elements ("lexemes"), such as identifiers, strings, numbers, operators and punctuation, as well as various kinds of text blocks such as blanks and comments which are ignored by langauge processors. The Search Engine uses a language-specific scanner to scan each source file and break it into lexemes according to the precise rules for that language. These scanners are derived from the language definitions used by DMS Software Reengineering Toolkit, which is used for language-accurate analysis and transformation. Lexemes with variable content (identifiers, strings, comments, numbers) are converted from thier source code format to a normal form so that character escapes and radix differences are removed, making searches much easier to specify across languages. Scanned lexemes are then indexed to enable fast searches.

It is expected that the complete set of source files of interest are collected, scanned and indexed on a periodic basis, such as daily or weekly. The collected sources are available to the Search Engine for display.

The Search Engine is presently available on Windows 2000, XP and Vista.

Available Lexical Scanners

SD offers a family of lexical scanners on DMS. Presently available are:

The following scanners are Beta. Early adopters, please inquire:

Custom Scanning Options

Semantic Designs can build custom scanner with special features:




Source Code
Search Engine