CAUTION

This project is still under development. For further details please contact Benjamin Klatt.

General

Rough Introduction

The Semantic Relationship VPM analyzer identifies semantic relationships between Variation Points based on the terms used in software artifacts implementing the VP's variants. The necessary text analysis is done via the Lucene search framework. The analyzer offers several configurations to adjust the analysis.

Major Steps

The semantic analysis consists of two major steps: <uml> (*) -right-> "Index Terms" "Index Terms" -right-> "Find Relationships" "Find Relationships" -right-> (*) </uml>

Index Terms contained in the Variation Point's implementing artifacts
Find relationships based on the terms and their frequency in the variation points.

The indexing can be adjusted through configuration (see below).
To find relationships, the analyzer provides a set finders implementing different strategies to interprete the term occurences and frequencies.

Relationship Identification

The analyzer provides three different strategies to detect shared terms:

By default, only the Shared Term Finder is activated and configured to react for at least one shared term.

Overall-Similarity-Finder (Basic Shared Term Finder)

Finds variation points sharing at least one, or a configured degree of terms.
This Finder computes the overall similarity of two VPs. Therefore it extracts all terms of a document. Afterwards it searches for documents having at least a specified percentage (a configuration) of the terms in common.

Rare-Term-Finder

Calculate the most frequent terms used in all variation points and detect relationships only for those.
This Finder uses the rarest terms of a VP and searches the index using these terms to find according variation points. Use the max. Percentage configuration to adjust rarity. Rare terms are assumed to carry more semantic than terms used in nearly every variation point. A term is part of the search if following is true:

<math>{{term Frequency} / {total number of terms in document}} <= max. Percentage</math>

Top-N-Finder

Determine the least frequently used terms for each variation point and detect relationships only for those.
The Top-N-Finder first calculates the N-th top frequent terms in the whole index. After that, it filters all terms out that have occurences in less than the specified "least document freqency" percent of all documents. Those terms are considered to carry informations about functional features. Though, as a next step, those terms make up clusters. All documents containing such a term are clustered.

Term Selection

The terms to use for the analysis are provided by technology specific extensions (SemanticContentProvider). Such an extension accepts a software element and returns a list of terms potentially containing any semantic (separated into code and comments). For example, the JaMoPP extension returns the name of methods or classifiers.

Having all terms collected, a term processing is applied to prepare the terms for the analysis:

Split on Whitespace Characters (Lucene's WhitespaceTokenizer / java.lang.Character.isWhitespace() )
Split on case change (CamelCase)
Filter out Stop-Words
Transform to lower-case
Filter out words with less than 3 characters

Usage

Starting an Analysis

To start a semantic analysis, just press "Analyze VPM" from workbench and select the "Semantic VPM Analyzer" from the analysis wizard. Modify the configurations as needed and press next. Select "VPMGraph only" if you want the results beeing displayed in the VPMGraph window. Otherwise, choose the "Refinement Browser". Click "Finish" to start analysis.

Configurations

General Configurations

Access the Analyzers configurations by clicking "Semantic VPM Analyzer" in the wizard. There are two configurations to adjust the comparison, yet.

Include Comments
- This flag determines whether comments should be part of the analysis or not. If true, the VPMAnalyzer includes the comments of the Variation Points implementing software elements for similarity comparisons.
Use Rare-Finder
- This flag determines whether to use the Rare-Term-Finder or not.
Rare-Terms Max. Percentage
- As the Rare-Term-Finder uses the most rare terms of a VP, this configuration specifies the maximum share of a term in a VP. Use values between 0 and 1.
Use Overall-Similarity Finder
- This flag determines whether to use the Overall-Similarity-Finder or not.
Minimum Similarity
- Specifies the minimum similarity of two documents to be matched by the Overall-Similarity-Finder. Use values between 0 and 1.
UseTopNTermFinder
- This flag determines whether to use the Top-N-Term-Finder or not.
Least document frequency
- Defines the least document frequency for the Top-N-Term-Finder.
N
- The max. number of terms to be searched for in the Top-N-Term-Finder.
Stop-Words
- Define your own stop-words here. Stop-words are filtered out and are thereby not part of the analysis. Use commonly used words here that should not imply semantic relations. Separate the words by a whitespace, case is ignored. A default stop-word-list is used if left blank. Examples are below:
  - integer value field case
  - hello ignore this words

Stop-Word Configuration

The semantic analyzer already has a default list of stop-words which can be found in the Constants.java-class. The list contains the following terms:

get, set, new, case, remove, class, type, create, arg, default, configure, clear, value, misc, fig

Technical Documentation

Lucene

General

Apache LuceneTM is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

A Lucene index in gerneal consists of multiple documents, each having multiple fields. For Variation Points, individual documents are created with a unique VP-ID field and a content field. The index is hold in-memory to allow a faster analysis, and because the index is not required after the analysis has been performed. Term-frequencies are indexed for later analysis techniques.

Lucene indexing options

Lucene offers various options when it comes to indexing text. As these options significantly change the search results, the options are explained below.

Indexed

Determines whether to analyze the text or not.

Tokenized

Determines whether to tokenize the text or not.

Stored

Determines whether to store the original text or not.

StoreTermVectors

Determines whether to store term vectors or not.

IndexOptions

DOCS_ONLY
- Only documents are indexed: term frequencies and positions are omitted.
DOCS_AND_FREQS
- Only documents and term frequencies are indexed: positions are omitted.
DOCS_AND_FREQS_AND_POSITIONS
- Indexes documents, frequencies and positions.
DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS
- Indexes documents, frequencies, positions and offsets.

org.splevo.vpm.analyzer.semantic

Package Description

This is the base of the semantic analyzer. This package holds the Analyzer class, a constants class and some helper classes.

Constants.java

This class holds all constant values for the semantic analyzer. This way all default values can be changed easily. The constants are structured as follows:

Analyzer Settings
Index field names
Labels for the configurations
Default values for the configurations
General constants

IndexASTNodeSwitch.java

This class extracts all relevant text content from a given EObject. The constructor has the option to either index comments or ignore them. This switch does not iterate through a Node's children. It only indexes the direct content. When calling the doSwitch(EObject)-Method, the text gets stored in the switch. Get the text by using the getContent() and getComments()-method.

SemanticVPMAnalyzer.java

This is the extry-point for the analysis process. When starting a analysis, the analyze(VPMGraph)-method is called. In this method, the semantic analyzer first extracts all relevant text content from the VPMGraph and stores it in a Lucene index (fillIndex(VPMGraph). After that it searches this index for similar Variation Points (findRelationships(VPMAnalyzerResult)). Afterwards a clean-up is done and all index content gets deleted. Find a rough explanation of the main methods below:

fillIndex(VPMGraph)
- Iterates through every VP from the VPMGraph and uses the IndexASTNodeSwitch's doSwitch(EObject)-method to extract the text content. The content gets stored in the index with the Indexer's addToIndex(...)-method.
findRelationships(VPMAnalyzerResult)
- First gets all user configurations
- Uses the Searcher's findSemanticRelationships(...) to find similar VPs.
- Iterates through all found similars and adds the EdgeDescriptors to the VPMAnalyzerResult.

The configuration methods use the labels and default-values from the Constants.java-file.

StructuredMap.java

This is a helper-class that allows to store links between string-IDs. For each link the StructuredMap can store a explanation that contains a rough description of why the nodes are related. The usage of the class is described below:

void addLink(String id1, String id2, String explanation)
- Use this method to add a link between the two String IDs with an explanation that describes why there is a link between this nodes.
String getExplanation(String id1, String id2)
- Get the explanation for the link with the given node IDs. Returns null if there is no explanation for this link or the IDs are not existing.
Map<String, Set<String>> getAllLinks()
- Returns a Map that contains all links from the StructuredMap. Has the ID of the left node as key. The values list contains the IDs of the right nodes. This way every key holds values.size() links.

org.splevo.vpm.analyzer.semantic.lucene

Package Description

This package has all Lucene-related classes. Classes that interact with a Lucene-Index or override the Lucene default behaviour should be part of this package.

CustomPerFieldAnalyzer.java

This is just a wrapper-class. It has a static method that gives a AnalyzerWrapper. While creating a Lucene-Index the whole content gets analyzed with one single analyzer. The AnalyzerWrapper allows to use a individual analyzer for each field in the index. Since the index consists of three fields, each containing different semantics, a seperate analyzer per field would be a good approach. The index structure is explained in the Indexer topic. As the ID-field is not analyzed, this field doesn't need an analyzer. Find the default analyzer for the field below:

Code-field: LuceneCodeAnalyzer
- This analyzer is explained below.
Comment-field: StandardAnalyzer
- The StandardAnalyzer is made especially for the common english language which is perfect for comments.

Indexer.java

This class is responsible for the index-creation. All content gets stored in one central index. A Lucene-Index has documents. Those documents contain fields. Fields can be seen as key-value-pairs with a name as key and its content as value. The strucure of the index hold by this class is as follows:

Index
- The main index that has all documents.
Document
- For each VP a document is created.
Field
- Each document has two mandatory fields and one optional field. There are several options for indexing fields in lucene. Have a look at Index Options for further information.
  - ID: This is the unique ID of the VP.
    - Indexed: false
    - Tokenized: false
    - Stored: true
    - Store Term Vectors: false
    - Index Options: DOCS_ONLY
  - Content: The program code of the VP is stored in this field.
    - Indexed: true
    - Tokenized: true
    - Stored: false
    - Store Term Vectors: true
    - Index Options: DOCS_AND_FREQS
  - Comment(optional): The code comments of the VP are stored in this field.
    - Indexed: true
    - Tokenized: true
    - Stored: false
    - Store Term Vectors: true
    - Index Options: DOCS_AND_FREQS

The Indexer uses the RAMDirectory-Class which means that the index is stored in-memory. Some important methods of this class are explained below:

void setStopWords(String[])
- Sets the stop-word-list for the main index. Those words are filtered out and are not part of the analysis.
DirectoryReader getIndexReader()
- Gets a reader that grants access to reading- and searching- operations.
boolean addToIndex(String variationPointId, String content, String comments)
- Adds a document to the index with the given content. If you don't want comments to be indexed, just set the comments parameter to null.

Searcher.java

This class s responsible for the search. Use the findSemanticRelationships(...) method to start a search. The method gets a IndexReader from the indexer-class and uses the FinderExecutor to execute analysis as configured. This is the point where you would add more Finders.

LuceneCodeAnalyzer.java

This is a Lucene Analyzer-Class. This analyzer class is specialized in code processing. In multiple steps the input-text first gets tokenized and then filtered. The exact steps in the correct order are as follows:

Tokenization with Lucene's WhitespaceTokenizer
Split on case change
Stop-Words are filtered out
Transformation to lower-case
Words with a length < 3 get filtered out

org.splevo.vpm.analyzer.semantic.lucene.finder

Package Description

This package has the Finder-Logic. The AbstractRelationshipFinder class defines the finder's behaviour. All Finder-implementations should be in this package.

AbstractRelationshipFinder.java

The Abstract class that defines the behaviour of a Finder. A Finder has to implement the abstract method findSimilarEntries(). Use the reader member to access the index. The matchComments member indicates whether to respect comments for analysis or not.

AbstractLuceneQueryFinder.java

This is a special AbstractRelationshipFinder that already implements the querying of a index.

To implement a AbstractLuceneQueryFinder just override the Query buildQuery(String fieldName, Map<String, Integer> termFrequencies) method. This method is called for each document in the index. The fieldName parameter stands for the field that has to be used. The termFrequencies maps has the document's terms and their frequency. Use those information to build a query that matches similar documents.

FinderExecutor.java

This helper-class takes multiple AbstractRelationshipFinder and executes their findSimilarEntries() method. The results of all searches is merged to one StructuredMap.

OverallSimilarityFinder.java

Building a BooleanQuery containing all terms of the document this Finder finds similar documents. A minimum similarity defines how much terms a similar document has to have in common with the current one.

RareTermFinder.java

Extracts terms with frequencies less than a specified max. percentage. Those terms are used to search similar documents.

TopNTermFinder.java

As a first step, this finder calculates the top-N terms of the whole index. Then it filters out all terms that have occurences in less than a specified percentage of the documents. Afterwards, every of those terms makes up a cluster.

Expanding the Semantic Analyzer

Custom Finders

Adding new Finders is easy. Just add a new Finder-class to the org.splevo.vpm.analyzer.semantic.lucene.analyzer package and implement either the AbstractRelationshipFinder or the AbstractLuceneQueryFinder. The difference between the two classes are explained below.

AbstractRelationshipFinder
- Top-Level Abstract class.
- Implement the findSimilarEntries() method to execute the search.
  - Use the reader member to query the index and implement new comparison techniques.
AbstractLuceneQueryFinder
- Special AbstractRelationshipFinder implementation to ease the use of lucene queries.
- Executes a query on the index and adds all results to a StructuredMap.
- Implement the buildQuery(Map<String, Integer> termFrequencies) method.
  - A semantic relationship is build between all document (i.e. variation points) found by the query.

After creating the new Finder, it must be added to the executor. Do this by adding it to the analysisExecutor in the findSemanticRelationships(...)-method in the Searcher-class.

Allow the user to enable / disable the Finder as required. Therefore, a new configuration should be build. Have a look into the SemanticRelationshipAnalyzer-class to see how configurations are build.