class SyntaxSuggest::CleanDocument

Parses and sanitizes source into a lexically aware document

Internally the document is represented by an array with each index containing a CodeLine correlating to a line from the source code.

There are three main phases in the algorithm:

  1. Sanitize/format input source

  2. Search for invalid blocks

  3. Format invalid blocks into something meaningful

This class handles the first part.

The reason this class exists is to format input source for better/easier/cleaner exploration.

The CodeSearch class operates at the line level so we must be careful to not introduce lines that look valid by themselves, but when removed will trigger syntax errors or strange behavior.

## Join Trailing slashes

Code with a trailing slash is logically treated as a single line:

1 it "code can be split" \
2    "across multiple lines" do

In this case removing line 2 would add a syntax error. We get around this by internally joining the two lines into a single “line” object

## Logically Consecutive lines

Code that can be broken over multiple lines such as method calls are on different lines:

1 User.
2   where(name: "schneems").
3   first

Removing line 2 can introduce a syntax error. To fix this, all lines are joined into one.

## Heredocs

A heredoc is an way of defining a multi-line string. They can cause many problems. If left as a single line, the parser would try to parse the contents as ruby code rather than as a string. Even without this problem, we still hit an issue with indentation:

1 foo = <<~HEREDOC
2  "Be yourself; everyone else is already taken.""
3    ― Oscar Wilde
4      puts "I look like ruby code" # but i'm still a heredoc
5 HEREDOC

If we didn’t join these lines then our algorithm would think that line 4 is separate from the rest, has a higher indentation, then look at it first and remove it.

If the code evaluates line 5 by itself it will think line 5 is a constant, remove it, and introduce a syntax errror.

All of these problems are fixed by joining the whole heredoc into a single line.

## Comments and whitespace

Comments can throw off the way the lexer tells us that the line logically belongs with the next line. This is valid ruby but results in a different lex output than before:

1 User.
2   where(name: "schneems").
3   # Comment here
4   first

To handle this we can replace comment lines with empty lines and then re-lex the source. This removal and re-lexing preserves line index and document size, but generates an easier to work with document.