class Prism::Translation::Parser::Lexer
Accepts a list of prism tokens and converts them into the expected format for the parser gem.
Constants
- COMMENT_CONTINUATION_TYPES
-
Types of tokens that are allowed to continue a method call with comments in-between. For these, the parser gem doesn’t emit a newline token after the last comment.
- DELIMITER_SYMETRY
-
When one of these delimiters is encountered, then the other one is allowed to be escaped as well.
- ESCAPES
-
Escape sequences that have special and should appear unescaped in the resulting string.
- HeredocData
-
Heredocs are complex and require us to keep track of a bit of info to refer to later
- LAMBDA_TOKEN_TYPES
-
It is used to determine whether ‘do` is of the token type `kDO` or `kDO_LAMBDA`.
NOTE: In edge cases like ‘-> (foo = -> (bar) {}) do end`, please note that `kDO` is still returned instead of `kDO_LAMBDA`, which is expected: github.com/ruby/prism/pull/3046
- LPAREN_CONVERSION_TOKEN_TYPES
-
The ‘PARENTHESIS_LEFT` token in
Prism
is classified as either `tLPAREN` or `tLPAREN2` in theParser
gem. The following token types are listed as those classified as `tLPAREN`. - REGEXP_META_CHARACTERS
-
github.com/whitequark/parser/blob/v3.3.6.0/lib/parser/lexer-strings.rl#L14
- TYPES
-
The direct translating of types between the two lexers.
Attributes
An array of tuples that contain prism tokens and their associated lex state when they were lexed.
A hash that maps offsets in bytes to offsets in characters.
The Parser::Source::Buffer that the tokens were lexed from.
Public Class Methods
Source
# File lib/prism/translation/parser/lexer.rb, line 227 def initialize(source_buffer, lexed, offset_cache) @source_buffer = source_buffer @lexed = lexed @offset_cache = offset_cache end
Initialize the lexer with the given source buffer, prism tokens, and offset cache.
Public Instance Methods
Source
# File lib/prism/translation/parser/lexer.rb, line 237 def to_a tokens = [] index = 0 length = lexed.length heredoc_stack = [] quote_stack = [] # The parser gem emits the newline tokens for comments out of order. This saves # that token location to emit at a later time to properly line everything up. # https://github.com/whitequark/parser/issues/1025 comment_newline_location = nil while index < length token, state = lexed[index] index += 1 next if %i[IGNORED_NEWLINE __END__ EOF].include?(token.type) type = TYPES.fetch(token.type) value = token.value location = range(token.location.start_offset, token.location.end_offset) case type when :kDO types = tokens.map(&:first) nearest_lambda_token_type = types.reverse.find { |type| LAMBDA_TOKEN_TYPES.include?(type) } if nearest_lambda_token_type == :tLAMBDA type = :kDO_LAMBDA end when :tCHARACTER value.delete_prefix!("?") # Character literals behave similar to double-quoted strings. We can use the same escaping mechanism. value = unescape_string(value, "?") when :tCOMMENT if token.type == :EMBDOC_BEGIN while !((next_token = lexed[index][0]) && next_token.type == :EMBDOC_END) && (index < length - 1) value += next_token.value index += 1 end value += next_token.value location = range(token.location.start_offset, lexed[index][0].location.end_offset) index += 1 else is_at_eol = value.chomp!.nil? location = range(token.location.start_offset, token.location.end_offset + (is_at_eol ? 0 : -1)) prev_token = lexed[index - 2][0] if index - 2 >= 0 next_token = lexed[index][0] is_inline_comment = prev_token&.location&.start_line == token.location.start_line if is_inline_comment && !is_at_eol && !COMMENT_CONTINUATION_TYPES.include?(next_token&.type) tokens << [:tCOMMENT, [value, location]] nl_location = range(token.location.end_offset - 1, token.location.end_offset) tokens << [:tNL, [nil, nl_location]] next elsif is_inline_comment && next_token&.type == :COMMENT comment_newline_location = range(token.location.end_offset - 1, token.location.end_offset) elsif comment_newline_location && !COMMENT_CONTINUATION_TYPES.include?(next_token&.type) tokens << [:tCOMMENT, [value, location]] tokens << [:tNL, [nil, comment_newline_location]] comment_newline_location = nil next end end when :tNL next_token = next_token = lexed[index][0] # Newlines after comments are emitted out of order. if next_token&.type == :COMMENT comment_newline_location = location next end value = nil when :tFLOAT value = parse_float(value) when :tIMAGINARY value = parse_complex(value) when :tINTEGER if value.start_with?("+") tokens << [:tUNARY_NUM, ["+", range(token.location.start_offset, token.location.start_offset + 1)]] location = range(token.location.start_offset + 1, token.location.end_offset) end value = parse_integer(value) when :tLABEL value.chomp!(":") when :tLABEL_END value.chomp!(":") when :tLCURLY type = :tLBRACE if state == EXPR_BEG | EXPR_LABEL when :tLPAREN2 type = :tLPAREN if tokens.empty? || LPAREN_CONVERSION_TOKEN_TYPES.include?(tokens.dig(-1, 0)) when :tNTH_REF value = parse_integer(value.delete_prefix("$")) when :tOP_ASGN value.chomp!("=") when :tRATIONAL value = parse_rational(value) when :tSPACE value = nil when :tSTRING_BEG next_token = lexed[index][0] next_next_token = lexed[index + 1][0] basic_quotes = ["\"", "'"].include?(value) if basic_quotes && next_token&.type == :STRING_END next_location = token.location.join(next_token.location) type = :tSTRING value = "" location = range(next_location.start_offset, next_location.end_offset) index += 1 elsif value.start_with?("'", '"', "%") if next_token&.type == :STRING_CONTENT && next_token.value.lines.count <= 1 && next_next_token&.type == :STRING_END # the parser gem doesn't simplify strings when its value ends in a newline if !(string_value = next_token.value).end_with?("\n") && basic_quotes next_location = token.location.join(next_next_token.location) value = unescape_string(string_value, value) type = :tSTRING location = range(next_location.start_offset, next_location.end_offset) index += 2 tokens << [type, [value, location]] next end end quote_stack.push(value) elsif token.type == :HEREDOC_START quote = value[2] == "-" || value[2] == "~" ? value[3] : value[2] heredoc_type = value[2] == "-" || value[2] == "~" ? value[2] : "" heredoc = HeredocData.new( identifier: value.match(/<<[-~]?["'`]?(?<heredoc_identifier>.*?)["'`]?\z/)[:heredoc_identifier], common_whitespace: 0, ) if quote == "`" type = :tXSTRING_BEG end # The parser gem trims whitespace from squiggly heredocs. We must record # the most common whitespace to later remove. if heredoc_type == "~" || heredoc_type == "`" heredoc.common_whitespace = calculate_heredoc_whitespace(index) end if quote == "'" || quote == '"' || quote == "`" value = "<<#{quote}" else value = '<<"' end heredoc_stack.push(heredoc) quote_stack.push(value) end when :tSTRING_CONTENT if (lines = token.value.lines).one? # Heredoc interpolation can have multiple STRING_CONTENT nodes on the same line. is_first_token_on_line = lexed[index - 1] && token.location.start_line != lexed[index - 2][0].location&.start_line # The parser gem only removes indentation when the heredoc is not nested not_nested = heredoc_stack.size == 1 if is_first_token_on_line && not_nested && (current_heredoc = heredoc_stack.last).common_whitespace > 0 value = trim_heredoc_whitespace(value, current_heredoc) end value = unescape_string(value, quote_stack.last) else # When the parser gem encounters a line continuation inside of a multiline string, # it emits a single string node. The backslash (and remaining newline) is removed. current_line = +"" adjustment = 0 start_offset = token.location.start_offset emit = false lines.each.with_index do |line, index| chomped_line = line.chomp backslash_count = chomped_line[/\\{1,}\z/]&.length || 0 is_interpolation = interpolation?(quote_stack.last) is_percent_array = percent_array?(quote_stack.last) if backslash_count.odd? && (is_interpolation || is_percent_array) if is_percent_array # Remove the last backslash, keep potential newlines current_line << line.sub(/(\\)(\r?\n)\z/, '\2') adjustment += 1 else chomped_line.delete_suffix!("\\") current_line << chomped_line adjustment += 2 end # If the string ends with a line continuation emit the remainder emit = index == lines.count - 1 else current_line << line emit = true end if emit end_offset = start_offset + current_line.bytesize + adjustment tokens << [:tSTRING_CONTENT, [unescape_string(current_line, quote_stack.last), range(start_offset, end_offset)]] start_offset = end_offset current_line = +"" adjustment = 0 end end next end when :tSTRING_DVAR value = nil when :tSTRING_END if token.type == :HEREDOC_END && value.end_with?("\n") newline_length = value.end_with?("\r\n") ? 2 : 1 value = heredoc_stack.pop.identifier location = range(token.location.start_offset, token.location.end_offset - newline_length) elsif token.type == :REGEXP_END value = value[0] location = range(token.location.start_offset, token.location.start_offset + 1) end if percent_array?(quote_stack.pop) prev_token = lexed[index - 2][0] if index - 2 >= 0 empty = %i[PERCENT_LOWER_I PERCENT_LOWER_W PERCENT_UPPER_I PERCENT_UPPER_W].include?(prev_token&.type) ends_with_whitespace = prev_token&.type == :WORDS_SEP # parser always emits a space token after content in a percent array, even if no actual whitespace is present. if !empty && !ends_with_whitespace tokens << [:tSPACE, [nil, range(token.location.start_offset, token.location.start_offset)]] end end when :tSYMBEG if (next_token = lexed[index][0]) && next_token.type != :STRING_CONTENT && next_token.type != :EMBEXPR_BEGIN && next_token.type != :EMBVAR && next_token.type != :STRING_END next_location = token.location.join(next_token.location) type = :tSYMBOL value = next_token.value value = { "~@" => "~", "!@" => "!" }.fetch(value, value) location = range(next_location.start_offset, next_location.end_offset) index += 1 else quote_stack.push(value) end when :tFID if !tokens.empty? && tokens.dig(-1, 0) == :kDEF type = :tIDENTIFIER end when :tXSTRING_BEG if (next_token = lexed[index][0]) && !%i[STRING_CONTENT STRING_END EMBEXPR_BEGIN].include?(next_token.type) # self.`() type = :tBACK_REF2 end quote_stack.push(value) when :tSYMBOLS_BEG, :tQSYMBOLS_BEG, :tWORDS_BEG, :tQWORDS_BEG if (next_token = lexed[index][0]) && next_token.type == :WORDS_SEP index += 1 end quote_stack.push(value) when :tREGEXP_BEG quote_stack.push(value) end tokens << [type, [value, location]] if token.type == :REGEXP_END tokens << [:tREGEXP_OPT, [token.value[1..], range(token.location.start_offset + 1, token.location.end_offset)]] end end tokens end
Convert the prism tokens into the expected format for the parser gem.
Private Instance Methods
Source
# File lib/prism/translation/parser/lexer.rb, line 561 def calculate_heredoc_whitespace(heredoc_token_index) next_token_index = heredoc_token_index nesting_level = 0 previous_line = -1 result = Float::MAX while (lexed[next_token_index] && next_token = lexed[next_token_index][0]) next_token_index += 1 next_next_token = lexed[next_token_index] && lexed[next_token_index][0] # String content inside nested heredocs and interpolation is ignored if next_token.type == :HEREDOC_START || next_token.type == :EMBEXPR_BEGIN # When interpolation is the first token of a line there is no string # content to check against. There will be no common whitespace. if nesting_level == 0 && next_token.location.start_column == 0 result = 0 end nesting_level += 1 elsif next_token.type == :HEREDOC_END || next_token.type == :EMBEXPR_END nesting_level -= 1 # When we encountered the matching heredoc end, we can exit break if nesting_level == -1 elsif next_token.type == :STRING_CONTENT && nesting_level == 0 common_whitespace = 0 next_token.value[/^\s*/].each_char do |char| if char == "\t" common_whitespace = (common_whitespace / 8 + 1) * 8; else common_whitespace += 1 end end is_first_token_on_line = next_token.location.start_line != previous_line # Whitespace is significant if followed by interpolation whitespace_only = common_whitespace == next_token.value.length && next_next_token&.location&.start_line != next_token.location.start_line if is_first_token_on_line && !whitespace_only && common_whitespace < result result = common_whitespace previous_line = next_token.location.start_line end end end result end
Wonky heredoc tab/spaces rules. github.com/ruby/prism/blob/v1.3.0/src/prism.c#L10548-L10558
Source
# File lib/prism/translation/parser/lexer.rb, line 715 def interpolation?(quote) quote != "'" && !quote.start_with?("%q", "%w", "%i") end
Determine if characters preceeded by a backslash should be escaped or not
Source
# File lib/prism/translation/parser/lexer.rb, line 532 def parse_complex(value) value.chomp!("i") if value.end_with?("r") Complex(0, parse_rational(value)) elsif value.start_with?(/0[BbOoDdXx]/) Complex(0, parse_integer(value)) else Complex(0, value) end rescue ArgumentError 0i end
Parse a complex from the string representation.
Source
# File lib/prism/translation/parser/lexer.rb, line 525 def parse_float(value) Float(value) rescue ArgumentError 0.0 end
Parse a float from the string representation.
Source
# File lib/prism/translation/parser/lexer.rb, line 518 def parse_integer(value) Integer(value) rescue ArgumentError 0 end
Parse an integer from the string representation.
Source
# File lib/prism/translation/parser/lexer.rb, line 547 def parse_rational(value) value.chomp!("r") if value.start_with?(/0[BbOoDdXx]/) Rational(parse_integer(value)) else Rational(value) end rescue ArgumentError 0r end
Parse a rational from the string representation.
Source
# File lib/prism/translation/parser/lexer.rb, line 725 def percent_array?(quote) quote.start_with?("%w", "%W", "%i", "%I") end
Determine if the string is part of a %-style array.
Source
# File lib/prism/translation/parser/lexer.rb, line 513 def range(start_offset, end_offset) Range.new(source_buffer, offset_cache[start_offset], offset_cache[end_offset]) end
Creates a new parser range, taking prisms byte offsets into account
Source
# File lib/prism/translation/parser/lexer.rb, line 720 def regexp?(quote) quote == "/" || quote.start_with?("%r") end
Regexp
allow interpolation but are handled differently during unescaping
Source
# File lib/prism/translation/parser/lexer.rb, line 607 def trim_heredoc_whitespace(string, heredoc) trimmed_whitespace = 0 trimmed_characters = 0 while (string[trimmed_characters] == "\t" || string[trimmed_characters] == " ") && trimmed_whitespace < heredoc.common_whitespace if string[trimmed_characters] == "\t" trimmed_whitespace = (trimmed_whitespace / 8 + 1) * 8; break if trimmed_whitespace > heredoc.common_whitespace else trimmed_whitespace += 1 end trimmed_characters += 1 end string[trimmed_characters..] end
Wonky heredoc tab/spaces rules. github.com/ruby/prism/blob/v1.3.0/src/prism.c#L16528-L16545
Source
# File lib/prism/translation/parser/lexer.rb, line 642 def unescape_string(string, quote) # In single-quoted heredocs, everything is taken literally. return string if quote == "<<'" # OPTIMIZATION: Assume that few strings need escaping to speed up the common case. return string unless string.include?("\\") # Enclosing character for the string. `"` for `"foo"`, `{` for `%w{foo}`, etc. delimiter = quote[-1] if regexp?(quote) # Should be escaped handled to single-quoted heredocs. The only character that is # allowed to be escaped is the delimiter, except when that also has special meaning # in the regexp. Since all the symetry delimiters have special meaning, they don't need # to be considered separately. if REGEXP_META_CHARACTERS.include?(delimiter) string else # There can never be an even amount of backslashes. It would be a syntax error. string.gsub(/\\(#{Regexp.escape(delimiter)})/, '\1') end elsif interpolation?(quote) # Appending individual escape sequences may force the string out of its intended # encoding. Start out with binary and force it back later. result = "".b scanner = StringScanner.new(string) while (skipped = scanner.skip_until(/\\/)) # Append what was just skipped over, excluding the found backslash. result << string.byteslice(scanner.pos - skipped, skipped - 1) # Simple single-character escape sequences like \n if (replacement = ESCAPES[scanner.peek(1)]) result << replacement scanner.pos += 1 elsif (octal = scanner.check(/[0-7]{1,3}/)) # \nnn # NOTE: When Ruby 3.4 is required, this can become result.append_as_bytes(chr) result << octal.to_i(8).chr.b scanner.pos += octal.bytesize elsif (hex = scanner.check(/x([0-9a-fA-F]{1,2})/)) # \xnn result << hex[1..].to_i(16).chr.b scanner.pos += hex.bytesize elsif (unicode = scanner.check(/u([0-9a-fA-F]{4})/)) # \unnnn result << unicode[1..].hex.chr(Encoding::UTF_8).b scanner.pos += unicode.bytesize elsif scanner.peek(3) == "u{}" # https://github.com/whitequark/parser/issues/856 scanner.pos += 3 elsif (unicode_parts = scanner.check(/u{.*}/)) # \u{nnnn ...} unicode_parts[2..-2].split.each do |unicode| result << unicode.hex.chr(Encoding::UTF_8).b end scanner.pos += unicode_parts.bytesize end end # Add remainging chars result << string.byteslice(scanner.pos..) result.force_encoding(source_buffer.source.encoding) result else delimiters = Regexp.escape("#{delimiter}#{DELIMITER_SYMETRY[delimiter]}") string.gsub(/\\([\\#{delimiters}])/, '\1') end end
Apply Ruby string escaping rules