![]() |
SciTE Script Lexer |
A lexer may be written as a script in the Lua language instead of in C++. This is a little simpler and allows lexers to be developed without using a C++ compiler.
A script lexer is attached by setting the file lexer to be a name that starts with "script_". Styles and other properties can then be assigned using this name. For example,
style.script_zog.0=fore:#7f007f,bold
style.script_zog.1=fore:#000000
style.script_zog.2=fore:#000080,bold
style.script_zog.3=fore:#008000,font:Georgia,italics,size:9
Then the lexer is implemented in Lua similar to this:
function OnStyle(styler)
local S_DEFAULT = 0
local S_IDENTIFIER = 1
local S_KEYWORD = 2
local S_UNICODECOMMENT = 3
local identifierCharacters = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
styler:StartStyling(styler.startPos, styler.lengthDoc, styler.initStyle)
while styler:More() do
-- Exit state if needed
if styler:State() == S_IDENTIFIER then
if not identifierCharacters:find(styler:Current(), 1, true) then
local identifier = styler:Token()
if identifier == "if" or identifier == "end" then
styler:ChangeState(S_KEYWORD)
end
styler:SetState(S_DEFAULT)
end
elseif styler:State() == S_UNICODECOMMENT then
if styler:Match("»") then
styler:ForwardSetState(S_DEFAULT)
end
end
-- Enter state if needed
if styler:State() == S_DEFAULT then
if styler:Match("«") then
styler:SetState(S_UNICODECOMMENT)
elseif identifierCharacters:find(styler:Current(), 1, true) then
styler:SetState(S_IDENTIFIER)
end
end
styler:Forward()
end
styler:EndStyling()
end
The result looks like
« Clip into the positive zone »
if (a > 0) a
0
end
The lexer loops through the part of the document indicated assigning a style to each character.
while styler:More() do
-- Code that examines the text and sets lexical states
styler:Forward()
end
styler:EndStyling()
There are many different ways to structure the code that examines the text and sets lexical states. A structure that has proven useful in C++ lexers is to write two blocks of code as shown in the example. The first block checks if the current state should end and if so sets the state to the default 0. The second block is responsible for detecting whether a new state should be entered from the default state. This structure means everything is dealt with as switching from or to the default state and avoids having to consider many combinations of states.
The styler iterates over whole characters rather than bytes. Thus if the document is encoded in UTF-8, styler:Current() may be a multibyte string. If the script is also encoded in UTF-8, then it is easy to check against Unicode characters with code like
If using an encoding like Latin-1 and the script is also encoded in the same encoding then literals can be used as above.
If the language can be encoded in different ways then more complex code may be needed along with encoding-specific code.
Sometimes a lexer needs to see some information earlier in the file, perhaps a declaration changes the syntax or the particular form of quote at the start of a string must be matched at its end. Since the standard loop only goes forward from the starting position, different calls must be used like CharAt and StyleAt. These use byte positions and do not treat multi-byte characters as single entities.
The lexer above can lex approximately 90K per second on a 2.4 GHz Athlon 64. For most situations, this will feel completely fluid.
More complex lexers will be slower. If a lexer is so slow that the application becomes unresponsive then the lexer can choose to split up each request. It can do so by deciding upon a range of whole lines and using this range as the arguments to StartStyling. This allows the user's keystrokes and mouse moves to be processed. The lexer will automatically be called again to lex more of the document.
The API of the styler object passed to OnStyle:
<tr><td>Line(position) → integer</td>
<td>Convert a byte position into a line number</td></tr>
<tr><td>CharAt(position) → integer</td>
<td>Unsigned byte value at argument</td></tr>
<tr><td>StyleAt(position) → integer</td>
<td>Style value at argument</td></tr>
<tr><td>LevelAt(line) → integer</td>
<td>Fold level for a line</td></tr>
<tr><td>SetLevelAt(line, level)</td>
<td>Set the fold level for a line</td></tr>
<tr><td>LineState(line) → integer</td>
<td>State value for a line</td></tr>
<tr><td>SetLineState(line, state)</td>
<td>Set state value for a line. This can be used to store extra information from lexing,
such as a current language mode, so that there is no need to look back in the document.</td></tr>
<tr><td>startPos : integer</td>
<td>Start of the range to be lexed</td></tr>
<tr><td>lengthDoc : integer</td>
<td>Length of the range to be lexed</td></tr>
<tr><td>initStyle : integer</td>
<td>Starting style</td></tr>
<tr><td>language : string</td>
<td>Name of the language. Allows implementation of multiple languages with one OnStyle function.</td></tr>
Name | Explanation |
---|---|
StartStyling(startPos, length, initStyle) | Start setting styles from startPos for length with initial style initStyle |
EndStyling() | Styling has been completed so tidy up |
More() → boolean | Are there any more characters to process |
Forward() | Move forward one character |
Position() → integer | What is the position in the document of the current character |
AtLineStart() → boolean | Is the current character the first on a line |
AtLineEnd() → boolean | Is the current character the last on a line |
State() → integer | The current lexical state value |
SetState(state) | Set the style of the current token to the current state and then change the state to the argument |
ForwardSetState(state) | Combination of moving forward and setting the state. Useful when the current character is a token terminator like " for a string. |
ChangeState(state) | Change the current state so that the state of the current token will be set to the argument |
Current() → string | The current character |
Next() → string | The next character |
Previous() → string | The previous character |
Token() → string | The current token |
Match(string) → boolean | Is the text from the current position the same as the argument? |
This example is for a line-oriented language as is sometimes used for configuration files. It uses low level direct calls instead of the StartStyling/More/Forward/EndStyling calls.
function OnStyle(styler)
local lineStart = editor:LineFromPosition(styler.startPos)
local lineEnd = editor:LineFromPosition(styler.startPos + styler.lengthDoc)
editor:StartStyling(styler.startPos, 31)
for line=lineStart,lineEnd,1 do
local lengthLine = editor:PositionFromLine(line+1) - editor:PositionFromLine(line)
local lineText = editor:GetLine(line)
local first = string.sub(lineText,1,1)
local style = 0
if first == "+" then
style = 1
elseif first == " " or first == "\t" then
style = 2
end
editor:SetStyling(lengthLine, style)
end
end