Scraping documentation for syntax definitions

Discuss features, issues, about Editors for AHK
lexikos
Posts: 9592
Joined: 30 Sep 2013, 04:07
Contact:

Scraping documentation for syntax definitions

23 Dec 2022, 20:11

People maintaining AutoHotkey support in editors would benefit from a standard source of syntax data.

Currently the most complete source is the documentation, as it must be kept up to date, and it includes its own syntax highlighter. The documentation's syntax definitions include parameter names, whereas AutoHotkey v1 itself doesn't include them in the program or source code, and v2 includes them for many functions, but only in the source code, and not yet all functions.

@fincs wrote some scripts that get basic keyword lists; I'm not sure whether he's made any further progress.

I started to write a script for scraping function definitions from the documentation, but it's really not something I want to spend much time. There are so many more interesting things I could be working on, like AutoHotkey v2.1.

So I'm starting this topic in the hope that there will be volunteers to take up this project.

Goals:
  • Produce a script (or scripts) for v2 that scrapes both v1 and v2 documentation.
  • Scrape keyword lists (easy).
  • Scrape parameter definitions for functions, directives and v1 commands.
    Exclude control flow statements, since they can be unique.
  • Produce maps or arrays of objects, which tool authors can iterate over to dump data in an appropriate format.
  • Optional: include a function for dumping the data in standard formats such as JSON and XML.
Once it is in a working state, it should be submitted to the documentation repository and kept in sync with the documentation.

My script is in a very rough state. I would not suggest using it as a base directly, but perhaps it will be of some use for reference.
Spoiler

Some notes:
  • The script is expected to be in the root directory of a clone of the v2 documentation repository, with a v1 directory in the parent directory if isV1 is set to true.
  • I started out using eval for the index data, but later realized that it would be so trivial to parse with RegExMatch that it is probably better to do so, removing the dependency on ActiveScript/ScriptControl/MSHTML.
  • I started out trying to use MSHTML to parse the HTML (ComObjGet or WebBrowser.Navigate), but something in the documentation sidebar scripts caused it to "html-encode" all of the HTML, utterly breaking it. Parsing it with RegEx instead wasn't difficult. An alternative would be to strip out the <script> elements and use HTMLDocument.write().
  • My script uses the optional? convention new to v2, just for debug output. There are some parameter names in v1 that have a "?" suffix with no particular meaning.
  • functions in for item, uri in functions can be changed to directives or commands to test those.
  • When !isV1, the script tries to verify the results by checking the properties of Func. There are several cases where I think it is not worthwhile to correct inconsistencies (e.g. Control is actually mandatory in most cases even when it is preceded by an optional parameter). It also picks up some cases where the scraping code is insufficient, because the function has multiple usages.
  • There are some cases where functions lack a <pre class="Syntax"> block in the documentation, or have multiple definitions or "overloads". It would be acceptable to special-case these within the script, or adjust the documentation.
jj4156
Posts: 19
Joined: 17 Jun 2019, 07:03
Contact:

Re: Scraping documentation for syntax definitions

20 Jan 2023, 13:26

I think I have done a some part of this job. For the convenience of my own extension of vs code (focus on v1 for now, and plan to shift to v2), I collected most of v1 functions and command from the forum builtin.ts. It is basically a big json in a typescript file.
And another file may be helpful. LangSpec A LangSpec covers part of v1 syntax.
lexikos
Posts: 9592
Joined: 30 Sep 2013, 04:07
Contact:

Re: Scraping documentation for syntax definitions

23 Jan 2023, 21:28

There are other files defining v1 and v2 syntax for other projects. That is not the point. As per the first line of my post, it is about having a standard source of syntax metadata. If a new function is added, for instance, it will be added to the documentation and can then be scraped by a script to produce up-to-date syntax definitions. As an extension author, you can hypothetically use this to keep your definitions up to date.
jj4156
Posts: 19
Joined: 17 Jun 2019, 07:03
Contact:

Re: Scraping documentation for syntax definitions

04 Feb 2024, 07:44

This snippet scrapes the member of class. Instead of using IE which will execute javascript, htmlfile merely load html and everthing will remain the same. I think it is helpful.
Spoiler

Return to “Editors”

Who is online

Users browsing this forum: No registered users and 58 guests