Sort removes characters

Report problems with documented functionality
just me
Posts: 9453
Joined: 02 Oct 2013, 08:51
Location: Germany

Sort removes characters

18 Nov 2014, 08:20

Related topic: /viewtopic.php?f=5&t=5157

Simplified example:

Code: Select all

#NoEnv
Var := "
(Join
2`r`n
3`n
4`n
5`n
6`n
7`n
8`n
9`n
1
)"
MsgBox, 0, Before Sort, %Var%
Sort, Var
MsgBox, 0, After Sort, %Var%`n`nWhere is the 9?
User avatar
jballi
Posts: 724
Joined: 29 Sep 2013, 17:34

Re: Sort removes characters

21 Nov 2014, 17:02

just me wrote:Where is the 9?
It's still there, it's just hidden from the display because of the extra `r character in the data. Don't believe me? Check the before and after length of the string.
just me
Posts: 9453
Joined: 02 Oct 2013, 08:51
Location: Germany

Re: Sort removes characters

22 Nov 2014, 02:35

It's not there. The length of the string is the same because two characters (`r`n) are added to the originally last line (1) and two characters are removed from the new last line (9`n). So one `r is added and the 9 is removed.

A sort function must not change the content of a variable besides the order in any case. The added `r might be acceptable because it actually wouldn't do harm in the very most cases, but the loss of a character is not.
HotKeyIt
Posts: 2364
Joined: 29 Sep 2013, 18:35
Contact:

Re: Sort removes characters

22 Nov 2014, 07:36

It assumes here that last item is terminated with CRLF and removes 2 characters but it actually ends with LF and only 1 character shoud be removed.
If I get it right following needs to be added to ResultType Line::PerformSort() to terminate LF correctly.

Code: Select all

	// ResultType Line::PerformSort()

	// Terminate the variable's contents.
	if (trailing_crlf_added_temporarily) // Remove the CRLF only after its presence was used above to simplify the code by reducing the number of types/cases.
	{
		if (dest[-2] == '\r') // ends with CRLF
		{
			dest[-2] = '\0';
			output_var.ByteLength() -= 2 * sizeof(TCHAR);
		}
		else // ends with LF
		{
			dest[-1] = '\0';
			output_var.ByteLength() -= sizeof(TCHAR);
		}
	}
	else
	...
lexikos
Posts: 9583
Joined: 30 Sep 2013, 04:07
Contact:

Re: Sort removes characters

22 Nov 2014, 18:04

Sort is not designed to handle both `n and `r`n in the same content. It's not even really designed to handle `r`n as a delimiter. `r is treated as part of the content, not the delimiter, and trailing_crlf_added_temporarily works around this by ensuring the last item ends with `r. The implication is that if it detects `r in the first item, there must be `r at the end of every other item, otherwise the "unique" option will not work correctly.

On the other hand, the documentation does not actually say that `r`n acts as a delimiter. It says "x defaults to linefeed (`n), which correctly sorts VarName if its lines end in either LF (`n) or CR+LF (`r`n)." If all lines end in `n or `r`n but not consistently, they are still sorted correctly, but an item ending with `r is not considered to be a duplicate of the same item without `r.

In just me's example, there is one line which does not end in `n or `r`n: the last line. If every line ends with `n or `r`n, as the documentation says, the items are sorted correctly.

So HotKeyIt's change should fix the truncation problem, but Sort still won't be designed to handle a mix of `n and `r`n.
just me
Posts: 9453
Joined: 02 Oct 2013, 08:51
Location: Germany

Re: Sort removes characters

23 Nov 2014, 03:10

Just a few thoughts:
  • lexikos wrote:... otherwise the "unique" option will not work correctly.
    That's how I understand Chris' comments, too. And I think that all other options (besides F) should work correctly. But surprisinglytrailing_crlf_added_temporarily = true; is set regardless of the U option. This should be changed.
  • The removal of duplicate items is done separately after sorting:

    Code: Select all

    		if (omit_dupes && item_prev)
    		{
    			// Update to the comment below: Exact dupes will still be removed when sort_by_naked_filename
    			// or g_SortColumnOffset is in effect because duplicate lines would still be adjacent to
    			// each other even in these modes.  There doesn't appear to be any exceptions, even if
    			// some items in the list are sorted as blanks due to being shorter than the specified 
    			// g_SortColumnOffset.
    			// As documented, special dupe-checking modes are not offered when sort_by_naked_filename
    			// is in effect, or g_SortColumnOffset is greater than 1.  That's because the need for such
    			// a thing seems too rare (and the result too strange) to justify the extra code size.
    			// However, adjacent dupes are still removed when any of the above modes are in effect,
    			// or when the "random" mode is in effect.  This might have some usefulness; for example,
    			// if a list of songs is sorted in random order, but it contains "favorite" songs listed twice,
    			// the dupe-removal feature would remove duplicate songs if they happen to be sorted
    			// to lie adjacent to each other, which would be useful to prevent the same song from
    			// playing twice in a row.
    			if (g_SortNumeric && !g_SortColumnOffset)
    				// if g_SortColumnOffset is zero, fall back to the normal dupe checking in case its
    				// ever useful to anyone.  This is done because numbers in an offset column are not supported
    				// since the extra code size doensn't seem justified given the rarity of the need.
    				keep_this_item = (ATOF(*item_curr) != ATOF(item_prev)); // ATOF() ignores any trailing \r in CRLF mode, so no extra logic is needed for that.
    			else
    				keep_this_item = tcscmp2(*item_curr, item_prev, g_SortCaseSensitive); // v1.0.43.03: Added support for locale-insensitive mode.
    				// Permutations of sorting case sensitive vs. eliminating duplicates based on case sensitivity:
    				// 1) Sort is not case sens, but dupes are: Won't work because sort didn't necessarily put
    				//    same-case dupes adjacent to each other.
    				// 2) Converse: probably not reliable because there could be unrelated items in between
    				//    two strings that are duplicates but weren't sorted adjacently due to their case.
    				// 3) Both are case sensitive: seems okay
    				// 4) Both are not case sensitive: seems okay
    				//
    				// In light of the above, using the g_SortCaseSensitive flag to control the behavior of
    				// both sorting and dupe-removal seems best.
    		}
    
    So any extra logic needed to handle a trailing \r should be implemented here (maybe it had been here before v1.0.47.05?). It will cost performance for the Unique sort, but would be much cleaner than the current solution.
  • A hint should be added to the docs to make clear that the `r will stay part of the item.
lexikos
Posts: 9583
Joined: 30 Sep 2013, 04:07
Contact:

Re: Sort removes characters

23 Nov 2014, 22:09

just me wrote:trailing_crlf_added_temporarily = true; is set regardless of the U option. This should be changed.
It affects the sort order, in the same way that using `r`n vs `n affects the sort order; shown below. At the moment, if you always use `r`n or always use `n, the results should be consistent (between the last item and other items). If trailing_crlf_added_temporarily depended on the U option, sort order would also depend on the U option.

Code: Select all

x =
(Join
a`a2`r`n
a
)
y =
(Join
a`a2`n
a
)
Sort x
Sort y
MsgBox % x "`n---`n" y
just me wrote:It will cost performance for the Unique sort, but would be much cleaner than the current solution.
I doubt that it would be "cleaner", mainly because I think it would be more convoluted, but I suppose it would produce better results.

However, the current solution also ensures that the new final item (after sorting) doesn't end up with a trailing `r, and that the former final item (before sorting) ends up delimited by `r`n, not just `n.
maybe it had been here before v1.0.47.05?
I believe the bugs trailing_crlf_added_temporarily was intended to fix existed because `r wasn't handled at all. You can probably produce the same results artificially by using some other delimiter preceded by `r. You can also see the effect by using some other visible character in place of `r.
just me
Posts: 9453
Joined: 02 Oct 2013, 08:51
Location: Germany

Re: Sort removes characters

24 Nov 2014, 02:27

lexikos wrote:It affects the sort order, in the same way that using `r`n vs `n affects the sort order; shown below.
Interesting, but personally I'd prefer to get the sort order of y in both cases. I doubt that there are 'real life' cases where the CR of a CRLF combination is intended to be a part of the items to be sorted. Also, we already get a different sort order depending on wether x has been read from a file by FileRead or from an edit control, because AHK 'kindly' removes the CR in the latter case.

Edit: Added the following example:

Code: Select all

#NoEnv
X := "a`ta`r`na"
Gui, Add, Edit, vED, %X%
GuiControlGet, Y, , ED
Sort, X
Sort, Y
FileDelete, Tmp.txt
FileAppend, %Y%, Tmp.Txt
FileRead, Z, Tmp.txt
Sort, Z
MsgBox % X "`n---`n" Y "`n---`n" Z
ExitApp
Edit2: I'ts a rather theoretical problem. So I suggest to implement HotKeyIt's fix and change the docs to:
Dx: Specifies x as the delimiter character, which determines where each item in VarName begins and ends. If this option is not present, x defaults to linefeed (`n), which correctly sorts VarName if all of its lines either end in LF (`n) or CR+LF (`r`n). In the latter case CR (`r) is treated as part of the line to be sorted.
lexikos
Posts: 9583
Joined: 30 Sep 2013, 04:07
Contact:

Re: Sort removes characters

24 Nov 2014, 20:33

The treatment of `r is halfway between "as part of the line" and "as part of the delimiter", due to the behaviour I mentioned earlier:
However, the current solution also ensures that the new final item (after sorting) doesn't end up with a trailing `r, and that the former final item (before sorting) ends up delimited by `r`n, not just `n.
just me wrote:Also, we already get a different sort order depending on wether x has been read from a file by FileRead or from an edit control
The sort order depends on the contents, not on the options. If the U option affected trailing_crlf_added_temporarily, the sort order would depend on both the contents and the options. Anyway, it has to be applied regardless of U for the workaround-behaviour I noted above, which is also described in the comments.
Interesting, but personally I'd prefer to get the sort order of y in both cases.
Putting aside performance or simplicity, that would be ideal; it would mean treating `r as part of the delimiter, not in any way part of the item.
just me
Posts: 9453
Joined: 02 Oct 2013, 08:51
Location: Germany

Re: Sort removes characters

25 Nov 2014, 03:49

lexikos wrote:The treatment of `r is halfway between "as part of the line" and "as part of the delimiter"
I think it's entirely treated "as part of the line" and also halfways "as part of the delimiter". And it shouldn't be treated "as part of the delimiter" at all, because of
If this option is not present, x defaults to linefeed (`n) ...
And only the delimiter should be added to the last line.
Putting aside performance or simplicity, that would be ideal;
I agree, and I don't know how it could be done correctly without a perceptible loss of performance.
Chris wrote: // OLD/OBSOLETE comment from a section that was removed because it's now handled by this section:
// Check if special handling is needed due to the following situation:
// Delimiter is LF but the contents are lines delimited by CRLF, not just LF
// and the original/unsorted list's last item was not terminated by an
// "allowed delimiter". The symptoms of this are that after the sort, the
// last item will end in \r when it should end in no delimiter at all.
// This happens pretty often, such as when the clipboard contains files.
// In the future, an option letter can be added to turn off this workaround,
// but it seems unlikely that anyone would ever want to.
Well, the new option could be L (line sort). It would strip all CRLF as well as sole LF from the lines before sorting and add back the line delimiter of the first line after the sort. This shouldn't have severe affects on existing sorts. Also, it would solve the problem of unrecognized dupes for the last line.

Return to “Bug Reports”

Who is online

Users browsing this forum: No registered users and 48 guests