How to split text into characters

Question

I would like to define a function which splits an input text into its constituent letters, so I can process each letter individually (the background to this question is that I would like to put each letter into an individual box, and would like to avoid calling the my boxing function by hand over and over).

So how would I go about splitting a text into its letters, so that for each letter, I can call an appropriate subfunction?

Ulrike Fischer · Accepted Answer · 2010-09-03 16:19:16Z

24

The answer depends a lot about what you mean by a "character" and how your input looks like (does it e.g. commands). One possibility is to use soul which contains a lot code to analyze text. E.g. you can get your boxes simply by misusing the \so command:

\documentclass{article}
\usepackage{soul}

\makeatletter
\def\SOUL@soeverytoken{%
 \fbox{\the\SOUL@token}}
\makeatother
\begin{document}
\so{abcADBC kdkkk dkdk kdkdk }
\end{document}

answered Sep 3, 2010 at 16:19

Ulrike Fischer

332k21 gold badges505 silver badges1k bronze badges

Soul's parsing features are great.
– Will Robertson
Commented Sep 3, 2010 at 16:23
1

soul is nice as long as you are dealing with 8-bit characters (i.e. use pdfTeX); but if you use Unicode you should use XeTeX or LuaTeX and forget about soul.
– Martin Schröder
Commented Oct 29, 2012 at 12:21

Add a comment |

egreg · Accepted Answer · 2023-06-20 09:47:39Z

This seems a job for expl3. Let's say we want to split a string of characters into its constituents, for later processing. So we define a macro that takes two arguments: the string and the macro for the processing

\documentclass{article}

\ExplSyntaxOn

\NewDocumentCommand{\stringprocess}{ m m }
 {
  \egreg_string_process:nn { #1 } { #2 }
 }
\cs_new_protected:Npn \egreg_string_process:nn #1 #2
 {
  \text_map_inline:nn { #2 } { #1 { ##1 } }
 }
\ExplSyntaxOff

\newcommand{\boxchar}[1]{\fbox{\strut#1} } % leave a space after the box

\begin{document}

\stringprocess{\boxchar}{abcdef}

\stringprocess{\boxchar}{ábcdefß}

\end{document}

Different toy problem, but here “complex” character are more problematic, so I'll stick with ASCII. We want to input a string and get as a result a token list that contains only the digits in the string, separated by commas. We assume that the input is controlled, so it contains only alphanumeric characters.

All we need is to define a suitable auxiliary function, instead of the simple \boxchar used before. However, it's better to use sequences instead of token lists, so I'll rework the solution from the start.

\documentclass{article}
\usepackage{xparse}

\ExplSyntaxOn
\seq_new:N \l__egreg_input_string_seq
\seq_new:N \l__egreg_output_string_seq

\cs_new_protected:Npn \egreg_string_process:nnn #1 #2 #3
% #1 = preprocess macro
% #2 = postprocess macro
% #3 = string
 {
  \seq_clear:N \l__egreg_output_string_seq
  \seq_set_split:Nnn \l__egreg_input_string_seq { } { #3 }
  \seq_map_inline:Nn \l__egreg_input_string_seq
   { #1 { ##1 } }
  #2
 }

\NewDocumentCommand{\boxchars}{ m }
 {
  \egreg_boxchars:n { #1 }
 }

\cs_new_protected:Npn \egreg_boxchars:n #1
 {
  \egreg_string_process:nnn
   { \egreg_fbox_strut:n }
   { \seq_use:Nnnn \l__egreg_output_string_seq { ~ } { ~ } { ~ } }
   { #1 }
 }
\cs_new_protected:Npn \egreg_fbox_strut:n #1
 {
  \seq_put_right:Nn \l__egreg_output_string_seq { \fbox { \strut #1 } }
 }
\ExplSyntaxOff

\begin{document}
\boxchars{abcdef}
\end{document}

This would give the same result as before, but \unskip wouldn't be necessary.

The string passed as fourth argument to \egreg_string_process:nnnn is split into its components; the third argument is the delimiter of the components, which can also be empty; an auxiliary "output" sequence is cleared for possible subsequent use by the preprocessing or postprocessing macros;
Each element of the sequence is passed to the "preprocessing macro", which should be a one argument function;
The "postprocess" macro is applied.

In the example, the preprocess macro stores \fbox, the postprocess macro just produces the items in the output sequence, separated by spaces.

What about the toy problem? The preprocess macro should test whether the item is a digit and, in this case, add it to the output sequence. Let's add this code before \ExplSyntaxOff

\cs_new_protected:Npn \egreg_store_digit:n #1
 {
  \bool_if:nT
   {
    \int_compare_p:n { `#1 >= `0 } && \int_compare_p:n { `#1 <= `9 }
   }
   {
    \seq_put_right:Nn \l__egreg_output_string_seq { #1 }
   }
 }
\cs_new:Npn \egreg_print_list_commas:n #1
 {
  \seq_use:Nnnn \l__egreg_output_string_seq { , } { , } { , }
 }

\NewDocumentCommand{\extractdigits}{ m }
 {
  \egreg_string_process:nnnn
   { \egreg_store_digit:n }
   { \egreg_print_list_commas:n }
   { }
   { #2 }
 }

and try with

\begin{document}
\extractdigits{a1b2c3}
\end{document}

to get

1,2,3

Complete code for the toy problem:

\documentclass{article}
\usepackage{xparse}

\ExplSyntaxOn
\seq_new:N \l__egreg_input_string_seq
\seq_new:N \l__egreg_output_string_seq

\cs_new_protected:Npn \egreg_string_process:nnnn #1 #2 #3 #4
% #1 = preprocess macro
% #2 = postprocess macro
% #3 = separator
% #4 = string
 {
  \seq_clear:N \l__egreg_output_string_seq
  \seq_set_split:Nnn \l__egreg_input_string_seq { #3 } { #4 }
  \seq_map_inline:Nn \l__egreg_input_string_seq
   { #1 { ##1 } }
  #2
 }

\cs_new_protected:Npn \egreg_store_digit:n #1
 {
  \bool_if:nT
   {
    \int_compare_p:n { `#1 >= `0 } && \int_compare_p:n { `#1 <= `9 }
   }
   {
    \seq_put_right:Nn \l__egreg_output_string_seq { #1 }
   }
 }
\cs_new:Npn \egreg_print_list_commas:n #1
 {
  \seq_use:Nnnn \l__egreg_output_string_seq { , } { , } { , }
 }

\NewDocumentCommand{\extractdigits}{ O{} m }
 {
  \egreg_string_process:nnnn
   { \egreg_store_digit:n }
   { \egreg_print_list_commas:n }
   { #1 }
   { #2 }
 }

\ExplSyntaxOff

\begin{document}
\extractdigits{a1b2c3d}
\end{document}

Stefan Kottwitz · Accepted Answer · 2010-09-03 18:53:47Z

13

The xstring package provides macros for splitting strings, extracting characters of strings and replacement in strings. This may be combined with the forloop and ifthen packages if needed.

answered Sep 3, 2010 at 18:53

Stefan Kottwitz♦

233k84 gold badges675 silver badges829 bronze badges

1

xstring seems to be a very well thought package. Thanks!
– yannisl
Commented Sep 3, 2010 at 19:55

Add a comment |

Holle · Accepted Answer · 2012-10-31 18:32:33Z

Stop! I think we cannot finish without a LuaLaTeX example:). Here it is:

\documentclass{article}
\usepackage{luacode}

\begin{luacode}
    function GetDigits(str) str:gsub("%d",tex.print) end
\end{luacode}

\def\getDigits#1{\directlua{GetDigits("#1")}}

\begin{document}
\getDigits{a123b4c5}
\end{document}

Ok, this is a really minimal example. Instead of the simple tex.print you can do what you want with each of the digits (print, do calculations or store them for later use). If you want boxed digits try this:

str:gsub("%d", function(d) tex.print(string.format("\\fbox{\\strut%s}",d)) end)

TH. · Accepted Answer · 2010-09-03 18:01:14Z

It's fairly easy to write such a macro for yourself, at least for simple cases. Here's one I just wrote.

\newcommand*\foreachletter[2]{%
        \begingroup
        \let\templettercommand#1%
        \let\tempspacecommand#2%
        \catcode`\ 12
        \foreachlettergo
}
\def\foreachlettergo#1{%
        \testletter#1\relax
        \endgroup
}
\def\testletter#1#2\relax{%
        \if#1\otherspace
                \tempspacecommand
        \else
                \templettercommand{#1}%
        \fi
        \ifx\relax#2\relax
                \let\next\relax
        \else
                \let\next\testletter
        \fi
        \next#2\relax
}
\catcode`\ 12
\def\otherspace{ }%
\catcode`\ 10

\foreachletter\fbox\textvisiblespace{Here is some text!}

The \foreachletter macro takes three arguments, the first is a macro that takes a single argument (like \fbox) that will be expanded for each nonspace token in the third argument. The second argument is the macro that will be expanded for each space token. The third argument should only contain letters, "other" (for example punctuation), and spaces.

This doesn't work in all cases, but I think it's more readable than using \futurelet which is probably the better way to do it.

Of course, one should go with an existing, well-tested solution. I just wanted to point out that it's not black magic. — TH., Commented Sep 4, 2010 at 7:26

Community · Accepted Answer · 2020-06-10 12:32:59Z

Here is a version that uses the \literate option from the the listings package. After a call to \ExtractDigits{a123b4c5}, all the digits are extracted and available as a comma separated list in the macro \ListOfDigits for further processing:

enter image description here

Notes:

The options to the literate do need to include all the characters that need to be ignored. So if you want to allow for any alphabetic character than all 52 characters need to be listed (26 upper case, and 26 lower case). If you want to allow for punctuation or other characters those can be added as well.
I used \IfStrEq form the xstring package to ensure that a comma separator was only added after the first member of the list. If this additional package is not desired it should easy to rewrite this to not need that package.
The showframe package was used just to show the page margins.
It seems that if the last parameter, the <length> to the literate option is specified as zero, then the replacement text is not executed, but using -1 seemed to work. Not sure if this is a feature, as I do not see it in the documentation, but it does what was required without adding space in the output.

I would think that there would have been some sort of wildcard option, but I have not been able to find that yet.

References:

A solution form How keep a running list of strings and then process them one at a time is used to accumulate the digits in the \ListOfDigits macro.

Code:

\documentclass{article}
\usepackage{listings}
\usepackage{pgffor}
\usepackage{xstring}
\usepackage{showframe}

% https://tex.stackexchange.com/questions/14393/how-keep-a-running-list-of-strings-and-then-process-them-one-at-a-time
\newcounter{NumberOfDigits}
\def\ListOfDigits{}
\makeatletter
\newcommand{\AddToListOfNumbers}[1]{%
    \IfStrEq{\ListOfDigits}{}{}{\g@addto@macro\ListOfDigits{,}}%
    \g@addto@macro\ListOfDigits{#1}%
    \stepcounter{NumberOfDigits}%
}
\makeatother
\lstdefinestyle{FormattedNumber}{%
    literate={0}{\AddToListOfNumbers{0}}{-1}%
             {1}{\AddToListOfNumbers{1}}{-1}%
             {2}{\AddToListOfNumbers{2}}{-1}%
             {3}{\AddToListOfNumbers{3}}{-1}%
             {4}{\AddToListOfNumbers{4}}{-1}%
             {5}{\AddToListOfNumbers{5}}{-1}%
             {6}{\AddToListOfNumbers{6}}{-1}%
             {7}{\AddToListOfNumbers{7}}{-1}%
             {8}{\AddToListOfNumbers{8}}{-1}%
             {9}{\AddToListOfNumbers{9}}{-1}% 
             {a}{}{0}%
             {b}{}{0}%
             {c}{}{0}%
            % .... code here missing ... list ALL characters that are to be ignored
             {y}{}{0}%
             {z}{}{0}%
             {A}{}{0}%
             {B}{}{0}%
             {C}{}{0}%
            % .... code here missing ... list ALL characters that are to be ignored
             {Y}{}{0}%
             {Z}{}{0}%
}
\newcommand{\ExtractDigits}[1]{%
    \setcounter{NumberOfDigits}{0}%
    \lstinline[style=FormattedNumber]{#1}%
}

\begin{document}
\noindent
\ExtractDigits{a123b4c5}%
The number of digits is \arabic{NumberOfDigits}.
The digits are:\par%
\foreach \Digit in \ListOfDigits {%
    ~\Digit\par%
}%
\end{document}

yannisl · Accepted Answer · 2013-01-25 12:28:46Z

3

Just a quick hack using the coolstr package for \substr and the ifthen package for the whiledo

  \newcounter{scancount}
   \whiledo{\value{scancount}<8}{%
      \stepcounter{scancount} 
      \thescancount 
      \substr{abcdefgh}{\thescancount}{1}
   }

This will print:

 1a 2b 3c 4d 5e 6f 7g 8h

edited Jan 25, 2013 at 12:28

answered Sep 3, 2010 at 19:49

yannisl

118k35 gold badges290 silver badges563 bronze badges

Which package? CTAN has no coolstring package. Do you mean substr?
– Martin Schröder
Commented Jan 25, 2013 at 12:16
@MartinSchröder My bad, should read coolstr, will fix it thanks.
– yannisl
Commented Jan 25, 2013 at 12:28

Add a comment |

score 3 · Accepted Answer · 2019-08-23 09:15:19Z

3

This is based on some undocumented pgf routine. It doesn't require packages. You can redefine \dosomething to fit your needs.

\documentclass{article}
\makeatletter
\def\prg@token@stop{\prg@token@stop}% <- thanks to Joseph Wright
\newcommand\prg[1]{\expandafter\prg@i\@firstofone#1\prg@token@stop}
\def\prg@i#1{%
    \ifx\pgfmath@token@stop#1%
    \else
     \dosomething{#1}
      \expandafter\prg@i
    \fi}  
\makeatother
\begin{document}
\newcommand\dosomething[1]{\fbox{#1}}
\prg{abcdef}
\end{document}

edited Aug 23, 2019 at 9:15

answered Aug 23, 2019 at 8:27

user194703

1

I'd be tempted to \def\prg@token@stop{\prg@token@stop} so it doesn't match other undefined tokens
– Joseph Wright ♦
Commented Aug 23, 2019 at 8:55
@JosephWright Yes, that is a good suggestion.
– user194703
Commented Aug 23, 2019 at 9:00

Add a comment |

Steven B. Segletes · Accepted Answer · 2019-08-23 10:37:44Z

The newly released tokcycle package can do this, all the while keeping track of macros, spaces, groups (explicit and implicit), active characters, etc.

\pars are not a problem and, in this MWE, spaces are expanded a bit for clarity.

\documentclass{article}
\usepackage[T1]{fontenc}
\usepackage{tokcycle}
\makeatletter
\begin{document}
\catcode`?=\active
\def ?{package}
\tokencycle
  {\addcytoks{\fbox{#1}}}% CHARACTERS
  {\addcytoks{\textup{\{}}\processtoks{#1}\addcytoks{\textup{\}}}}% GROUPS
  {\addcytoks{#1}}% MACROS
  {\addcytoks{\hspace{2em minus 1em}}}% SPACES
Here is my example \textit{of a \bgroup \scshape Token Cycle\egroup{} using}
the

\textsf{tokcycle} ?.
\endtokencycle
\end{document}

Stack Exchange Network

How to split text into characters

9 Answers 9

Notes:

References:

Code:

You must log in to answer this question.

Linked

Hot Network Questions

How to split text into characters

9 Answers 9

Notes:

References:

Code:

You must log in to answer this question.

Linked

Related

Hot Network Questions