1

I'm writing a Python CSS-selector library that allows one to write these kinds of expressions in Python as a pet project. The goal of the library is to represent selectors in a flat, intuitive and interesting way; all valid syntax defined by the Selectors Level 4 Draft must be supported, in one way or another.

# lorem|foo.bar[baz^="qux"]:has(:invalid)::first-line
selector = (Namespace('lorem') | Tag('foo')) \
           .bar \
           # Can also be written as [Attribute('baz').starts_with('qux')]
           [Attribute('baz', '^=', 'qux')] \
           # '>>' is used instead of ' '.
           [:'has', (Selector.SELF >> PseudoClass('invalid'),)] \
           [::'first-line']

Here's how the hierachy looks like (/ signifies an alias, () mixin superclasses):

Selector(ABC)  # Enum too?
├── PseudoElement
├── ComplexSelector(Sequence[CompoundSelector | Combinator])
├── CompoundSelector(Sequence[SimpleSelector])
├── SimpleSelector
│   ├── TypeSelector / Tag
│   ├── UniversalSelector
│   ├── AttributeSelector / Attribute
│   ├── ClassSelector / Class
│   ├── IDSelector / ID
│   └── PseudoClass
├── SELF / PseudoClass('scope')
└── ALL / UniversalSelector()

Combinator
├── ChildCombinator: '__gt__' / '>'
├── DescendantCombinator: '__rshift__' / '>>'
├── NamespaceSeparator: '__or__' / '|'
├── NextSiblingCombinator: '__add__' / '+'
├── SubsequentSiblingsCombinator: '__sub__' / '-'
└── ColumnCombinator: '__floordiv__' / '//'

This design has some disadvantages:

  • The replacements of combinators:

    • Descendant combinator ( ) → right shift (>>)
    • Column combinator (||) → floor division (//)
    • Subsequent-siblings combinator (~) → minus/subtract (-)

    >> and // are currently not valid combinators, but may be in the future. The last is much safer, since - is already considered a valid character for <ident-token>s.

  • Functional pseudo-classes needs a comma between its name (a string/non-callable) and its arguments (a tuple):

    • [:'where', (Class('foo'), Class('bar'))]

Those disadvantages might need to be considered while modifying the design around the limitations:

  • HTML classes with hyphens cannot be added with Python dotted attribute syntax (.foo-bar); not to mention, this also means that any classes that implement this syntax using __getattr__/__getattribute__ won't be able to have any methods.
  • Currently there is no way to add an ID in the middle of a compound selector. Since Python doesn't have a # operator I'm at a loss. I have thought about overloading __call__ but Tag('foo').bar('baz') or Tag('foo')[Attribute('qux')]('baz') would look too much like a normal method call.

How should I go about working around these limitations?

5
  • How would this library be used? Is it simply to represent CSS selectors with a DSL, or would it include other functionality, for example parsing selector strings, or applying selectors to a DOM?
    – Jasmijn
    Commented Oct 28, 2023 at 0:09
  • @Jasmijn Yes, I intend to write a parser too, and the eventual parsing results will be represented using this hierachy, but that is not the focus of the question.
    – InSync
    Commented Oct 28, 2023 at 0:44
  • Minor comment: you have some typos in the syntax of pseudo-classes and -elements, writing :'has' instead of ':has' etc.
    – Jasmijn
    Commented Oct 28, 2023 at 18:15
  • @Jasmijn That's not a typo. selector[:'has'] calls __getitem__() with a slice(None, 'has', None) (and [::'first-line'] with slice(None, None, 'first-line')).
    – InSync
    Commented Oct 28, 2023 at 23:10
  • Oh right, this is a slicing operation, not a list display. I hope you reconsider, this is an extremely unintuitive use of slicing.
    – Jasmijn
    Commented Oct 29, 2023 at 2:59

2 Answers 2

2

Because Python has very different syntax and semantics from CSS selectors, I think these problems will only get worse. You'll end up with something that doesn't look like CSS does and something that doesn't work like Python usually does. Therefore I would like to propose a different way of approaching the syntax.

CSS selectors are mostly a linear combination of simple selectors and combinators. I would suggest using that, and representing something like ns|p a:link as something like Tag('p', namespace='ns') + Descendant() + Tag('a') + PseudoClass('Link').

That is, you only use a single magic method to represent concatenation. Everything else is just regular Python objects, using regular Python constructors.

Your example could be

# lorem|foo.bar[baz^="qux"]:has(:invalid)::first-line
selector = Tag('foo', namespace='lorem) + \
           Class('bar') + \
           Attribute('baz', '^=', 'qux') + \
           PseudoClass('has', Selector.SELF + Descendant() + PseudoClass('invalid')) + \
           PseudoElement('first-line')

It may not be exactly what you were looking for, but it has the advantage that it is much easier to learn for Python users because it has much fewer rules and exceptions, and you don't need to worry about new selectors or incompatible syntax.

You can also use & instead of +, in which case you can represent a selector list with |, for example: p.warning, #bigwarning can become Tag('p') & Class('warning') | ID('bigwarning').


An alternative idea is to use no magic at all, and represent compound and/or complex selectors using lists or wrapper objects.

foo.bar > a might be something like Child([Tag('foo'), Class('Bar')], [Tag('a')]) (compound selectors are lists, complex selectors are wrapped by combinators) or [Tag('foo'), Class('bar'), Child(), Tag('a')] (complex selectors are lists containing the selectors).

The best option depends on ergonomics, and the ergonomics depend on how users will build and manipulate selectors and for what purpose.

1

You want to represent CSS concepts using valid python syntax which "looks like" the CSS source text.

Simplest approach would be stick with straight CSS source text, which we can roundtrip through deserializers and serializers. Representing punctuation-heavy CSS as python source will be inherently lossy, so you're going to have to store the details somewhere, perhaps in a global dict or in various """docstrings""".

It would be worthwhile to explicitly write down your various goals and tradeoffs. For example, getting IDE navigation / autocompletion "for free" might be one of the things you find attractive about your proposed scheme.

Python notation has been exploited for representing SI units, algebra, vector math, and pathnames. The notation is already a good fit for these domains, in some cases because the language strove to be a good fit. So lossless representation can often be achieved.

There are two mature problem domains that you might wish to take inspiration from.


sqlalchemy

The SQLAlchemy community uses at least one python DSL, arguably multiple ones, to represent SQL operations.

The impedance match is not perfect. Operator precedence is a bit of a rough edge, with a OR b turning into (a) | (b) when the two terms are complex. For some operators, such as IN, we resort to .in_() method call notation despite the in keyword seeming to be available.

Table or column names in principle can incorporate SPACE and many other characters, especially when "`" or other quoting mechanisms are used. But in practice DBAs will often choose to adhere to a conservative regex such as r("^\w+$"). Your approach might offer enough advantages that web designers would choose to adhere to conservative naming conventions, so e.g. "a-b" --> a_b --> "a-b" could be safely round-tripped.

SQL JOINs are commonly more than a hundred lines long, and a great many production queries have been recast to fit within this DSL.


type annotation

Type hinting continues to be something of a moving target in the python community. An application's source code might be read by an "old" or "new" interpreter, or type checker.

Expressing types in a back-compatible way for old interpreters or checkers has been a source of tension, often relieved via a string annotation escape valve. Forward references sometimes raise challenges that are resolved in the same way. In recent years we've had less need for this escape valve.

We see annotations appearing in the AST, and also in comment text.

The experiences of the type annotation community seem most relevant to your CSS goals.


Your goals are still a bit nebulous at this point. Several developer communities have traveled down this road, showing what works well, or poorly, or would work better after adopting some PEPs. You may be able to draw inspiration, learn from mistakes, and better predict a path to success by looking back at this history and incorporating some elements in your project goals.

2
  • I don't see how this answers my question. I need a working hierachy so that I can use it to represent the parsing results of my (to-be-written) CSS-selector parser. That SQLAlchemy also uses Python operator overloading has nothing to do with my CSS selector library, and type annotation is a matter of implementation, not of API designing/library architecture.
    – InSync
    Commented Oct 28, 2023 at 23:20
  • 1
    The OP question needs focus, details, or clarity. I was encouraging you to refine your goals. Otherwise my short answer would be, "a python parse tree is inadequate to represent the CSS details you care about; you should abandon this hopeless end game." Which I phrased as "stick with straight CSS source text". I was trying to channel your requirements gathering efforts in a constructive direction, to turn a Quixotic cause into one that stood a chance of delivering value to end users who might choose to adopt it.
    – J_H
    Commented Oct 29, 2023 at 1:38

Not the answer you're looking for? Browse other questions tagged or ask your own question.