There are always a wide variety of Options, Tools and Approaches available. Picking the CORRECT ones to do the work as EFFICIENTLY and ELEGANTLY as possible, is the ART to be mastered.
At its core, a parser is a software tool designed to extract specific data from various types of content.
Parsing by definition:
Parse Definition & Meaning - Merriam-Webster
1 a : to divide (a sentence) into grammatical parts and identify the parts and their relations to each other
b : to describe (a word) grammatically by stating the part of speech and explaining the inflection (see inflection sense 2a) and syntactical relationships
2 : to examine in a minute way : analyze critically
So, in essence, we are sifting through phrases, sentences, and paragraphs of content (written in this case), looking for the relationships among these words, and trying to understand the message they convey in CONTEXT to where, why, and how they are used.
The data sources can vary, but typically in Web Scraping, it would be scraped from the HTML of the webpage or website. JSON, XML, or any other type of structured data syntaxes are solvable using the same approach.
From Business Model to Data
The scenario for a parser leads to another thought pattern, namely DDD. It would be prudent to design the Parser to behave like a “black/white/pink/X/Y/Z” box, isolated and stand-alone.
We can then interface with the parser via translators and ACLs to ensure it doesn’t affect the rest of the main application when modifications are made to its model.
We design the parser to do only one core thing, Parse HTML or Parse XML, if we need to “interpret” both language models, we design two unique and dedicated Parsers.
Approaches
First, we need to look at quite a few data examples to get a good sense of what we are dealing with before we start writing code for a dedicated parser.
We need to understand the “structure” of the language targeted, compensating for a variety of subtle, but very important syntactical “markers” for example.
Approach Example A)
Employing Behaviour-Driven Design (BDD)
Using BDD’s approach where we generate Specifications of the Required Behaviour of the code for the Model, we can use the testing language called Gherkin as one approach example when developing a Gherkin Parser. 1
Gherkin uses a scenario to express a testable behaviour within the feature test. Every Gherkin scenario belongs to a feature, and a scenario may have many steps.
Gherkin’s parent-child relationships are indicated by the indentation of the file.
From the data example in Figure 1.1, we have the following program structure:
1. Feature
1. Scenario
1. Step
2. Step
2. Scenario
1. Step
2. Step
3. Step
The empty lines on lines two and seven can largely be ignored for our purposes. Lines three and eight contain an interesting construct we need to consider:
The Gherkin #comment.
Gherkin comments can appear on any line, with any number of leading whitespace, but will always start with the # character. These few facts will make it relatively painless for us to parse these later.
We can update our mental program structure to:
1. Feature
1. Comment
2. Scenario
1. Step
2. Step
3. Comment
4. Scenario
1. Step
2. Step
3. Step
The next question is whether these structures are identified exclusively by their indentation, or does Gherkin provide a different way to distinguish a scenario from a step?
Continuing to peruse the documentation, we find that every non-blank line must begin with a Gherkin keyword. They call out that the exception to this rule is free-form descriptions, so we will need to look at this as well later.
We notice there are certain keywords used repeatedly and this is another indication of how we would need to construct the model to accommodate the syntactical structure of the “content-language” the parser will be working with.
The same approach could be applied to an HTML or any other parser.
Approach Example B)
Applying Domain-Driven Design (DDD) to Features
When running a business, especially online, in the digital domain the business owners or marketing and sales division will often request a new “flashy feature”. Why? To torment the dev team? Eh, maybe sometimes…, but mostly because they have noticed the clients out there are keen on having a more efficient way to get things done. The competition is offering such, and the “Firm” needs to stay current, living on the edge and competitive.
So, there we go. Yesterday’s cool feature is simply not cutting it anymore and changes are needed.
Now comes the nightmare: 5k lines of code need to be “re-spaghettified” to accommodate “Change my Default Language”.
If the “feature” has grown “heavy roots” into the DOM’s structure and isn’t flexible and suitably isolated from the rest of the app’s HTML, Database Schemas, etc., I’m suspecting an; “Aww-rrR…, Pain will be scooped and will be had, by the ladle-full, me mateys!” situation.
If, however, the original feature was isolated in its own domain, and not tangled deeply into the whole site’s Frontend and Backend, it would be far simpler and more reliable to modify or create a completely new “Feature Black-box” and simply slot it in when tested and ready. This is fairly common knowledge. Whether it’s common practice though, is another question.
This scenario directly boils down to our parser models as well. But how?
Well, let’s theorise for a moment on the best tool today vs the best tool tomorrow:
Imagine we have a great app, PHP-based, solid as rock, reliable and doing exactly what the business model requires. And there was bliss…
Soon we have more and more requirements for improved versions of features and futures.
Business: “We need to do better. Our MRR is dropping.”
DevLead: “Why fiddle with a working feature, it is as solid as it will ever be!”
Business: “The other folks are killing us with the speeds they are processing their data!”
DevLead: “But we will have to redevelop the whole app. It will take several months if not years!”
Business: “There’s got to be a way... Come on people, you’ve always done the impossible!”
Business: “What’s this RUST language I hear people are using these days to get stuff done at, what…? Warp 9.95…?”
DevLead: ... tearing up…“If only we built this app another way…”
So, from the silly conversation, one might form a variety of possible reasons why things are what they are, and what they could have been. “All’s fair in love and war,” they say. And yes, that is how life goes. What we can attempt though, is to continue to think out of the box, and in doing so design for these possible scenarios.
Say we applied DDD and the Business agreed from the onset that more time and money spent on diligent strategic planning UPFRONT with the Dev Team would be more prudent.
Say the Dev Team advised that yes, PHP will be the most cost-effective and elegant way to build this parser as things are today, but we know this world, it is CONSTANTLY changing, so let’s talk, research, and plan for probable scenarios then we can all come to a well-informed conclusion whether it will be worth it to spend the extra X% now and probably benefit from it by Y% tomorrow.
How? Well, we need to build the parser in such a way that we can switch from using a PHP-based version to say a Python, or RUST version. All while leaving the rest of the platform untouched. We provide enough “space” to fit in that 20k horsepower V-36 of the future and make sure the “chassis and drive chain” can manage, or at least be adapted more easily and effectively. We won’t need all the actual “space” right from the start, but to be able to put that baby in, we would only need a dedicated server to run the parser on, using RUST and WASM for example.
Crazy? Yes, it does sound like it, but if that’s where the world is going to go, and it probably will, we should foresee this possibility and spend a bit more time, and therefore cash on architecting the design to accommodate such a “freak of future”.
And while we’re at it let’s say we stick to REST API calls to the DB for now, but ensure we can scrape all reactions, comments and other highly dynamic data using GraphQL as well. So, depending on how this “Data Processing Monster” grows in future, it would be far easier to adapt, reconfigure and stay in the Data Game.
Granted, it has zero to do with the Parser’s implementation details, but it has ALL to do with the Business surviving another day, and doing what it does best, Scrape, Parse, and Process Data for the intended Market. The rest is the devil’s tail. Always there, always present, and always in the details.
Conclusion on the Approach
Using this approach helps to produce valuable software objects for the Business Model and allows us to think through the end result of the Business Requirement without getting distracted by all of the finer details of developing the parser. That will, or should preferably only come later. So, relax, the “pain” of implementation WILL come, it WILL happen, but why not regulate the “masochism” to your own “design” instead of “rolling down the rabbit holes” in a nonconsensual way?
When too much of the parser’s implementation detail leaks UP, through into the final business data models, it generally produces a result that’s rather cumbersome, error-prone and unfriendly to work with.
It is tricky at first to “disconnect” our thought patterns and use a more abstract, call it “reverse-thinking”, Modelling process, but once the value of focussing on What the Business Model needs to achieve, Why it is needed, When it’s needed, Where it’s needed, and by Whom it might be needed, the Business Model gets satisfied and the intended Business Operation remains intact. Last, we can look at the How we will need to do it elegantly and satisfy all those lovely “Business-Ws”.
Once the model was proven to be sound and suitable by the Domain Experts on the matter — “What do the business people actually want to achieve?” — it is much easier to apply the best software tools and techniques to ensure it does exactly what the Business Model requires, nothing more, nothing less.
When it comes to the parser’s implementation details, we can now focus on just that: The devil’s de-tailing.
These examples were taken from the book “Advanced PHP Strings, Text Analysis, Generation, and Parsing via Laravel” by John Koster.