hydrus/help/downloader_parsers.html

<html>
	<head>
		<title>downloader - parsers</title>
		<link href="hydrus.ico" rel="shortcut icon" />
		<link href="style.css" rel="stylesheet" type="text/css" />
	</head>
	<body>
		<div class="content">
			<p><a href="downloader_url_classes.html"><---- Back to URL Classes</a></p>
			<p class="warning">This system still needs work. The user interface remains a hellscape, so I won't put in screenshots for now.</p>
			<h3>parsers</h3>
			<p>In hydrus, a parser is an object that takes a single block of HTML or JSON data (as returned by a URL) and returns many kinds of hydrus-level metadata.</p>
			<p>Parsers are flexible and potentially complicated. You might like to open <i>network->manage parsers</i> and explore the UI as you read this page. Check out how the examples in the client work, and if you want to write a new one, see if there is something already in there that is similar--it is usually easier to duplicate an existing parser and then alter it than to create a new one from scratch every time.</p>
			<p>There are three main components in the parsing system:</p>
			<ul>
				<li><b>Formulae:</b> Take parsable data, search it in some manner, and return 0 to n strings.</li>
				<li><b>Content Parsers:</b> Take parsable data, apply a formula to it to get some strings, and apply a single metadata 'type' and perhaps some additional modifiers.</li>
				<li><b>Page Parsers:</b> Take parsable data, apply content parsers to it, and return all the metadata.</li>
			</ul>
			<p>Formulae do the grunt work of parsing and string conversion, content parsers turn the strings into something richer, and page parsers are the containers.</li>
			<h3 id="formulae">formulae</h3>
			<p>A formula takes some data and returns some strings. The different kinds are:</p>
			<ul>
				<li><h3 id="html_formula">html</h3></li>
				<li>This takes HTML or a sample of HTML--and any regular sort of XML <i>should</i> also work, it is not at all strict--searches for nodes with certain tag names and/or attributes, and then returns those nodes' particular attribute value, string content, or html beneath.</li>
				<li>The search occurs in steps:</li>
				<li>(image of a decent formula with several steps)</li>
				<li>Each step will be applied in turn, starting at the root node and searching beneath the nodes found in the previous step.</li>
				<li>For instance, if you have this html:</li>
				<li><pre>&lt;html&gt;
  &lt;body&gt;
    &lt;div class="media_taglist"&gt;
      &lt;span class="generaltag"&gt;&lt;a href="(search page)"&gt;blonde hair&lt;/a&gt; (3456)&lt;/span&gt;
      &lt;span class="generaltag"&gt;&lt;a href="(search page)"&gt;blue eyes&lt;/a&gt; (4567)&lt;/span&gt;
      &lt;span class="generaltag"&gt;&lt;a href="(search page)"&gt;bodysuit&lt;/a&gt; (5678)&lt;/span&gt;
      &lt;span class="charactertag"&gt;&lt;a href="(search page)"&gt;samus aran&lt;/a&gt; (2345)&lt;/span&gt;
      &lt;span class="artisttag"&gt;&lt;a href="(search page)"&gt;splashbrush&lt;/a&gt; (123)&lt;/span&gt;
    &lt;/div&gt;
    &lt;div class="content"&gt;
      &lt;span class="media"&gt;(a whole bunch of content that doesn't have tags in)&lt;/span&gt;
    &lt;/div&gt;
  &lt;/body&gt;
&lt;/html&gt;</pre></li>
				<li>To find the artist, "splashbrush", you would want to:</li>
				<ul>
					<li>get every &lt;div&gt; tag with attributes class=media_taglist</li>
					<li>and then get every &lt;span&gt; tag with attributes class=artisttag</li>
					<li>and then get the string content of those tags</li>
				</ul>
				<li>This will return a single string, "splashbrush". Changing the "artisttag" to "charactertag" or "generaltag" would give you "samus aran" and "blonde hair","blue eyes","bodysuit" respectively.</li>
				<li>You might be tempted to just go straight for the &lt;span&gt; with class=artisttag, but many sites use the same class to render a sidebar of favourite/popular tags or some other sponsored content, so it is best to make sure you narrow down to the larger &lt;div&gt; container so you don't get anything you don't want.</li>
				<li>When you add or edit one of these rules, you get this:</li>
				<li>(image of rule edit panel)</li>
				<li>Note that you can select to get only the 1st or xth instance of a found tag if you like, which can be useful in situations like this:</li>
				<li><pre>&lt;span class="generaltag"&gt;
  &lt;a href="(add tag)"&gt;+&lt;/a&gt;
  &lt;a href="(remove tag)"&gt;-&lt;/a&gt;
  &lt;a href="(search page)"&gt;blonde hair&lt;/a&gt; (3456)
&lt;/span&gt;</pre></li>
				<li>Without any more attributes, there isn't a good way to distinguish the &lt;a&gt; with "blonde hair" from the other two--so just set 'get the 3rd &lt;a&gt; tag' and you are good.</li>
				<li>Once you have narrowed down the right nodes you want, you can decide what to return. So, given a node of:</li>
				<li><pre>&lt;a href="(URL A)" class="thumb"&gt;Forest Glade&lt;/a&gt;</pre></li>
				<li>Returning the 'href' attribute would return the string "(URL A)", returning the string content would give "Forest Glade", and returning the full html would give "&lt;a href="(URL A)" class="thumb"&gt;Forest Glade&lt;/a&gt;". This last choice is useful in complicated situations where you want a second, separated layer of parsing, which we will get to later.</li>
				<li><h3 id="json_formula">json</h3></li>
				<li>This takes some JSON and does a similar style of search:</li>
				<li>(image of edit formula panel)</li>
				<li>It is a bit simpler than HTML--if the current node is a list (called an 'Array' in JSON), you can fetch every item or the xth item, and if it is a dictionary (called an 'Object' in JSON), you can fetch a particular string entry. Since you can't jump down several layers with attribute lookups or tag names, you have to go down every layer one at a time. In any case, if you have something like this:</li>
				<li><a href="json_thread_example.png"><img src="json_thread_example.png" width="50%" height="50%"/></a></li>
				<li>Then searching for "posts"->1st list item->"sub" will give you "Nobody like kino here.".</li>
				<li>Then searching for "posts"->all list items->"tim" will give you the three file hashes (since the third post has no file attached, the parser skips over it without complaint).</li>
				<li>Searching for "posts"->1st list item->"com" will give you the OP's comment, <span class="dealwithit">~AS RAW UNPARSED HTML~</span>.</li>
				<li>The default is to fetch the final nodes' 'data content', which means coercing simple variables into strings. If the current node is a list or dict, no string is returned.</li>
				<li>But if you like, you can return the json beneath the current node (which, like HTML, includes the current node). This again will come in useful later.</li>
				<li><h3 id="compound_formula">compound</h3></li>
				<li>If you want to create a string from multiple parsed strings--for instance by appending the 'tim' and the 'ext' in our json example together--you can use a Compound formula. This fetches multiple lists of strings and tries to place them into a single string using \1 \2 \3 regex substitution syntax:</li>
				<li>(image of the edit panel--use the thread watcher one with complicated gubbins)</li>
				<li>This is where the magic happens, sometimes, so keep it in mind if you need to do something cleverer than the data you have seems to provide.</li>
				<li><h3 id="context_variable_formula">context variable</h3></li>
				<li>desc</li>
				<li>ui walkthrough</li>
				<li>misc</li>
			</ul>
			<p>talk about string match and string converter</p>
			<p>how to test</p>
			<p>It is a great idea to check the html or json you are trying to parse with your browser. Most web browsers have great developer tools that let you walk through the different nodes in a pretty way. The JSON image above is one of the views Firefox provides if you simply enter a JSON URL.</p>
			<h3 id="content_parsers">content parsers</h3>
			<p>different types and what they mean</p>
			<p>hash needs conversion to bytes</p>
			<p>vetos</p>
			<h3 id="page_parsers">page parsers</h3>
			<p>pre-parsing conversion example for tumblr</p>
			<p>example urls are helpful</p>
			<p>mention vetos again</p>
			<p>subsidiary page parsers and what that is for</p>
			<h3>page example</h3>
			<p>do a danbooru example with sample image stuff</p>
			<h3>gallery example</h3>
			<p>something with an API?</p>
			<h3>thread example</h3>
			<p>subsidiary page parsers in the example</p>
			<p>source time and subject->comment fallback fun</p>
			<p>The context variable bit to fetch the right board for the file url</p>
			<p class="right"><a href="downloader_searches.html">Let's learn about Searches ----></a></p>
		</div>
	</body>
</html>