reading huge xml with xmlreader

sep 22 2019, 5:04am

The DOMDocument class is good for reading small XML file but for large / huge XML, code may stall and give you no error at all. For large XML, you should use XMLReader instead to preserve your server memory usage.

XML source

A huge example of XML was downloaded from this page at Karaoke Version affiliation program and it generally has following structure:

<artists>
	<artist id="2000">
		<name>The Solids</name>
		<name_sorted>Solids, The</name_sorted>
		<url>http://www.karaoke-version.com/mp3-backingtrack/the-solids/</url>
		<rank>5866</rank>
		<songs>
			<song id="5022">
				<name>Hey Beautiful</name>
				<url>http://www.karaoke-version.com/mp3-backingtrack/the-solids/hey-beautiful.html</url>
				<rank>24467</rank>
				<preview>http://www.karaoke-version.com/preview/57278/</preview>
				...
			

If we somehow need to save that into our database then we may format a data row / line i.e: artist's name, artist's song name and the link for previewing the audio. Example:

The Solids, Hey Beautiful, http://www.karaoke-version.com/preview/57278/
...
			

PHP script

<?php
	$t = time();
	$m = memory_get_usage();

	const XML_FILENAME = 'karaokeversion_catalog_en_GBP.xml';
	
	$liner = new XMLReader();
	$liner->open(XML_FILENAME);

	$artistCount = 0;	//number of artists
	$songCount = 0;		//number of songs (all artists)

	while($liner->read()){
		if($liner->nodeType === XMLReader::ELEMENT && $liner->name === 'artist'){
			
			//convert current line into an XML node
			$node = $liner->expand();
			
			//for each artist node found, assume unknown artist name, initialize it
			$artistName = '';
			
			//walk through this artist node's child nodes to find artist name and songs
			for($j = 0; $j < $node->childNodes->length; $j++){
				$nodeChild = $node->childNodes->item($j);
				if($nodeChild->nodeType === XML_ELEMENT_NODE && $nodeChild->nodeName === 'name')
					$artistName = $nodeChild->nodeValue;
				elseif($nodeChild->nodeName === 'songs'){
					
					//walk through this songs node's child nodes
					for($k = 0; $k < $nodeChild->childNodes->length; $k++){
						$nodeGrandChild = $nodeChild->childNodes->item($k);
						if($nodeGrandChild->nodeType === XML_ELEMENT_NODE && $nodeGrandChild->nodeName === 'song'){

							//for each song node found, assume unknown song details, initialize them
							$songName = '';
							$songPreview = '';
					
							//walk through this song node's child nodes
							for($l = 0; $l < $nodeGrandChild->childNodes->length; $l++){
								$nodeGrandGrandChild = $nodeGrandChild->childNodes->item($l);
								if($nodeGrandGrandChild->nodeType === XML_ELEMENT_NODE){
									if($nodeGrandGrandChild->nodeName === 'name')
										$songName = $nodeGrandGrandChild->nodeValue;
									elseif($nodeGrandGrandChild->nodeName === 'preview')
										$songPreview = $nodeGrandGrandChild->nodeValue;
								}
							}
							//add validation first here then format a new entry line to be stored somewhere
							if(!empty($artistName) && !empty($songName) && filter_var($songPreview, FILTER_VALIDATE_URL)){
								echo "$artistName, $songName, $songPreview\n";
								$songCount++;
							}
						}
					}
				}
			}
			$artistCount++;
		}
	}//end while
	
	$liner->close();

	//report
	$t = time() - $t;
	$m = memory_get_usage() - $m;
	
	echo "time spent: $t seconds.\n",
		"memory usage: $m bytes.\n",
		"artist count: $artistCount.\n",
		"song count: $songCount.\n";
?>
			

Demo

Click on following link to test: xmlreader.


www.000webhost.com